Prediction Warnings when predicting a fitted caret SVM model - r

I'm hoping to get some pointers as to why I'm getting :
Warning message: In method$predict(modelFit = modelFit, newdata =
newdata, submodels = param) : kernlab class prediction calculations
failed; returning NAs
When I print out the prediction:
svmRadial_Predict
[1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>....
The code I wrote to perform the SVM fitting:
#10-fold cross validation in 3 repetitions
control = trainControl(seeds = s, method="repeatedcv", number=10,
repeats=3, savePredictions = TRUE, classProbs = TRUE)
The SVM model for the fitting is like this:
svmRadial_model = train(y=modelTrain$Emotion,
x=modelTrain[c(2:4)],
method ='svmRadial',
trControl = control,
data=modelTrain,
tuneLength = 3
)
And the code I wrote to perform the prediction looks like this:
svmRadial_Predict <- predict(svmRadial_model,
newdata = modelTest[c(2:4)], probability = TRUE )
I've checked the data, and there's no NA values in the training or testing set. The y value is a factor and the x values are numeric if that makes a difference? Any tips to debug this would be very much appreciated!
As the model trains I can see warnings like this:
line search fails -1.407938 -0.1710936 2.039448e-05
which I had assumed was just the model being unable to fit a hyperplane for particular observations in the data. I'm using the svmRadial kernel
The data I'm trying to fit was already centred and scaled using the R scale() function.
Further work leads me to believe it's something to do with the
classProbs = TRUE flag. If I leave it out, no warnings are printed.
I've kicked off another run of my code, SVM seems to take ages to complete on my laptop for this task but I'll report the results as soon as that completes.
As a final edit, the model fitting completed without error, and I can use that model just fine for prediction/calculating the confusion matrix etc. I don't understand why including classProbs = TRUE breaks it, but maybe it's related to the combination of the cross validation that does, with the cross validation I had requested in my trainControl

Your problem is part of the peculiarities of the caret package.
There are two potential reasons why your prediction fails with kernlab svm methods called by caret:
The x, y interface returns a caret::train object which the predict function cannot use.
Solution: Simply replace by the formula interface.
train(form = Emotion ~ . , data = modelTrain, ...
The iterative search within the support vector machine algorithm doesn't converge.
Solution 2a)
Set different seeds before the train() call until it
converges.
set.seed(xxx)
train(form = Emotion ~ . , data = modelTrain, ...)
Solution 2b)
Decrease parameter minstep as suggested by #catastrophic-failure here. For this solution, there is no parameter in the ksvm function, so you need to change the source code in line 2982: minstep <- 1e-10 to a lower value. Then compile the source code yourself. No guarantee it will help though.
Try out Solution 1 first, as it is the most likely!

My solution to this was just to leave out the classProbs = TRUE parameter of the trainControl function. Once I did that, everything worked. I'd guess it's related to what's happening with cross validation under the hood but I'm not certain of that.

Related

caret with NAs and Factors

I feel like I'm stuck in a bit of a circular error here.
I have some columns with NA (still trialling whether to impute or omit) and a few categorical/factor columns too.
If I use the formula method I can run my model but then get issues with trying to predict as the factors are dummified.
train(sales~.,
data=df,
method="glmnet",
preProcess=c('center', 'scale', 'zv'),
trControl=trainControl(method="repeatedcv", number=5, repeats=2),
na.action = na.omit)
This suggests to use non-forumla method
https://stackoverflow.com/a/30169022/10291291
train(
x = model.frame(formula( sales~.), df)[,-1],
y = model.frame(formula( sales~.), df)[,1],
method="glmnet",
preProcess=c('center', 'scale', 'zv'),
trControl=trainControl(method="repeatedcv", number=5, repeats=2),
na.action = na.omit)
However when I try that I get issues with the NAs and this post suggests to go back to formulas
https://stackoverflow.com/a/48230658/10291291
For reference I'll likely be sticking with xgboost and glmnet
So a little lost but can't imagine this is that irregular so hoping I've perhaps missed something obvious

bartMachine in caret train error : incorrect number of dimensions

I encounter a strange problem when trying to train a model in R using caret :
> bart <- train(x = cor_data, y = factor(outcome), method = "bartMachine")
Error in tuneGrid[!duplicated(tuneGrid), , drop = FALSE] :
nombre de dimensions incorrect
However, when using rf, xgbTree, glmnet, or svmRadial instead of bartMachine, no error is raised.
Moreover, dim(cor_data) and length(outcome) return [1] 3056 134 and [1] 3056 respectively, which indicates that there is indeed no issue with the dimensions of my dataset.
I have tried changing the tuneGrid parameter in train, which resolved the problem but caused this issue instead :
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "pool-89-thread-1"
My dataset includes no NA, and all variables are either numerical or binary.
My goal is to extract the most important variables in the bart model. For example, I use for random forests:
rf <- train(x = cor_data, y = factor(outcome), method = "rf")
rfImp <- varImp(rf)
rf_select <- row.names(rfImp$importance[order(- rfImp$importance$Overall)[1:43], , drop = FALSE])
Thank you in advance for your help.
Since your goal is to extract the most important variables in the bart model, I will assume you are willing to bypass the caret wrapper and do it directly in R bartMachine, which is the only way I could successfully run it.
For my system, solving the memory issue required 2 further things:
Restart R and before loading anything, allocate 8Gb memory as so:
options(java.parameters = "-Xmx8g")
When running bartMachineCV, turn off mem_cache_for_speed:
library(bartMachine)
set_bart_machine_num_cores(16)
bart <- bartMachineCV(X = cor_data, y = factor(outcome), mem_cache_for_speed = F)
This will iterate through 3 values of k (2, 3 and 5) and 2 values of m (50 and 200) running 5 cross-validations each time, then builds a bartMachine using the best hyperparameter combination. You may also have to reduce the number of cores depending on your system, but this took about an hour on a 20,000 observation x 12 variable training set on 16 cores. You could also reduce the number of hyperparameter combinations it tests using the k_cvs and num_tree_cvs arguments.
Then to get the variable importance:
vi <- investigate_var_importance(bart, num_replicates_for_avg = 20)
print(vi)
You can also use it as a predictive model with predict(bart, new_data=new) similar to the object normally returned by caret::train(). This worked on R4.0.5, bartMachine_1.2.6 and rJava_1.0-4

How to retrieve elastic net coefficients?

I am using the caret package to train an elastic net model on my dataset modDat. I take a grid search approach paired with repeated cross validation to select the optimal values of the lambda and fraction parameters required by the elastic net function. My code is shown below.
library(caret)
library(elasticnet)
grid <- expand.grid(
lambda = seq(0.5, 0.7, by=0.1),
fraction = seq(0, 1, by=0.1)
)
ctrl <- trainControl(
method = 'repeatedcv',
number = 5, #folds
repeats = 10, #repeats
classProbs = FALSE
)
set.seed(1)
enetTune <- train(
y ~ .,
data = modDat,
method = 'enet',
metric = 'RMSE',
tuneGrid = grid,
verbose = FALSE,
trControl = ctrl
)
I can get predictions using y_hat <- predict(enetTune, modDat), but I cannot view the coefficients underlying the predictions.
I have tried coef(enetTune$finalModel) but the only thing returned is NULL. I am suspecting that I have to give the coef() function more information but not sure how to do this.
In addition, I would like to produce a box plot of the 50 sets of coefficients (10 repeats of 5 folds) associated with the optimal lambda and fraction parameters.
To see the coefficients, use predict:
predict(enetTune$finalModel, type = "coefficients")
See ?predict.enet for more information on how to get specific coefficients.
Following on from the answer by #Weihuang Wong, you can get the coefficients from the final model using the following code:
predict.enet(enetTune$finalModel, s=enetTune$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients
To me what works best is stats::predict, as is #Weihuang Wong answer. However, as OP pointed out in a comment, that provides a list of coefficients for every value of lambda tested.
The important thing to understand here is that when you are using predict, your intention is precisely to predict the value of the parameters, and not really to retrieve them. You should then be aware of that an explore the options available.
In this case, you could use the same function with the argument s for the penalty parameter lambda. Remebember that you are still predicting, but this time you will get the coefficients you are looking for.
stats::predict(enetTune$finalModel, type = "coefficients", s = enetTune$bestTune$lambda)

Issues with predict function when building a CART model via CrossValidation using the train command

I am trying to build a CART model via cross validation using the train function of "caret" package.
My data is 4500 x 110 data frame, where all the predictor variables (except the first two, UserId and YOB (Year of Birth) which I am not using for model building) are factors with 2 levels except the dependent variable which is of type integer (although has only two values 1 and 0). Gender is one of the independent variables.
When I ran rpart command to get CART model (using the package "rpart"), i didn't have any problem with the predict function. However, I wanted to improve the model via cross validation, and so used the train function from the package "caret" with the following command:
tr = train(y ~ ., data = subImpTrain, method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
This build the model with the following warning
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
But it did give me a final model (best.tree). However, when I am trying to run the predict function using the following command:
best.tree.pred = predict(best.tree, newdata = subImpTest)
on the test data, it is giving me the following error:
Error in eval(expr, envir, enclos) : object 'GenderMale' not found
The Gender variable has two values: Female, Male
Can anybody help me understand the error
As #lorelai suggested, caret dummy-codes your variables if you supply it a formula. An alternative is to provide it the variables themselves, like so:
tr = train(y = subImpTrain$y, x = subImpTrain[, -subImpTrain$y],
method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
More importantly, however, you shouldn't use predict.rpart and instead use predict.train, like so:
predict(tr, subImpTest)
In which case it would work just fine with the formula interface.
I have had a similar problem in the past, although concerning another algorithm.
Basically, some algorithms transform the factor variables into dummy variables and rename them accordingly.
My solution was to create my own dummies and leave them in numerical format.
I read that decision trees manage to work properly even so.

Using r and weka. How can I use meta-algorithms along with nfold evaluation method?

Here is an example of my problem
library(RWeka)
iris <- read.arff("iris.arff")
Perform nfolds to obtain the proper accuracy of the classifier.
m<-J48(class~., data=iris)
e<-evaluate_Weka_classifier(m,numFolds = 5)
summary(e)
The results provided here are obtained by building the model with a part of the dataset and testing it with another part, therefore provides accurate precision
Now I Perform AdaBoost to optimize the parameters of the classifier
m2 <- AdaBoostM1(class ~. , data = temp ,control = Weka_control(W = list(J48, M = 30)))
summary(m2)
The results provided here are obtained by using the same dataset for building the model and also the same ones used for evaluating it, therefore the accuracy is not representative of real life precision in which we use other instances to be evaluated by the model. Nevertheless this procedure is helpful for optimizing the model that is built.
The main problem is that I can not optimize the model built, and at the same time test it with data that was not used to build the model, or just use a nfold validation method to obtain the proper accuracy.
I guess you misinterprete the function of evaluate_Weka_classifier. In both cases, evaluate_Weka_classifier does only the cross-validation based on the training data. It doesn't change the model itself. Compare the confusion matrices of following code:
m<-J48(Species~., data=iris)
e<-evaluate_Weka_classifier(m,numFolds = 5)
summary(m)
e
m2 <- AdaBoostM1(Species ~. , data = iris ,
control = Weka_control(W = list(J48, M = 30)))
e2 <- evaluate_Weka_classifier(m2,numFolds = 5)
summary(m2)
e2
In both cases, the summary gives you the evaluation based on the training data, while the function evaluate_Weka_classifier() gives you the correct crossvalidation. Neither for J48 nor for AdaBoostM1 the model itself gets updated based on the crossvalidation.
Now regarding the AdaBoost algorithm itself : In fact, it does use some kind of "weighted crossvalidation" to come to the final classifier. Wrongly classified items are given more weight in the next building step, but the evaluation is done using equal weight for all observations. So using crossvalidation to optimize the result doesn't really fit into the general idea behind the adaptive boosting algorithm.
If you want a true crossvalidation using a training set and a evaluation set, you could do the following :
id <- sample(1:length(iris$Species),length(iris$Species)*0.5)
m3 <- AdaBoostM1(Species ~. , data = iris[id,] ,
control = Weka_control(W = list(J48, M=5)))
e3 <- evaluate_Weka_classifier(m3,numFolds = 5)
# true crossvalidation
e4 <- evaluate_Weka_classifier(m3,newdata=iris[-id,])
summary(m3)
e3
e4
If you want a model that gets updated based on a crossvalidation, you'll have to go to a different algorithm, eg randomForest() from the randomForest package. That collects a set of optimal trees based on crossvalidation. It can be used in combination with the RWeka package as well.
edit : corrected code for a true crossvalidation. Using the subset argument has effect in the evaluate_Weka_classifier() as well.

Resources