how to debug errors like: "dim(x) must have a positive length" with caret - r

I'm running a predict over a fit similar to what is found in the caret guide:
Caret Measuring Performance
predictions <- predict(caretfit, testing, type = "prob")
But I get the error:
Error in apply(x, 1, paste, collapse = ",") :
dim(X) must have a positive length
I would like to know 1) the general way to diagnose these errors that are the result of bad inputs into functions like this or 2) why my code is failing.
1)
So looking at the error It's something to do with 'X'. Which argument is x? Obviously the first one in 'apply', but which argument in predict is eventually passed to apply? Looking at traceback():
10: stop("dim(X) must have a positive length")
9: apply(x, 1, paste, collapse = ",")
8: paste(apply(x, 1, paste, collapse = ","), collapse = "\n")
7: makeDataFile(x = newdata, y = NULL)
6: predict.C5.0(modelFit, newdata, type = "prob")
5: predict(modelFit, newdata, type = "prob") at C5.0.R#59
4: method$prob(modelFit = modelFit, newdata = newdata, submodels = param)
3: probFunction(method = object$modelInfo, modelFit = object$finalModel,
newdata = newdata, preProc = object$preProcess)
2: predict.train(caretfit, testing, type = "prob")
1: predict(caretfit, testing, type = "prob")
Now, this problem would be easy to solve if I could follow the code through and understand the problem as opposed to these general errors. I can trace the code using this traceback to the code at C5.0.R#59. (It looks like there's no way to get line numbers on every trace?) I can follow this code as far as this line 59 and then (I think) the predict function on line 44:
Github Caret C5.0 source
But after this I'm not sure where the logic flows. I don't see 'makeDataFile' anywhere in the caret source or, if it's in another package, how it got there. I've also tried Rstudio debugging, debug() and browser(). None provide the stacktrace I would expect from other languages. Any suggestion on how to follow the code when you don't know what an error msg means?
2) As for my particular inputs, 'caretfit' is simply the result of a caret fit and the testing data is 3million rows by 59 columns:
fitcontrol <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = TRUE,
summaryFunction = custom.summary,
allowParallel = TRUE)
fml <- as.formula(paste("OUTVAR ~",paste(colnames(training[,1:(ncol(training)-2)]),collapse="+")))
caretfit <- train(fml,
data = training[1:200000,],
method = "C5.0",
trControl = fitcontrol,
verbose = FALSE,
na.action = na.pass)

1 Debuging Procedure
You can pinpoint the problem using a couple of functions.
Although there still doesn't seem to be anyway to get a full stacktrace with line numbers in code (Boo!), you can use the functions you do get from the traceback and use the function getAnywhere() to search for the function you are looking for. So for example, you can do:
getAnywhere(makeDataFile)
to see the location and source. (Which also works great in windows when the libraries are often bundled up in binaries.) Then you have to use source or github to find the specific line numbers or to trace through the logic of the code.
In my particular problem if I run:
newdata <- testing
caseString <- C50:::makeDataFile(x = newdata, y = NULL)
(Note the three ":".) I can see that this step completes at this level, So it appears as if something is happening to my training dataset along the way.
So using gitAnywhere() and github over and over through my traceback I can find the line number manually (Boo!)
in caret/R/predict.train.R, predict.train (defined on line 108)
calls probFunction on line 153
in caret/R/probFunction, probFunction
(defined on line 3) calls method$prob function which is a stored
function in the fit object caretfit$modelInfo$prob which can be
inspected by entering this into the console. This is the same
function found in caret/models/files/C5.0.R on line 58 which calls
'predict' on line 59
something in caret knows to use
C50/R/predict.C5.0.R which you can see by searching with
getAnywhere()
this function runs makeDataFile on line 25 (part of
the C50 package)
which calls paste, which calls apply, which dies
with stop
2 Particular Problem with caret's predict
As for my problem, I kept inspecting the code, and adding inputs at different levels and it would complete successfully. What happens is that some modification happens to my dataset in predict.train.R which causes it to fail. Well it turns out that I wasn't including my 'na.action' argument, which for my tree-based data, used 'na.pass'. If I include this argument:
prediction <- predict(caretfit, testing, type = "prob", na.action = na.pass)
it works as expected. line 126 of predict.train makes use of this argument to decide whether to include non-complete cases in the prediction. My data has no complete cases and so it failed complaining of needing a matrix of some positive length.
Now how one would be able to know the answer to this apply error is due to a missing na.action argument is not obvious at all, hence the need for a good debugging procedure. If anyone knows of other ways to debug (keeping in mind that in windows, stepping through library source in Rstudio doesnt work very well), please answer or comment.

Related

Prediction Warnings when predicting a fitted caret SVM model

I'm hoping to get some pointers as to why I'm getting :
Warning message: In method$predict(modelFit = modelFit, newdata =
newdata, submodels = param) : kernlab class prediction calculations
failed; returning NAs
When I print out the prediction:
svmRadial_Predict
[1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>....
The code I wrote to perform the SVM fitting:
#10-fold cross validation in 3 repetitions
control = trainControl(seeds = s, method="repeatedcv", number=10,
repeats=3, savePredictions = TRUE, classProbs = TRUE)
The SVM model for the fitting is like this:
svmRadial_model = train(y=modelTrain$Emotion,
x=modelTrain[c(2:4)],
method ='svmRadial',
trControl = control,
data=modelTrain,
tuneLength = 3
)
And the code I wrote to perform the prediction looks like this:
svmRadial_Predict <- predict(svmRadial_model,
newdata = modelTest[c(2:4)], probability = TRUE )
I've checked the data, and there's no NA values in the training or testing set. The y value is a factor and the x values are numeric if that makes a difference? Any tips to debug this would be very much appreciated!
As the model trains I can see warnings like this:
line search fails -1.407938 -0.1710936 2.039448e-05
which I had assumed was just the model being unable to fit a hyperplane for particular observations in the data. I'm using the svmRadial kernel
The data I'm trying to fit was already centred and scaled using the R scale() function.
Further work leads me to believe it's something to do with the
classProbs = TRUE flag. If I leave it out, no warnings are printed.
I've kicked off another run of my code, SVM seems to take ages to complete on my laptop for this task but I'll report the results as soon as that completes.
As a final edit, the model fitting completed without error, and I can use that model just fine for prediction/calculating the confusion matrix etc. I don't understand why including classProbs = TRUE breaks it, but maybe it's related to the combination of the cross validation that does, with the cross validation I had requested in my trainControl
Your problem is part of the peculiarities of the caret package.
There are two potential reasons why your prediction fails with kernlab svm methods called by caret:
The x, y interface returns a caret::train object which the predict function cannot use.
Solution: Simply replace by the formula interface.
train(form = Emotion ~ . , data = modelTrain, ...
The iterative search within the support vector machine algorithm doesn't converge.
Solution 2a)
Set different seeds before the train() call until it
converges.
set.seed(xxx)
train(form = Emotion ~ . , data = modelTrain, ...)
Solution 2b)
Decrease parameter minstep as suggested by #catastrophic-failure here. For this solution, there is no parameter in the ksvm function, so you need to change the source code in line 2982: minstep <- 1e-10 to a lower value. Then compile the source code yourself. No guarantee it will help though.
Try out Solution 1 first, as it is the most likely!
My solution to this was just to leave out the classProbs = TRUE parameter of the trainControl function. Once I did that, everything worked. I'd guess it's related to what's happening with cross validation under the hood but I'm not certain of that.

Predict() with regsubsets

I'm trying to replicate the results from An Introduction to Statistical Learning with Applications in R. Specifically, the Lab in section 6.5.3. I have followed the code in the lab exactly:
library("ISLR")
library("leaps")
set.seed(1)
train = sample(c(TRUE, FALSE), nrow(Hitters), rep = TRUE)
test = (!train)
regfit.best = regsubsets(Salary ~., data = Hitters[train,], nvmax= 19)
test.mat = model.matrix(Salary~., data = Hitters[test,])
val.errors = rep(NA, 19)
for (i in 1:19){
coefi= coef(regfit.best, id = i)
pred=test.mat[,names(coefi)]%*%coefi
val.errors[i]=mean((Hitters$Salary[test]-pred)^2)
}
When I run this I still get the following error:
Warning message:
In Hitters$Salary[test] - pred :
longer object length is not a multiple of shorter object length
Error in mean((Hitters$Salary[test] - pred)^2) :
error in evaluating the argument 'x' in selecting a method for function 'mean': Error: dims [product 121] do not match the length of object [148]
And val.errors is a vector of 19 NAs.
I'm still relatively new to R and to the validation approach, so I'm not sure exactly why the dimensions of these are different.
It was actually an issue with not carrying over steps from the previous subsection, which omitted entries that were incomplete.
You need to remove rows with missing data. Run "Hitters = na.omit(Hitters)" at the beginning.

Errors while performing caret tuning in R

I am building a predictive model with caret/R and I am running into the following problems:
When trying to execute the training/tuning, I get this error:
Error in if (tmps < .Machine$double.eps^0.5) 0 else tmpm/tmps :
missing value where TRUE/FALSE needed
After some research it appears that this error occurs when there missing values in the data, which is not the case in this example (I confirmed that the data set has no NAs). However, I also read somewhere that the missing values may be introduced during the re-sampling routine in caret, which I suspect is what's happening.
In an attempt to solve problem 1, I tried "pre-processing" the data during the re-sampling in caret by removing zero-variance and near-zero-variance predictors, and automatically inputting missing values using a carets knn automatic imputing method preProcess(c('zv','nzv','knnImpute')), , but now I get the following error:
Error: Matrices or data frames are required for preprocessing
Needless to say I checked and confirmed that the input data set are indeed matrices, so I dont understand why I get this second error.
The code follows:
x.train <- predict(dummyVars(class ~ ., data = train.transformed),train.transformed)
y.train <- as.matrix(select(train.transformed,class))
vbmp.grid <- expand.grid(estimateTheta = c(TRUE,FALSE))
adaptive_trctrl <- trainControl(method = 'adaptive_cv',
number = 10,
repeats = 3,
search = 'random',
adaptive = list(min = 5, alpha = 0.05,
method = "gls", complete = TRUE),
allowParallel = TRUE)
fit.vbmp.01 <- train(
x = (x.train),
y = (y.train),
method = 'vbmpRadial',
trControl = adaptive_trctrl,
preProcess(c('zv','nzv','knnImpute')),
tuneGrid = vbmp.grid)
The only difference between the code for problem (1) and (2) is that in (1), the pre-processing line in the train statement is commented out.
In summary,
-There are no missing values in the data
-Both x.train and y.train are definitely matrices
-I tried using a standard 'repeatedcv' method in instead of 'adaptive_cv' in trainControl with the same exact outcome
-Forgot to mention that the outcome class has 3 levels
Anyone has any suggestions as to what may be going wrong?
As always, thanks in advance
reyemarr
I had the same problem with my data, after some digging i found that I had some Inf (infinite) values in one of the columns.
After taking them out (df <- df %>% filter(!is.infinite(variable))) the computation ran without error.

H2O: Deep learning object not found in function 'predict' for argument 'model'

I'm just testing out h2o, in particular its deep learning capabilities, since I've heard great things about it. So far I've been using the following code:
library(h2o)
library(caret)
data("iris")
# Initiate H2O --------------------
h2o.removeAll() # Clean up. Just in case H2O was already running
h2o.init(nthreads = -1, max_mem_size="22G") # Start an H2O cluster with all threads available
# Get training and tournament data -------------------
a <- createDataPartition(iris$Species, list=FALSE)
training <- iris[a,]
test <- iris[-a,]
# Convert target to factor -------------------
target <- as.factor(iris$Species)
feature_names <- names(train)[1:(ncol(train)-1)]
train_h2o <- as.h2o(train)
test_h2o <- as.h2o(test)
prob <- test[, "id", drop = FALSE]
model_dl <- h2o.deeplearning(x = feature_names, y = "target", training_frame = train_h2o, stopping_metric = "logloss")
h2o.logloss(model_dl)
pred_dl <- predict(model_dl, newdata = tourn_h2o)
prob <- cbind(prob, as.data.frame(pred_dl$p1, col.names = "dl"))
write.table(prob[, c("id", "dl")], paste0(model_dl#model_id, ".csv"), sep = ",", row.names = FALSE, col.names = c("id", "probability"))
The relevant part is really that last line, where I got the following error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
Object 'DeepLearning_model_R_1494350691427_70' not found in function: predict for argument: model
Has anyone come across this before? Are there any easy solutions to this that I might be missing? Thanks in advance.
EDIT: With the updated code I get the error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
Illegal argument(s) for DeepLearning model: DeepLearning_model_R_1494428751150_1. Details: ERRR on field: _train: Training data must have at least 2 features (incl. response).
ERRR on field: _stopping_metric: Stopping metric cannot be logloss for regression.
I assume this has to do with the the way the Iris dataset is being read in.
Answer To First Question: Your original error message sounds like one you can get when things get of sync. E.g. maybe you had two sessions running at once, and removed the model in one session; the other session wouldn't know its variables are now out of date. H2O allows multiple connections, but they have to be co-operative. (Flow - see next paragraph - counts as a second session.)
Unless you can make a reproducible example, shrug and put it down to gremlins, and start a new session. Or, go and look at the data/models in Flow (a web server always running on 127.0.0.1:54321 ), and see if something is no longer there.
For your EDIT question, your model is making a regression model, but you are trying to use logloss, so thought you were doing a classification. This is caused by not having set the target variable to be a factor. Your current as.factor() line is on the wrong data, in the wrong place. It should go after your as.h2o() lines:
train_h2o <- as.h2o(training) #Typo fix
test_h2o <- as.h2o(test)
feature_names <- names(training)[1:(ncol(training)-1)] #typo fix
y = "Species" #The column we want to predict
train_h2o[,y] <- as.factor(train_h2o[,y])
test_h2o[,y] <- as.factor(test_h2o[,y])
And then make the model with:
model_dl <- h2o.deeplearning(x = feature_names, y = y, training_frame = train_h2o, stopping_metric = "logloss")
Get predictions with:
pred_dl <- predict(model_dl, newdata = test_h2o) #Typo fix
And compare with correct answer with the prediction using:
cbind(test[, y], as.data.frame(pred_dl$predict))
(BTW, H2O always detects the Iris data set columns as numeric vs. factor perfectly, so the above as.factor() lines are not needed; your error message must've been on your original data.)
StackOverflow advice: test your reproducible example, in full, and copy and paste in that exact code, with the exact error message that code is giving you. Your code had numerous small typos. E.g. train in places, training in others. createDataPartition() was not given; I assumed a = sample(nrow(iris), 0.8*nrow(iris)). test has no "id" column.
Other H2O advice:
Run h2o.removeAll() after h2o.init(). It was giving you an error message if run before. (Personally I avoid that function - it is the kind of thing that gets left in a production script by mistake...)
Consider importing your data into h2o earlier, and using h2o.splitFrame() to split it. I.e. avoid doing things in R that H2O can easily handle.
Avoid having your data in R, at all, if you can. Prefer importFile() over as.h2o().
The thinking beyond both the last points is that H2O will scale beyond the memory of one machine, while R won't. It also is less confusing than trying to keep track of the same thing in two places.
I had the same issue but could resolve it quite easily.
My error occured because I read in an h2o-object before initialising the h2o-cluster. So I trained an h2o-model, saved it, shut down the cluster, loaded in the model and then initialized the cluster once again.
Before reading in the h2o-object, you should already initialize the cluster (h2o.init()).

R predicts less instances that the dataset

I am creating a predictive model in R with "caret" and I don't understand how the function "predict" works.
I have a dataset testing with 222 instances, but when I execute the next command:
j48Probs <- predict(j48Model3x10cv, newdata = testing, type = "prob")
j48probs have 178 elements, and when I try to get the confusion matrix, I get the next error:
j48Classes <- predict(j48Model3x10cv, newdata = testing, type = "raw")
confusionMatrix(data=j48Classes, testing$Survived)
Error in table(data, reference, dnn = dnn, ...) :
all arguments must have the same length
What can be happening?
Thank you so much!
There have to be missing values in your testing set. One of the options in predict is na.action and this is standard set to na.omit. Which means any record containing missing data in one of the predictors will be ignored.
Take your testing data set and do nrow(na.omit(testing). It will show you how many rows will be used for prediction. In your case this probably returns 177 records.
And this issue rolls over into the confusionmatrix, you are trying to compare 177 predictions against 222 labels.
You could set na.action = NULL, but this will return NA predictions. That might or might not make sense. In some cases not being able to predict a classification is also information. You could also try imputing the missing data.
If the problem is that in the predictions there are many instances omited because of the missing values, I consider two solutions:
1.- not to omit them.
2.- omit them in the testing set too.
1.- I solved that this way:
j48Probs <- predict(j48Model3x10cv, newdata = testing, type = "prob", na.action = na.pass)
j48Classes <- predict(j48Model3x10cv, newdata = testing, type = "raw", na.action = na.pass)
confusionMatrix(data=j48Classes, testing$Survived)
2.- I solved this way:
testing <- (na.omit(testing)

Resources