R predicts less instances that the dataset - r

I am creating a predictive model in R with "caret" and I don't understand how the function "predict" works.
I have a dataset testing with 222 instances, but when I execute the next command:
j48Probs <- predict(j48Model3x10cv, newdata = testing, type = "prob")
j48probs have 178 elements, and when I try to get the confusion matrix, I get the next error:
j48Classes <- predict(j48Model3x10cv, newdata = testing, type = "raw")
confusionMatrix(data=j48Classes, testing$Survived)
Error in table(data, reference, dnn = dnn, ...) :
all arguments must have the same length
What can be happening?
Thank you so much!

There have to be missing values in your testing set. One of the options in predict is na.action and this is standard set to na.omit. Which means any record containing missing data in one of the predictors will be ignored.
Take your testing data set and do nrow(na.omit(testing). It will show you how many rows will be used for prediction. In your case this probably returns 177 records.
And this issue rolls over into the confusionmatrix, you are trying to compare 177 predictions against 222 labels.
You could set na.action = NULL, but this will return NA predictions. That might or might not make sense. In some cases not being able to predict a classification is also information. You could also try imputing the missing data.

If the problem is that in the predictions there are many instances omited because of the missing values, I consider two solutions:
1.- not to omit them.
2.- omit them in the testing set too.
1.- I solved that this way:
j48Probs <- predict(j48Model3x10cv, newdata = testing, type = "prob", na.action = na.pass)
j48Classes <- predict(j48Model3x10cv, newdata = testing, type = "raw", na.action = na.pass)
confusionMatrix(data=j48Classes, testing$Survived)
2.- I solved this way:
testing <- (na.omit(testing)

Related

Errors while performing caret tuning in R

I am building a predictive model with caret/R and I am running into the following problems:
When trying to execute the training/tuning, I get this error:
Error in if (tmps < .Machine$double.eps^0.5) 0 else tmpm/tmps :
missing value where TRUE/FALSE needed
After some research it appears that this error occurs when there missing values in the data, which is not the case in this example (I confirmed that the data set has no NAs). However, I also read somewhere that the missing values may be introduced during the re-sampling routine in caret, which I suspect is what's happening.
In an attempt to solve problem 1, I tried "pre-processing" the data during the re-sampling in caret by removing zero-variance and near-zero-variance predictors, and automatically inputting missing values using a carets knn automatic imputing method preProcess(c('zv','nzv','knnImpute')), , but now I get the following error:
Error: Matrices or data frames are required for preprocessing
Needless to say I checked and confirmed that the input data set are indeed matrices, so I dont understand why I get this second error.
The code follows:
x.train <- predict(dummyVars(class ~ ., data = train.transformed),train.transformed)
y.train <- as.matrix(select(train.transformed,class))
vbmp.grid <- expand.grid(estimateTheta = c(TRUE,FALSE))
adaptive_trctrl <- trainControl(method = 'adaptive_cv',
number = 10,
repeats = 3,
search = 'random',
adaptive = list(min = 5, alpha = 0.05,
method = "gls", complete = TRUE),
allowParallel = TRUE)
fit.vbmp.01 <- train(
x = (x.train),
y = (y.train),
method = 'vbmpRadial',
trControl = adaptive_trctrl,
preProcess(c('zv','nzv','knnImpute')),
tuneGrid = vbmp.grid)
The only difference between the code for problem (1) and (2) is that in (1), the pre-processing line in the train statement is commented out.
In summary,
-There are no missing values in the data
-Both x.train and y.train are definitely matrices
-I tried using a standard 'repeatedcv' method in instead of 'adaptive_cv' in trainControl with the same exact outcome
-Forgot to mention that the outcome class has 3 levels
Anyone has any suggestions as to what may be going wrong?
As always, thanks in advance
reyemarr
I had the same problem with my data, after some digging i found that I had some Inf (infinite) values in one of the columns.
After taking them out (df <- df %>% filter(!is.infinite(variable))) the computation ran without error.

how to debug errors like: "dim(x) must have a positive length" with caret

I'm running a predict over a fit similar to what is found in the caret guide:
Caret Measuring Performance
predictions <- predict(caretfit, testing, type = "prob")
But I get the error:
Error in apply(x, 1, paste, collapse = ",") :
dim(X) must have a positive length
I would like to know 1) the general way to diagnose these errors that are the result of bad inputs into functions like this or 2) why my code is failing.
1)
So looking at the error It's something to do with 'X'. Which argument is x? Obviously the first one in 'apply', but which argument in predict is eventually passed to apply? Looking at traceback():
10: stop("dim(X) must have a positive length")
9: apply(x, 1, paste, collapse = ",")
8: paste(apply(x, 1, paste, collapse = ","), collapse = "\n")
7: makeDataFile(x = newdata, y = NULL)
6: predict.C5.0(modelFit, newdata, type = "prob")
5: predict(modelFit, newdata, type = "prob") at C5.0.R#59
4: method$prob(modelFit = modelFit, newdata = newdata, submodels = param)
3: probFunction(method = object$modelInfo, modelFit = object$finalModel,
newdata = newdata, preProc = object$preProcess)
2: predict.train(caretfit, testing, type = "prob")
1: predict(caretfit, testing, type = "prob")
Now, this problem would be easy to solve if I could follow the code through and understand the problem as opposed to these general errors. I can trace the code using this traceback to the code at C5.0.R#59. (It looks like there's no way to get line numbers on every trace?) I can follow this code as far as this line 59 and then (I think) the predict function on line 44:
Github Caret C5.0 source
But after this I'm not sure where the logic flows. I don't see 'makeDataFile' anywhere in the caret source or, if it's in another package, how it got there. I've also tried Rstudio debugging, debug() and browser(). None provide the stacktrace I would expect from other languages. Any suggestion on how to follow the code when you don't know what an error msg means?
2) As for my particular inputs, 'caretfit' is simply the result of a caret fit and the testing data is 3million rows by 59 columns:
fitcontrol <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = TRUE,
summaryFunction = custom.summary,
allowParallel = TRUE)
fml <- as.formula(paste("OUTVAR ~",paste(colnames(training[,1:(ncol(training)-2)]),collapse="+")))
caretfit <- train(fml,
data = training[1:200000,],
method = "C5.0",
trControl = fitcontrol,
verbose = FALSE,
na.action = na.pass)
1 Debuging Procedure
You can pinpoint the problem using a couple of functions.
Although there still doesn't seem to be anyway to get a full stacktrace with line numbers in code (Boo!), you can use the functions you do get from the traceback and use the function getAnywhere() to search for the function you are looking for. So for example, you can do:
getAnywhere(makeDataFile)
to see the location and source. (Which also works great in windows when the libraries are often bundled up in binaries.) Then you have to use source or github to find the specific line numbers or to trace through the logic of the code.
In my particular problem if I run:
newdata <- testing
caseString <- C50:::makeDataFile(x = newdata, y = NULL)
(Note the three ":".) I can see that this step completes at this level, So it appears as if something is happening to my training dataset along the way.
So using gitAnywhere() and github over and over through my traceback I can find the line number manually (Boo!)
in caret/R/predict.train.R, predict.train (defined on line 108)
calls probFunction on line 153
in caret/R/probFunction, probFunction
(defined on line 3) calls method$prob function which is a stored
function in the fit object caretfit$modelInfo$prob which can be
inspected by entering this into the console. This is the same
function found in caret/models/files/C5.0.R on line 58 which calls
'predict' on line 59
something in caret knows to use
C50/R/predict.C5.0.R which you can see by searching with
getAnywhere()
this function runs makeDataFile on line 25 (part of
the C50 package)
which calls paste, which calls apply, which dies
with stop
2 Particular Problem with caret's predict
As for my problem, I kept inspecting the code, and adding inputs at different levels and it would complete successfully. What happens is that some modification happens to my dataset in predict.train.R which causes it to fail. Well it turns out that I wasn't including my 'na.action' argument, which for my tree-based data, used 'na.pass'. If I include this argument:
prediction <- predict(caretfit, testing, type = "prob", na.action = na.pass)
it works as expected. line 126 of predict.train makes use of this argument to decide whether to include non-complete cases in the prediction. My data has no complete cases and so it failed complaining of needing a matrix of some positive length.
Now how one would be able to know the answer to this apply error is due to a missing na.action argument is not obvious at all, hence the need for a good debugging procedure. If anyone knows of other ways to debug (keeping in mind that in windows, stepping through library source in Rstudio doesnt work very well), please answer or comment.

Predict function from Caret package give an Error

I am doing just a regular logistic regression using the caret package in R. I have a binomial response variable coded 1 or 0 that is called a SALES_FLAG and 140 numeric response variables that I used dummyVars function in R to transform to dummy variables.
data <- dummyVars(~., data = data_2, fullRank=TRUE,sep="_",levelsOnly = FALSE )
dummies<-(predict(data, data_2))
model_data<- as.data.frame(dummies)
This gives me a data frame to work with. All of the variables are numeric. Next I split into training and testing:
trainIndex <- createDataPartition(model_data$SALE_FLAG, p = .80,list = FALSE)
train <- model_data[ trainIndex,]
test <- model_data[-trainIndex,]
Time to train my model using the train function:
model <- train(SALE_FLAG~. data=train,method = "glm")
Everything runs nice and I get a model. But when I run the predict function it does not give me what I need:
predict(model, newdata =test,type="prob")
and I get an ERROR:
Error in dimnames(out)[[2]] <- modelFit$obsLevels :
length of 'dimnames' [2] not equal to array extent
On the other hand when I replace "prob" with "raw" for type inside of the predict function I get prediction but I need probabilities so I can code them into binary variable given my threshold.
Not sure why this happens. I did the same thing without using the caret package and it worked how it should:
model2 <- glm(SALE_FLAG ~ ., family = binomial(logit), data = train)
predict(model2, newdata =test, type="response")
I spend some time looking at this but not sure what is going on and it seems very weird to me. I have tried many variations of the train function meaning I didn't use the formula and used X and Y. I used method = 'bayesglm' as well to check and id gave me the same error. I hope someone can help me out. I don't need to use it since the train function to get what I need but caret package is a good package with lots of tools and I would like to be able to figure this out.
Show us str(train) and str(test). I suspect the outcome variable is numeric, which makes train think that you are doing regression. That should also be apparent from printing model. Make it a factor if you want to do classification.
Max

Issues with predict function when building a CART model via CrossValidation using the train command

I am trying to build a CART model via cross validation using the train function of "caret" package.
My data is 4500 x 110 data frame, where all the predictor variables (except the first two, UserId and YOB (Year of Birth) which I am not using for model building) are factors with 2 levels except the dependent variable which is of type integer (although has only two values 1 and 0). Gender is one of the independent variables.
When I ran rpart command to get CART model (using the package "rpart"), i didn't have any problem with the predict function. However, I wanted to improve the model via cross validation, and so used the train function from the package "caret" with the following command:
tr = train(y ~ ., data = subImpTrain, method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
This build the model with the following warning
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
But it did give me a final model (best.tree). However, when I am trying to run the predict function using the following command:
best.tree.pred = predict(best.tree, newdata = subImpTest)
on the test data, it is giving me the following error:
Error in eval(expr, envir, enclos) : object 'GenderMale' not found
The Gender variable has two values: Female, Male
Can anybody help me understand the error
As #lorelai suggested, caret dummy-codes your variables if you supply it a formula. An alternative is to provide it the variables themselves, like so:
tr = train(y = subImpTrain$y, x = subImpTrain[, -subImpTrain$y],
method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
More importantly, however, you shouldn't use predict.rpart and instead use predict.train, like so:
predict(tr, subImpTest)
In which case it would work just fine with the formula interface.
I have had a similar problem in the past, although concerning another algorithm.
Basically, some algorithms transform the factor variables into dummy variables and rename them accordingly.
My solution was to create my own dummies and leave them in numerical format.
I read that decision trees manage to work properly even so.

Evaluating weka classifier J48 with missing values in test set, R RWeka

I have an error when evaluating a simple test set with evaluate_Weka_classifier. Trying to learn how the interface works from R to Weka with RWeka, but I still don't get this.
library("RWeka")
iris_input <- iris[1:140,]
iris_test <- iris[-(1:140),]
iris_fit <- J48(Species ~ ., data = iris_input)
evaluate_Weka_classifier(iris_fit, newdata = iris_test, numFolds=5)
No problems here, as we would assume (It is ofcourse a stupit test, no random holdout data etc). But now I want to simulate missing data (alot). So i set Petal.Width as missing:
iris_test$Petal.Width <- NA
evaluate_Weka_classifier(iris_fit, newdata = iris_test, numFolds=5)
Which gives the error:
Error in .jcall(evaluation, "S", "toSummaryString", complexity) :
java.lang.IllegalArgumentException: Can't have more folds than instances!
Edit: This error should tell me that I have not enough instances, but I have 10
Edit: If I use write.arff, it can be exported and read in by Weka. Change Petal.Width {} into Petal.Width numeric to make the two files exactly the same. Then it works in Weka.
Is this a thinking error? When reading Machine Learning, Practical machine learning tools and techniques it seems to be legit. Maybe I just have to tell RWeka that I want to use fractions when a split uses a missing variable?
Thnx!
The issue is that you need to tell J48() what to do with missing values.
library(RWeka)
?J48()
#pertinent output
J48(formula, data, subset, na.action,
control = Weka_control(), options = NULL)
na.action tells R what to do with missing values. When following up on na.action you will find that "The ‘factory-fresh’ default is na.omit". Under this setting of course there are not enough instances!
Instead of leaving na.action as the default omit, I have changed it as follows,
iris_fit<-J48(Species~., data = iris_input, na.action=NULL)
and it works like a charm!

Resources