Cannot extractPrediction using caret in R - r

I'm totally stucked on a random forest classification model since I cannot extract predictions. And I'm really out of clues since:
predict(forest.model1, titanic.final.test)
works like a charm, while
extractPrediction(list(forest.model1), testX=titanic.final.test[,-2], testY=titanic.final.test[,2])
which should be equivalent, gives me this error:
Error in predict.randomForest(modelFit, newdata) :
variables in the training data missing in newdata
Here's my trainControl:
forest.fitControl <- trainControl( method = "repeatedcv", repeats = 5,
summaryFunction = twoClassSummary, classProbs=TRUE,
returnData=TRUE, seeds=NULL, savePredictions=TRUE, returnResamp="all")
any idea?

Test and Train need to have the same structure (i.e. all the same columns). So my only guess is that negating the second column is resulting in a different structure that the data used to train the model. Hard to know without seeing the structire of the training vs. test data.frames.
Edit After Looking at Code:
Recreated this from your repo... Sure it shouldn't be the first column you pull out for testX and use for testY. Something like:
extractPrediction(list(forest.model1), testX=titanic.final.test[,-1], testY=titanic.final.test[,1])

Related

Format finalModel from caret

I have an annoying problem when pulling the finalModel from the caret package. I need the final logistic regression from a crossCV, but the list object has added the ` symbol around model terms that need to be evaluated in the formula (e.g., `I(x^2)`). This renders the object unusable for any other package.
I have tried: 1. stating the model formula in the function call, 2. using poly() instead of I for the terms, 3. using lapply on the model object with a gsub() function to replace the character, 4. using grep with lappy on the model object just to find the character and going thru manually with gsub. But basically, lapply doesn't dig down thru all the variable list lengths, and doesn't return a model object.
#Here's the problem
library(caret)
dat=data.frame(y=as.factor(rbinom(1000, 1, prob=0.5)),
x=rnorm(1000,10,1),
w=rnorm(1000,100,1),
z=rnorm(1000,1000,1))
levels(dat$y) <- c("A","P")
Train <- createDataPartition(dat$y, p=0.7, list=FALSE)
train<- dat[ Train, ]
test <- dat[ -Train, ]
lof=as.formula(y~ x+I(x^2)+w+I(w^2)+z+I(z^2))
m1<-glm(lof,family="binomial", data=train)
m1
pred1=predict(m1,newdata=test, type="response" )
ctrl <- trainControl(
method = "repeatedcv",
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary,
)
m2<-train(lof,family="binomial", data=train,
method="glm",
trControl = ctrl,
metric = "ROC")
m3=m2$finalModel
m3
pred2=predict(m3,newdata=test, type="response" )
res1=lapply(m3, function (x) grepl('\\`I',x))
m3$terms[[3]]=gsub('\\`I',"",m3$terms[[3]])
m3$terms[[3]]=gsub(')\\`',"",m3$terms[[3]])
m3$terms
res=lapply(m3, function (x) grepl('\\`I',names(x)))
names(m3$effects)[res$effects==TRUE]=gsub('\\`I',"",names(m3$effects)[res$effects==TRUE])
names(m3$effects)[res$effects==TRUE]=gsub(')\\`',"",names(m3$effects)[res$effects==TRUE])
pred3=predict(m3,newdata=test, type="response" )
And an attempted fix with lapply finds some instances, but not all, and cannot be easily used to modify the original object. So even a manual fix is stymied. Obviously the easiest solution to figure out how to stop caret from doing this in the first place rather than trying to edit the object.
I should probably note that I'm not trying to extract the finalModel to use with predict, I am aware I can predict with caret on the entire m2 object... I want it for use with the dismo package.
No replies, but what I ended up doing as a workaround was to create new variables x2=x^2 etc, which is fine as far as it goes, but for my use that means I ALSO had to create new raster layers with x^2 etc which is a waste of time and space.

debugging caret with SMOTE in R

I'm trying to use SMOTE in R within the trainControl function in caret. Following the author's example I do as follows:
#first, create an imbalanced data set
set.seed(2969)
imbal_train <- twoClassSim(10000, intercept = -20, linearVars = 20)
imbal_test <- twoClassSim(10000, intercept = -20, linearVars = 20)
table(imbal_train$Class)
Class1 Class2
9411 589
I want to use the SMOTE algorithm to oversample my minority class. However, this has to be done carefully. For instance, we shouldn't oversample before doing cross validation. This would lead us to optimistic generalization error.
#create my folds (5 in this case)
folds <- createFolds(factor(imbal_train$Class), k = 5, list = TRUE,returnTrain=TRUE)
#trainControl to set up my training phase.
ctrl <- trainControl(method = "cv", index = folds,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "all",
## new option here:
sampling = "smote")
#train the model
set.seed(5627)
smote_inside <- train(Class ~ ., data = imbal_train,
method = "treebag",
nbagg = 50,
metric = "ROC",
trControl = ctrl)
It runs without error. I now want to see the training and testing set used in each iteration. I need to make sure that before oversampling the training folders, one folder was hold out and no new synthetic records were created inside of it.
Looking into the objects output by train, I could see that smote_inside$control may have some information. Concretely, it has the index and index_out: these are the row indexes for the training and testing in each cv iteration. However, when I do :
lista=smote_inside$control
dd=imbal_train[lista$index$Fold1,] #training data first cv iteration
table(dd$Class)
Class1 Class2
7529 471
You can see that it is still imbalanced. SMOTE is supposed to create some synthetic records from the minority class. Maybe this information is saved in another place?
Questions:
How can I see the new training records that were created using smote to balance the data?
How can I be sure that the testing folder wasn't contaminated with the oversampling?
Where can I find what caret is doing with SMOTE? pointers to a source code.
Some answers:
It does not retain that information
It is designed not to contaminate the holdout data. If you want proof (beyond what is shown in the link that you reference), look at createModel to see how it does the sampling and predictionFunction for how the data are handled prior to prediction.
The package sources are available basically everywhere. The two functions above (along with probFunction) to the work.

R Caret using Recipe - Unable to create model using recipe functionality of caret package

I have a dataframe with 1560 samples (rows) and four features (columns) and one column with the class (TRUE/ FALSE).
Unfortunately the dataframe is too large to give you a reproducible sample. Any general help would be appreciated though!
When I now run the caret train() function,
lr_original <- train(original_data$class, original_data[,1:4], method='glm',metric = 'Accuracy', trControl= trainControl(method='cv', savePredictions = TRUE))
I get the error Error in table(y) : attempt to make a table with >= 2^31 elements
I already tried different sampling methods (LOOCV and none) as well as different classifying methods (knn and svm) - always the same error.
Is 1560 rows too much for the train function? Is there any way around it?
Thank you for your help
I interchanged x and y in the train function.
With
lr_original <- train(original_data[,1:4],original_data$class, method='glm',metric = 'Accuracy', trControl= trainControl(method='cv', savePredictions = TRUE))
it works :)

R session is aborted using random forest of "caret" package

The table has 560,304 rows, and 10 columns. I used the caret package random Forest to train my model. Before training, I defined my columns time and classification types. other columns do not need direct type defining. I used rose sampling because of the data imbalanced classification. I did not use the first column because it is the identifier. Here is my code:
require(openxlsx)
library(chron)
library(caret)
library(randomForest)
set.seed(9560)
record<-read.xlsx('_3.xlsx',sheet =1 , colNames = TRUE,
rowNames = FALSE, skipEmptyRows = TRUE)
#Define column types
record$TIME <- chron(times=record$TIME)
record$Is.Fraud <- factor(record$Is.Fraud)
MyControl <- trainControl(method = "repeatedCV", repeats=5,
sampling="rose")
#train model
rfModel <- train(record$Is.Fraud ~.,data=record[,2:10],method='rf',
trControl=MyControl)
However, the R session is aborted and crashed.
when I upsample data and use two different random Forest and combine them there is no problem even when I use upsampling instead of rose. the following code works fine:
upSmp<-upSample(x = record[, 2:9],y=record$Is.Fraud)
upSmp <- upSmp[sample(nrow(upSmp)),]
rdFrst1<-randomForest(Class ~.,data=upSmp[1:(nrow(upSmp)/2),])
rdFrst2<-randomForest(Class ~.,data=upSmp[((nrow(upSmp)/2)+1):nrow(upSmp),])
rf.combined <- combine(rdFrst1,rdFrst2)
I think there is a problem with my memory when I use repeatedcv. What is my problem? Can I handle it in caret train function with combining two random forest like how I used RandomForest function and combined models?
I attached the step it is stoped in .

r caret predict returns fewer output than input

I used caret to train an rpart model below.
trainIndex <- createDataPartition(d$Happiness, p=.8, list=FALSE)
dtrain <- d[trainIndex, ]
dtest <- d[-trainIndex, ]
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv", number=10, repeats=10)
fitRpart <- train(Happiness ~ ., data=dtrain, method="rpart",
trControl = fitControl)
testRpart <- predict(fitRpart, newdata=dtest)
dtest contains 1296 observations, so I expected testRpart to produce a vector of length 1296. Instead it's 1077 long, i.e. 219 short.
When I ran the prediction on the first 220 rows of dtest, I got a predicted result of 1, so it's consistently 219 short.
Any explanation on why this is so, and what I can do to get a consistent output to the input?
Edit: d can be loaded from here to reproduce the above.
I downloaded your data and found what explains the discrepancy.
If you simply remove the missing values from your dataset, the length of the outputs match:
testRpart <- predict(fitRpart, newdata = na.omit(dtest))
Note nrow(na.omit(dtest)) is 1103, and length(testRpart) is 1103. So you need a strategy to address missing values. See ?predict.rpart and the options for the na.action parameter to choose what you want.
Similar to what Josh mentioned, if you need to generate predictions using predict.train from caret, simply pass the na.action of na.pass:
testRpart <- predict(fitRpart, newdata = dtest, na.action = na.pass)
Note: moving this to a separate answer based on Ricky's comment on Josh's answer above for visibility.
I had a similar issue using "newx" instead of "newdata" in the predict function. Using "newdata" (or nothing) solve my problem, hope it will help someone else who used newx and having same issue.

Resources