R session is aborted using random forest of "caret" package - r

The table has 560,304 rows, and 10 columns. I used the caret package random Forest to train my model. Before training, I defined my columns time and classification types. other columns do not need direct type defining. I used rose sampling because of the data imbalanced classification. I did not use the first column because it is the identifier. Here is my code:
require(openxlsx)
library(chron)
library(caret)
library(randomForest)
set.seed(9560)
record<-read.xlsx('_3.xlsx',sheet =1 , colNames = TRUE,
rowNames = FALSE, skipEmptyRows = TRUE)
#Define column types
record$TIME <- chron(times=record$TIME)
record$Is.Fraud <- factor(record$Is.Fraud)
MyControl <- trainControl(method = "repeatedCV", repeats=5,
sampling="rose")
#train model
rfModel <- train(record$Is.Fraud ~.,data=record[,2:10],method='rf',
trControl=MyControl)
However, the R session is aborted and crashed.
when I upsample data and use two different random Forest and combine them there is no problem even when I use upsampling instead of rose. the following code works fine:
upSmp<-upSample(x = record[, 2:9],y=record$Is.Fraud)
upSmp <- upSmp[sample(nrow(upSmp)),]
rdFrst1<-randomForest(Class ~.,data=upSmp[1:(nrow(upSmp)/2),])
rdFrst2<-randomForest(Class ~.,data=upSmp[((nrow(upSmp)/2)+1):nrow(upSmp),])
rf.combined <- combine(rdFrst1,rdFrst2)
I think there is a problem with my memory when I use repeatedcv. What is my problem? Can I handle it in caret train function with combining two random forest like how I used RandomForest function and combined models?
I attached the step it is stoped in .

Related

R training and tuning random forest classifier with hardware challenge

I'm sorry if this is not appropriate to ask here but please forgive a noobie.
I'm training a random forest multiple class(8) classifier using R Caret on my experimental data using my desktop, 32GB RAM and a 4 core CPU. However, I'm facing constant complains from RStudio reporting it cannot allocate vector of 9GB. So I have to reduce the training set all the way to 1% of the data just to run fold CV and some grid search. As a result my mode accuracy is ~50% and the resulting features selected aren't very good at all. Only 2 out of 8 classes are being distinguished somewhat truthfully. Of course it could be that I don't have any good features. But I want to at least test train and tune my model on a decent size of training data first. What are the solutions can help? or is there anywhere I can upload my data and train somewhere? I'm so new that I don't know if something like cloud based things can help me? Pointers will be appreciated.
Edit: I have uploaded the data table and my codes so maybe it is my bad coding screwed things up.
Here is a link to the data:
https://drive.google.com/file/d/1wScYKd7J-KlRvvDxHAmG3_If1o5yUimy/view?usp=sharing
Here are my codes:
#load libraries
library(data.table)
library(caret)
library(caTools)
library(e1071)
#read the data in
df.raw <-fread("CLL_merged_sampled_same_ctrl_40percent.csv", header =TRUE,data.table = FALSE)
#get the useful data
#subset and get rid of useless labels
df.1 <- subset(df.raw, select = c(18:131))
df <- subset(df.1, select = -c(2:4))
#As I want to build a RF model to classify drug treatments
# make the treatmentsun as factors
#there should be 7 levels
df$treatmentsum <- as.factor(df$treatmentsum)
df$treatmentsum
#find nearZerovarance features
#I did not remove them. Just flagged them
nzv <- nearZeroVar(df[-1], saveMetrics= TRUE)
nzv[nzv$nzv==TRUE,]
possible.nzv.flagged <- nzv[nzv$nzv=="TRUE",]
write.csv(possible.nzv.flagged, "Near Zero Features flagged.CSV", row.names = TRUE)
#identify correlated features
df.Cor <- cor(df[-1])
highCorr <- sum(abs(df.Cor[upper.tri(df.Cor)]) > .99)
highlyCor <- findCorrelation(df.Cor, cutoff = .99,verbose = TRUE)
#Get rid off strongly correlated features
filtered.df<- df[ ,-highlyCor]
str(filtered.df)
#identify linear dependecies
linear.combo <- findLinearCombos(filtered.df[-1])
linear.combo #no linear ones detected
#splt datainto training and test
#Here is my problem, I want to use 80% of the data for training
#but in my computer, I can only use 0.002
set.seed(123)
split <- sample.split(filtered.df$treatmentsum, SplitRatio = 0.8)
training_set <- subset(filtered.df, split==TRUE)
test_set <- subset(filtered.df, split==FALSE)
#scaling numeric data
#leave the first column labels out
training_set[-1] = scale(training_set[-1])
test_set[-1] = scale(test_set[-1])
training_set[1]
#build RF
#use Cross validation for model training
#I can't use repeated CV as it fails on my machine
#I set a grid search for tuning
control <- trainControl(method="cv", number=10,verboseIter = TRUE, search = 'grid')
#default mtry below, is around 10
#mtry <- sqrt(ncol(training_set))
#I used ,mtry 1:12 to run, but I wanted to test more, limited again by machine
tunegrid <- expand.grid(.mtry = (1:20))
model <- train(training_set[,-1],as.factor(training_set[,1]), data=training_set, method = "rf", trControl = control,metric= "Accuracy", maximize = TRUE ,importance = TRUE, type="classification", ntree =800,tuneGrid = tunegrid)
print(model)
plot(model)
prediction2 <- predict(model, test_set[,-1])
cm<-confusionMatrix(prediction2, as.factor(test_set[,1]), positive = "1")

debugging caret with SMOTE in R

I'm trying to use SMOTE in R within the trainControl function in caret. Following the author's example I do as follows:
#first, create an imbalanced data set
set.seed(2969)
imbal_train <- twoClassSim(10000, intercept = -20, linearVars = 20)
imbal_test <- twoClassSim(10000, intercept = -20, linearVars = 20)
table(imbal_train$Class)
Class1 Class2
9411 589
I want to use the SMOTE algorithm to oversample my minority class. However, this has to be done carefully. For instance, we shouldn't oversample before doing cross validation. This would lead us to optimistic generalization error.
#create my folds (5 in this case)
folds <- createFolds(factor(imbal_train$Class), k = 5, list = TRUE,returnTrain=TRUE)
#trainControl to set up my training phase.
ctrl <- trainControl(method = "cv", index = folds,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "all",
## new option here:
sampling = "smote")
#train the model
set.seed(5627)
smote_inside <- train(Class ~ ., data = imbal_train,
method = "treebag",
nbagg = 50,
metric = "ROC",
trControl = ctrl)
It runs without error. I now want to see the training and testing set used in each iteration. I need to make sure that before oversampling the training folders, one folder was hold out and no new synthetic records were created inside of it.
Looking into the objects output by train, I could see that smote_inside$control may have some information. Concretely, it has the index and index_out: these are the row indexes for the training and testing in each cv iteration. However, when I do :
lista=smote_inside$control
dd=imbal_train[lista$index$Fold1,] #training data first cv iteration
table(dd$Class)
Class1 Class2
7529 471
You can see that it is still imbalanced. SMOTE is supposed to create some synthetic records from the minority class. Maybe this information is saved in another place?
Questions:
How can I see the new training records that were created using smote to balance the data?
How can I be sure that the testing folder wasn't contaminated with the oversampling?
Where can I find what caret is doing with SMOTE? pointers to a source code.
Some answers:
It does not retain that information
It is designed not to contaminate the holdout data. If you want proof (beyond what is shown in the link that you reference), look at createModel to see how it does the sampling and predictionFunction for how the data are handled prior to prediction.
The package sources are available basically everywhere. The two functions above (along with probFunction) to the work.

R Caret using Recipe - Unable to create model using recipe functionality of caret package

I have a dataframe with 1560 samples (rows) and four features (columns) and one column with the class (TRUE/ FALSE).
Unfortunately the dataframe is too large to give you a reproducible sample. Any general help would be appreciated though!
When I now run the caret train() function,
lr_original <- train(original_data$class, original_data[,1:4], method='glm',metric = 'Accuracy', trControl= trainControl(method='cv', savePredictions = TRUE))
I get the error Error in table(y) : attempt to make a table with >= 2^31 elements
I already tried different sampling methods (LOOCV and none) as well as different classifying methods (knn and svm) - always the same error.
Is 1560 rows too much for the train function? Is there any way around it?
Thank you for your help
I interchanged x and y in the train function.
With
lr_original <- train(original_data[,1:4],original_data$class, method='glm',metric = 'Accuracy', trControl= trainControl(method='cv', savePredictions = TRUE))
it works :)

Plot in SVM model (e1071 Package) using DocumentTermMatrix

i trying do create a plot for my model create using SVM in e1071 package.
my code to build the model, predict and build confusion matrix is
ptm <- proc.time()
svm.classifier = svm(x = train.set.list[[0.999]][["0_0.1"]],
y = train.factor.list[[0.999]][["0_0.1"]],
kernel ="linear")
pred = predict(svm.classifier, test.set.list[[0.999]][["0_0.1"]], decision.values = TRUE)
time[["svm"]] = proc.time() - ptm
confmatrix = confusionMatrix(pred,test.factor.list[[0.999]][["0_0.1"]])
confmatrix
train.set.list and test.set.list contains the test and train set for several conditions. train and set factor has the true label for each set. Train.set and test.set are both documenttermmatrix.
Then i tried to see a plot of my data, i tried with
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]])
but i got the message:
"Error in plot.svm(svm.classifier, train.set.list[[0.999]][["0_0.1"]]) :
missing formula."
what i'm doing wrong? confusion matrix seems good to me even not using formula parameter in svm function
Without given code to run, it's hard to say exactly what the problem is. My guess, given
?plot.svm
which says
formula formula selecting the visualized two dimensions. Only needed if more than two input variables are used.
is that your data has more than two predictors. You should specify in your plot function:
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]], predictor1 ~ predictor2)

Cannot extractPrediction using caret in R

I'm totally stucked on a random forest classification model since I cannot extract predictions. And I'm really out of clues since:
predict(forest.model1, titanic.final.test)
works like a charm, while
extractPrediction(list(forest.model1), testX=titanic.final.test[,-2], testY=titanic.final.test[,2])
which should be equivalent, gives me this error:
Error in predict.randomForest(modelFit, newdata) :
variables in the training data missing in newdata
Here's my trainControl:
forest.fitControl <- trainControl( method = "repeatedcv", repeats = 5,
summaryFunction = twoClassSummary, classProbs=TRUE,
returnData=TRUE, seeds=NULL, savePredictions=TRUE, returnResamp="all")
any idea?
Test and Train need to have the same structure (i.e. all the same columns). So my only guess is that negating the second column is resulting in a different structure that the data used to train the model. Hard to know without seeing the structire of the training vs. test data.frames.
Edit After Looking at Code:
Recreated this from your repo... Sure it shouldn't be the first column you pull out for testX and use for testY. Something like:
extractPrediction(list(forest.model1), testX=titanic.final.test[,-1], testY=titanic.final.test[,1])

Resources