caret package classifiers are not responding - r

I am trying to model a train data with caret package's classifiers, but it does not respond for a very long time (I have waited for 2 hours). On the other hand, it works for other datasets.
Here is the link to my train data: http://www.htmldersleri.org/train.csv (It is well-known Reuters-21570 data set)
And the command I am using is:
model<-train(class~.,data=train,method="knn")
Note: for any other method (eg: svm, naive bayes, etc.), is stucks anyway.
Note 2: For package e1071, naiveBayes classifier works, but with 0,08% accuacy!
Can anyone tell me what can be the problem? Thanks in advance.

This seems to be multiclass classification problem. I'm not sure if caret supports that. However, I can show you how you would do the same thing with the mlr package
library(mlr)
x <- read.csv("http://www.htmldersleri.org/train.csv")
tsk <- makeClassifTask(data = x, target = 'class')
#Assess the performane with 10-fold cross-validation
crossval('classif.knn', tsk)
If you want to know which learners are integrated in mlr that support this kind of task, type
listLearners(tsk)

Related

R caret randomforest

Using the defaults of the train in caret package, I am trying to train a random forest model for the dataset xtr2 (dim(xtr2): 765 9408). The problem is that it unbelievably takes too long (more than one day for one training) to fit the function. As far as I know train in its default uses bootstrap sampling (25 times) and three random selection of mtry, so why it should take so long?
Please notice that I need to train the rf, three times in each run (because I need to make a mean of the results of different random forest models with the same data), and it takes about three days, and I need to run the code for 10 different samples, so it would take me 30 days to have the results.
My question is how I can make it faster?
Can changing the defaults of train make the operation time less? for example using CV for training?
Can parallel processing with caret package help? if yes, how it can be done?
Can tuneRF of random forest package make any changes to the time?
This is the code:
rffit=train(xtr2,ytr2,method="rf",ntree=500)
rf.mdl =randomForest(x=xtr2,y=as.factor(ytr2),ntree=500,
keep.forest=TRUE,importance=TRUE,oob.prox =FALSE ,
mtry = rffit$bestTune$mtry)
Thank you,
My thoughts on your questions:
Yes! But don't forget you also have control over the search grid caret uses for the tuning parameters; in this case, mtry. I'm not sure what the default search grid is for mtry, but try the following:
ctrl <- trainControl("cv", number = 5, verboseIter = TRUE)
set.seed(101) # for reproducibility
rffit <- train(xtr2, ytr2, method = "rf", trControl = ctrl, tuneLength = 5)
Yes! See the caret website: http://topepo.github.io/caret/parallel-processing.html
Yes and No! tuneRF simply uses the OOB error to find an optimal value of mtry (the only tuning parameter in randomForest). Using cross-validation tends to work better and produce a more honest estimate of model performance. tuneRF can take a long time but should be quicker than k-fold cross-validation.
Overall, the online manual for caret is quite good: http://topepo.github.io/caret/index.html.
Good luck!
You use train for determining mtry only. I would skip the train step, and stay with default mtry:
rf.mdl =randomForest(x=xtr2,y=as.factor(ytr2),ntree=500,
keep.forest=TRUE,importance=TRUE,oob.prox =FALSE)
I strongly doubt that 3 different runs is a good idea.
If you do 10 fold cross-validation (I am not sure it should be done anyways, as validation is ingrained into the random forest), 10 parts is too much, if you are short in time. 5 parts would be enough.
Finally, the time of randomForest is proportional to nTree. Set nTree=100, and your program will run 5 time faster.
I would also just add, that it the main issue is speed, there are several other random forest implementations in caret, and many of them are much faster than the original randomForest which is notoriously slow. I've found ranger to be a nice alternative that suited my very simple needs.
Here is a nice summary of the random forest packges in R. Many of these are in caret already.
Also for consideration, here's an interesting study of the performance of ranger vs rborist, where you can see how performance is affected by the tradeoff between sample size and features.

Error using NB model in textmodel() of quanteda package

I am trying to fit a model to dfm I created using quanteda. I am getting the following error. Any ideas??
tModel <- textmodel(udfm1,model = "NB", smooth=1)
Error in textmodel(udfm1, model = "NB", smooth = 1) :
model NB not implemented.
p.s. I am creating a model to predict the next word for mobile application. I only know Naive Bayes and am not familiar with the other models in this package. So feel free to recommend.
Apologies for this: while the ?textmodel indicates that "NB" is an available model, in fact as of quanteda v0.9.1-7 it's not yet implemented. I have code that implements multinomial and Bernoulli Naive Bayes as a textmodel type but we moved it to a development branch pending more testing. (But coming soon.)
For predicting the next word, that sounds like a question for the text-mining tag of Cross-Validated. Nothing directly in quanteda yet for this, but you should be able to use the dfm directly with most classifiers and regression models.

Applying k-fold Cross Validation model using caret package

Let me start by saying that I have read many posts on Cross Validation and it seems there is much confusion out there. My understanding of that it is simply this:
Perform k-fold Cross Validation i.e. 10 folds to understand the average error across the 10 folds.
If acceptable then train the model on the complete data set.
I am attempting to build a decision tree using rpart in R and taking advantage of the caret package. Below is the code I am using.
# load libraries
library(caret)
library(rpart)
# define training control
train_control<- trainControl(method="cv", number=10)
# train the model
model<- train(resp~., data=mydat, trControl=train_control, method="rpart")
# make predictions
predictions<- predict(model,mydat)
# append predictions
mydat<- cbind(mydat,predictions)
# summarize results
confusionMatrix<- confusionMatrix(mydat$predictions,mydat$resp)
I have one question regarding the caret train application. I have read A Short Introduction to the caret Package train section which states during the resampling process the "optimal parameter set" is determined.
In my example have I coded it up correctly? Do I need to define the rpart parameters within my code or is my code sufficient?
when you perform k-fold cross validation you are already making a prediction for each sample, just over 10 different models (presuming k = 10).
There is no need make a prediction on the complete data, as you already have their predictions from the k different models.
What you can do is the following:
train_control<- trainControl(method="cv", number=10, savePredictions = TRUE)
Then
model<- train(resp~., data=mydat, trControl=train_control, method="rpart")
if you want to see the observed and predictions in a nice format you simply type:
model$pred
Also for the second part of your question, caret should handle all the parameter stuff. You can manually try tune parameters if you desire.
An important thing to be noted here is not confuse model selection and model error estimation.
You can use cross-validation to estimate the model hyper-parameters (regularization parameter for example).
Usually that is done with 10-fold cross validation, because it is good choice for the bias-variance trade-off (2-fold could cause models with high bias, leave one out cv can cause models with high variance/over-fitting).
After that, if you don't have an independent test set you could estimate an empirical distribution of some performance metric using cross validation: once you found out the best hyper-parameters you could use them in order to estimate de cv error.
Note that in this step the hyperparameters are fixed but maybe the model parameters are different accross the cross validation models.
In the first page of the short introduction document for caret package, it is mentioned that the optimal model is chosen across the parameters.
As a starting point, one must understand that cross-validation is a procedure for selecting best modeling approach rather than the model itself CV - Final model selection. Caret provides grid search option using tuneGrid where you can provide a list of parameter values to test. The final model will have the optimized parameter after training is done.

pruning tree with caret library returns complex trees

I'm using the caret package for a tree model. I understood that caret uses CV to find the optimal tuning parameter for pruning the tree.
This is the code I use:
id2 <- sample(1:nrow(data),2/3*nrow(data))
#learn
app <- data[id2,]
#test
test <- data[-id2,]
ctrl<-trainControl(method="cv", number=8,classProbs=TRUE, summaryFunction=twoClassSummary)
mod0 <- train(class~., data=app,method="rpart",trControl=ctrl,metric="ROC")
plot(mod0)
plot(mod0$finalModel,uniform=TRUE,margin=.1);text(mod0$finalModel,cex=0.8)
Here is my data: https://drive.google.com/open?id=1xrCXTLqKvGiGeo2X0Y1DvoSKvzbYFnyccLimceDIbZg
But everytime I run the code I get trees of different complexities (because of CV?) and the tree is not really pruned but very complex and a lot of terminal nodes.
How can I get a less complex tree ?
You need to set the seed prior to calling train to get reproducible results. Also, if you are running in parallel, set the seeds option in trainControl.
As for "complex trees"... that is pretty subjective. Why do you expect them to be more simplistic?
One difference between the results of train and rpart is that the latter uses the "one SE" method for pruning while train prunes to the depth with the best performance. You can use a "one SE" method with train too (see the package website) but I've always found that it tends to be conservative (which was the original point).
Max

R. How to boost the SVM model

I have made SVM model using SVM package in R for a classification problem. I got only 87% accuracy. But random forest produces around 92.4%.
fit.svm<-svm(modelformula, data=training, gamma = 0.01, cost = 1,cross=5)
Would like to use boosting for tuning this SVM model. Can someone will help me to tune this SVM model?
What are the best parameters I can provide for SVM method?
Example for booting for SVM model.
To answer your first question.
The e1071 library in R has a built-in tune() function to perform CV. This will help you select the optimal parameters cost, gamma, kernel. You can also manipulate a SVM in R with the package kernlab. You may get different results from the 2 libraries. Let me know if you need any examples.
You may want to look into the caret package. It allows you to both pick various kernels for SVM (model list) and also run parameter sweeps to find the best model.

Resources