Is there a way to use caret for Survival Analysis. I really like how easy to use it is. I tried fitting a random survival forest using the party package, which is on caret's list.
This works:
library(survival)
library(caret)
library(party)
fitcforest <- cforest(Surv(futime, death) ~ sex+age, data=flchain,
controls = cforest_classical(ntree = 1000))
but using caret I get an error:
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
repeats = 2,
)
cforestfit <- train(Surv(futime, death) ~ sex+age,data=flchain, method="cforest",trControl = fitControl)
I get this error:
Error: nrow(x) == length(y) is not TRUE
Is there a way to make these Surv object work with caret?
Can I use other survival analysis oriented packages with caret?
thanks
Not yet. That is one of two major updates that should be coming soon (the other expands pre-processing).
Contact me offline if you are interested in helping the development and/or testing of those features.
Thanks,
Max
I have found no way to train survival models with caret. As an alternative, the mlr framework (1) has a set of survival learners (2). I have found mlr to be extremely user-friendly and useful.
mlr: http://mlr-org.github.io/mlr-tutorial/release/html/
survival learners in mlr: http://mlr-org.github.io/mlr-tutorial/release/html/integrated_learners/index.html#survival-analysis-15
There is an increasing number of packages in R that model survival data, examples;
For lasso and elastic nets: BioSpear.
For random forest: randomForestSRC.
Best, Loic
Related
I am rather new to machine learning and I am currently trying to implement a random forest classification using the caret and randomForest packages in R. I am using the trainControl function with repeated cross-validation. Maybe it is a stupid question but as far as I understand random forest usually uses bagging to split the training data into different subsets with replacement using 1/3 as a validation set based on which the OOB is calculated on. But what happens if you specify that you want to use k-fold cross-validation? From the caret documentation, I assumed that it uses only cross-validation for the resampling, But if it only used cross-validation, why do you still get an OOB error? Or is bagging still used for the creation of the model and cross-validation for the performance evaluation?
TrainingControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE, classProbs = TRUE, search = "grid")
train(x ~ ., data = training_set,
method = "rf",
metric = "Accuracy",
trControl = TrainingControl,
ntree = 1000,
importance = TRUE
)
Trying to address your questions:
random forest usually uses bagging to split the training data into
different subsets with replacement using 1/3 as a validation set based
on which the OOB is calculated on
Yes, caret is using randomForest() from the package randomForest, and more specifically, it bootstraps on the training data, generate multiple decision tress which are bagged, to reduce overfitting, from wiki:
This bootstrapping procedure leads to better model performance because
it decreases the variance of the model, without increasing the bias.
This means that while the predictions of a single tree are highly
sensitive to noise in its training set, the average of many trees is
not, as long as the trees are not correlated.
So if you call k-fold cross-validation from caret, it simply runs randomForest() on different training sets, therefore the answer to this:
But what happens if you specify that you want to use k-fold
cross-validation? From the caret documentation, I assumed that it uses
only cross-validation for the resampling, But if it only used
cross-validation, why do you still get an OOB error?
Would be the sampling and bagging is performed because it is part of randomforest. caret simply repeats this on different training set and estimates the error on their respective test set. The OOB error generated from randomForest() stays regardless. The difference is that you have a truly "unseen" data that can be used to evaluate your model.
I do want to tune a classification algorithm predicting probabilities using caret.
Since my data-set is highly unbalanced, the default Accuracy option of caret seems not to be so helpful according to this post: https://stats.stackexchange.com/questions/68702/r-caret-difference-between-roc-curve-and-accuracy-for-classification.
In my specific case, I want to determine the optimal mtry parameter of a random forest, which predicts probabilities. I do have 3 classes and a palance ratio of 98.7% - 0.45% - 0.85%. An reproducible example - which has sadely no unbalanced data-set - is given by:
library(caret)
data(iris)
control = trainControl(method="CV", number=5,verboseIter = TRUE,classProbs=TRUE)
grid = expand.grid(mtry = 1:3)
rf_gridsearch = train(y=iris[,5],x=iris[-5],method="ranger", num.trees=2000, tuneGrid=grid, trControl=control)
rf_gridsearch
So my two questions basically are:
What alternative summary metrics besides the Accuracy do I have?
(Using multiROC is not my favourite, due to: https://stats.stackexchange.com/questions/68702/r-caret-difference-between-roc-curve-and-accuracy-for-classification. I think of sth. like a Brier Score)
How do I implement them?
Many thanks!
I'm working with caret and the method avNNET. I would like to try all subsets of variables while doing cross validation. So I can determine the best predictors and parameters (like a brute-force approach).
I have used stepAIC with glm, is there something similar?
In the caret manual you will find the "pcaNNet" method, which is Neural Networks with Feature Extraction.
An example using it:
# define training control
train_control <- trainControl(method="repeatedcv", number=10, repeats = 10, classProbs = TRUE)
# train the model
model <- train(Status~., data=My_data, trControl=train_control, method="pcaNNet", metric = "Kappa")
# summarize results
print(model)
# Confusion matrix
model %>% confusionMatrix()
I am interested in a multiple classes problem with imbalanced classes and I was rather happy with the caret package so far but, I have some practical issues with multiclass classification :
Is it possible to use another metric than accuracy and kappa, I don't find how to use F-measure or G-mean ? It seems that the only choice is between Accuracy and Kappa.
Here is an example of my code for some method and its relevant grid.
myControl <- trainControl(method='cv', number=5, classProbs = TRUE,sampling='up')
tune <- train(data.train[,-1],data.train[,1],tuneGrid=grid,method = method,metric= 'Kappa',trControl =myControl)
I am currently wondering on the way to set 10 trees using the random forest algorithm from the Caret package, and hope an assistance could be obtained:
below is my syntax:
tr <- trainControl(method = "repeatedcv",number = 20)
fit<-train(y ~.,method="rf",data=example, trControl=tr)
Following researches on http://www.inside-r.org/packages/cran/randomForest/docs/randomForest
Setting either n=10
as argument in randomForest() or n.trees in case of using gbm could have merely helped, but I am interested in the Caret package.
Any feedback would be very appreciated.
Thanks
Caret's train() uses the randomForest() function when you specify method = "rf" in the train call.
You simply need to pass ntree = 10 to train which then will be passed on to randomForest().
Therefore, your call would look like this:
fit <- train(y ~., method="rf",data=example, trControl=tr, ntree = 10)
For interest to anyone in my position who landed here while using ranger method of random forrest (Google still directed me here when specifying "ranger" in my search term) use num.trees.
num.trees = 20
I think ntree is a parameter you are looking for