I am very new to machine learning and was told to run a series of methods in order to predict a variable in my study. I am trying to predict a variable using "ranger", "ctree" and "xgbTree", but first using trainControl ahead of time using "cv".
library(randomForest)
library(caret)
library(ranger)
ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3))
train <- iris[ind==1,]
test <- iris[ind==2,]
set.seed(222)
ctrl<- trainControl(method = "cv", number = 10)
new.rf<- caret::train(Sepal.Length ~., data =train, method = "ranger", trControl = ctrl)
My issue is that I want my method to use the same parameters/conditions each time my data has a new split. I only want my "ind" variable to be randomized, and everything else for trainControl and the following train and predict to be consistent. I have included a sample of my code using the Iris data built in R. I am planning to split my training data as 70% and my validation as 30%.
####This was how I was checking to see if the training models were being consistent
ctrl1<- trainControl(method = "cv", number = 10)
ctrl2<- trainControl(method = "cv", number = 10)
new.rf1<- caret::train(Sepal.Length ~., data =train, method = "ranger", trControl = ctrl1)
new.rf2<- caret::train(Sepal.Length ~., data =train, method = "ranger", trControl = ctrl2)
rf.pred1<- predict(new.rf1, test)
rf.pred2<- predict(new.rf2, test)
rf.pred1
rf.pred2
I am certain that my wording is not correct in many, if not all parts. The reasoning for this task is to see how different the outcomes become with different combinations of my sample data.
Related
In R, I have a logistic regression model as follows
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(result~., data=df,
trControl = train_control,
method = "glm",
family=binomial(link="logit"))
calculatedVarImp <- varImp(logit_Model, scale = FALSE)
I use multiple datasets that run through the same code, so the variable importance changes for each dataset. Is there a way to get the names of the variables that are less than n (e.g. 1) in the overall importance, so I can automate the removal of those variables and rerun the model.
I was unable to get the information from 'calculatedVarImp' variable by subsetting 'overall' value
lowVarImp <- subset(calculatedVarImp , importance$Overall <1)
Also, is there a better way of doing variable selection?
Thanks in advance
You're using the caret package. Not sure if you're aware of this, but caret has a method for stepwise logistic regression using the Akaike Information Criterion: glmStepAIC.
So it iteratively trains a model for every subset of predictors and stops at the one with the lowest AIC.
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(y~., data= train_data,
trControl = train_control,
method = "glmStepAIC",
family=binomial(link="logit"),
na.action = na.omit)
logit_Model$finalModel
This is as automated as it gets but it may be worth reading this answer about the downsides to this method:
See Also.
below are my scripts in running KNN model using cross validate method.
## cross validation
library(caret)
cvroc <- trainControl(method = "repeatedcv",
number = 10, # number of iteration
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
#KNN Model
set.seed(222)
fit_roc <- train(admit ~. ,
data = training,
method = 'knn',
tuneLength = 20,
trControl = cvroc, .
preProc = c("center","scale"),
metric = "ROC",
tuneGrid =expand.grid(k=1:60))
fit_roc
KNN model output
Question:
My question will be, how can i convert the output from the model into a data.frame?
i used the command below it gives error.
aa <- data.frame(fit_roc)
Thanks!
Depending on which part of the output you want, you can do the following: The point is fit_roc$whatever you want to extract
fit_roc$results
I have been delving into the R package caret recently, and have a question about reproducibility and comparison of models during training that I haven't quite been able to pin down.
My intention is that each train call, and thus each resulting model, uses the same cross validation splits so that the initial stored results from the cross-validation are comparable from the out-of-sample estimations of the model that are calculated during building.
One method I've seen is that you can specify the seed prior to each train call as such:
set.seed(1)
model <- train(..., trControl = trainControl(...))
set.seed(1)
model2 <- train(..., trControl = trainControl(...))
set.seed(1)
model3 <- train(..., trControl = trainControl(...))
However, does sharing a trainControl object between the train calls mean that they are using the same resampling and indexes generally or whether I have to explicitly pass the seeds argument into the function. Does the train control object have random functions when it is used or are they set on declaration?
My current method has been:
set.seed(1)
train_control <- trainControl(method="cv", ...)
model1 <- train(..., trControl = train_control)
model2 <- train(..., trControl = train_control)
model3 <- train(..., trControl = train_control)
Are these train calls going to be using the same splits and be comparable, or do I have to take further steps to ensure that? i.e. specifying seeds when the trainControl object is made, or calling set.seed before each train? Or both?
Hopefully this has made some sense, and isn't a complete load of rubbish. Any help
My code project that I'm querying about can be found here. It might be easier to read it and you'll understand.
The CV folds are not created during defining trainControl unless explicitly stated using index argument which I recommend. These can be created using one of the specialized caret functions:
createFolds
createMultiFolds
createTimeSlices
groupKFold
That being said, using a specific seed prior to trainControl definition will not result in the same CV folds.
Example:
library(caret)
library(tidyverse)
set.seed(1)
trControl = trainControl(method = "cv",
returnResamp = "final",
savePredictions = "final")
create two models:
knnFit1 <- train(iris[,1:4], iris[,5],
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trControl)
ldaFit2 <- train(iris[,1:4], iris[,5],
method = "lda",
tuneLength = 10,
trControl = trControl)
check if the same indexes are in the same folds:
knnFit1$pred %>%
left_join(ldaFit2$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
#FALSE
If you set the same seed prior each train call
set.seed(1)
knnFit1 <- train(iris[,1:4], iris[,5],
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trControl)
set.seed(1)
ldaFit2 <- train(iris[,1:4], iris[,5],
method = "lda",
tuneLength = 10,
trControl = trControl)
set.seed(1)
rangerFit3 <- train(iris[,1:4], iris[,5],
method = "ranger",
tuneLength = 10,
trControl = trControl)
knnFit1$pred %>%
left_join(ldaFit2$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
knnFit1$pred %>%
left_join(rangerFit3$pred, by = "rowIndex") %>%
mutate(same = Resample.x == Resample.y) %>%
{all(.$same)}
the same indexes will be used in the folds. However I would not rely on this method when using parallel computation. Therefore in order to insure the same data splits are used it is best to define them manually using index/indexOut arguments to trainControl.
When you set the index argument manually the folds will be the same, however this does not ensure that models made by the same method will be the same, since most methods include some sort of stochastic process. So to be fully reproducible it is advisable to set the seed prior to each train call also. When run in parallel to get fully reproducible models the seeds argument to trainControl needs to be set.
I have a model like the following:
library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
my_data <- Sonar
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = T,
summaryFunction = twoClassSummary
)
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
How do I plot the ROC curve for this model? As I understand it, the probabilities must be saved (which I did in trainControl), but because of the random sampling which bootstrapping uses to generate a 'test' set, I am not sure how caret calculates the ROC value and how to generate a curve.
To isolate the class probabilities for the best performing parameters, I am doing:
for (a in 1:length(model$bestTune))
{model$pred <-
model$pred[model$pred[, paste(colnames(model$bestTune)[a])] == model$bestTune[1, a], ]}
Please advise.
Thanks!
First an explanation:
If you are not going to check how each possible hyper parameter combination predicted on each sample in each re-sample you can set savePredictions = "final" in trainControl to save space:
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = "final",
summaryFunction = twoClassSummary
)
after running the model:
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
results of interest are in model$pred
here you can check how many samples were tested in each re-sample (I set 25 repetitions)
nrow(model$pred[model$pred$Resample == "Resample01",])
#83
caret always provides prediction from rows not used in the model build.
nrow(my_data) #208
83/208 makes sense for the test samples for boot632
Now to build the ROC curve. You may opt for several options here:
-average the probability for each sample and use that (this is usual for CV since you have all samples repeated the same number of times, but it can be done with boot also).
-plot all as is without averaging
-plot ROC for each re-sample.
I will show you the second approach:
Create a data frame of class probabilities and true outcomes:
for_lift = data.frame(Class = model$pred$obs, xgbTree = model$pred$R)
plot ROC:
pROC::plot.roc(pROC::roc(response = for_lift$Class,
predictor = for_lift$xgbTree,
levels = c("M", "R")),
lwd=1.5)
You can also do this with ggplot, to do so I find it easiest to make a lift object using caret function lift
lift_obj = lift(Class ~ xgbTree, data = for_lift, class = "R")
specify which class the probability was used ^.
library(ggplot2)
ggplot(lift_obj$data)+
geom_line(aes(1-Sp , Sn, color = liftModelVar))+
scale_color_discrete(guide = guide_legend(title = "method"))
I am currently learning how to implement logistical Regression in R
I have taken a data set and split it into a training and test set and wish to implement forward selection, backward selection and best subset selection using cross validation to select the best features.
I am using caret to implement cross-validation on the training data set and then testing the predictions on the test data.
I have seen the rfe control in caret and had also had a look at the documentation on the caret website as well as following the links on the question How to use wrapper feature selection with algorithms in R?. It isn't apparent to me how to change the type of feature selection as it seems to default to backward selection. Can anyone help me with my workflow. Below is a reproducible example
library("caret")
# Create an Example Dataset from German Credit Card Dataset
mydf <- GermanCredit
# Create Train and Test Sets 80/20 split
trainIndex <- createDataPartition(mydf$Class, p = .8,
list = FALSE,
times = 1)
train <- mydf[ trainIndex,]
test <- mydf[-trainIndex,]
ctrl <- trainControl(method = "repeatedcv",
number = 10,
savePredictions = TRUE)
mod_fit <- train(Class~., data=train,
method="glm",
family="binomial",
trControl = ctrl,
tuneLength = 5)
# Check out Variable Importance
varImp(mod_fit)
summary(mod_fit)
# Test the new model on new and unseen Data for reproducibility
pred = predict(mod_fit, newdata=test)
accuracy <- table(pred, test$Class)
sum(diag(accuracy))/sum(accuracy)
You can simply call it in mod_fit. When it comes to backward stepwise the code below is sufficient
trControl <- trainControl(method="cv",
number = 5,
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary)
caret_model <- train(Class~.,
train,
method="glmStepAIC", # This method fits best model stepwise.
family="binomial",
direction="backward", # Direction
trControl=trControl)
Note that in trControl
method= "cv", # No need to call repeated here, the number defined afterward defines the k-fold.
classProbs = T,
summaryFunction = twoClassSummary # Gives back ROC, sensitivity and specifity of the chosen model.