I created a random forest model using the train function from caret with the following train control parameters:
ctrl <- trainControl(sampling = "down")
set.seed(123)
rf_fit <- train(Dep ~.,
data = dt_train,
method = "rf",
metric = "Kappa",
trControl = ctrl,
importance = TRUE)
My forest is performing well and now I need to sort of explain it in simpler terms. I am using inTrees package to do that but to proceed, I need the undersampled data first that was generated from train.
Is there any way that I can get the undersampled data from train function?
Related
I am using a train function in training a Random Forest model:
fitControl = trainControl(method="oob")
tuneGrid = expand.grid(.mtry=c(15))
rfmod = train(p ~
x +
y +
z,
method="rf",
data=train_df,
tuneGrid=tuneGrid,
trControl=fitControl,
importance=TRUE,
allowParallel=TRUE)
This is a simplified code showing the structure of my model and I am using the training data train_df.
And I want to have the prediction from each tree. I searched somehow and tried this:
preds <- predict(rfmod, newdata = test_df[1], predict.all = TRUE)
I just used the first row of my test data test_df. After this, when I check preds it just still gives me only one prediction value, while I want the prediction from all the trees.
How can I do this to get all the predictions from all the trees?
Thanks in advance!!
In R, I have a logistic regression model as follows
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(result~., data=df,
trControl = train_control,
method = "glm",
family=binomial(link="logit"))
calculatedVarImp <- varImp(logit_Model, scale = FALSE)
I use multiple datasets that run through the same code, so the variable importance changes for each dataset. Is there a way to get the names of the variables that are less than n (e.g. 1) in the overall importance, so I can automate the removal of those variables and rerun the model.
I was unable to get the information from 'calculatedVarImp' variable by subsetting 'overall' value
lowVarImp <- subset(calculatedVarImp , importance$Overall <1)
Also, is there a better way of doing variable selection?
Thanks in advance
You're using the caret package. Not sure if you're aware of this, but caret has a method for stepwise logistic regression using the Akaike Information Criterion: glmStepAIC.
So it iteratively trains a model for every subset of predictors and stops at the one with the lowest AIC.
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(y~., data= train_data,
trControl = train_control,
method = "glmStepAIC",
family=binomial(link="logit"),
na.action = na.omit)
logit_Model$finalModel
This is as automated as it gets but it may be worth reading this answer about the downsides to this method:
See Also.
Currently got this Random Forest model, just seeing how well it predicts those with diabetes positive or diabetes negative
Model is calculated using the caret workflow
when looking at variable importance i was told to use the code
randomForest::importance(model$finalModel)
what is the purpose of $finalModel? what is $finalModel as compared to just the original model? should it not be just be the original model passed in as the argument instead to view variable importance?
example below:
library(tidyverse)
library(mlbench)
library(caret)
library(car)
library(glmnet)
library(rpart.plot)
library(rpart)
data("PimaIndiansDiabetes2")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples,]
test.data <- PimaIndiansDiabetes2[-training.samples,]
model_rf <- caret::train(
diabetes ~.,
data = train.data,
method = "rf",
trControl = trainControl("cv", number = 10),
importance = TRUE)
model_rf
model_rf$bestTune
model_rf$finalModel
# variable importance here
importance(model_rf$finalModel)
From the documentation:
finalModel A fit object using the best parameters
Most of the times with train you pass some different values for hyper-parameter estimation, to find the values that achieve the best performance (using trainControl).
Inside model_rf you'll find under finalModel the model built with the best parameters.
FYI caret also has a function for variable importance plotting: varImp.
I am trying to do cross validation of a linear model in R using cv.lm. I have tried capturing the output from cv.lm in a separate variable using something like:
cvOutput <- cv.lm(.....)
However, I cannot extract the predicted values from every fold as cvOutput seems to have no information about folds. Is there any way of extracting this?
Try this out. (I used Caravan dataset from MASS package for example)
First your partition the data
df <- Caravan
inTrain <- createDataPartition(df$Purchase,
p =0.8,
list =F)
training <- df[ inTrain,]
testing <- df[-inTrain,]
Then you choose the method
fitControl <- trainControl(method = "cv", number = 10)
Then you can have your cross validated model
fit <- train(Purchase ~ .,
data = training,
method = "lm",
trControl = fitControl)
I am using caret train function with the preProcess option:
fit <- train(form,
data=train,
preProcess=c("YeoJohnson","center","scale","bagImpute"),
method=model,
metric = "ROC",
tuneLength = tune,
trControl=fitControl)
This preprocesses the training data. However, when I predict, the observations with NAs, they are omitted even though I have bagImpute as an option. I know there is a na.action parameter on predict.train, but I can't get it to work.
predict.train(model, newdata=test, na.action=???)
Is it correct to assume that the predict function automatically preprocesses the new data because the model was trained using the preProcess option? If so, shouldn't the new data be imputed and processed the same way as train? What am i doing wrong?
Thanks for any help.
You would use na.action = na.pass. The problem is, while making a working example, I found a bug that occurs with the formula method for train and imputation. Here is an example without the formula method:
library(caret)
set.seed(1)
training <- twoClassSim(100)
testing <- twoClassSim(100)
testing$Linear05[4] <- NA
fitControl <- trainControl(classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(2)
fit <- train(x = training[, -ncol(training)],
y = training$Class,
preProcess=c("YeoJohnson","center","scale","bagImpute"),
method="lda",
metric = "ROC",
trControl=fitControl)
predict(fit, testing[1:5, -ncol(testing)], na.action = na.pass)
The bug will be fixed on the next release of the package.
Max