I am using Random forest algorithm to predict target variable "Y" have 4 values
Below syntax is used to create model
control <- trainControl(method="repeatedcv", number=2, repeats=1, search="random")
seed <- 7
metric <- "Accuracy"
set.seed(seed)
mtry <- sqrt(ncol(train))
model <- train(Target~., data=complete, method="rf", metric=metric, tuneLength=15, trControl=control)
But, when I test trained model on test dataset it does gives accuracy close to 50% only , is there any way in which accuracy can be increased close 70% and above?
Related
I'm trying to create a boxplot with the distribution of RMSE over all predicted resamples. The mean of the resamples equals the models predicted RMSE and therefore it would be interesting to exhibit how this number is calculated. How can I obtain predicted RMSE if I had run each of the models resamples? For example with 5-fold CV:
Model RMSE: 5
Fold 1, 2, 3, 4 ,5 = 5.02 , 5.01, 5, 4.99, 4.98
# Load packages
library(mlbench)
library(caret)
# Load data
data(BostonHousing)
#Dividing the data into train and test set
set.seed(1)
sample <- createDataPartition(BostonHousing$medv, p=0.75, list = FALSE)
train <- BostonHousing[sample,]
test <- BostonHousing[-sample,]
control <- trainControl(method='repeatedcv', number=10, repeats=3, savePredictions=TRUE)
metric <- 'RMSE'
# some random model
set.seed(1)
example <- train(medv~., data=train, method='example', metric=metric,
preProc=c('center', 'scale'), trControl=control)
I know one can obtain for this for resampled on train; example$resample
Is there some similar default way to this for predicted with each resample?
Appreciate all help, thanks.
Is there any way, where we can create multiple random forest models by fine-tuning hyper parameters on train data and check the test data performance against all models and store it in a csv file?
For ex:- i have one model with mtry is 6, nodesize is 3, and another model where mtryis 10 and nodesize is 4 What i need to do is to test these two models performance on test data and store the key model metrics like confusion matrix, sensitivity, and specificity.
i have tried the following code
train_performance <- data.frame('TN'=0,'FP'=0,'FN'=0,'TP'=0,'accuracy'=0,'kappa'=0,'sensitivity'=0,'specificity'=0)
modellist <- list()
for (mtry in c(6,11)){
for (nodesize in c(2,3)){
fit_model <- randomForest(dv~., train_final,mtry = mtry, importance=TRUE, nodesize=nodesize,
sampsize = ceiling(.8*nrow(train_final)), proximity=TRUE,na.action = na.omit,
ntree=500)
Key_col <- paste0(mtry,"-",nodesize)
modellist[[Key_col]] <- fit_model
pred_train <- predict(fit_model, train_final)
cf <- confusionMatrix(pred_train, train_final$DV, mode = 'everything', positive = '1')
train_performance$TN <- cf$table[1]
train_performance$FP <- cf$table[2]
train_performance$FN <- cf$table[3]
train_performance$TP <- cf$table[4]
train_performance$accuracy=cf$overall[1]
train_performance$kappa=cf$overall[2]
train_performance$sensitivity=cf$byClass[1]
train_performance$specificity=cf$byClass[2]
train_performance$key=Key_col
}
}
Below is sample method using caret package on how to tune and train your random forest model which outputs accuracy parameters for all models:
library(randomForest)
library(mlbench)
library(caret)
# Load Dataset
data(Sonar)
dataset <- Sonar
x <- dataset[,1:60]
y <- dataset[,61]
# Create model with default paramters
control <- trainControl(method="repeatedcv", number=10, repeats=3)
seed <- 7
metric <- "Accuracy"
set.seed(seed)
mtry <- sqrt(ncol(x))
tunegrid <- expand.grid(.mtry=mtry)
rf_default <- train(Class~., data=dataset, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_default)
output:
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.8138384 0.6209924 0.0747572 0.1569159
Tune Using Caret:
Random Search:
One search strategy that we can use is to try random values within a range.
# Random Search
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
set.seed(seed)
mtry <- sqrt(ncol(x))
rf_random <- train(Class~., data=dataset, method="rf", metric=metric, tuneLength=15, trControl=control)
print(rf_random)
plot(rf_random)
output:
Resampling results across tuning parameters:
mtry Accuracy Kappa Accuracy SD Kappa SD
11 0.8218470 0.6365181 0.09124610 0.1906693
14 0.8140620 0.6215867 0.08475785 0.1750848
17 0.8030231 0.5990734 0.09595988 0.1986971
24 0.8042929 0.6002362 0.09847815 0.2053314
30 0.7933333 0.5798250 0.09110171 0.1879681
34 0.8015873 0.5970248 0.07931664 0.1621170
45 0.7932612 0.5796828 0.09195386 0.1887363
47 0.7903896 0.5738230 0.10325010 0.2123314
49 0.7867532 0.5673879 0.09256912 0.1899197
50 0.7775397 0.5483207 0.10118502 0.2063198
60 0.7790476 0.5513705 0.09810647 0.2005012
Grid Search:
Another search is to define a grid of algorithm parameters to try.
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
set.seed(seed)
tunegrid <- expand.grid(.mtry=c(1:15))
rf_gridsearch <- train(Class~., data=dataset, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_gridsearch)
plot(rf_gridsearch)
output:
Resampling results across tuning parameters:
mtry Accuracy Kappa Accuracy SD Kappa SD
1 0.8377273 0.6688712 0.07154794 0.1507990
2 0.8378932 0.6693593 0.07185686 0.1513988
3 0.8314502 0.6564856 0.08191277 0.1700197
4 0.8249567 0.6435956 0.07653933 0.1590840
5 0.8268470 0.6472114 0.06787878 0.1418983
6 0.8298701 0.6537667 0.07968069 0.1654484
7 0.8282035 0.6493708 0.07492042 0.1584772
8 0.8232828 0.6396484 0.07468091 0.1571185
9 0.8268398 0.6476575 0.07355522 0.1529670
10 0.8204906 0.6346991 0.08499469 0.1756645
11 0.8073304 0.6071477 0.09882638 0.2055589
12 0.8184488 0.6299098 0.09038264 0.1884499
13 0.8093795 0.6119327 0.08788302 0.1821910
14 0.8186797 0.6304113 0.08178957 0.1715189
15 0.8168615 0.6265481 0.10074984 0.2091663
There are many other methods to tune your random forest model and store the results of these models, above two are the most widely used methods.
Moreover, you can also manually set these parameters up and train and tune the model.
I'm using the caret package in R to fit a LASSO regression model. My code runs fine, however I would like to extract the Intercept for the final model so I can build a scoring key using the selected predictors and coefficients.
For example, if "Extraversion" is the variable I am trying to model using survey items, I would like to produce the following scoring key:
Intercept + Survey_Item_1*Slope + Survey_Item_2*Slope + and so on
FWIW, I am able to extract the coefficients for the predictors.
My code for reference:
##Create Training & test set
set.seed(9808)
ind <- sample(0:1, nrow(df), replace=T, prob=c(.75,.25))
train <- df[ind==0,]
test <- df[ind==1,]
ctrl <- trainControl(method = "repeatedcv", number=5, repeats = 5)
##Train Lasso model
fit.lasso <- train(Extraversion ~., , data=train, method="lasso", preProc=c('scale','center','nzv'), trControl=ctrl)
fit.lasso
predict.enet(fit.lasso$finalModel, type='coefficients', s=fit.lasso$bestTune$fraction, mode='fraction')
##Fit models to test data
lasso_test<- predict(fit.lasso, newdata=test, na.action="na.pass")
postResample(pred = lasso_test, obs = test[,c(1)])
I'm trying to run a caret method that not requires parameters, such as lda, the example below uses "lvq" which needs 2 parameters (size and k)
set.seed(7)
# load the library
library(caret)
# load the dataset
data(iris)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# design the parameter tuning grid
grid <- expand.grid(size=c(5,10,20,50), k=c(1,2,3,4,5))
# train the model
model <- train(Species~., data=iris, method="lvq", trControl=control, tuneGrid=grid)
# summarize the model
print(model)
plot(model)
I tried to work it out assigning tuneGrid=NULL
set.seed(7)
# load the library
library(caret)
# load the dataset
data(iris)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# design the parameter tuning grid
grid <- expand.grid(size=c(5,10,20,50), k=c(1,2,3,4,5))
# train the model
model <- train(Species~., data=iris, method="lda", trControl=control, tuneGrid=NULL)
# summarize the model
print(model)
plot(model)
But I get the error
There are no tuning parameters for this model
Caret contains a number of LDA methods like:
method = "lda" involves no tuning parameters.
method = "lda2" allows to tune dimen (number of discriminant vectors).
If you want to tune parameters (and that must be only number of discriminant vectors), you must use "lda2". "lda" do not allows tuning so to run it you must delete tuneGrid. Deleting tuneGrid you just switch off cross-validation.
I'll answer my own question, I think that just deleting the tuneGrid=NULL works fine
set.seed(7)
# load the library
library(caret)
# load the dataset
data(iris)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# design the parameter tuning grid
grid <- expand.grid(size=c(5,10,20,50), k=c(1,2,3,4,5))
# train the model
model <- train(Species~., data=iris, method="lda", trControl=control)
# summarize the model
print(model)
I am using caret package in order to train a K-Nearest Neigbors algorithm. For this, I am running this code:
Control <- trainControl(method="cv", summaryFunction=twoClassSummary, classProb=T)
tGrid=data.frame(k=1:100)
trainingInfo <- train(Formula, data=trainData, method = "knn",tuneGrid=tGrid,
trControl=Control, metric = "ROC")
As you can see, I am interested in obtain the AUC parameter of the ROC. This code works good but returns the testing error (which the algorithm uses for tuning the k parameter of the model) as the mean of the error of the CrossValidation folds. I am interested in return, in addition of the testing error, the training error (the mean across each fold of the error obtained with the training data). ¿How can I do it?
Thank you
What you are asking is a bad idea on multiple levels. You will grossly over-estimate the area under the ROC curve. Consider the 1-NN model: you will have perfect predictions every time.
To do this, you will need to run train again and modify the index and indexOut objects:
library(caret)
set.seed(1)
dat <- twoClassSim(200)
set.seed(2)
folds <- createFolds(dat$Class, returnTrain = TRUE)
Control <- trainControl(method="cv",
summaryFunction=twoClassSummary,
classProb=T,
index = folds,
indexOut = folds)
tGrid=data.frame(k=1:100)
set.seed(3)
a_bad_idea <- train(Class ~ ., data=dat,
method = "knn",
tuneGrid=tGrid,
trControl=Control, metric = "ROC")
Max