Fitting models with class probabilities with caret in R? - r

I'm working on making some predictions with stacked ML algorithms in R, and I have successfully prepared the sub-models (see working code below:
trainSet <- read.csv("train.csv")
testSet <- read.csv("test.csv")
trainSet$Survived <- as.factor(trainSet$Survived)
algorithmList <- c('lda', 'rpart', 'glm', 'knn', 'svmRadial')
# create submodels
control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)
set.seed(seed)
models <- caretList(Survived~ Pclass + Sex + Fare, data=trainSet, trControl=control, methodList=algorithmList)
results <- resamples(models)
summary(results)
dotplot(results)
but when I actually go to stack the sub-models:
# stack using glm
stackControl <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)
set.seed(seed)
stack.glm <- caretStack(models, method="glm", metric="Accuracy", trControl=stackControl)
print(stack.glm)
It gives me the error message:
Error in check_caretList_model_types(list_of_models) :
The following models were fit by caret::train with no class probabilities: lda, rpart, glm, knn, svmRadial.
Please re-fit them with trainControl(classProbs=TRUE)
But, as you can see, I believe I actually did fit them with classProbs=TRUE (see my 'control' variable) and don't understand why I'm getting this error message! Any ideas?

Related

Decision Trees with Logloss and SMOTE

I am working through some decision trees with the data from the Kaggle Walmart competition and I am running into a couple errors. I was successful last week with running the trees in rpart, but now I am using caret to incorporate logloss and smote for overclassification. Below are is my code and the respective errors:
set.seed(1234)
ind <- sample(2, nrow(data), replace=TRUE, prob=c(0.8, 0.2))
train <- data[ind==1,]
test <- data[ind==2,]
########################
#Building a new DT with logloss and CV
########################
ctrl <- trainControl(method="cv", number=5, classProbs=TRUE,
summaryFunction=mnLogLoss)
ll_tree <- train(TripType~., data=train, method="rpart", metric="logLoss",
trControl=ctrl)
Error in ctrl$summaryFunction(testOutput, lev, method) :
'data' should have columns consistent with 'lev'
In addition: Warning message:
In train.default(x, y, weights = w, ...) :
cannnot compute class probabilities for regression
###################
#Using SMOTE
###################
ctrl2 <- trainControl(method="cv", number=5, classProbs=TRUE,
summaryFunction=mnLogLoss, sampling = "smote")
smote_tree <- train(TripType~., data=train, trControl=ctrl2, method="rpart")
Error: sampling methods are only implemented for classification problems
Any help would be appreciated as this is my first time trying this.
Thanks

Extract the Intercept from a Caret LASSO Model

I'm using the caret package in R to fit a LASSO regression model. My code runs fine, however I would like to extract the Intercept for the final model so I can build a scoring key using the selected predictors and coefficients.
For example, if "Extraversion" is the variable I am trying to model using survey items, I would like to produce the following scoring key:
Intercept + Survey_Item_1*Slope + Survey_Item_2*Slope + and so on
FWIW, I am able to extract the coefficients for the predictors.
My code for reference:
##Create Training & test set
set.seed(9808)
ind <- sample(0:1, nrow(df), replace=T, prob=c(.75,.25))
train <- df[ind==0,]
test <- df[ind==1,]
ctrl <- trainControl(method = "repeatedcv", number=5, repeats = 5)
##Train Lasso model
fit.lasso <- train(Extraversion ~., , data=train, method="lasso", preProc=c('scale','center','nzv'), trControl=ctrl)
fit.lasso
predict.enet(fit.lasso$finalModel, type='coefficients', s=fit.lasso$bestTune$fraction, mode='fraction')
##Fit models to test data
lasso_test<- predict(fit.lasso, newdata=test, na.action="na.pass")
postResample(pred = lasso_test, obs = test[,c(1)])

Caret - Factor issue in multi class classification

I want to perform a multi-class classification in the caretpackage. Below is a minimum example.
library(caret)
library(randomForest)
x <- data.frame("A"=seq(1,100), "B"=seq(1,100), "C"="class1")
x[,"C"] <- as.character(x[,"C"])
x[1,"C"] <- "class2"
x[2,"C"] <- "class3"
x[3,"C"] <- "class4"
x[4,"C"] <- "class5"
x[5,"C"] <- "class6"
x[6,"C"] <- "class7"
x[7,"C"] <- "class8"
x[8,"C"] <- "class9"
x[9,"C"] <- "class10"
x[10,"C"] <- "class11"
x[11,"C"] <- "class12"
x[,"C"] <- as.factor(x[,"C"])
control <- trainControl(method="repeatedcv", number=10, repeats=1, search="grid") set.seed(5) tunegrid <- expand.grid(.mtry=c(1:2)) fit <- train(x=x[,1:2], y=x$C, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(fit)
plot(fit)
When running the code I get an error stating 1: model fit failed for Fold2.Rep1: mtry=1 Error in randomForest.default(x, y, mtry = param$mtry, ...) :
Can't have empty classes in y.
Related posts suggest that this is due to unaccounted factors in the response variable - which is not taken account of in resampling. Typically, one runs into the problem, if there is a higher number of classes to be predicted (and little observations).
Is there any workaround to change the caret package such that the missing factors are removed in the resampling methods (e.g., by droplevels())?

Random Forest Accuracy

I am using Random forest algorithm to predict target variable "Y" have 4 values
Below syntax is used to create model
control <- trainControl(method="repeatedcv", number=2, repeats=1, search="random")
seed <- 7
metric <- "Accuracy"
set.seed(seed)
mtry <- sqrt(ncol(train))
model <- train(Target~., data=complete, method="rf", metric=metric, tuneLength=15, trControl=control)
But, when I test trained model on test dataset it does gives accuracy close to 50% only , is there any way in which accuracy can be increased close 70% and above?

extract coefficients within R caret

library(caret)
data(iris)
train_control <- trainControl(method="repeatedcv", number=10, repeats=10)
model <- train(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data=iris, trControl=train_control, method="lm")
I can get the coefficients of the final selected model with model$finalModel$coefficients. Is there any way to get the coefficients for all models?

Resources