How to stack machine learning models in R - r

I am new to machine learning and R.
I know that there is an R package called caretEnsemble, which could conveniently stack the models in R. However, this package looks has some problems when deals with multi-classes classification tasks.
Temporarily, I wrote some codes to try to stack the models manually and here is the example I worked on:
adData = data.frame(diagnosis, predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3 / 4)[[1]]
training = adData[inTrain,]
testing = adData[-inTrain,]
modelFitRF <- train(diagnosis ~ ., data = training, method = "rf")
modelFitGBM <- train(diagnosis ~ ., data = training, method = "gbm",verbose=F)
modelFitLDA <- train(diagnosis ~ ., data = training, method = "lda")
predRF <- predict(modelFitRF,newdata=testing)
predGBM <- predict(modelFitGBM, newdata = testing)
prefLDA <- predict(modelFitLDA, newdata = testing)
confusionMatrix(predRF, testing$diagnosis)$overall[1]
confusionMatrix(predGBM, testing$diagnosis)$overall[1]
confusionMatrix(prefLDA, testing$diagnosis)$overall[1]
Now I've got three models: modelFitRF, modelFitGBM and modelFitLDA, and three predicted vectors corresponding to such three models based on the test set.
Then I will create a data frame to contain these predicted vectors and the original dependent variable in the test set:
predDF <- data.frame(predRF, predGBM, prefLDA, diagnosis = testing$diagnosis, stringsAsFactors = F)
And then, I just used such data frame as a new train set to create a stacked model:
modelStack <- train(diagnosis ~ ., data = predDF, method = "rf")
combPred <- predict(modelStack, predDF)
confusionMatrix(combPred, testing$diagnosis)$overall[1]
Considering that stacking models usually should improve the accuracy of the predictions, I'de like to believe this might be a right to stack the models. However, I also doubt that here I used the predDF which is created by the predictions from three models with the test set.
I am not sure whether I should use the results from the test set and then apply them back to the test set to get final predictions?
(I am referring to this block below:)
predDF <- data.frame(predRF, predGBM, prefLDA, diagnosis = testing$diagnosis, stringsAsFactors = F)
modelStack <- train(diagnosis ~ ., data = predDF, method = "rf")
combPred <- predict(modelStack, predDF)
confusionMatrix(combPred, testing$diagnosis)$overall[1]


How to calibrate probabilities in R?

I am trying to calibrate probabilities that I get with the predict function in the R package.
I have in my case two classes and mutiple predictors. I used the iris dataset as an example for you to try and help me out.
my_data <- iris %>% #reducing the data to have two classes only
dplyr::filter((Species =="virginica" | Species == "versicolor") ) %>% dplyr::select(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)
my_data <- droplevels(my_data)
index <- createDataPartition(y=my_data$Species,p=0.6,list=FALSE)
#creating train and test set for machine learning
Train <- my_data[index,]
Test <- my_data[-index,]
#machine learning based on Train data partition with glmnet method
classCtrl <- trainControl(method = "repeatedcv", number=10,repeats=5,classProbs = TRUE,savePredictions = "final")
glmnet_ML <- train(Species~., Train, method= "glmnet", trControl=classCtrl)
#probabilities to assign each row of data to one class or the other on Test
predTestprob <- predict(glmnet_ML,Test,type="prob")
#trying out calibration following "Applied predictive modeling" book from Max Kuhn p266-273
predTrainprob <- predict(glmnet_ML,Train,type="prob")
predTest <- predict(glmnet_ML,Test)
predTestprob <- predict(glmnet_ML,Test,type="prob")
Test$PredProb <- predTestprob[,"versicolor"]
Test$Pred <- predTest
Train$PredProb <- predTrainprob[,"versicolor"]
#logistic regression to calibrate
sigmoidalCal <- glm(relevel(Species, ref= "virginica") ~ PredProb,data = Train,family = binomial)
#predicting calibrated scores
sigmoidProbs <- predict(sigmoidalCal,newdata = Test[,"PredProb", drop = FALSE],type = "response")
Test$CalProb <- sigmoidProbs
#plotting to see if it works
calCurve2 <- calibration(Species ~ PredProb + CalProb, data = Test)
xyplot(calCurve2,auto.key = list(columns = 2))
According to me, the result given by the plot is not good which indicates a mistake in the calibration, the Calprob curve should follow the diagonal but it doe not.
Has anyone done anything similar ?

Making a Prediction from a qda function in r

I am attempting to make a QDA Model in r. My code for the Model is below, and the model works (It makes a prediction for the training data and creates a working confusion matrix.
+EDUCATION+JOB, data = train)
predmodel.train.qda = predict(Model3, data=train)
table(Predicted=predmodel.train.qda$class, TARGET_FLAG=train$TARGET_FLAG)
predmodel.test.qda = predict(Model3, newdata=modtest)
table(Predicted=predmodel.test.qda$class, TARGET_FLAG=modtest$TARGET_FLAG)
+EDUCATION+JOB, data = data)
Model3Prediction <- predict(Model3, type = "response")
confusionMatrix(data$Model3Prediction, data$TARGET_FLAG)
This produces the desired effects, but when I apply the model to the Test Data i get the following error:
"Error in $<*tmp*, P_TARGET_FLAG, value = list(class = c(1L, :
replacement has 2 rows, data has 2141"
test$P_TARGET_FLAG <- predict(Model3, newdata = test, type = "response")
How do I get the model to predict the value of my test data?
I hope, you are already splitting your data in train and test -
trainset = (data)
test = Data[!trainset,]
Once you are done, Try to use below code.
Model3 <- qda(TARGET_FLAG~KIDSDRIV+PARENT1+MSTATUS+CAR_USE+TIF+CAR_TYPE +CLM_FREQ+REVOKED+MVR_PTS+ URBANICITY +SQRT_TRAVTIME +SQRT_BLUEBOOK+SQRT_INCOME +EDUCATION+JOB, data = data, subset=trainset) qda.preds <- predict(Model3 , new =test) 'cm.f <- table(test$predictor, qda.preds$class) 'cm.f

Retrain best model on full dataset in R

I have two models to select from and using some criteria I choose one of the two. (The below is just an example, I know it doesn't make much sense)
sample_dat= sample(1:nrow(cars), 5)
train = cars[-sample_dat, ]
test = cars[sample_dat, ]
models = list(lm(dist ~ speed, train), glm(dist ~ speed, train, family = "poisson"))
test_res = sapply(models, function(x) accuracy(predict(x, test, type = "response"), test$dist)[2]) #Getting the RMSE for each model
best_model = models[which.min(test_res)]
How can I retrain the best model using the full dataset (train + test)? I checked the update and update.formula functions but these don't seem to be updating the data part.
update(best_model[[1]],data = rbind(train,test))
You do not want to change the formula since that is the best model but rather update the data
Base R using your own logic, first creating a list mirroring the models list:
sample_dat= sample(1:nrow(cars), 5)
train = cars[-sample_dat, ]
test = cars[sample_dat, ]
models = list(lm(dist ~ speed, train), glm(dist ~ speed, train, family = "poisson"))
model_application = list(as.expression("lm(dist ~ speed, cars)$call"),
as.expression("glm(dist ~ speed, cars, family = 'poisson'))$call"))
test_res = sapply(models,
# Store a function to caclulate the RMSE: rmse => function
rmse <- function(actual_vec, pred_vec){sqrt(mean((pred_vec - actual_vec)**2))}
# Getting the RMSE for each model: numeric scalar => .GlobalEnv
rmse(test$dist, predict(x, data = test, type = "response"))
best_model = models[[which.min(test_res)]]
applied_model <- eval(eval(as.expression(parse(text = model_application[[which.min(test_res)]]))))

How to make a new prediction using rfcv in R

I have used the RandomForest (RF) Package in R for making RF cross validation for proteins data using "rfcv" function.
How can I make a predict for new protein data using object I had from rfcv?
rfvc will cross validate the model against some data.
In order to predict some values for other data you need to use the predict function.
Given a forest, rf and some new data newdata call
predict(rf, newdata)
The detailed docs give this as an example:
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
iris.pred <- predict(iris.rf, iris[ind == 2,])
table(observed = iris[ind==2, "Species"], predicted = iris.pred)
## Get prediction for all trees.

Simple Way to Combine Predictions from Multiple Models for Subset Data in R

I would like to build separate models for the different segments of my data. I have built the models like so:
log1 <- glm(y ~ ., family = "binomial", data = train, subset = x1==0)
log2 <- glm(y ~ ., family = "binomial", data = train, subset = x1==1 & x2<10)
log3 <- glm(y ~ ., family = "binomial", data = train, subset = x1==1 & x2>=10)
If I run the predictions on the training data, R remembers the subsets and the prediction vectors are with the length of the respective subset.
However, if I run the predictions on the testing data, the prediction vectors are with the length of the whole dataset, not that of the subsets.
My question is whether there is a simpler way to achieve what I would by first subsetting the testing data, then running the predictions on each dataset, concatenating the predictions, rbinding the subset data, and appending the concatenated predictions like this:
T1 <- subset(Test, x1==0)
T2 <- subset(Test, x1==1 & x2<10)
T3 <- subset(Test, x1==1 & x2>=10)
log1pred <- predict(log1, newdata = T1, type = "response")
log2pred <- predict(log2, newdata = T2, type = "response")
log3pred <- predict(log3, newdata = T3, type = "response")
allpred <- c(log1pred, log2pred, log3pred)
TAll <- rbind(T1, T2, T3)
TAll$allpred <-
I'd like to think I am being stupid and there is an easier way to accomplish this - many models on small subsets of the data. How to combine them to get the predictions on the full testing data?
First, here's some sample data
train <- data.frame(x1=sample(0:1, 100, replace=T),
y=sample(0:1, 100, replace=T))
test <- data.frame(x1=sample(0:1, 10, replace=T),
Now we can fit the models. Here I place them in a list to make it easier to keep them together, and I also remove x1 from the model since it will be fixed for each subset
glm(y ~ .-x1, family = "binomial", data = train, subset = x1==0),
glm(y ~ .-x1, family = "binomial", data = train, subset = x1==1 & x2<10),
glm(y ~ .-x1, family = "binomial", data = train, subset = x1==1 & x2>=10)
Now, for the training data, I create an indicator which specifies which group the observation falls into. I do this by looking at the subset= parameter of each of the calls and evaluating those conditions in the test data.
whichsubset <- as.vector(sapply(fits, function(x) {
eval(subsetparam, test)
})%*% matrix(1:length(fits), ncol=1))
You'll want to make sure your groups are mutually exclusive because this code does not check. Then you can use factor with a split/unsplit strategy for making your predictions
Map(function(a,b) predict(a,b),
fits, split(test, whichsubset)
And even easier strategy would have been just to create the segregating factor in the first place. This would make the model fitting easier as well.
