caret: different RMSE on the same data - r

I think, that my problem is quite weird. When I use RMSE metric for the best model selection by train function, I obtain different RMSE value from computed by my own function on the same data. Where ist the problem? Does my function work wrong?
library(caret)
library(car)
library(nnet)
data(oil)
ztest=fattyAcids[c(81:96),]
fit<-list(r1=c(1:80))
pred<-list(r1=c(81:96))
ctrl <- trainControl(method = "LGOCV",index=fit,indexOut=pred)
model <- train(Palmitic~Stearic+Oleic+Linoleic+Linolenic+Eicosanoic,
fattyAcids,
method='nnet',
linout=TRUE,
trace=F,
maxit=10000,
skip=F,
metric="RMSE",
tuneGrid=expand.grid(.size=c(10,11,12,9),.decay=c(0.005,0.001,0.01)),
trControl = ctrl,
preProcess = c("range"))
model
forecast <- predict(model, ztest)
Blad<-function(zmienna,prognoza){
RMSE<-((sum((zmienna-prognoza)^2))/length(zmienna))^(1/2)
estymatory <- c(RMSE)
names(estymatory) <-c('RMSE')
estymatory
}
Blad(ztest$Palmitic,forecast)

The resampled estimates shown in the output of train are calculated using rows 81:96. Once train figures out the right tuning parameter settings, it refits using all the data (1:96). The model from that data is used to make the new predictions.
For this reason, the model performance
> getTrainPerf(model)
TrainRMSE TrainRsquared method
1 0.9230175 0.8364212 nnet
is worse than the other predictions:
> Blad(ztest$Palmitic,forecast)
RMSE
0.3355387
The predictions in forecast are created from a model that included those same data points, which is why it looks better.
Max

Related

Get predictions from each tree in a random forest model in R (using train function in training and predict in predicting)

I am using a train function in training a Random Forest model:
fitControl = trainControl(method="oob")
tuneGrid = expand.grid(.mtry=c(15))
rfmod = train(p ~
x +
y +
z,
method="rf",
data=train_df,
tuneGrid=tuneGrid,
trControl=fitControl,
importance=TRUE,
allowParallel=TRUE)
This is a simplified code showing the structure of my model and I am using the training data train_df.
And I want to have the prediction from each tree. I searched somehow and tried this:
preds <- predict(rfmod, newdata = test_df[1], predict.all = TRUE)
I just used the first row of my test data test_df. After this, when I check preds it just still gives me only one prediction value, while I want the prediction from all the trees.
How can I do this to get all the predictions from all the trees?
Thanks in advance!!

How to calculate R-squared after using bagging function to develop CART decision trees?

I am using the following bagging function with ipred to bootstrap the sample 500 times in R in order to develop decision trees:
baggedsample <- bagging(p ~., data, nbagg=500, coob=TRUE, control = list
(minbucket=5))
After this, I would like to know the R-squared.
I notice that if I do the bagging with caret function, R-squared would be automatically calculated as follows:
# Specify 10-fold cross validation
ctrl <- trainControl(method = "cv", number = 10)
# CV bagged model
baggedsample <- train(
p ~ .,
data,
method = "treebag",
trControl = ctrl,
importance = TRUE
)
# assess results
baggedsample
RMSE Rsquared MAE
## 36477.25 0.7001783 24059.85
Appreciate any guidance on this issue, thanks.
Since you do not provide any data, I will illustrate using the built-in iris data.
You can simply compute R-squared from the formula.
attach(iris)
BAG = bagging(Sepal.Length ~ ., data=iris)
R2 = 1 - sum((Sepal.Length - predict(BAG))^2) /
sum((Sepal.Length - mean(Sepal.Length))^2)
R2
[1] 0.824782

Pooled Regression Results using mice, caret, and glmnet

Not sure if this more of a statistics question but the closest similar problem I could find is here, although I couldn't get it to work for my case.
I am trying to develop a pooled, penalized logistic regression model. I used mice to create a mids object and then fit a model to each dataset using caret repeated cross-validation with elastic net regression (glmnet) to tune parameters. The fitted object is not of class "mira" but I think I fixed that by changing the object class with the right list items. The major issue is that glmnet does not have an associated vcov method, which is required by pool().
I would like to use penalized regression based on the amount of variables and uncertainty over which ones are the best predictors. My data consists of 4x numeric variables and 9x categorical variables of varying levels and I anticipate including interactions.
Does anyone know how I might be able to create my own vcov method or otherwise address this issue? I am not sure if this is possible.
Example data and code are enclosed, noting that I am not able to share the actual data.
library(mice)
library(caret)
dat <- as.data.frame(list(time=c(4,3,1,1,2,2,3,5,2,4,5,1,4,3,1,1,2,2,3,5,2,4,5,1),
status=c(1,1,1,0,2,2,0,0,NA,1,2,0,1,1,1,NA,2,2,0,0,1,NA,2,0),
x=c(0,2,1,1,NA,NA,0,1,1,2,0,1,0,2,1,1,NA,NA,0,1,1,2,0,1),
sex=c("M","M","M","M","F","F","F","F","M","F","F","M","F","M","M","M","F","F","M","F","M","F","M","F")))
imp <- mice(dat,m=5, seed=192)
control = trainControl(method = "repeatedcv",
number = 10,
repeats=3,
verboseIter = FALSE)
mod <- list(analyses=vector("list", imp$m))
for(i in 1:imp$m){
mod$analyses[[i]] <- train(sex ~ .,
data = complete(imp, i),
method = "glmnet",
family="binomial",
trControl = control,
tuneLength = 10,
metric="Kappa")
}
obj <- as.mira(mod)
obj <- list(call=mod$analyses[[1]]$call, call1=imp$call, nmis=imp$nmis, analyses=mod$analyses)
oldClass(obj) <- "mira"
pool(obj)
Produces:
Error in pool(obj) : Object has no vcov() method.

Obtaining training Error using Caret package in R

I am using caret package in order to train a K-Nearest Neigbors algorithm. For this, I am running this code:
Control <- trainControl(method="cv", summaryFunction=twoClassSummary, classProb=T)
tGrid=data.frame(k=1:100)
trainingInfo <- train(Formula, data=trainData, method = "knn",tuneGrid=tGrid,
trControl=Control, metric = "ROC")
As you can see, I am interested in obtain the AUC parameter of the ROC. This code works good but returns the testing error (which the algorithm uses for tuning the k parameter of the model) as the mean of the error of the CrossValidation folds. I am interested in return, in addition of the testing error, the training error (the mean across each fold of the error obtained with the training data). ¿How can I do it?
Thank you
What you are asking is a bad idea on multiple levels. You will grossly over-estimate the area under the ROC curve. Consider the 1-NN model: you will have perfect predictions every time.
To do this, you will need to run train again and modify the index and indexOut objects:
library(caret)
set.seed(1)
dat <- twoClassSim(200)
set.seed(2)
folds <- createFolds(dat$Class, returnTrain = TRUE)
Control <- trainControl(method="cv",
summaryFunction=twoClassSummary,
classProb=T,
index = folds,
indexOut = folds)
tGrid=data.frame(k=1:100)
set.seed(3)
a_bad_idea <- train(Class ~ ., data=dat,
method = "knn",
tuneGrid=tGrid,
trControl=Control, metric = "ROC")
Max

How to track a progress while building model with the caret package?

I am trying to build model using train function from caret package:
model <- train(training$class ~ .,data=training, method = "nb")
Training set contains about 20K observations, each observation has above 100 variables. I would like to know if building a model from that dataset will take hours or days.
How to estimate time needed to train model from data? How track a progress of training process when using functions from caret package?
Assuming that you are training the model with
an expanded grid of tuning parameters (all combinations of the tuning parameters)
and a resampling technique of your choice (cross validation, bootstrap etc)
You could set
trainctrl <- trainControl(verboseIter = TRUE)
and set it in the trControl argument of the train function to track the training progress
model <- train(training$class ~ .,data=training, method = 'nb', trControl = trainctrl)
This prints out the progress out to the console at each resampling stage, and allows you to gauge the progress of the training/parameter tuning.
To estimate the total running time, you could run the model once to see how long it runs, and estimate the total time by multiplying accordingly based on your resampling scheme and number of parameter combinations. This can be done by setting the trainControl again, and setting the tuneLength to 1:
trainctrl <- trainControl(method = 'none')
model <- train(training$class ~ ., data = training, method = 'nb', trControl = trainctrl, tuneLength = 1)
Hope this helps! :)

Resources