caret::train: specify training data parameters - r

I am designing a neural network model that predicts estimation of van genuchten water retention parameters (theta_r, thera_s, alpha, n) using limited to more extended input data like texture, bulk density, and one or two water retention. Investigating neural networks in R project I found RSNNS package and I create and train multiple multi-layer perceptrons (MLPs) with tuning on the number of hidden units and the learning rate. The general performance characterized with training and testing RMSEs for these models is really poor and random, in fact, i used log-transformed values of alpha and n parameters to avoid bias and account for their approximately lognormal distributions but this does not help much :(. I was recommended to work with nnet and caret package but I've had trouble adapting the code i don't know what I'm doing wrong, any suggestion?
#input dataset
basic <- read.table(url("https://dl.dropboxusercontent.com/s/m8qe4k5swz1m3ij/basic.txt?dl=1&token_hash=AAH6Z3d6fWTLoQZYi04Ys72sdufdERE5gm4v7eF0cgMlkQ"), header=T, sep=" ")
#output dataset
fitted <- read.table(url("https://dl.dropboxusercontent.com/s/rjx745ej80osbbu/fitted.txt?dl=1&token_hash=AAHP1zcPQyw4uSe8rw8swVm3Buqe3TP7I1j-4_SOeeUTvw"), header=T, sep=" ")
# Use log-transformed values of alpha and n output parameters
fitted$alpha <- log(fitted$alpha)
fitted$n <- log(fitted$n)
#Fit model with caret package
library(caret)
model <- train(x = basic, y = fitted, method='nnet', linout=TRUE, trace = FALSE,
#Grid of tuning parameters to try:
tuneGrid=expand.grid(.size=c(1,5,10),.decay=c(0,0.001,0.1)))

caret is just a wrapper to the algorithms it is calling so you can specify any parameter in the algorith even if it is not an option in caret's tuning grid. This is accomplishing via the "..." in caret's train() function, which is basically saying that you can pass any extra parameters into the method you are calling. I'm not sure what parameters you want to adjust to your nnet call (and I'm getting errors accessing your dropbox data) so here is a trivial example passing in specific values to maxit and Hess:
> library(caret)
> m1 <- train(Species~.,data=iris, method='nnet', linout=TRUE, trace = FALSE,trControl=trainControl("cv"))
> #this time pass in values for maxint and Hess
> m2 <- train(Species~.,data=iris, method='nnet', linout=TRUE, trace = FALSE,trControl=trainControl("cv"),maxint=10,Hess=T)
> m1$finalModel$call
nnet.formula(formula = modFormula, data = data, size = tuneValue$.size,
decay = tuneValue$.decay, linout = TRUE, trace = FALSE)
> m2$finalModel$call
nnet.formula(formula = modFormula, data = data, size = tuneValue$.size,
decay = tuneValue$.decay, linout = TRUE, trace = FALSE, maxint = 10,
Hess = ..4)

Related

Rpart vs Caret Cross Validation for Complexity Parameter

I have a few questions about the difference between Rpart and Caret (using Rpart):
When using Rpart to fit a decision tree, calling dt$cptable displays a table of complexity parameters and their associated cross-validation errors. When pruning a tree, we would want to select the CP with the lowest cross-validation error. How are these cross-validation errors calculated? In reading Rpart's vignette, it seems like RPart does the following:
a) Fits the full tree based on the user-specified parameters. As the tree is being built, the algorithm calculates the complexity parameter at each split
b) The algorithm then splits the data into k folds, and for each CP, basically just performs cross-validation using these folds. Then it calculates the average error across all of the folds to get the 'xerror' output we see in CP$table
If we were to use caret with cross validation to find the optimal tree, how is it running? Basically, is the algorithm splitting the dataset into k folds, then calling the Rpart function, and for each call of the Rpart function doing the same thing described in point 1 above? In other words, is it using cross-validation within cross-validation, whereas Rpart is just using cross-validation once?
Below is some code, even though I'm asking more about how the algorithm functions, maybe it will be useful:
library(rpart)
library(rpart.plot)
library(caret)
set.seed(100)
data.class <- data.termlife[, 2:ncol(data.termlife)]
data.class$TERM_FLAG <- as.factor(data.class$TERM_FLAG)
train.indices <- createDataPartition(data.class$TERM_FLAG, p = .8, list = FALSE)
data.class.t <- data.class[train.indices, ]
data.class.v <- data.class[-train.indices, ]
#Using Rpart
rpart.ctrl <- rpart.control(minsplit = 5, minbucket = 5, cp = .01)
f <- as.formula(paste0("TERM_FLAG ~ ", paste0(names(data.class.t)[2:9], collapse = "+")))
dt <- rpart(formula = f, data = data.class.t, control = rpart.ctrl, parms = list(split = "gini"))
cp.best.rpart <- dt$cptable[which.min(dt$cptable[, "xerror"]), "CP"]
#Using Caret
train.ctrl <- trainControl(method = "cv", number = 10)
tGrid <- expand.grid(cp = seq(0, .02, .0001))
dt.caret <- train(form = f, data = data.class.t, method = "rpart", metric = "Accuracy", trControl = train.ctrl, tuneGrid = tGrid)
cp.best.caret <- dt.caret$bestTune$cp
print(paste("Rpart's best CP: ", cp.best.rpart))
print(paste("Caret's best CP: ", cp.best.caret))
[1] "Rpart's best CP: 0.0194444444444444"
[1] "Caret's best CP: 0.02"
The results are very similar, so when would you ever want to use Caret with Rpart? Thank you!!!

Pooled Regression Results using mice, caret, and glmnet

Not sure if this more of a statistics question but the closest similar problem I could find is here, although I couldn't get it to work for my case.
I am trying to develop a pooled, penalized logistic regression model. I used mice to create a mids object and then fit a model to each dataset using caret repeated cross-validation with elastic net regression (glmnet) to tune parameters. The fitted object is not of class "mira" but I think I fixed that by changing the object class with the right list items. The major issue is that glmnet does not have an associated vcov method, which is required by pool().
I would like to use penalized regression based on the amount of variables and uncertainty over which ones are the best predictors. My data consists of 4x numeric variables and 9x categorical variables of varying levels and I anticipate including interactions.
Does anyone know how I might be able to create my own vcov method or otherwise address this issue? I am not sure if this is possible.
Example data and code are enclosed, noting that I am not able to share the actual data.
library(mice)
library(caret)
dat <- as.data.frame(list(time=c(4,3,1,1,2,2,3,5,2,4,5,1,4,3,1,1,2,2,3,5,2,4,5,1),
status=c(1,1,1,0,2,2,0,0,NA,1,2,0,1,1,1,NA,2,2,0,0,1,NA,2,0),
x=c(0,2,1,1,NA,NA,0,1,1,2,0,1,0,2,1,1,NA,NA,0,1,1,2,0,1),
sex=c("M","M","M","M","F","F","F","F","M","F","F","M","F","M","M","M","F","F","M","F","M","F","M","F")))
imp <- mice(dat,m=5, seed=192)
control = trainControl(method = "repeatedcv",
number = 10,
repeats=3,
verboseIter = FALSE)
mod <- list(analyses=vector("list", imp$m))
for(i in 1:imp$m){
mod$analyses[[i]] <- train(sex ~ .,
data = complete(imp, i),
method = "glmnet",
family="binomial",
trControl = control,
tuneLength = 10,
metric="Kappa")
}
obj <- as.mira(mod)
obj <- list(call=mod$analyses[[1]]$call, call1=imp$call, nmis=imp$nmis, analyses=mod$analyses)
oldClass(obj) <- "mira"
pool(obj)
Produces:
Error in pool(obj) : Object has no vcov() method.

Calculate Prediction Intervals of a predicted value using Caret package of R

I used different neural network packages within Caret package for my predictions. Code used with nnet package is
library(caret)
# training model using nnet method
data <- na.omit(data)
xtrain <- data[,c("temperature","prevday1","prevday2","prev_instant1","prev_instant2","prev_2_hour")]
ytrain <- data$power
train_model <- train(x = xtrain, y = ytrain, method = "nnet", linout=TRUE, na.action = na.exclude,trace=FALSE)
# prediction using training model created
pred_ob <- predict(train_model, newdata=dframe,type="raw")
The predict function simply calculates the prediction value. But, I also need prediction intervals (2-sigma) as well. On searching, I found a relevant answer at stackoverflow link, but this does not result as needed. The solution suggests to use finalModelvariable as
predict(train_model$finalModel, newdata=dframe, interval = "confidence",type=raw)
Is there any other way to calculate prediction intervals? The training data used is the dput() of my previous question at stackoverflow link and the dput() of my prediction dataframe (test data) is
dframe <- structure(list(temperature = 27, prevday1 = 1607.69296666667,
prevday2 = 1766.18103333333, prev_instant1 = 1717.19306666667,
prev_instant2 = 1577.168915, prev_2_hour = 1370.14983583333), .Names = c("temperature",
"prevday1", "prevday2", "prev_instant1", "prev_instant2", "prev_2_hour"
), class = "data.frame", row.names = c(NA, -1L))
****************************UPDATE***********************
I used nnetpredintpackage as suggested at link. To my surprise it results in an error, which I find difficult to debug. Here is my updated code till now,
library(nnetpredint)
nnetPredInt(train_model, xTrain = xtrain, yTrain = ytrain,newData = dframe)
It results in the following error:
Error: Number of observations for xTrain, yTrain, yFit are not the same
[1] 0
I can check that xtrain, ytrain and dframe are with correct dimensions, but I do not have any idea about yFit. I don't need this according to the examples of nnetpredintvignette
caret doesn't generate prediction intervals; that relies on the individual package. If that package cannot do this, then neither can the train objects. I agree that nnetPredInt is the appropriate way to go.
Two other notes:
you most likely should center and scale your data if you have not already.
using the finalModel object is somewhat dangerous since it has no idea what was done to the data (e.g. dummy variables, centering and scale or other preprocessing methods, etc) before it was created.
Max
Thanks for your question. And a simple answer to your problem is: Right now the nnetPredInt function only support the following S3 object, "nnet", "nn" and "rsnns", produced by different neural network packages. And the train function in caret package return an "train" object. That's why the function nnetPredInt doesn't get the yFit vectors, which is the fitted.value of the training datasets, from your train_model.
1.Quick way to use the model from caret package:
Get the finalModel result from the 'train' object:
nnetObj = train_model$finalModel # return the 'nnet' model which the caret package has found.
yPredInt = nnetPredInt(nnetObj, xTrain = xtrain, yTrain = ytrain,newData = dframe)
For Example, Use the Iris Dataset and the 'nnet' method from caret package for regression prediction.
library(caret)
library(nnetpredint)
# Setosa 0 and Versicolor 1
ird <- data.frame(rbind(iris3[,,1], iris3[,,2]), species = c(rep(0, 50), rep(1, 50)))
samp = sample(1:100, 80)
xtrain = ird[samp,][1:4]
ytrain = ird[samp,]$species
# Training
train_model <- train(x = xtrain, y = ytrain, method = "nnet", linout = FALSE, na.action = na.exclude,trace=FALSE)
class(train_model) # [1] "train"
nnetObj = train_model$finalModel
class(nnetObj) # [1] "nnet.formula" "nnet"
# Constructing Prediction Interval
xtest = ird[-samp,][1:4]
ytest = ird[-samp,]$species
yPredInt = nnetPredInt(nnetObj, xTrain = xtrain, yTrain = ytrain,newData = xtest)
# Compare Results: ytest and yPredInt
ytest
yPredInt
2.The Hard Way
Use the generic nnetPredInt function to pass all the neural net specific parameters to the function:
nnetPredInt(object = NULL, xTrain, yTrain, yFit, node, wts, newData,alpha = 0.05 , lambda = 0.5, funName = 'sigmoid', ...)
xTrain # Training Dataset
yTrain # Training Target Value
yFit # Fitted Value of the training data
node # Structure of your network, like c(4,5,5,1)
wts # Specific order of weights parameters found by your neural network
newData # New Data for prediction
Tips:
Right now nnetpredint package only support the standard multilayer neural network regression with activated output, not the linear output,
And it will support more type of models soon in the future.
You can use the nnetPredInt function {package:nnetpredint}. Check out the function's help page here
If you are open to writing your own implementation there is another option. You can get prediction intervals from a trained net using the same implementation you would write for standard non-linear regression (assuming back propagation was used to do the estimation).
This paper goes through the methodology and is fairly straight foward: http://www.cis.upenn.edu/~ungar/Datamining/Publications/yale.pdf.
There are, as with everything,some cons (outlined in the paper) to this approach but definitely worth knowing as an option.

Pre-Processing Data in Caret and Making Predictions on an Unknown Data Set

I am using the Caret package train function to fit a model and then predict to predict values on an unknown data set (which I then get feedback on so I know the quality of my predictions). I'm having problems and I'm convinced it has to do with preprocessing the unknown data.
Briefly and simply, this is what I'm doing:
Pre-Process Training Data:
preproc = preProcess(train_num,method = c("center", "scale"))
train_standardized <- predict(preproc, train_num)
Train the Model:
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9),
n.trees = c(100,500),
shrinkage = 0.1,
n.minobsinnode = 20)
train.boost = train(x=train_standardized[,-length(train_standardized)],
y=train_standardized$response,
method = "gbm",
metric = "ROC",
maximize = FALSE,
tuneGrid= gbmGrid,
trControl = trainControl(method="cv",
number=5,
classProbs = TRUE,
verboseIter = TRUE,
summaryFunction=twoClassSummary,
savePredictions = TRUE))
Prepare unknown data for predictions:
...
unknown_standardized <- predict(preproc, unknown_num)
...
Make the actual prediction on the unknown data:
preds <- predict(train.boost,newdata=unknown_standardized,type="prob")
Note that the "preproc" object is the same one resulting from analysis of the training set and used to make the centered/standardized predictions on which the model was trained.
When I get my evaluation back my evaluation on the unknown data it is substantially worse than what was predicted using the training set (ROC using training data via cross validation is about .83, ROC using the unknown data that I get back from the evaluating party is about .70).
Do I have the process right? What am I doing wrong?
Thanks in advance.
In one sense, you are not doing anything wrong at all.
A predictor is likely to do better on a training sample as it has used that data to build the model.
The whole point of the training set is to see how well that model generalizes. It is likely to "overfit" to the training data to a greater or lesser extent and to do somewhat worse on new data.
At least once you have your score against new data, you know the true accuracy of the model. If that accuracy is sufficient for your purposes, then the model will be useable and (because you have done the training/test) robust to new data.
Now, it is possible that the model could be better if it was trained on a wider variety of data. So to increase real accuracy, it might be worth using cross-validation to train it on multiple slices of the data - k fold cross-validation. Caret has a nice facility for that. http://machinelearningmastery.com/how-to-estimate-model-accuracy-in-r-using-the-caret-package/

Plot learning curves with caret package and R

I would like to study the optimal tradeoff between bias/variance for model tuning. I'm using caret for R which allows me to plot the performance metric (AUC, accuracy...) against the hyperparameters of the model (mtry, lambda, etc.) and automatically chooses the max. This typically returns a good model, but if I want to dig further and choose a different bias/variance tradeoff I need a learning curve, not a performance curve.
For the sake of simplicity, let's say my model is a random forest, which has just one hyperparameter 'mtry'
I would like to plot the learning curves of both training and test sets. Something like this:
(red curve is the test set)
On the y axis I put an error metric (number of misclassified examples or something like that); on the x axis 'mtry' or alternatively the training set size.
Questions:
Has caret the functionality to iteratively train models based of training set folds different in size? If I have to code by hand, how can I do that?
If I want to put the hyperparameter on the x axis, I need all the models trained by caret::train, not just the final model (the one with maximum performance got after CV). Are these "discarded" model still available after train?
Caret will iteratively test lots of cv models for you if you set the
trainControl() function and the parameters (e.g. mtry) using a tuneGrid().
Both of these are then passed as control options to the train()
function. The specifics of the tuneGrid parameters (e.g. mtry, ntree) will be different for each
model type.
Yes the final trainFit model will contain the error rate (however you specified it) for all folds of your CV.
So you could specify e.g. a 10-fold CV times a grid with 10 values of mtry -which would be 100 iterations. You might want to go get a cup of tea or possibly lunch.
If this sounds complicated ... there is a very good example here - caret being one of the best documented packages about.
Here's my code on how I approached this issue of plotting a learning curve in R while using the Caret package to train your model. I use the Motor Trend Car Road Tests in R for illustrative purposes. To begin, I randomize and split the mtcars dataset into training and test sets. 21 records for training and 13 records for the test set. The response feature is mpg in this example.
# set seed for reproducibility
set.seed(7)
# randomize mtcars
mtcars <- mtcars[sample(nrow(mtcars)),]
# split iris data into training and test sets
mtcarsIndex <- createDataPartition(mtcars$mpg, p = .625, list = F)
mtcarsTrain <- mtcars[mtcarsIndex,]
mtcarsTest <- mtcars[-mtcarsIndex,]
# create empty data frame
learnCurve <- data.frame(m = integer(21),
trainRMSE = integer(21),
cvRMSE = integer(21))
# test data response feature
testY <- mtcarsTest$mpg
# Run algorithms using 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
# loop over training examples
for (i in 3:21) {
learnCurve$m[i] <- i
# train learning algorithm with size i
fit.lm <- train(mpg~., data=mtcarsTrain[1:i,], method="lm", metric=metric,
preProc=c("center", "scale"), trControl=trainControl)
learnCurve$trainRMSE[i] <- fit.lm$results$RMSE
# use trained parameters to predict on test data
prediction <- predict(fit.lm, newdata = mtcarsTest[,-1])
rmse <- postResample(prediction, testY)
learnCurve$cvRMSE[i] <- rmse[1]
}
pdf("LinearRegressionLearningCurve.pdf", width = 7, height = 7, pointsize=12)
# plot learning curves of training set size vs. error measure
# for training set and test set
plot(log(learnCurve$trainRMSE),type = "o",col = "red", xlab = "Training set size",
ylab = "Error (RMSE)", main = "Linear Model Learning Curve")
lines(log(learnCurve$cvRMSE), type = "o", col = "blue")
legend('topright', c("Train error", "Test error"), lty = c(1,1), lwd = c(2.5, 2.5),
col = c("red", "blue"))
dev.off()
The output plot is as shown below:
At some point, probably after this question was asked, the caret package added the learning_curve_dat function which helps assess model performance across a range of training set sizes.
Here is the example from the function documentation:
library(caret)
set.seed(1412)
class_dat <- twoClassSim(1000)
set.seed(29510)
lda_data <- learning_curve_dat(dat = class_dat,
outcome = "Class",
test_prop = 1/4,
## `train` arguments:
method = "lda",
metric = "ROC",
trControl = trainControl(classProbs = TRUE,
summaryFunction = twoClassSummary))
ggplot(lda_data, aes(x = Training_Size, y = ROC, color = Data)) +
geom_smooth(method = loess, span = .8)
The performance metric(s) are found for each Training_Size and saved in lda_data along with the Data variable ("Resampling", "Training", and optionally "Testing").
Here is a link to the function documentation: https://rdrr.io/cran/caret/man/learning_curve_dat.html
To be clear, this answers the first part of the question but not the second part.
NOTE Before at least August 2020 there was a typo in the caret package code and documentation. The function call was learing_curve_dat before it was corrected to learning_curve_dat. I've updated my answer to reflect this change. Make sure you are using a recent version of the caret package.

Resources