R: how does caret choose default tuning range? - r

When using R caret to compare multiple models on the same data set, caret is smart enough to select different tuning ranges for different models if the same tuneLength is specified for all models and no model-specific tuneGrid is specified.
For example, the tuning ranges chosen by caret for one particular data set are:
earth(nprune): 2, 5, 8, 11, 14
gamSpline(df): 1, 1.5, 2, 2.5, 3
rpart(cp): 0.010, 0.054, 0.116, 0.123, 0.358
Does anyone know how caret determines these default tuning ranges? I have been searching through the documentation but still haven't pinned down the algorithm to choose the ranges.

It depends on the model. For rpart and a few others, it fits and initial model to get a sense of what reasonable values should be. In other cases, it is less intelligent. For example, for gamSpline it is expand.grid(df = seq(1, 3, length = len)).
You can see what it does per model using getModelInfo:
> getModelInfo("earth")[[1]]$grid
function(x, y, len = NULL) {
dat <- if(is.data.frame(x)) x else as.data.frame(x)
dat$.outcome <- y
mod <- earth( .outcome~., data = dat, pmethod = "none")
maxTerms <- nrow(mod$dirs)
maxTerms <- min(200, floor(maxTerms * .75) + 2)
data.frame(nprune = unique(floor(seq(2, to = maxTerms, length = len))),
degree = 1)
}
Max

Related

R SVM algorithm

Can someone please explain this line of code, on from what svm, `., data, Kernal, and ranges??
tune.out <- tune(svm,
mpglevel ~ .,
data = Auto,
kernel = "linear",
ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100, 1000)))
tune.out <- tune(svm,
mpglevel ~ .,
data = Auto,
kernel = "linear",
ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100, 1000)))
"svm" => the support vector classifier
"mpgelevel ~. " => classify the dependent variable mpglevel over the rest independent variables of the dataset
"data = Auto" => the variable that holds the data set
"kernel = 'linear'," =>The function of kernel is to take data as input and transform it into the required form. Different SVM algorithms use different types of kernel functions. These functions can be different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid.
The basic kernel "linear" means you believe that there is a straight line which separates your dataset into 2 classes. If a straight line cannot split your dataset, then you need to transform the data into different dimensions, for example this dataset, will not have a straight line to split it into 2 classes.
That is when you chose other kernels. The kernel functions are explained, for example, here
"ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100, 1000)))" => is different values of "cost of classification" you want to measure the svm against. or how much an SVM should be allowed to “bend” with the data. For a low cost, you aim for a smooth decision surface and for a higher cost.

Why grpreg library and gglasso library in R are giving different results for group LASSO?

I have been trying to do unsupervised feature selection using LASSO (by removing class column). The dataset includes categorical (factor) and continuous (numeric) variables. Here is the link. I built a design matrix using model.matrix() which creates dummy variables for each level of the categorical variables.
dataset <- read.xlsx("./hepatitis.data.xlsx", sheet = "hepatitis", na.strings = "")
names_df <- names(dataset)
formula_LASSO <- as.formula(paste("~ 0 +", paste(names_df, collapse = " + ")))
LASSO_df <- model.matrix(object = formula_LASSO, data = dataset, contrasts.arg = lapply(dataset[ ,sapply(dataset, is.factor)], contrasts, contrasts = FALSE ))
### Group LASSO using gglasso package
gglasso_group <- c(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 15, 16, 17, 17)
fit <- gglasso(x = LASSO_df, y = y_k, group = gglasso_group, loss = "ls", intercept = FALSE, nlambda = 100)
# Cross validation
fit.cv <- cv.gglasso(x = LASSO_df, y = y_k, group = gglasso_group, nfolds = 10)
# Best lambda
best_lambda_fit.cv <- fit.cv$lambda.1se
# Final coefficients of variables
coefs = coef.gglasso(object = fit, s = best_lambda_fit.cv)
### Group LASSO with grpreg package
group_lasso <- grpreg(X = LASSO_df, y = y_k, group = gglasso_group, penalty = "grLasso")
plot(group_lasso)
cv_group_lasso <- cv.grpreg(X = LASSO_df, y = y_k, group = gglasso_group, penalty = "grLasso", se = "quick")
# Best lambda
best_lambda_group_lasso <- cv_group_lasso$lambda.min
coef_mat_group_lasso <- as.matrix(coef(cv_group_lasso))
If you check coefs and coef_mat_group_lasso, you will realize that they are not the same. Also, the best lambda values are not the same. I am not sure which one to choose for feature selection.
Any idea of how to remove intercept in grpreg() function? intercept = FALSE is not working.
Any help is appreciated. Thanks in advance.
Please refer to the gglasso paper and the grpreg paper.
Different objective functions. On page 175 of grpreg paper, the author performs a step called group standardization, which normalizes the feature matrix within each group by right-multiplying an orthonormal matrix and a non-negative diagonal matrix. After the group lasso step with group standardization, the estimated coefficients are left-multiplied by the same matrices such that we obtain the coefficients of the original linear model. In such a way, however, the group lasso penalty is not equivalent to that without group standardization. For the detailed discussion, please also find it on page 175.
Different algorithms. The grpreg uses block coordinate descent, while gglasso uses an algorithm called groupwise-majorization-descent. It is natural to see small numerical differences when the algorithms are not the same.

Use R's NeuralNetToolslibrary to Plot the Network Structure of a H2O Deep Neural Network

I want to be able to use R's NeuralNetTools tools library to plot the network layout of a h2o deep neural network. Below is a sample code that plots the network layout of the model from the neural net package.
library(NeuralNetTools)
library(neuralnet)
data(neuraldat)
wts_in <- neuralnet(Y1 ~ X1 + X2 + X3, data = neuraldat, hidden = c(4),
rep=1)
plotnet(wts_in)
I want to do the same thing but use H2o deep neural model. The code shows how to generate a layout by only knowing the number of layers and weight structure.
library(NeuralNetTools)
# B1-H11, I1-H11, I2-H11, B1-H12, I1-H12, I2-H12, B2-H21, H11-H21, H12-H21,
# B2-H22, H11-H22, H12-H22, B3-O1, H21-O1, H22-O1
wts_in <- c(1.12, 1.49, 0.16, -0.11, -0.19, -0.16, 0.5, 0.2, -0.12, -0.1,
0.89, 0.9, 0.56, -0.52, 0.81)
struct <- c(2, 2, 2, 1) # two inputs, two (two nodes each), one output
x_names<-c("No","Yes") #Input Variable Names
y_names<-c("maybe") #Output Variable Names
plotnet(wts_in, struct=struct)
Below is the above neuralnet model but I have used H2o to generate it. I’m stumped on how to get the number of layers.
library(h2o)
h2o.init()
neuraldat.hex <- as.h2o(neuraldat)
h2o_neural_model<-h2o.deeplearning(x = 1:4, y = 5,
training_frame= neuraldat.hex,
hidden=c(2,3),
epochs = 10,
model_id = NULL)
h2o_neural_model#model
I can use the weights #h2o.weights(object, matrix_id = 1) and bias function #h2o.biases(object, vector_id = 1) to build the structure but I need it to determine the number layers. I know I can specify the number layers in the model to start with but I sometimes write code that will determine the number of layers going into the model and so I need to a function determine the layers in network structure and weights for the plotnet() function below.
plotnet(wts_in, struct=struct)
As an alternative, it would be nice if I had a ggplot2 function instead of the plotnet() function.
Any help is greatly appreciated.
I know it's been 8 months and it's likely that you've already figured it out. However, I will post my solution for those who run into the same problem.
The importance here resides in the parameter export_weights_and_biases of h2o.deeplearning(); and the h2o.weigths(neuralnet) and h2o.biases(neuralnet) functions, which gives the parameters you're looking for.
All that's left is ordering the data.
# Load your data
neuraldat.hex <- as.h2o(neuraldat)
h2o_neural_model <- h2o.deeplearning(x = 1:4, y = 5,
training_frame= neuraldat.hex,
hidden = c(2,3),
epochs = 10,
model_id = NULL,
export_weights_and_biases = T) # notice this parameter!
# for each layer, starting from left hidden layer,
# append bias and weights of each node in layer to
# numeric vector.
wts <- c()
for (l in 1:(length(h2o_neural_model#allparameters$hidden)+1)){
wts_in <- h2o.weights(h2o_neural_model, l)
biases <- as.vector(h2o.biases(h2o_neural_model, l))
for (i in 1:nrow(wts_in)){
wts <- c(wts, biases[i], as.vector(wts_in[i,]))
}
}
# generate struct from column 'units' in model_summary
struct <- h2o_neural_model#model$model_summary$units
# plot it
plotnet(wts, struct = struct)
The h2o object that it's returned by the deeplearning function it's quite complex and one can get lost in the documentation.

stop xgboost based on eval_metric

I am trying to run xgboost for a problem with very noisy features and interested in stopping the number of rounds based on a custom eval_metric that I have defined.
Based on domain knowledge I know that when the eval_metric (evaluated on the training data) goes above a certain value xgboost is overfitting. And I would like to just take the fitted model at that specific number of rounds and not proceed further.
What would be the best way to achieve this ?
It would be somewhat in line with the early stopping criteria but not exactly.
Alternately, if there is a possibility to get the model from an intermediate round ?
Here is an example to better explain by question. (Using the toy example that comes with xgboost help docs and using the default eval_metric)
library(xgboost)
data(agaricus.train, package='xgboost')
train <- agaricus.train
bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 5, objective = "binary:logistic")
Here is the output
[0] train-error:0.046522
[1] train-error:0.022263
[2] train-error:0.007063
[3] train-error:0.015200
[4] train-error:0.007063
Now lets say from domain knowledge I know that once the train error goes below 0.015 (third round in this case), any further rounds only lead to over fitting. How would I stop the training process after the third round and get hold of the trained model to use it for prediction over a different dataset ?
I need to run the training process over many different datasets and I have no sense of how many rounds it might take to train to get the error below a fixed number, hence I can't set the nrounds argument to a predetermined value. Only intuition I have is that once the training error goes below a number I need to stop further training rounds.
In the absence of any code you have tried or any data you are using then try something like this:
require(xgboost)
library(Metrics) # for rmse to calculate errors
# Assume you have a training set db.train and have some
# feature indices of interest and a test set db.test
predz <- c(2, 4, 6, 8, 10, 12)
predictors <- names(db.train[, predz])
# you have some response you are interested in
outcomeName <- "myLabel"
# you may like to include for testing some other parameters like:
# eta, gamma, colsample_bytree, min_child_weight
# here we look at depths from 1 to 4 and rounds 1 to 100 but set your own values
smallestError <- 100 # set to some sensible value depending on your eval metric
for (depth in seq(1, 4, 1)) {
for (rounds in seq(1, 100, 1)) {
# train
bst <- xgboost(data = as.matrix(db.train[,predictors]),
label = db.train[,outcomeName],
max.depth = depth,
nround = rounds,
eval_metric = "logloss",
objective = "binary:logistic",
verbose=TRUE)
gc()
# predict
predictions <- as.numeric(predict(bst, as.matrix(db.test[, predictors]),
outputmargin = TRUE))
err <- rmse(as.numeric(db.test[, outcomeName]), as.numeric(predictions))
if (err < smallestError) {
smallestError = err
print(paste(depth,rounds,err))
}
}
}
You could adapt this code for your particular evaluation metric and print this out to suit your situation. Similarly you could introduce a break in the code when some specified number of rounds is reached that satisfies some condition you seek to achieve.

xgboost Random Forest with sparse matrix data and multinomial Y

I'm not sure if xgboost's many nice features can be combined in the way that I need (?), but what I'm trying to do is to run a Random Forest with sparse data predictors on a multi-class dependent variable.
I know that xgboost can do any 1 of those things:
Random Forest via tweaking xgboost parameters:
bst <- xgboost(data = train$data, label = train$label, max.depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "binary:logistic")
Sparse matrix predictors
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,
eta = 1, nthread = 2, nround = 10,objective = "binary:logistic")
Multinomial (multiclass) dependent variable models via multi:softmax or multi:softprob
xgboost(data = data, label = multinomial_vector, max.depth = 4,
eta = 1, nthread = 2, nround = 10,objective = "multi:softmax")
However, I run into an error regarding non-conforming length when I try to do all of them at once:
sparse_matrix <- sparse.model.matrix(TripType~.-1, data = train)
Y <- train$TripType
bst <- xgboost(data = sparse_matrix, label = Y, max.depth = 4, num_parallel_tree = 100, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "multi:softmax")
Error in xgb.setinfo(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
length(Y)
[1] 647054
length(sparse_matrix)
[1] 66210988200
nrow(sparse_matrix)
[1] 642925
The length error I'm getting is comparing the length of my single multi-class dependent vector (let's call it n) to the length of the sparse matrix index, which I believe is j * n for j predictors.
The specific use case here is the Kaggle.com Walmart competition (the data is public, but very large by default -- about 650,000 rows and several thousand candidate features). I've been running multinomial RF models on it via H2O, but it sounds like a lot of other folks have been using xgboost, so I wonder if this is possible.
If it's not possible, then I wonder if one could/should estimate each level of the dependent variable separately and try to come the results?
Here is what is happening:
When you do this:
sparse_matrix <- sparse.model.matrix(TripType~.-1, data = train)
you are losing rows from your data
sparse.model.matrix cannot deal with NA's by default, when it see's one, it drops the row
as it happens there are exactly 4129 rows that contain NA's in the original data.
This is the difference between these two numbers:
length(Y)
[1] 647054
nrow(sparse_matrix)
[1] 642925
The reason this works on the previous examples is as follows
In the binomial case :
it is recycling the Y vector and completing the missing labels. (this is BAD)
In the random forest case:
(I think) it's because I random forest never uses the predictions from previous trees, so this error goes unseen. (this is BAD)
Takeaway:
Neither of the previous examples that work will train well
sparse.model.matrix drops NA's you are losing rows in your training data, this is a big problem and needs to be addressed
Good luck!

Resources