How to specify offset_column in h2o.stackedEnsemble() - r

I am running gbm and glm with offset_column as base learners in h2o. My response variable is binary and the offset_column is a positive constant. Base learners worked. Here is the code:
train["offset"]<-train["log_hazard"] # offset column in the training set
my_gbm <- h2o.gbm(x = x, y = y, training_frame = train,
fold_column = "fold_id",
keep_cross_validation_predictions = TRUE,
offset_column = "offset",
seed = 1)
my_glm <- h2o.glm(x = x, y = y, training_frame = train,
fold_column = "fold_id",
keep_cross_validation_predictions = TRUE,
offset_column = "offset",
seed = 1,family = "binomial")
Then I am passing the offset_column in h2o.stackedEnsemble() through metalerner_params. Here is the code:
stack_model <- h2o.stackedEnsemble(x = x,
y = y,
training_frame = train,
base_models = list(my_gbm, my_glm),
metalearner_params = list(offset_column = "offset"))
But I received the following error:
ERRR on field: _offset_column: Offset column 'offset' not found in the training frame
The offset_column is in the training data. I am not sure why I am receiving this error message.
Then I tried running h2o.stackedEnsemble() without the metalerner_params option. Here is the code:
stack_model <- h2o.stackedEnsemble(x = x,
y = y,
training_frame = train,
base_models = list(my_gbm, my_glm))
and received the following warning message:
Warning message:
In .h2o.startModelJob(algo, params, h2oRestApiVersion) :
Dropping bad and constant columns: [offset].
I am not sure whether it ran properly. Can anyone please help me with this issue?

if you carefully read h2o docs for h2o.stackedEnsemble then you realize that h2o metalearner won't need offset parameter anymore as it will use cross-validated predicted values from base models to train:
my_gbm <- h2o.gbm(x = x, y = y, training_frame = train,
fold_column = "fold_id",
keep_cross_validation_predictions = TRUE,
offset_column = "offset",
seed = 1)
my_glm <- h2o.glm(x = x, y = y, training_frame = train,
fold_column = "fold_id",
keep_cross_validation_predictions = TRUE,
offset_column = "offset",
seed = 1,family = "binomial")
stack_model <- h2o.stackedEnsemble(x = x,
y = y,
training_frame = train,
base_models = list(my_gbm, my_glm))
h2o.performance(my_gbm, newdata = test)
h2o.performance(my_glm, newdata = test)
h2o.performance(stack_model, newdata = test)

Related

Prediction on new data with GLMNET and CARET - The number of variables in newx must be X

I have a dataset with which I am doing k-folds cross-validation with.
In each fold, I have split the data into a train and test dataset.
For the training on the dataset X, I run the following code:
cv_glmnet <- caret::train(x = as.data.frame(X[curtrainfoldi, ]), y = y[curtrainfoldi, ],
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
I check the class of 'cv_glmnet', and 'train' is returned.
I then want to use this model to predict values in the test dataset, which is a matrix that has the same number of variables (columns)
# predicting on test data
yhat <- predict.train(cv_glmnet, newdata = X[curtestfoldi, ])
However, I keep running into the following error:
Error in predict.glmnet(modelFit, newdata, s = modelFit$lambdaOpt, type = "response") :
The number of variables in newx must be 210
I noticed in the caret.predict documentation, it states the following:
newdata an optional set of data to predict on. If NULL, then the
original training data are used but, if the train model used a recipe,
an error will occur.
I am confused as to why am I running into this error. Is it related to how I am defining newdata? My data has the right number of variables/columns (same as the train dataset), so I have no idea what is causing the error.
You get the error because your column names changes when you pass as.data.frame(X). If your matrix doesn't have column names, it creates column names and the model expects these when it tries to predict. If it has column names, then some of them could be changed :
library(caret)
library(tibble)
X = matrix(runif(50*20),ncol=20)
y = rnorm(50)
cv_glmnet <- caret::train(x = as.data.frame(X), y = y,
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
yhat <- predict.train(cv_glmnet, newdata = X)
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Error in predict.glmnet(modelFit, newdata, s = modelFit$lambdaOpt) :
The number of variables in newx must be 20
If you have column names, it works
colnames(X) = paste0("column",1:ncol(X))
cv_glmnet <- caret::train(x = as.data.frame(X), y = y,
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
yhat <- predict.train(cv_glmnet, newdata = X)

ensemble_glmnet: could not find function "predict.cv.glmnet"

I am trying to run the ensemble_glmnet program, but receiving an error that it cannot find predict.cv.glmnet. I have loaded the glmnet and glmnetUtils libraries.
I'm running RStudio 1.2.5033 and R version 3.6.2
library(BuenaVista)
library(glmnet)
library(glmnetUtils)
data<-iris[sample(1:150, size = 150, replace = FALSE),]
data <- derive_variables(dataset=data, type = "dummy", integer = TRUE, return_dataset=TRUE)
data$Species_setosa<-as.factor(data$Species_setosa)
test <-data[101:50,c(1,2,3,4,6,7)]
data<-data[,c(5,1,2,3,4,6,7)]
ensemble_glmnet(y_index = 1, train = data, valid_size = 50, n = 10, alpha = 1, family = "binomial", type = "class")
Error in predict.cv.glmnet(object = cv.glmnet(x = X, y = Y, nfolds =
nfolds, : could not find function "predict.cv.glmnet"

Error while running h2o.deeplearning algorithm in R

I am facing an error while running this command in H2O Deep Learning in R:
model <- h2o.deeplearning(x = x, y = y, seed = 1234,
training_frame = as.h2o(trainDF),
nfolds = 3,
stopping_rounds = 7,
epochs = 400,
overwrite_with_best_model = TRUE,
activation = "Tanh",
input_dropout_ratio = .1,
hidden = c(10,10),
l1 = 6e-4,
loss = "automatic",
distribution = 'AUTO',
stopping_metric = "MSE")
ERROR as below:
Error in h2o.deeplearning(x = x, y = y, seed = 1234, training_frame = as.h2o(trainDF), :
unused arguments (training_frame = as.h2o(trainDF), stopping_rounds = 7, overwrite_with_best_model = TRUE, distribution = "AUTO", stopping_metric = "MSE")
I was not able to reproduce your specific error, but I was able to get the code to work on my end by updating loss="automatic" to loss="Automatic" (note that loss it is case sensitive).

H2o.ensemble: family <- "gaussian", error with family = gamma requires a positive response

Error in h2o.ensemble(x = x, y = y, training_frame = train, family = family, : family = gamma requires a positive respone
Traceback:
h2o.ensemble(x = x, y = y, training_frame = train, family = family,
. learner = learner, metalearner = metalearner, cvControl = list(V = 5,
. shuffle = TRUE))
stop("family = gamma requires a positive respone")
reponse "y" is with both negative and positive values.`
code:
## Load required packages
library(h2o)
library(h2oEnsemble)
h2o.init(nthreads = -1, max_mem_size = "8G")
data <- h2o.importFile('./input/df_train.csv')
# Partition the data into train and test sets
splits <- h2o.splitFrame(data, seed = 1)
train <- splits[[1]]
test <- splits[[2]]
# Identify response and predictor variables
y <- "logerror"
x <- setdiff(colnames(data), c(y, "parcelid", "transactiondate"))
print(x)
# Specify the base learner library & the metalearner
learner <- c("h2o.glm.wrapper", "h2o.randomForest.wrapper",
"h2o.xgboost.wrapper",
"h2o.gbm.wrapper", "h2o.deeplearning.wrapper")
metalearner <- "h2o.glm.wrapper"
family <- "gaussian"
# Train the ensemble using 5-fold CV to generate level-one data
fit <- h2o.ensemble(x = x, y = y,
training_frame = train,
family = family,
learner = learner,
metalearner = metalearner,
cvControl = list(V = 5, shuffle = TRUE))
# Evaluate performance on a test set
perf <- h2o.ensemble_performance(fit, newdata = test)
perf
This was a bug in h2oEnsemble v0.2.0 that was introduced when I added support for the extra family values (gamma, poisson, etc). I have fixed the bug and released a h2oEnsemble v0.2.1; you can find a link to download the new package here, or use the R command below:
install.packages("https://h2o-release.s3.amazonaws.com/h2o-ensemble/R/h2oEnsemble_0.2.1.tar.gz", repos = NULL)
On a separate note, your code attempts to include an XGBoost model by using a wrapper, "h2o.xgboost.wrapper" -- there is no built-in wrapper for XGBoost in the h2oEnsemble package yet, so that won't work. I will add the XGBoost wrapper after h2o 3.14.0.1 is released. That should happen in the next week or two.

Tuning two parameters for random forest in Caret package

When i only used mtry parameter as the tuingrid, it worked but when i added ntree parameter the error becomes Error in train.default(x, y, weights = w, ...): The tuning parameter grid should have columns mtry. The code is as below:
require(RCurl)
require(prettyR)
library(caret)
url <- "https://raw.githubusercontent.com/gastonstat/CreditScoring/master/CleanCreditScoring.csv"
cs_data <- getURL(url)
cs_data <- read.csv(textConnection(cs_data))
classes <- cs_data[, "Status"]
predictors <- cs_data[, -match(c("Status", "Seniority", "Time", "Age", "Expenses",
"Income", "Assets", "Debt", "Amount", "Price", "Finrat", "Savings"), colnames(cs_data))]
train_set <- createDataPartition(classes, p = 0.8, list = FALSE)
set.seed(123)
cs_data_train = cs_data[train_set, ]
cs_data_test = cs_data[-train_set, ]
# Define the tuned parameter
grid <- expand.grid(mtry = seq(4,16,4), ntree = c(700, 1000,2000) )
ctrl <- trainControl(method = "cv", number = 10, summaryFunction = twoClassSummary,classProbs = TRUE)
rf_fit <- train(Status ~ ., data = cs_data_train,
method = "rf",
preProcess = c("center", "scale"),
tuneGrid = grid,
trControl = ctrl,
family= "binomial",
metric= "ROC" #define which metric to optimize metric='RMSE'
)
rf_fit
You have to create a custom RF using the random forest package and then include the param that you want to include.
customRF <- list(type = "Classification", library = "randomForest", loop = NULL)
customRF$parameters <- data.frame(parameter = c("mtry", "ntree"), class = rep("numeric", 2), label = c("mtry", "ntree"))
customRF$grid <- function(x, y, len = NULL, search = "grid") {}
customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
randomForest(x, y, mtry = param$mtry, ntree=param$ntree, ...)
}
customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata)
customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata, type = "prob")
customRF$sort <- function(x) x[order(x[,1]),]
customRF$levels <- function(x) x$classes
customRF
Then you can use method as [customRF] in the train function.
You should change:
grid <- expand.grid(.mtry = seq(4,16,4),. ntree = c(700, 1000,2000) )

Resources