Error while estimating xgboost in h2o after update to 3.18 - r

I encountered the known issue of not being able to save the xgboost model and load it later to obtain predictions and it was supposedly changed in h2o 3.18 (the problem was in 3.16). I updated the package from h2o's website (downloadable zip) and now the model that had no problem gives the following error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = urlSuffix, :
Unexpected CURL error: Failed to connect to localhost port 54321: Connection refused
This is only in the case of xgboost (binary classification), as other models I use work fine. Of course h2o is initialised and a previous model estimates without problems. Does anyone have any idea what can be the issue?
EDIT: Here is a reproducible example (based on Erin's answer) that produces the error:
library(h2o)
library(caret)
h2o.init()
# Import a sample binary outcome train set into H2O
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# Assigning fold column
set.seed(1)
cv_folds <- createFolds(as.data.frame(train)$response,
k = 5,
list = FALSE,
returnTrain = FALSE)
# version 1
train <- train %>%
as.data.frame() %>%
mutate(fold_assignment = cv_folds) %>%
as.h2o()
# version 2
train <- h2o.cbind(train, as.h2o(cv_folds))
names(train)[dim(train)[2]] <- c("fold_assignment")
# For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])
xgb <- h2o.xgboost(x = x,
y = y,
seed = 1,
training_frame = train,
fold_column = "fold_assignment",
keep_cross_validation_predictions = TRUE,
eta = 0.01,
max_depth = 3,
sample_rate = 0.8,
col_sample_rate = 0.6,
ntrees = 500,
reg_lambda = 0,
reg_alpha = 1000,
distribution = 'bernoulli')
Both versions of creating the train data.frame result in the same error.

You didn't say whether you have re-trained the models using 3.18. In general, H2O only guarantees model compatibility between major version of H2O. If you have not retrained the models, that's probably the reason that XGBoost is not working properly. If you have re-trained the models with 3.18 and XGBoost is still not working, then please post a reproducible example and we will check it out further.
EDIT:
I am adding reproducible example (the only difference from your code and this code is that I am not using fold_column here). This runs fine on 3.18.0.2. Without a reproducible example that produces an error, I can't help you any further.
library(h2o)
h2o.init()
# Import a sample binary outcome train set into H2O
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])
xgb <- h2o.xgboost(x = x,
y = y,
seed = 1,
training_frame = train,
keep_cross_validation_predictions = TRUE,
eta = 0.01,
max_depth = 3,
sample_rate = 0.8,
col_sample_rate = 0.6,
ntrees = 500,
reg_lambda = 0,
reg_alpha = 1000,
distribution = 'bernoulli')

Related

How do I set the upper range mtry tuning value in mlr3, when I also conduct automated feature selection?

Date: 2022-08-17. R Version: 4.0.3. Platform: x86_64-apple-darwin17.0 (64-bit)
Problem: In mlr3 (classif.task, learner: random forest), I use automated hyperparameter optimization (HPO; mtry in the range between 1 and the number of features in the data), and automated feature selection (single criterion: msr = classif.auc).
I run into this ranger error message:
'mtry can not be larger than number of variables in data. Ranger will EXIT now.'
I am relatively sure that what happens is when a subset of features have been selected and HPO attempts to assess the performance for a higher number of features, that this produces the error. If this is true, then how do I set the upper range limit in HPO for the mtry parameter in such a case (see repex below)?
# Make data with binary outcome.
set.seed(123); n <- 500
for(i in 1:9) {
assign(paste0("x", i), rnorm(n=n, mean = 0, sd = sample(1:6,1)))
}
z <- 0 + (.02*x1) + .03*x2 - .06*x3 + .03*x4 + .1*x5 + .08*x6 + .09*x7 - .008*x8 + .045*x9
pr = 1/(1+exp(-z))
y = rbinom(n, 1, pr)
dat <- data.frame(y=factor(y), x1, x2, x3, x4, x5, x6, x7, x8, x9)
#
library(mlr3verse)
tskclassif <- TaskClassif$new(id="rangerCheck", backend=dat, target="y")
randomForest <- lrn("classif.ranger", predict_type = "prob")
# Question: How do I set the upper range limit for the mtry parameter, in order to not get the error message?
searchSpaceRANDOMFOREST <- ps(mtry=p_int(lower = 1, upper = (ncol(dat)-1)))
# Hyperparameter optimization
resamplingTuner <- rsmp("cv", folds=4)
tuner <-
atRANDOMFOREST <- AutoTuner$new(
learner=randomForest,
resampling = resamplingTuner,
measure = msr("classif.auc"),
search_space = searchSpaceRANDOMFOREST,
terminator = trm("evals", n_evals = 10),
tuner = tnr("random_search"))
# Feature selection
instance = FSelectInstanceSingleCrit$new(
task = tskclassif,
learner = atRANDOMFOREST,
resampling = rsmp("holdout", ratio = .8),
measure = msr("classif.auc"),
terminator = trm("evals", n_evals = 20)
)
fselector <- fs("random_search")
fselector$optimize(instance)
# Error message:
# Error: mtry can not be larger than number of variables in data. Ranger will EXIT now.
# Fehler in ranger::ranger(dependent.variable.name = task$target_names, data = task$data(), : User interrupt or internal error.
# This happened PipeOp classif.ranger.tuned's $train()
You should be able to use the mtry.ratio parameter in https://mlr3learners.mlr-org.com/reference/mlr_learners_classif.ranger.html instead of mtry to have a dynamic feature count selection during tuning which does not exceed the number of available features.

How do you run nonlinear moderation using the nlsem package in R?

I'm just trying to learn how to use the nlsem package in R to fit nonlinear SEMM, but I keep running into to the error "Posterior probability could not be calculated properly. Choose different starting parameters" when I try to create the res object. I'm trying to estimate a nonlinear model where latent variable tas predicts latent variable cts, moderated by latent variable ams. I'm still pretty new to R and very new to nonlinear analyses, so any help at all would be appreciated!
My code so far:
##nonlinear SEM
#Select data
FPerpSEMM<-subset(FPerp,
select=(c("tas1", "tas3", "tas6", "tas7", "tas9", "tas13","tas14", "AMSEscalate",
"AMSNegAttribution", "AMSSelfAware", "AMSCalming", "cts_5", "cts_25",
"cts_29", "cts_35", "cts_49", "cts_65", "cts_67", "cts_69")))
FPerpSEMM$x1<-FPerpSEMM$tas1
FPerpSEMM$x2<-FPerpSEMM$tas3
FPerpSEMM$x3<-FPerpSEMM$tas6
FPerpSEMM$x4<-FPerpSEMM$tas7
FPerpSEMM$x5<-FPerpSEMM$tas9
FPerpSEMM$x6<-FPerpSEMM$tas13
FPerpSEMM$x7<-FPerpSEMM$tas14
FPerpSEMM$x8<-FPerpSEMM$AMSEscalate
FPerpSEMM$x9<-FPerpSEMM$AMSNegAttribution
FPerpSEMM$x10<-FPerpSEMM$AMSSelfAware
FPerpSEMM$x11<-FPerpSEMM$AMSCalming
FPerpSEMM$y1<-FPerpSEMM$cts_5
FPerpSEMM$y2<-FPerpSEMM$cts_25
FPerpSEMM$y3<-FPerpSEMM$cts_29
FPerpSEMM$y4<-FPerpSEMM$cts_35
FPerpSEMM$y5<-FPerpSEMM$cts_49
FPerpSEMM$y6<-FPerpSEMM$cts_65
FPerpSEMM$y7<-FPerpSEMM$cts_67
FPerpSEMM$y8<-FPerpSEMM$cts_69
FPerpSEMMr1<-subset(FPerpSEMM,
select=(c("x1","x2","x3","x4","x5","x6","x7","x8","x9","x10","x11",
"y1","y2","y3","y4","y5","y6","y7","y8")))
#Create dataframe containing only complete cases
FPerpSEMMcc<-na.omit(FPerpSEMMr1)
# load data
dat <- as.matrix(FPerpSEMMcc[, c(12:19, 1:7, 8:11)])
# specify model of class SEMM
model<- specify_sem(num.x = 11, num.y = 8, num.xi = 2, num.eta = 1,
xi = "x1-x7,x8-x11", eta = "y1-y8",
num.classes = 3, interaction = "xi1:xi2", rel.lat = "eta1~xi1+xi2",
constraints = "direct1")
class(model)
#fit model
dat <- as.matrix(FPerpSEMMcc[, c(12:19, 1:7, 8:11)])
set.seed(911)
pars.start <- runif(count_free_parameters(model))
res <- em(model, dat, pars.start, convergence = 0.1, max.iter = 200)
summary(res)
plot(res)

Reproducing the Airlines Delay h2o flow example with h2o package does not match

The following script, reproduces an equivalent problem as it was stated in h2o Help (Help -> View Example Flow or Help -> Browse Installed packs.. -> examples -> Airlines Delay.flow, download), but using h2o R-package and a fixed seed (123456):
library(h2o)
# To use avaliable cores
h2o.init(max_mem_size = "12g", nthreads = -1)
IS_LOCAL_FILE = switch(1, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])
# Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id="glm_model", seed=123456, training_frame=allyears2k.hex,
ignore_const_cols = T, y = response,
family="binomial", solver="IRLSM",
alpha=0.5,lambda=0.00001, lambda_search=F, standardize=T,
non_negative=F, score_each_iteration=F,
max_iterations=-1, link="family_default", intercept=T, objective_epsilon=0.00001,
beta_epsilon=0.0001, gradient_epsilon=0.0001, prior=-1, max_active_predictors=-1
)
# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
This is the Confusion Matrix for the training set:
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
NO YES Error Rate
NO 0 20887 1.000000 =20887/20887
YES 0 23091 0.000000 =0/23091
Totals 0 43978 0.474942 =20887/43978
And the metrics:
H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.2473858
RMSE: 0.4973789
LogLoss: 0.6878898
Mean Per-Class Error: 0.5
AUC: 0.5550138
Gini: 0.1100276
R^2: 0.007965165
Residual Deviance: 60504.04
AIC: 60516.04
On contrary the result of h2o flow has a better performance:
and Confusion Matrix for max f1 threshold:
The h2o flow performance is much better than running the same algorithm using the equivalent R-package function.
Note: For sake of simplicity I am using Airlines Delay problem, that is a well-known problem using h2o, but I realized that such kind of significant difference are found in other similar situations using glm algorithm.
Any thought about why these significant differences occur
Appendix A: Using default model parameters
Following the suggestion from #DarrenCook answer, just using default building parameters except for excluding columns and seed:
h2o flow
Now the buildModel is invoked like this:
buildModel 'glm', {"model_id":"glm_model-default",
"seed":"123456","training_frame":"allyears2k.hex",
"ignored_columns":
["DayofMonth","DepTime","CRSDepTime","ArrTime","CRSArrTime","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
"CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed"],
"response_column":"IsDepDelayed","family":"binomial"
}
and the results are:
and the training metrics:
Running R-Script
The following script allows for an easy switch into default configuration (via IS_DEFAULT_MODEL variable) and also keeping the configuration as it states in the Airlines Delay example:
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores
IS_LOCAL_FILE = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])
if (IS_DEFAULT_MODEL) {
fit1 <- h2o.glm(
x = predictors, model_id = "glm_model", seed = 123456,
training_frame = allyears2k.hex, y = response, family = "binomial"
)
} else { # Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id = "glm_model", seed = 123456, training_frame = allyears2k.hex,
ignore_const_cols = T, y = response,
family = "binomial", solver = "IRLSM",
alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
non_negative = F, score_each_iteration = F,
max_iterations = -1, link = "family_default", intercept = T, objective_epsilon = 0.00001,
beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1, max_active_predictors = -1
)
}
# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
It produces the following results:
MSE: 0.2473859
RMSE: 0.497379
LogLoss: 0.6878898
Mean Per-Class Error: 0.5
AUC: 0.5549898
Gini: 0.1099796
R^2: 0.007964984
Residual Deviance: 60504.04
AIC: 60516.04
Confusion Matrix (vertical: actual; across: predicted)
for F1-optimal threshold:
NO YES Error Rate
NO 0 20887 1.000000 =20887/20887
YES 0 23091 0.000000 =0/23091
Totals 0 43978 0.474942 =20887/43978
Some metrics are close, but the Confusion Matrix is quite diferent, the R-Script predict all flights as delayed.
Appendix B: Configuration
Package: h2o
Version: 3.18.0.4
Type: Package
Title: R Interface for H2O
Date: 2018-03-08
Note: I tested the R-Script also under 3.19.0.4231 with the same results
This is the cluster information after running the R:
> h2o.init(max_mem_size = "12g", nthreads = -1)
R is connected to the H2O cluster:
H2O cluster version: 3.18.0.4
...
H2O API Extensions: Algos, AutoML, Core V3, Core V4
R Version: R version 3.3.3 (2017-03-06)
Troubleshooting Tip: build the all-defaults model first:
mDef = h2o.glm(predictors, response, allyears2k.hex, family="binomial")
This takes 2 seconds and gives almotst exactly the same AUC and confusion matrix as in your Flow screenshots.
So, we now know the problem you see is due to all the model customization you have done...
...except when I build your fit1 I get basically the same results as my default model:
NO YES Error Rate
NO 4276 16611 0.795279 =16611/20887
YES 1573 21518 0.068122 =1573/23091
Totals 5849 38129 0.413479 =18184/43978
This was using your script exactly as given, so it fetched the remote csv file. (Oh, I removed the max_mem_size argument, as I don't have 12g on this notebook!)
Assuming you can get exactly your posted results, running exactly the code you posted (and in a fresh R session, with a newly started H2O cluster), one possible explanation is you are using 3.19.x, but the latest stable release is 3.18.0.2? (My test was with 3.14.0.1)
Finally, I guess this is the explanation: both have the same parameter configuration for building the model (that is not the problem), but the H2o flow uses a specific parsing customization converting some variables values into Enum, that the R-script did not specify.
The Airlines Delay problem how it was specified in the h2o Flow example uses as predictor variables (the flow defines the ignored_columns):
"Year", "Month", "DayOfWeek", "UniqueCarrier",
"FlightNum", "Origin", "Dest", "Distance"
Where all of the predictors should be parsed as: Enum except Distance. Therefore the R-Script needs to convert such columns from numeric or char into factor.
Executing using h2o R-package
Here the R-Script updated:
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores
IS_LOCAL_FILE = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = T)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime",
"ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
column.asFactor <- c("Year", "Month", "DayofMonth", "DayOfWeek",
"UniqueCarrier", "FlightNum", "Origin", "Dest", response)
# Coercing as factor (equivalent to Enum from h2o Flow)
# Note: Using lapply does not work, see the answer of this question
# https://stackoverflow.com/questions/49393343/how-to-coerce-multiple-columns-to-factors-at-once-for-h2oframe-object
for (col in column.asFactor) {
allyears2k.hex[col] <- as.factor(allyears2k.hex[col])
}
if (IS_DEFAULT_MODEL) {
fit1 <- h2o.glm(x = predictors, y = response,
training_frame = allyears2k.hex,
family = "binomial", seed = 123456
)
} else { # Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id = "glm_model", seed = 123456,
training_frame = allyears2k.hex,
ignore_const_cols = T, y = response,
family = "binomial", solver = "IRLSM",
alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
non_negative = F, score_each_iteration = F,
max_iterations = -1, link = "family_default", intercept = T,
objective_epsilon = 0.00001,
beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1,
max_active_predictors = -1
)
}
# Analysis
print("Confusion Matrix for training dataset")
confMatrix <- h2o.confusionMatrix(fit1)
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
Here the result running the R-Script under default configuraiton IS_DEFAULT_MODEL=T:
H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.2001145
RMSE: 0.4473416
LogLoss: 0.5845852
Mean Per-Class Error: 0.3343562
AUC: 0.7570867
Gini: 0.5141734
R^2: 0.1975266
Residual Deviance: 51417.77
AIC: 52951.77
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
NO YES Error Rate
NO 10337 10550 0.505099 =10550/20887
YES 3778 19313 0.163614 =3778/23091
Totals 14115 29863 0.325799 =14328/43978
Executing under h2o flow
Now executing the flow: Airlines_Delay_GLMFixedSeed, we can obtain the same results. Here the detail about the flow configuration:
The parseFiles function:
parseFiles
paths: ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]
destination_frame: "allyears2k.hex"
parse_type: "CSV"
separator: 44
number_columns: 31
single_quotes: false
column_names:
["Year","Month","DayofMonth","DayOfWeek","DepTime","CRSDepTime","ArrTime",
"CRSArrTime","UniqueCarrier","FlightNum","TailNum","ActualElapsedTime",
"CRSElapsedTime","AirTime","ArrDelay","DepDelay","Origin","Dest",
"Distance","TaxiIn","TaxiOut","Cancelled","CancellationCode",
"Diverted","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed",
"IsDepDelayed"]
column_types ["Enum","Enum","Enum","Enum","Numeric","Numeric",
"Numeric","Numeric", "Enum","Enum","Enum","Numeric",
"Numeric", "Numeric","Numeric","Numeric",
"Enum","Enum","Numeric","Numeric","Numeric",
"Enum","Enum","Numeric","Numeric","Numeric",
"Numeric","Numeric","Numeric","Enum","Enum"]
delete_on_done: true
check_header: 1
chunk_size: 4194304
where the following predictor columns are converted to Enum: "Year", "Month", "DayOfWeek", "UniqueCarrier", "FlightNum", "Origin", "Dest"
Now invoking the buildModel function as follows, using the default parameters except for ignored_columns and seed:
buildModel 'glm', {"model_id":"glm_model-default","seed":"123456",
"training_frame":"allyears2k.hex",
"ignored_columns":["DayofMonth","DepTime","CRSDepTime","ArrTime",
"CRSArrTime","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
"CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed"],"response_column":"IsDepDelayed",
"family":"binomial"}
and finally we get the following result:
and Training Output Metrics:
model glm_model-default
model_checksum -2438376548367921152
frame allyears2k.hex
frame_checksum -2331137066674151424
description ·
model_category Binomial
scoring_time 1521598137667
predictions ·
MSE 0.200114
RMSE 0.447342
nobs 43978
custom_metric_name ·
custom_metric_value 0
r2 0.197527
logloss 0.584585
AUC 0.757084
Gini 0.514168
mean_per_class_error 0.334347
residual_deviance 51417.772427
null_deviance 60855.951538
AIC 52951.772427
null_degrees_of_freedom 43977
residual_degrees_of_freedom 43211
Comparing both results
The training metrics are almost the same for first 4-significant digits:
R-Script H2o Flow
MSE: 0.2001145 0.200114
RMSE: 0.4473416 0.447342
LogLoss: 0.5845852 0.584585
Mean Per-Class Error: 0.3343562 0.334347
AUC: 0.7570867 0.757084
Gini: 0.5141734 0.514168
R^2: 0.1975266 0.197527
Residual Deviance: 51417.77 51417.772427
AIC: 52951.77 52951.772427
Confusion Matrix is slightly different:
TP TN FP FN
R-Script 10337 19313 10550 3778
H2o Flow 10341 19309 10546 3782
Error
R-Script 0.325799
H2o Flow 0.3258
My understanding is that the difference are withing the acceptable threshold (around 0.0001), therefore we can say that both interfaces provide the same result.

Using XGBoost in R for regression based model

I'm trying to use XGBoost as a replacement for gbm.
The scores I'm getting are rather odd, so I'm thinking maybe I'm doing something wrong in my code.
My data contains several factor variables, all other numeric.
Response variable is a continuous variable indicating a House-Price.
I Understand that in order to use XGBoost, I need to use One Hot Enconding for those. I'm doing so by using the following code:
Xtest <- test.data
Xtrain <- train.data
XSalePrice <- Xtrain$SalePrice
Xtrain$SalePrice <- NULL
# Combine data
Xall <- data.frame(rbind(Xtrain, Xtest))
# Get categorical features names
ohe_vars <- names(Xall)[which(sapply(Xall, is.factor))]
# Convert them
dummies <- dummyVars(~., data = Xall)
Xall_ohe <- as.data.frame(predict(dummies, newdata = Xall))
# Replace factor variables in data with OHE
Xall <- cbind(Xall[, -c(which(colnames(Xall) %in% ohe_vars))], Xall_ohe)
After that, I'm splitting the data back to the test & train set:
Xtrain <- Xall[1:nrow(train.data), ]
Xtest <- Xall[-(1:nrow(train.data)), ]
And then building a model, and printing the RMSE & Rsquared:
# Model
xgb.fit <- xgboost(data = data.matrix(Xtrain), label = XSalePrice,
booster = "gbtree", objective = "reg:linear",
colsample_bytree = 0.2, gamma = 0.0,
learning_rate = 0.05, max_depth = 6,
min_child_weight = 1.5, n_estimators = 7300,
reg_alpha = 0.9, reg_lambda = 0.5,
subsample = 0.2, seed = 42,
silent = 1, nrounds = 25)
xgb.pred <- predict(xgb.fit, data.matrix(Xtrain))
postResample(xgb.pred, XSalePrice)
Problem is I'm getting very off RMSE & Rsxquare:
RMSE Rsquared
1.877639e+05 5.308910e-01
That are VERY far from the results I get when using GBM.
I'm thinking i'm doing something wrong, my best guess it probably with the One Hot Encoding phase which I'm unfamiliar, So used a googled code with adjustments to my data.
Can someone indicate what am I doing wrong and how to 'fix' it?
UPDATE:
After reviewing #Codutie answer, my code has some errors:
Xtrain <- sparse.model.matrix(SalePrice ~. , data = train.data)
XDtrain <- xgb.DMatrix(data = Xtrain, label = "SalePrice")
xgb.DMatrix produces:
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
train.data is data frame, and it has 1453 rows. Label SalePrice also contains 1453 values (No missing values)
Thanks
train <- dat[train_ind,]
train.y <- train[,ncol(train_ind)]
xgboost(data =data.matrix(train[,-1]),
label = train.y,
objective = "reg:linear",
eval_metric = "rmse",
max.depth =15,
eta = 0.1,
nround = 15,
subsample = 0.5,
colsample_bytree = 0.5,
num_class = 12,
nthread = 3
)
Two clues to control XGB for Regression,
1) eta : if eta is small, models tends to overfit
2) eval_metric : Not sure if xgb allowed user to use their own eval_metric. But this metric is not useful when the quantitative dependent variable contains outlier. Check if XGB support hubber loss function.

R Caret's rfe [Error in { : task 1 failed - "rfe is expecting 184 importance values but only has 2"]

I am using Caret's rfe for a regression application. My data (in data.table) has 176 predictors (including 49 factor predictors). When I run the function, I get this error:
Error in { : task 1 failed - "rfe is expecting 176 importance values but only has 2"
Then, I used model.matrix( ~ . - 1, data = as.data.frame(train_model_sell_single_bid)) to convert the factor predictors to dummy variables. However, I got similar error:
Error in { : task 1 failed - "rfe is expecting 184 importance values but only has 2"
I'm using R version 3.1.1 on Windows 7 (64-bit), Caret version 6.0-41. I also have Revolution R Enterprise version 7.3 (64-bit) installed.
But the same error was reproduced on Amazon EC2 (c3.8xlarge) Linux instance with R version 3.0.1 and Caret version 6.0-24.
Datasets used (to reproduce my error):
https://www.dropbox.com/s/utuk9bpxl2996dy/train_model_sell_single_bid.RData?dl=0
https://www.dropbox.com/s/s9xcgfit3iqjffp/train_model_bid_outcomes_sell_single.RData?dl=0
My code:
library(caret)
library(data.table)
library(bit64)
library(doMC)
load("train_model_sell_single_bid.RData")
load("train_model_bid_outcomes_sell_single.RData")
subsets <- seq(from = 4, to = 184, by= 4)
registerDoMC(cores = 32)
set.seed(1015498)
ctrl <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
repeats = 1,
#saveDetails = TRUE,
verbose = FALSE)
x <- as.data.frame(train_model_sell_single_bid[,!"security_id", with=FALSE])
y <- train_model_bid_outcomes_sell_single[,bid100]
lmProfile_single_bid100 <- rfe(x, y,
sizes = subsets,
preProc = c("center", "scale"),
rfeControl = ctrl)
It seems that you might have highly correlated predictors.
Prior to feature selection you should run:
crrltn = findCorrelation(correlations, cutoff = .90)
if (length(crrltn) != 0)
x <- x[,-crrltn]
If after this the problem persists, it might be related to high correlation of the predictors within folds automatically generated, you can try to control the generated folds with:
set.seed(12213)
index <- createFolds(y, k = 10, returnTrain = T)
and then give these as arguments to the rfeControl function:
lmctrl <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
index = index,
verbose = TRUE)
set.seed(111333)
lrprofile <- rfe( z , x,
sizes = sizes,
rfeControl = lmctrl)
If you keep having the same problem, check if there are highly correlated between predictors within each fold:
for(i in 1:length(index)){
crrltn = cor(x[index[[i]],])
findCorrelation(crrltn, cutoff = .90, names = T, verbose = T)
}

Resources