Date: 2022-08-17. R Version: 4.0.3. Platform: x86_64-apple-darwin17.0 (64-bit)
Problem: In mlr3 (classif.task, learner: random forest), I use automated hyperparameter optimization (HPO; mtry in the range between 1 and the number of features in the data), and automated feature selection (single criterion: msr = classif.auc).
I run into this ranger error message:
'mtry can not be larger than number of variables in data. Ranger will EXIT now.'
I am relatively sure that what happens is when a subset of features have been selected and HPO attempts to assess the performance for a higher number of features, that this produces the error. If this is true, then how do I set the upper range limit in HPO for the mtry parameter in such a case (see repex below)?
# Make data with binary outcome.
set.seed(123); n <- 500
for(i in 1:9) {
assign(paste0("x", i), rnorm(n=n, mean = 0, sd = sample(1:6,1)))
}
z <- 0 + (.02*x1) + .03*x2 - .06*x3 + .03*x4 + .1*x5 + .08*x6 + .09*x7 - .008*x8 + .045*x9
pr = 1/(1+exp(-z))
y = rbinom(n, 1, pr)
dat <- data.frame(y=factor(y), x1, x2, x3, x4, x5, x6, x7, x8, x9)
#
library(mlr3verse)
tskclassif <- TaskClassif$new(id="rangerCheck", backend=dat, target="y")
randomForest <- lrn("classif.ranger", predict_type = "prob")
# Question: How do I set the upper range limit for the mtry parameter, in order to not get the error message?
searchSpaceRANDOMFOREST <- ps(mtry=p_int(lower = 1, upper = (ncol(dat)-1)))
# Hyperparameter optimization
resamplingTuner <- rsmp("cv", folds=4)
tuner <-
atRANDOMFOREST <- AutoTuner$new(
learner=randomForest,
resampling = resamplingTuner,
measure = msr("classif.auc"),
search_space = searchSpaceRANDOMFOREST,
terminator = trm("evals", n_evals = 10),
tuner = tnr("random_search"))
# Feature selection
instance = FSelectInstanceSingleCrit$new(
task = tskclassif,
learner = atRANDOMFOREST,
resampling = rsmp("holdout", ratio = .8),
measure = msr("classif.auc"),
terminator = trm("evals", n_evals = 20)
)
fselector <- fs("random_search")
fselector$optimize(instance)
# Error message:
# Error: mtry can not be larger than number of variables in data. Ranger will EXIT now.
# Fehler in ranger::ranger(dependent.variable.name = task$target_names, data = task$data(), : User interrupt or internal error.
# This happened PipeOp classif.ranger.tuned's $train()
You should be able to use the mtry.ratio parameter in https://mlr3learners.mlr-org.com/reference/mlr_learners_classif.ranger.html instead of mtry to have a dynamic feature count selection during tuning which does not exceed the number of available features.
I'm just trying to learn how to use the nlsem package in R to fit nonlinear SEMM, but I keep running into to the error "Posterior probability could not be calculated properly. Choose different starting parameters" when I try to create the res object. I'm trying to estimate a nonlinear model where latent variable tas predicts latent variable cts, moderated by latent variable ams. I'm still pretty new to R and very new to nonlinear analyses, so any help at all would be appreciated!
My code so far:
##nonlinear SEM
#Select data
FPerpSEMM<-subset(FPerp,
select=(c("tas1", "tas3", "tas6", "tas7", "tas9", "tas13","tas14", "AMSEscalate",
"AMSNegAttribution", "AMSSelfAware", "AMSCalming", "cts_5", "cts_25",
"cts_29", "cts_35", "cts_49", "cts_65", "cts_67", "cts_69")))
FPerpSEMM$x1<-FPerpSEMM$tas1
FPerpSEMM$x2<-FPerpSEMM$tas3
FPerpSEMM$x3<-FPerpSEMM$tas6
FPerpSEMM$x4<-FPerpSEMM$tas7
FPerpSEMM$x5<-FPerpSEMM$tas9
FPerpSEMM$x6<-FPerpSEMM$tas13
FPerpSEMM$x7<-FPerpSEMM$tas14
FPerpSEMM$x8<-FPerpSEMM$AMSEscalate
FPerpSEMM$x9<-FPerpSEMM$AMSNegAttribution
FPerpSEMM$x10<-FPerpSEMM$AMSSelfAware
FPerpSEMM$x11<-FPerpSEMM$AMSCalming
FPerpSEMM$y1<-FPerpSEMM$cts_5
FPerpSEMM$y2<-FPerpSEMM$cts_25
FPerpSEMM$y3<-FPerpSEMM$cts_29
FPerpSEMM$y4<-FPerpSEMM$cts_35
FPerpSEMM$y5<-FPerpSEMM$cts_49
FPerpSEMM$y6<-FPerpSEMM$cts_65
FPerpSEMM$y7<-FPerpSEMM$cts_67
FPerpSEMM$y8<-FPerpSEMM$cts_69
FPerpSEMMr1<-subset(FPerpSEMM,
select=(c("x1","x2","x3","x4","x5","x6","x7","x8","x9","x10","x11",
"y1","y2","y3","y4","y5","y6","y7","y8")))
#Create dataframe containing only complete cases
FPerpSEMMcc<-na.omit(FPerpSEMMr1)
# load data
dat <- as.matrix(FPerpSEMMcc[, c(12:19, 1:7, 8:11)])
# specify model of class SEMM
model<- specify_sem(num.x = 11, num.y = 8, num.xi = 2, num.eta = 1,
xi = "x1-x7,x8-x11", eta = "y1-y8",
num.classes = 3, interaction = "xi1:xi2", rel.lat = "eta1~xi1+xi2",
constraints = "direct1")
class(model)
#fit model
dat <- as.matrix(FPerpSEMMcc[, c(12:19, 1:7, 8:11)])
set.seed(911)
pars.start <- runif(count_free_parameters(model))
res <- em(model, dat, pars.start, convergence = 0.1, max.iter = 200)
summary(res)
plot(res)
The following script, reproduces an equivalent problem as it was stated in h2o Help (Help -> View Example Flow or Help -> Browse Installed packs.. -> examples -> Airlines Delay.flow, download), but using h2o R-package and a fixed seed (123456):
library(h2o)
# To use avaliable cores
h2o.init(max_mem_size = "12g", nthreads = -1)
IS_LOCAL_FILE = switch(1, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])
# Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id="glm_model", seed=123456, training_frame=allyears2k.hex,
ignore_const_cols = T, y = response,
family="binomial", solver="IRLSM",
alpha=0.5,lambda=0.00001, lambda_search=F, standardize=T,
non_negative=F, score_each_iteration=F,
max_iterations=-1, link="family_default", intercept=T, objective_epsilon=0.00001,
beta_epsilon=0.0001, gradient_epsilon=0.0001, prior=-1, max_active_predictors=-1
)
# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
This is the Confusion Matrix for the training set:
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
NO YES Error Rate
NO 0 20887 1.000000 =20887/20887
YES 0 23091 0.000000 =0/23091
Totals 0 43978 0.474942 =20887/43978
And the metrics:
H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.2473858
RMSE: 0.4973789
LogLoss: 0.6878898
Mean Per-Class Error: 0.5
AUC: 0.5550138
Gini: 0.1100276
R^2: 0.007965165
Residual Deviance: 60504.04
AIC: 60516.04
On contrary the result of h2o flow has a better performance:
and Confusion Matrix for max f1 threshold:
The h2o flow performance is much better than running the same algorithm using the equivalent R-package function.
Note: For sake of simplicity I am using Airlines Delay problem, that is a well-known problem using h2o, but I realized that such kind of significant difference are found in other similar situations using glm algorithm.
Any thought about why these significant differences occur
Appendix A: Using default model parameters
Following the suggestion from #DarrenCook answer, just using default building parameters except for excluding columns and seed:
h2o flow
Now the buildModel is invoked like this:
buildModel 'glm', {"model_id":"glm_model-default",
"seed":"123456","training_frame":"allyears2k.hex",
"ignored_columns":
["DayofMonth","DepTime","CRSDepTime","ArrTime","CRSArrTime","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
"CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed"],
"response_column":"IsDepDelayed","family":"binomial"
}
and the results are:
and the training metrics:
Running R-Script
The following script allows for an easy switch into default configuration (via IS_DEFAULT_MODEL variable) and also keeping the configuration as it states in the Airlines Delay example:
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores
IS_LOCAL_FILE = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])
if (IS_DEFAULT_MODEL) {
fit1 <- h2o.glm(
x = predictors, model_id = "glm_model", seed = 123456,
training_frame = allyears2k.hex, y = response, family = "binomial"
)
} else { # Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id = "glm_model", seed = 123456, training_frame = allyears2k.hex,
ignore_const_cols = T, y = response,
family = "binomial", solver = "IRLSM",
alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
non_negative = F, score_each_iteration = F,
max_iterations = -1, link = "family_default", intercept = T, objective_epsilon = 0.00001,
beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1, max_active_predictors = -1
)
}
# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
It produces the following results:
MSE: 0.2473859
RMSE: 0.497379
LogLoss: 0.6878898
Mean Per-Class Error: 0.5
AUC: 0.5549898
Gini: 0.1099796
R^2: 0.007964984
Residual Deviance: 60504.04
AIC: 60516.04
Confusion Matrix (vertical: actual; across: predicted)
for F1-optimal threshold:
NO YES Error Rate
NO 0 20887 1.000000 =20887/20887
YES 0 23091 0.000000 =0/23091
Totals 0 43978 0.474942 =20887/43978
Some metrics are close, but the Confusion Matrix is quite diferent, the R-Script predict all flights as delayed.
Appendix B: Configuration
Package: h2o
Version: 3.18.0.4
Type: Package
Title: R Interface for H2O
Date: 2018-03-08
Note: I tested the R-Script also under 3.19.0.4231 with the same results
This is the cluster information after running the R:
> h2o.init(max_mem_size = "12g", nthreads = -1)
R is connected to the H2O cluster:
H2O cluster version: 3.18.0.4
...
H2O API Extensions: Algos, AutoML, Core V3, Core V4
R Version: R version 3.3.3 (2017-03-06)
Troubleshooting Tip: build the all-defaults model first:
mDef = h2o.glm(predictors, response, allyears2k.hex, family="binomial")
This takes 2 seconds and gives almotst exactly the same AUC and confusion matrix as in your Flow screenshots.
So, we now know the problem you see is due to all the model customization you have done...
...except when I build your fit1 I get basically the same results as my default model:
NO YES Error Rate
NO 4276 16611 0.795279 =16611/20887
YES 1573 21518 0.068122 =1573/23091
Totals 5849 38129 0.413479 =18184/43978
This was using your script exactly as given, so it fetched the remote csv file. (Oh, I removed the max_mem_size argument, as I don't have 12g on this notebook!)
Assuming you can get exactly your posted results, running exactly the code you posted (and in a fresh R session, with a newly started H2O cluster), one possible explanation is you are using 3.19.x, but the latest stable release is 3.18.0.2? (My test was with 3.14.0.1)
Finally, I guess this is the explanation: both have the same parameter configuration for building the model (that is not the problem), but the H2o flow uses a specific parsing customization converting some variables values into Enum, that the R-script did not specify.
The Airlines Delay problem how it was specified in the h2o Flow example uses as predictor variables (the flow defines the ignored_columns):
"Year", "Month", "DayOfWeek", "UniqueCarrier",
"FlightNum", "Origin", "Dest", "Distance"
Where all of the predictors should be parsed as: Enum except Distance. Therefore the R-Script needs to convert such columns from numeric or char into factor.
Executing using h2o R-package
Here the R-Script updated:
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores
IS_LOCAL_FILE = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = T)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime",
"ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
column.asFactor <- c("Year", "Month", "DayofMonth", "DayOfWeek",
"UniqueCarrier", "FlightNum", "Origin", "Dest", response)
# Coercing as factor (equivalent to Enum from h2o Flow)
# Note: Using lapply does not work, see the answer of this question
# https://stackoverflow.com/questions/49393343/how-to-coerce-multiple-columns-to-factors-at-once-for-h2oframe-object
for (col in column.asFactor) {
allyears2k.hex[col] <- as.factor(allyears2k.hex[col])
}
if (IS_DEFAULT_MODEL) {
fit1 <- h2o.glm(x = predictors, y = response,
training_frame = allyears2k.hex,
family = "binomial", seed = 123456
)
} else { # Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id = "glm_model", seed = 123456,
training_frame = allyears2k.hex,
ignore_const_cols = T, y = response,
family = "binomial", solver = "IRLSM",
alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
non_negative = F, score_each_iteration = F,
max_iterations = -1, link = "family_default", intercept = T,
objective_epsilon = 0.00001,
beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1,
max_active_predictors = -1
)
}
# Analysis
print("Confusion Matrix for training dataset")
confMatrix <- h2o.confusionMatrix(fit1)
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
Here the result running the R-Script under default configuraiton IS_DEFAULT_MODEL=T:
H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.2001145
RMSE: 0.4473416
LogLoss: 0.5845852
Mean Per-Class Error: 0.3343562
AUC: 0.7570867
Gini: 0.5141734
R^2: 0.1975266
Residual Deviance: 51417.77
AIC: 52951.77
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
NO YES Error Rate
NO 10337 10550 0.505099 =10550/20887
YES 3778 19313 0.163614 =3778/23091
Totals 14115 29863 0.325799 =14328/43978
Executing under h2o flow
Now executing the flow: Airlines_Delay_GLMFixedSeed, we can obtain the same results. Here the detail about the flow configuration:
The parseFiles function:
parseFiles
paths: ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]
destination_frame: "allyears2k.hex"
parse_type: "CSV"
separator: 44
number_columns: 31
single_quotes: false
column_names:
["Year","Month","DayofMonth","DayOfWeek","DepTime","CRSDepTime","ArrTime",
"CRSArrTime","UniqueCarrier","FlightNum","TailNum","ActualElapsedTime",
"CRSElapsedTime","AirTime","ArrDelay","DepDelay","Origin","Dest",
"Distance","TaxiIn","TaxiOut","Cancelled","CancellationCode",
"Diverted","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed",
"IsDepDelayed"]
column_types ["Enum","Enum","Enum","Enum","Numeric","Numeric",
"Numeric","Numeric", "Enum","Enum","Enum","Numeric",
"Numeric", "Numeric","Numeric","Numeric",
"Enum","Enum","Numeric","Numeric","Numeric",
"Enum","Enum","Numeric","Numeric","Numeric",
"Numeric","Numeric","Numeric","Enum","Enum"]
delete_on_done: true
check_header: 1
chunk_size: 4194304
where the following predictor columns are converted to Enum: "Year", "Month", "DayOfWeek", "UniqueCarrier", "FlightNum", "Origin", "Dest"
Now invoking the buildModel function as follows, using the default parameters except for ignored_columns and seed:
buildModel 'glm', {"model_id":"glm_model-default","seed":"123456",
"training_frame":"allyears2k.hex",
"ignored_columns":["DayofMonth","DepTime","CRSDepTime","ArrTime",
"CRSArrTime","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
"CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed"],"response_column":"IsDepDelayed",
"family":"binomial"}
and finally we get the following result:
and Training Output Metrics:
model glm_model-default
model_checksum -2438376548367921152
frame allyears2k.hex
frame_checksum -2331137066674151424
description ·
model_category Binomial
scoring_time 1521598137667
predictions ·
MSE 0.200114
RMSE 0.447342
nobs 43978
custom_metric_name ·
custom_metric_value 0
r2 0.197527
logloss 0.584585
AUC 0.757084
Gini 0.514168
mean_per_class_error 0.334347
residual_deviance 51417.772427
null_deviance 60855.951538
AIC 52951.772427
null_degrees_of_freedom 43977
residual_degrees_of_freedom 43211
Comparing both results
The training metrics are almost the same for first 4-significant digits:
R-Script H2o Flow
MSE: 0.2001145 0.200114
RMSE: 0.4473416 0.447342
LogLoss: 0.5845852 0.584585
Mean Per-Class Error: 0.3343562 0.334347
AUC: 0.7570867 0.757084
Gini: 0.5141734 0.514168
R^2: 0.1975266 0.197527
Residual Deviance: 51417.77 51417.772427
AIC: 52951.77 52951.772427
Confusion Matrix is slightly different:
TP TN FP FN
R-Script 10337 19313 10550 3778
H2o Flow 10341 19309 10546 3782
Error
R-Script 0.325799
H2o Flow 0.3258
My understanding is that the difference are withing the acceptable threshold (around 0.0001), therefore we can say that both interfaces provide the same result.
I'm trying to use XGBoost as a replacement for gbm.
The scores I'm getting are rather odd, so I'm thinking maybe I'm doing something wrong in my code.
My data contains several factor variables, all other numeric.
Response variable is a continuous variable indicating a House-Price.
I Understand that in order to use XGBoost, I need to use One Hot Enconding for those. I'm doing so by using the following code:
Xtest <- test.data
Xtrain <- train.data
XSalePrice <- Xtrain$SalePrice
Xtrain$SalePrice <- NULL
# Combine data
Xall <- data.frame(rbind(Xtrain, Xtest))
# Get categorical features names
ohe_vars <- names(Xall)[which(sapply(Xall, is.factor))]
# Convert them
dummies <- dummyVars(~., data = Xall)
Xall_ohe <- as.data.frame(predict(dummies, newdata = Xall))
# Replace factor variables in data with OHE
Xall <- cbind(Xall[, -c(which(colnames(Xall) %in% ohe_vars))], Xall_ohe)
After that, I'm splitting the data back to the test & train set:
Xtrain <- Xall[1:nrow(train.data), ]
Xtest <- Xall[-(1:nrow(train.data)), ]
And then building a model, and printing the RMSE & Rsquared:
# Model
xgb.fit <- xgboost(data = data.matrix(Xtrain), label = XSalePrice,
booster = "gbtree", objective = "reg:linear",
colsample_bytree = 0.2, gamma = 0.0,
learning_rate = 0.05, max_depth = 6,
min_child_weight = 1.5, n_estimators = 7300,
reg_alpha = 0.9, reg_lambda = 0.5,
subsample = 0.2, seed = 42,
silent = 1, nrounds = 25)
xgb.pred <- predict(xgb.fit, data.matrix(Xtrain))
postResample(xgb.pred, XSalePrice)
Problem is I'm getting very off RMSE & Rsxquare:
RMSE Rsquared
1.877639e+05 5.308910e-01
That are VERY far from the results I get when using GBM.
I'm thinking i'm doing something wrong, my best guess it probably with the One Hot Encoding phase which I'm unfamiliar, So used a googled code with adjustments to my data.
Can someone indicate what am I doing wrong and how to 'fix' it?
UPDATE:
After reviewing #Codutie answer, my code has some errors:
Xtrain <- sparse.model.matrix(SalePrice ~. , data = train.data)
XDtrain <- xgb.DMatrix(data = Xtrain, label = "SalePrice")
xgb.DMatrix produces:
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
train.data is data frame, and it has 1453 rows. Label SalePrice also contains 1453 values (No missing values)
Thanks
train <- dat[train_ind,]
train.y <- train[,ncol(train_ind)]
xgboost(data =data.matrix(train[,-1]),
label = train.y,
objective = "reg:linear",
eval_metric = "rmse",
max.depth =15,
eta = 0.1,
nround = 15,
subsample = 0.5,
colsample_bytree = 0.5,
num_class = 12,
nthread = 3
)
Two clues to control XGB for Regression,
1) eta : if eta is small, models tends to overfit
2) eval_metric : Not sure if xgb allowed user to use their own eval_metric. But this metric is not useful when the quantitative dependent variable contains outlier. Check if XGB support hubber loss function.