partial dependency plot using pdp package for classification xgboost - r

I use pdp package to run partial dependency for linear regression using xgboost package is perfect without any warning. But when I change to classification(logistic) label for xgboost. I got the warning messages for partial dependency say the partial dependency is based on linear as follows. May I ask if the code has to be revised in somehow to exactly feed classification object using xgboost package so that the partial dependency is correct. Or I can ignore the warning message , it is correct already. I know randomforest is straight forward without any warning messages
# Load required packages
library(pdp)
library(xgboost)
# Simulate training data with ten million records
set.seed(101)
trn <- as.data.frame(mlbench::mlbench.friedman1(n = 1e+07, sd = 1))
trn=trn[sample(nrow(trn), 500), ]
trn$y=ifelse(trn$y>16,1,0)
# Fit an XGBoost classification(logistic) model
set.seed(102)
bst <- xgboost(data = data.matrix(subset(trn, select = -y)),
label = trn$y,
objective = "reg:logistic",
nrounds = 100,
max_depth = 2,
eta = 0.1)
#partial dependency plot
pd <- partial(bst$handle,
pred.var = c("x.1"),
grid.resolution = 10,
train = data.matrix(subset(trn, select = -y)),
prob=TRUE,
plot = FALSE,
.progress = "text")
Warning message:
In superType.default(object) :
`type` could not be determined; assuming `type = "regression"`

In this case, you can safely ignore the warning; however, it did lead me to a small bug in the pdp package for which I will push a fix to shortly. Thanks for reporting!

Related

What's the difference between lgb.train() and lightgbm() in r?

I'm trying to build a regression model with R using lightGBM,
and i'm getting a bit confused with some functions and when/how to use them.
First one is what i've written in the title, what's the difference between lgb.train() and lightgbm()?
The description in the documentation(https://cran.r-project.org/web/packages/lightgbm/lightgbm.pdf) says that lgb.train is 'Logic to train with LightGBM' and lightgbm is 'Simple interface for training a LightGBM model', while both their outcome value is lgb.Booster, a trained model.
One difference I've found is that lgb.train() does not work with valids = , while lightgbm() does.
Second one is about a function lgb.cv(), regarding a cross validation in lightGBM. How do you apply the output of lgb.cv() to a model?
As I understood from the documentation i've linked above, it seems like the output of both lgb.cv and lgb.train is a model.
Is it correct to use it like the example below?
lgbcv <- lgb.cv(params,
lgbtrain,
nrounds = 1000,
nfold = 5,
early_stopping_rounds = 100,
learning_rate = 1.0)
lgbcv <- lightgbm(params,
lgbtrain,
nrounds = 1000,
early_stopping_rounds = 100,
learning_rate = 1.0)
Thank you in advance!
what's the difference between lgb.train() and lightgbm()?
These functions both train a LightGBM model, they're just slightly different interfaces. The biggest difference is in how training data are prepared. LightGBM training requires a special LightGBM-specific representation of the training data, called a Dataset. To use lgb.train(), you have to construct one of these beforehand with lgb.Dataset(). lightgbm(), on the other hand, can accept a data frame, data.table, or matrix and will create the Dataset object for you.
Choose whichever method you feel has a more friendly interface...both will produce a single trained LightGBM model (class "lgb.Booster").
that lgb.train() does not work with valids = , while lightgbm() does.
This is not correct. Both functions accept the keyword argument valids. Run ?lgb.train and ?lightgbm for documentation on those methods.
How do you apply the output of lgb.cv() to a model?
I'm not sure what you mean, but you can find an example of how to use lgb.cv() in the docs that show up when you run ?lgb.cv.
library(lightgbm)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
params <- list(objective = "regression", metric = "l2")
model <- lgb.cv(
params = params
, data = dtrain
, nrounds = 5L
, nfold = 3L
, min_data = 1L
, learning_rate = 1.0
)
This returns an object of class "lgb.CVBooster". That object has multiple "lgb.Booster" objects in it (the trained models that lightgbm() or lgb.train() produce).
You can extract any one of these from model$boosters. However, in practice I don't recommend using the models from lgb.cv() directly. The goal of cross-validation is to get an estimate of the generalization error for a model. So you can use lgb.cv() to figure out the expected error for a given dataset + set of parameters (by looking at model$record_evals and model$best_score).

Grid tuning xgboost with missing data

It seems like the expected method of grid tuning the xgboost model is using the caret package, as clearly displayed here: https://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees
However, I struggle to make sense of the case with missing data. When creating the model without using caret, I set the missing to NA.
dtrain = xgb.DMatrix(data = data.matrix(train$data),label = train$label,missing = NA)
That allows me to create the model like so:
bst = xgboost(data = dtrain,depth = 4,eta =.3,nthread = 2,
nround = 43, print.every.n = 5,
objective = "binary:logistic",eval_metric = "auc",verbose = TRUE
)
This works very nicely, however, caret does not take this kind of object.
This is what I'm trying:
xgbtrain = train(x = train$data,y = as.factor(make.names(train$label)),
trControl = trControl, tuneGrid = my_grid,method = "xgbTree")
But for every iteration it is telling me this:
Error in xgb.DMatrix(as.matrix(x), label = y) : can not open file "NA"
That's the same error message I was getting before in regular xgb.boost when I didn't set my missing to NA. The xgb.DMatrix is not a subsettable object I could take the data from, and it is also not possible to convert it to a data frame. How do I get around this?
EDIT
Figured it out. In the end, it had nothing to do with missing data, but with having factors in the dataset. Instead of using xgboost's function to convert to a sparse matrix, I used regular model.matrix() and was able to successfully plug in the new matrix into caret's train function.

xgboost using R xgb.importance throws error

I am using xgboost package from CRAN for the first time.
Creating a model as:
bst <- xgb.train(data = dtrain, booster = "gblinear",
objective = "reg:linear", max.depth = 5, nround = 2,watchlist=watchlist)
importance_matrix <- xgb.importance(model = bst)
When I call xgb.importance I get an error:
Error in readLines(filename_dump) : 'con' is not a connection
Any ideas why?
xgb.importance works fine for booster=gbtree
I did not find any documentation but looks like xgb.importance is valid for tree method only

h2oensemble Error in value[[3L]](cond) : argument "training_frame" must be a valid H2O H2OFrame or id

While trying to run the example on H2OEnsemble found on http://learn.h2o.ai/content/tutorials/ensembles-stacking/index.html from within Rstudio, I encounter the following error:
Error in value[3L] :
argument "training_frame" must be a valid H2O H2OFrame or id
after defining the ensemble
fit <- h2o.ensemble(x = x, y = y,
training_frame = train,
family = family,
learner = learner,
metalearner = metalearner,
cvControl = list(V = 5, shuffle = TRUE))
I installed the latest version of both h2o and h2oEnsemble but the issue remains. I have read here `h2o.cbind` accepts only of H2OFrame objects - R that the naming convention in h2o changed over time, but I assume by installing the latest version of both this should not be any longer the issue.
Any suggestions?
library(readr)
library(h2oEnsemble) # Requires version >=0.0.4 of h2oEnsemble
library(cvAUC) # Used to calculate test set AUC (requires version >=1.0.1 of cvAUC)
localH2O <- h2o.init(nthreads = -1) # Start an H2O cluster with nthreads = num cores on your machine
# Import a sample binary outcome train/test set into R
train <- h2o.importFile("http://www.stat.berkeley.edu/~ledell/data/higgs_10k.csv")
test <- h2o.importFile("http://www.stat.berkeley.edu/~ledell/data/higgs_test_5k.csv")
y <- "C1"
x <- setdiff(names(train), y)
family <- "binomial"
#For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])
# Specify the base learner library & the metalearner
learner <- c("h2o.glm.wrapper", "h2o.randomForest.wrapper",
"h2o.gbm.wrapper", "h2o.deeplearning.wrapper")
metalearner <- "h2o.deeplearning.wrapper"
# Train the ensemble using 5-fold CV to generate level-one data
# More CV folds will take longer to train, but should increase performance
fit <- h2o.ensemble(x = x, y = y,
training_frame = train,
family = family,
learner = learner,
metalearner = metalearner,
cvControl = list(V = 5, shuffle = TRUE))
This bug was recently introduced by a bulk find/replace change of a class name made to the h2o R code. The change was inadvertently applied to the ensemble code folder as well (where we currently have manual instead of automatic tests -- soon to be automatic to prevent this sort of thing). I've fixed the bug.
To fix, reinstall the h2oEnsemble package from GitHub:
library(devtools)
install_github("h2oai/h2o-3/h2o-r/ensemble/h2oEnsemble-package")
Thanks for the report! For a quicker response, post bugs and questions here: https://groups.google.com/forum/#!forum/h2ostream

Fatal error with train() in caret on Windows 7, R 3.0.2, caret 6.0-21

I am trying to use train() in caret to fit a classification model, but I'm hitting some kind of unhandled exception and my R session crashes before outputting any error information in the R console.
Windows error:
R for Windows terminal front-end has stopped working
I am running Windows 7, R 3.0.2, caret 6.0-21, and have tried running this on both 32/64 versions of R, in R Studio and also directly in the R console, and am getting the same results each time.
Here is my call to train:
library("AppliedPredictiveModeling")
library("caret")
data("AlzheimerDisease")
data <- data.frame(predictors, diagnosis)
tuneGrid <- expand.grid(interaction.depth = 1:2, n.trees = 100, shrinkage = 0.1)
trainControl <- trainControl(method = "cv", number = 5, verboseIter = TRUE)
gbmFit <- train(diagnosis ~ ., data = data, method = "gbm", trControl = trainControl, tuneGrid = tuneGrid)
There are no more errors using this parameter grid instead:
tuneGrid <- expand.grid(interaction.depth = 1, n.trees = 100:101, shrinkage = 0.1)
However, I am still getting all nans in the ValidDeviance column. Is this normal?
Note: My original problem is resolved, and this is a continuation from the comments section. Formatting blocks of code in the comments section is unreadable so I'm posting it up here. This is no longer a question regarding caret, but gbm instead.
I am still having issues, however, with direct calls to gbm using a single predictor with cv.folds specified. Here is the code:
library("AppliedPredictiveModeling")
library("caret")
data("AlzheimerDisease")
diagnosis <- as.numeric(diagnosis)
diagnosis[diagnosis == 1] <- 0
diagnosis[diagnosis == 2] <- 1
data <- data.frame(diagnosis, predictors[, 1])
gbmFit <- gbm(diagnosis ~ ., data = data, cv.folds = 5)
Again, this works without specifying cv.folds but with it, returns an error:
Error in checkForRemoteErrors(val) : 5 nodes produced errors; first error: incorrect number of dimensions
It is a bug that occurs when method = 'gbm' is used with a single model (i.e. nrow(tuneGrid) == 1). I'm about to release a new version, so I will fix this in that version.
One side note... it looks like you want to do classification. In that case, y should be a factor (and you shouldn't use only integers as the classes) otherwise it will be doing regression. These changes will work for now:
y <- factor(paste("Class", y, sep = ""))
and
tuneGrid <- expand.grid(interaction.depth = 1,
n.trees = 100:101,
shrinkage = 0.1)
Thanks,
Max

Resources