I have multiple lightgbm model in R for which I want to validate and extract the variable names used during the fit. This is really simple with a glm, but I can manage to find the way (if possible, see here) with lightgbm models.
Here a reproducible example to make everything clearer:
I use the data from lightgbm package:
library(lightgbm)
data(agaricus.train, package = "lightgbm")
I first run the basic lgbm model:
# formating the data
dtrain <- lgb.Dataset(train$data, label = train$label)
data(agaricus.test, package = "lightgbm")
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label)
params <- list(objective = "regression", metric = "l2")
valids <- list(test = dtest)
# running the model
model_lgbm <- lgb.train(
params = params
, data = dtrain
, nrounds = 10L
, valids = valids
, min_data = 1L
, learning_rate = 1.0
, early_stopping_rounds = 5L
)
Now, I can do the same thing for a glm:
## preparing the data
dd <- data.frame(label = train$label, as(train$data, "matrix")[,1:10])
## making the model
model_glm <- glm(label ~ ., data=dd, family="binomial")
From the glm, there is lots of ways to find quickly the variables used for the modeling, for example the most obvious one:
variable.names(model_glm)
[1] "(Intercept)" "cap.shape.bell" "cap.shape.conical" "cap.shape.convex"
[5] "cap.shape.flat" "cap.shape.knobbed" "cap.shape.sunken" "cap.surface.fibrous"
[9] "cap.surface.grooves" "cap.surface.scaly"
This function is not implemented in lightgbm :
variable.names(model_lgbm)
NULL
And trying to get into the model object with str is not helpful:
str(model_lgbm)
Classes 'lgb.Booster', 'R6' <lgb.Booster>
Public:
add_valid: function (data, name)
best_iter: 3
best_score: 0
current_iter: function ()
dump_model: function (num_iteration = NULL, feature_importance_type = 0L)
eval: function (data, name, feval = NULL)
eval_train: function (feval = NULL)
eval_valid: function (feval = NULL)
finalize: function ()
initialize: function (params = list(), train_set = NULL, modelfile = NULL,
lower_bound: function ()
predict: function (data, start_iteration = NULL, num_iteration = NULL,
raw: NA
record_evals: list
reset_parameter: function (params, ...)
rollback_one_iter: function ()
save: function ()
save_model: function (filename, num_iteration = NULL, feature_importance_type = 0L)
save_model_to_string: function (num_iteration = NULL, feature_importance_type = 0L)
set_train_data_name: function (name)
to_predictor: function ()
update: function (train_set = NULL, fobj = NULL)
upper_bound: function ()
Private:
eval_names: l2
get_eval_info: function ()
handle: 8.19470876878865e-316
higher_better_inner_eval: FALSE
init_predictor: NULL
inner_eval: function (data_name, data_idx, feval = NULL)
inner_predict: function (idx)
is_predicted_cur_iter: list
name_train_set: training
name_valid_sets: list
num_class: 1
num_dataset: 2
predict_buffer: list
set_objective_to_none: FALSE
train_set: lgb.Dataset, R6
train_set_version: 1
valid_sets: list
The only way I managed to access the variables names used is from the lgb.importance function, but it's less than ideal as calculating variables importance can be slow for big models and I'm not even sure it reports all the variables:
lgb.importance(model)$Feature
[1] "odor=none" "stalk-root=club"
[3] "stalk-root=rooted" "spore-print-color=green"
[5] "odor=almond" "odor=anise"
[7] "bruises?=bruises" "stalk-surface-below-ring=scaly"
[9] "gill-size=broad" "cap-surface=grooves"
[11] "cap-shape=conical" "gill-color=brown"
[13] "cap-shape=bell" "cap-shape=flat"
[15] "cap-surface=scaly" "cap-color=white"
[17] "population=clustered"
Is there a way to access only the variable names used in the lightgbm model? Thanks.
The comment "and I'm not even sure it reports all the variables" has me a bit confused about what you're asking for when you say "variable names used during the fit", so I've answered both interpretations here.
Both answers assume this slightly-smaller version of your reproducible example.
library(lightgbm)
data(agaricus.train, package = "lightgbm")
# formating the data
dtrain <- lgb.Dataset(
agaricus.train$data
, label = agaricus.train$label
)
data(agaricus.test, package = "lightgbm")
params <- list(
objective = "regression"
, metric = "l2"
)
# running the model
model_lgbm <- lgb.train(
params = params
, data = dtrain
, nrounds = 10L
, min_data = 1L
, learning_rate = 1.0
)
Feature Names of the Input Dataset
If you want to know the names of all features in the input dataset that was passed to LightGBM, regardless of whether or not all those columns were chosen for splits, you can examine the dumped model.
parsed_model <- jsonlite::fromJSON(
model_lgbm$dump_model()
)
parsed_model$feature_names
[1] "cap-shape=bell"
[2] "cap-shape=conical"
[3] "cap-shape=convex"
[4] "cap-shape=flat"
[5] "cap-shape=knobbed"
[6] "cap-shape=sunken"
[7] "cap-surface=fibrous"
[8] "cap-surface=grooves"
[9] "cap-surface=scaly"
[10] "cap-surface=smooth"
Features Chosen for Splits
If you want to know which features were actually used in splits chosen by LightGBM, you can use either lgb.model.dt.tree() or the feature importance example you gave above.
modelDT <- lgb.model.dt.tree(model_lgbm)
modelDT$split_feature
lgb.model.dt.tree() returns a data.table representation of the trained model. One row in the table corresponds to either one side of a split or to one leaf node. Rows that refer to a leaf node have NA for $split_feature.
If you have suggestions for making this easier, PRs and issues are welcome at https://github.com/microsoft/LightGBM.
Related
I am trying to do a k-fold cross validation on a model that predicts the joint distribution of the proportion of tree species basal area from satellite imagery. This requires the use of the DiricihletReg::DirichReg() function, which in turn requires that the response variables be prepared as a matrix using the DirichletReg::DR_data() function. I originally tried to accomplish this in the caret:: package, but I found out that caret:: does not support multivariate responses. I have since tried to implement this in the tidymodels:: suite of packages. Following the documentation on how to register a new model in the parsnip:: (I appreciate Max Kuhn's vegetable humor) package, I created a "DREG" model and a "DR" engine. My registered model works when I simply call it on a single training dataset, but my goal is to do kfolds cross-validation, implementing the vfolds_cv(), a workflow(), and the 'fit_resample()' function. With the code I currently have I get warning message stating:
Warning message:
All models failed. See the `.notes` column.
Those notes state that Error in get(resp_char, environment(oformula)): object 'cbind(PSME, TSHE, ALRU2)' not found This, I believe is due to the use of DR_data() to preprocess the response variables into the format necessary for Dirichlet::DirichReg() to run properly. I think the solution I need to implement involve getting this pre-processing to happen in either the recipe() call or in the set_fit() call when I register this model with parsnip::. I have tried to use the step_mutate() function when specifying the recipe, but that performs a function on each column as opposed to applying the function with the columns as inputs. This leads to the following error in the "notes" from the output of fit_resample():
Must subset columns with a valid subscript vector.
Subscript has the wrong type `quosures`.
It must be numeric or character.
Is there a way to get the recipe to either transform several columns to a DirichletRegData class using the DR_data() function with a step_*() function or using the pre= argument in set_fit() and set_pred()?
Below is my reproducible example:
##Loading Necessary Packages##
library(tidymodels)
library(DirichletReg)
##Creating Fake Data##
set.seed(88)#For reproducibility
#Response variables#
PSME_BA<-rnorm(100,50, 15)
TSHE_BA<-rnorm(100,40,12)
ALRU2_BA<-rnorm(100,20,0.5)
Total_BA<-PSME_BA+TSHE_BA+ALRU2_BA
#Predictor variables#
B1<-runif(100, 0, 2000)
B2<-runif(100, 0, 1800)
B3<-runif(100, 0, 3000)
#Dataset for modeling#
DF<-data.frame(PSME=PSME_BA/Total_BA, TSHE=TSHE_BA/Total_BA, ALRU2=ALRU2_BA/Total_BA,
B1=B1, B2=B2, B3=B3)
##Modeling the data using Dirichlet regression with repeated k-folds cross validation##
#Registering the model to parsnip::#
set_new_model("DREG")
set_model_mode(model="DREG", mode="regression")
set_model_engine("DREG", mode="regression", eng="DR")
set_dependency("DREG", eng="DR", pkg="DirichletReg")
set_model_arg(
model = "DREG",
eng = "DR",
parsnip = "param",
original = "model",
func = list(pkg = "DirichletReg", fun = "DirichReg"),
has_submodel = FALSE
)
DREG <-
function(mode = "regression", param = NULL) {
# Check for correct mode
if (mode != "regression") {
rlang::abort("`mode` should be 'regression'")
}
# Capture the arguments in quosures
args <- list(sub_classes = rlang::enquo(param))
# Save some empty slots for future parts of the specification
new_model_spec(
"DREG",
args=args,
eng_args = NULL,
mode = mode,
method = NULL,
engine = NULL
)
}
set_fit(
model = "DREG",
eng = "DR",
mode = "regression",
value = list(
interface = "formula",
protect = NULL,
func = c(pkg = "DirichletReg", fun = "DirichReg"),
defaults = list()
)
)
set_encoding(
model = "DREG",
eng = "DR",
mode = "regression",
options = list(
predictor_indicators = "none",
compute_intercept = TRUE,
remove_intercept = TRUE,
allow_sparse_x = FALSE
)
)
set_pred(
model = "DREG",
eng = "DR",
mode = "regression",
type = "numeric",
value = list(
pre = NULL,
post = NULL,
func = c(fun = "predict.DirichletRegModel"),
args =
list(
object = expr(object$fit),
newdata = expr(new_data),
type = "response"
)
)
)
##Running the Model##
DF$Y<-DR_data(DF[,c(1:3)]) #Preparing the response variables
dreg_spec<-DREG(param="alternative") %>%
set_engine("DR")
dreg_mod<-dreg_spec %>%
fit(Y~B1+B2+B3, data = DF)#Model works when simply run on single dataset
##Attempting Crossvalidation##
#First attempt - simply call Y as the response variable in the recipe#
kfolds<-vfold_cv(DF, v=10, repeats = 2)
rcp<-recipe(Y~B1+B2+B3, data=DF)
dreg_fit<- workflow() %>%
add_model(dreg_spec) %>%
add_recipe(rcp)
dreg_rsmpl<-dreg_fit %>%
fit_resamples(kfolds)#Throws warning about all models failing
#second attempt - use step_mutate_at()#
rcp<-recipe(~B1+B2+B3, data=DF) %>%
step_mutate_at(fn=DR_data, var=vars(PSME, TSHE, ALRU2))
dreg_fit<- workflow() %>%
add_model(dreg_spec) %>%
add_recipe(rcp)
dreg_rsmpl<-dreg_fit %>%
fit_resamples(kfolds)#Throws warning about all models failing
This works, but I'm not sure if it's what you were expecting.
First--getting the data setup for CV and DR_data()
I don't know of any package that has built what would essentially be a translation for CV and DirichletReg. Therefore, that part is manually done. You might be surprised to find it's not all that complicated.
Using the data you created and the modeling objects you created for tidymodels (those prefixed with set_), I created the CV structure that you were trying to use.
df1 <- data.frame(PSME = PSME_BA/Total_BA, TSHE = TSHE_BA/Total_BA,
ALRU2=ALRU2_BA/Total_BA, B1, B2, B3)
set.seed(88)
kDf2 <- kDf1 <- vfold_cv(df1, v=10, repeats = 2)
For each of the 20 subset data frames identified in kDf2, I used DR_data to set the data up for the models.
# convert to DR_data (each folds and repeats)
df2 <- map(1:20,
.f = function(x){
in_ids = kDf1$splits[[x]]$in_id
dd <- kDf1$splits[[x]]$data[in_ids, ] # filter rows BEFORE DR_data
dd$Y <- DR_data(dd[, 1:3])
kDf1$splits[[x]]$data <<- dd
})
Because I'm not all that familiar with tidymodels, next conducted the modeling using DirichReg. I then did it again with tidymodels and compared them. (The output is identical.)
DirichReg Models and summaries of the fits
set.seed(88)
# perform crossfold validation on Dirichlet Model
df2.fit <- map(1:20,
.f = function(x){
Rpt = kDf1$splits[[x]]$id$id
Fld = kDf1$splits[[x]]$id$id2
daf = kDf1$splits[[x]]$data
fit = DirichReg(Y ~ B1 + B2, daf)
list(Rept = Rpt, Fold = Fld, fit = fit)
})
# summary of each fitted model
fit.a <- map(1:20,
.f = function(x){
summary(df2.fit[[x]]$fit)
})
tidymodels and summaries of the fits (the code looks the same, but there are a few differences--the output is the same, though)
# I'm not sure what 'alternative' is supposed to do here?
dreg_spec <- DREG(param="alternative") %>% # this is not model = alternative
set_engine("DR")
set.seed(88)
dfa.fit <- map(1:20,
.f = function(x){
Rpt = kDf1$splits[[x]]$id$id
Fld = kDf1$splits[[x]]$id$id2
daf = kDf1$splits[[x]]$data
fit = dreg_spec %>%
fit(Y ~ B1 + B2, data = daf)
list(Rept = Rpt, Fold = Fld, fit = fit)
})
afit.a <- map(1:20,
.f = function(x){
summary(dfa.fit[[x]]$fit$fit) # extra nest for parsnip
})
If you wanted to see the first model?
fit.a[[1]]
afit.a[[1]]
If you wanted the model with the lowest AIC?
# comare AIC, BIC, and liklihood?
# what do you percieve best fit with?
fmin = min(unlist(map(1:20, ~fit.a[[.x]]$aic))) # dir
# find min AIC model number
paste0((map(1:20, ~ifelse(fit.a[[.x]]$aic == fmin, .x, ""))), collapse = "")
fit.a[[19]]
afit.a[[19]]
Below is the code which i am executing on XGBOOST,
data(Glass, package = "mlbench")
levels(Glass$Type) <- c(0:5) #Proper Sequence. Should start with 0
Glass$Type <- as.integer(as.character(Glass$Type))
set.seed(100)
options(scipen = 999)
library(caret)
R_index <- createDataPartition(Glass$Type, p=.7, list = FALSE)
gl_train <- Glass[R_index,]
gl_test <- Glass[-R_index,]
'%ni%' <- Negate('%in%')
library(xgboost)
library(Matrix)
#Creating the matrix for training the model
train_gl <- xgb.DMatrix(data.matrix(gl_train[ ,colnames(gl_train) %ni% 'Type']),
label = as.numeric(gl_train$Type))
test_gl <- xgb.DMatrix(data.matrix(gl_test[ ,colnames(gl_test) %ni% 'Type']))
watchlist <- list(train = gl_train, test = gl_test)
#Define the parameters and cross validate
param <- list("objective" = "multi:softmax",
"eval_metric" = "mlogloss",
"num_class" = length(unique(gl_train$Type)))
cv.nround <- 5
cv.nfold <- 3
cvMod <- xgb.cv(param = param, data = train_gl,
nfold = cv.nfold,
nrounds = cv.nround,
watchlist=watchlist)
#Build the Model
nrounds = 50
xgMod = xgboost(param = param, data = train_gl, nrounds = nrounds, watchlist = watchlist)
After executing xgMod i am getting the below mentioned error,
Error in check.custom.obj() :
Setting objectives in 'params' and 'obj' at the same time is not allowed
Let me know what's wrong in my code.
Any help is appreciated.
Regards,
Mohan
The problem is due to the watchlist parameter passed to xgboost.
watchlist is a parameter of xgb.train but not of xgboost, hence it is considered by xgboost like "other parameters" (...) .
The following code
xgMod <- xgboost(param = param, data = train_gl, nrounds = nrounds)
works correctly
[1] train-mlogloss:1.259886
[2] train-mlogloss:0.963367
[3] train-mlogloss:0.755535
[4] train-mlogloss:0.601647
[5] train-mlogloss:0.478923
...
I am trying to tune an xgboost model with a multiclass dependent variable in R. I am using MLR to do this, however I run into an error where xgboost doesn't have predict within its namespace - which I assume MLR wants to use. I have had a look online and see that other people have encountered similar issues. However, I can't entirely understand the answers that have been provided (e.g. https://github.com/mlr-org/mlr/issues/935), when I try to implement them the issue persists. My code is as follows:
# Tune parameters
#create tasks
train$result <- as.factor(train$result) # Needs to be a factor variable for makeClass to work
test$result <- as.factor(test$result)
traintask <- makeClassifTask(data = train,target = "result")
testtask <- makeClassifTask(data = test,target = "result")
lrn <- makeLearner("classif.xgboost",predict.type = "response")
# Set learner value and number of rounds etc.
lrn$par.vals <- list(
objective = "multi:softprob", # return class with maximum probability,
num_class = 3, # There are three outcome categories
eval_metric="merror",
nrounds=100L,
eta=0.1
)
# Set parameters to be tuned
params <- makeParamSet(
makeDiscreteParam("booster",values = c("gbtree","gblinear")),
makeIntegerParam("max_depth",lower = 3L,upper = 10L),
makeNumericParam("min_child_weight",lower = 1L,upper = 10L),
makeNumericParam("subsample",lower = 0.5,upper = 1),
makeNumericParam("colsample_bytree",lower = 0.5,upper = 1)
)
# Set resampling strategy
rdesc <- makeResampleDesc("CV",stratify = T,iters=5L)
# search strategy
ctrl <- makeTuneControlRandom(maxit = 10L)
#parallelStartSocket(cpus = detectCores()) # Enable parallel processing
mytune <- tuneParams(learner = lrn
,task = traintask
,resampling = rdesc
,measures = acc
,par.set = params
,control = ctrl
,show.info = T)
The specific error I get is:
Error: 'predict' is not an exported object from 'namespace:xgboost'
My package versions are:
packageVersion("xgboost")
[1] ‘0.6.4’
packageVersion("mlr")
[1] ‘2.8’
Would anyone know what I should do here?
Thanks in advance.
I am having trouble using the XGBoost in R.
I am reading a CSV file with my data:
get_data = function()
{
#Loading Data
path = "dados_eye.csv"
data = read.csv(path)
#Dividing into two groups
train_porcentage = 0.05
train_lines = nrow(data)*train_porcentage
train = data[1:train_lines,]
test = data[train_lines:nrow(data),]
rownames(train) = c(1:nrow(train))
rownames(test) = c(1:nrow(test))
return (list("test" = test, "train" = train))
}
This function is Called my the main.R
lista_dados = get_data()
#machine = train_svm(lista_dados$train)
#machine = train_rf(lista_dados$train)
machine = train_xgt(lista_dados$train)
The problem is here in the train_xgt
train_xgt = function(train_data)
{
data_train = data.frame(train_data[,1:14])
label_train = data.frame(factor(train_data[,15]))
print(is.data.frame(data_train))
print(is.data.frame(label_train))
dtrain = xgb.DMatrix(data_train, label=label_train)
machine = xgboost(dtrain, num_class = 4 ,max.depth = 2,
eta = 1, nround = 2,nthread = 2,
objective = "binary:logistic")
return (machine)
}
This is the Error:
becchi#ubuntu:~/Documents/EEG_DATA/Dados_Eye$ Rscript main.R
[1] TRUE
[1] TRUE
Error in xgb.DMatrix(data_train, label = label_train) :
xgb.DMatrix: does not support to construct from list Calls: train_xgt
-> xgb.DMatrix Execution halted becchi#ubuntu:~/Documents/EEG_DATA/Dados_Eye$
As you can see, they are both DataFrames.
I dont know what I am doing wrong, please help!
Just convert data frame to matrix first using as.matrix() and then pass to xgb.Dmatrix().
Check if all columns have numeric data in them- I think this could be because you have some column that has data stored as factors/ characters which it won't be able to convert to a matrix. if you have factor variables, you can use one-hot encoding to convert them into dummy variables.
Try:
dtrain = xgb.DMatrix(as.matrix(sapply(data_train, as.numeric)), label=label_train)
instead of just:
dtrain = xgb.DMatrix(data_train, label=label_train)
I am using a glmulti wrapper for glmer (binomial) and the summary is:
This is glmulti 1.0.7, Apr. 2013.
Length Class Mode
0 NULL NULL
Following what has been done on this this thread, though this is for lmer,
glmulti runs indefinitely when using genetic algorithm with lme4, I get the same result as above. Could it be that the versions have changed since and the wrapping has to be done differently? The following is the dummy code (lifted form the link above):
x = as.factor(round(runif(30),1))# dummy grouping factor
yind = runif(30,0,10) # mock dependent variable
a = runif(30) # dummy covariate
b = runif(30) # another dummy covariate
c = runif(30) # an another one
d = runif(30)
tmpdata <- data.frame(x=x,yind=yind,a=a,b=b,c=c,d=d)
lmer.glmulti <- function (formula, data, random = "", ...) {
lmer(paste(deparse(formula), random), data = data, REML=F, ...)
}
summary(glmulti(formula = yind~a*b*c*d,
data = tmpdata,
random = '+(1|x)',
level = 2,
method = 'h',
crit = 'aicc',
marginality = TRUE,
fitfunc = lmer.glmulti))
lme4 version: 1.1.5
glmulti version: 1.0.7
"R version 3.0.2 (2013-09-25)"
SOLUTION
This works:
lmer.glmulti <- function (formula, data, random, ...) {
lmer(paste(deparse(formula), random), data = data)
}
glmulti(y = yind~a*b*c*d,
data = tmpdata,
random = '+(1|x)',
level = 2,
method = 'h',
crit = 'aicc',
marginality = TRUE,
fitfunc = lmer.glmulti)
packageVersion('lme4')
‘1.1.5’
packageVersion('glmulti')
‘1.0.7’
R.version: 3.1.0
FYI: From the package maintainer:
"fitfunc must be the name of a function so your other call including the function definition in the glmulti call cannot work."
"you named the first argument to glmulti 'formula', where it must be unnamed or 'y'... Sorry. But y is a formula (if passing a string it is the dependent variable only). "