xgboost using R xgb.importance throws error - r

I am using xgboost package from CRAN for the first time.
Creating a model as:
bst <- xgb.train(data = dtrain, booster = "gblinear",
objective = "reg:linear", max.depth = 5, nround = 2,watchlist=watchlist)
importance_matrix <- xgb.importance(model = bst)
When I call xgb.importance I get an error:
Error in readLines(filename_dump) : 'con' is not a connection
Any ideas why?

xgb.importance works fine for booster=gbtree
I did not find any documentation but looks like xgb.importance is valid for tree method only

Related

readRDS error from reading a R object of ~ 160MB

I’m trying to save a R object, which is a linear regression model based on ridge regression using R package glmnet. I'm using saveRDS to save it and it runs without error.
saveRDS(ridge, file = 'rnaClassifer_ridgeReg.RDdata')
HOWEVER, I cannot load the object back to R studio via readRDS, and it keeps giving errors and crashes the R session.
readRDS('rnaClassifer_ridgeReg.RDdata')
Note here this is a R object with size of 161MB after saving as rnaClassifer_ridgeReg.RDdata (which can be downloaded from here). My local laptop has 8 cores / 32 GB, which I would think is enough?
Here I'm also attaching the dataset (here) used to build the regression model, along with the code. Feel free to run the commands below to generate the R object ridge, and see if you can save it and load it successfully back to R.
library (caret)
library (glmnet)
data.lm.train = read.table('data.txt.gz', header = T, sep = '\t', quote = '', check.names = F)
lambda <- 10^seq(-3, 3, length = 100)
### ridge regression
set.seed(666)
ridge <- train(
dnaScore ~., data = data.lm.train, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0, lambda = lambda)
)
Any help would be highly appreciated!

partial dependency plot using pdp package for classification xgboost

I use pdp package to run partial dependency for linear regression using xgboost package is perfect without any warning. But when I change to classification(logistic) label for xgboost. I got the warning messages for partial dependency say the partial dependency is based on linear as follows. May I ask if the code has to be revised in somehow to exactly feed classification object using xgboost package so that the partial dependency is correct. Or I can ignore the warning message , it is correct already. I know randomforest is straight forward without any warning messages
# Load required packages
library(pdp)
library(xgboost)
# Simulate training data with ten million records
set.seed(101)
trn <- as.data.frame(mlbench::mlbench.friedman1(n = 1e+07, sd = 1))
trn=trn[sample(nrow(trn), 500), ]
trn$y=ifelse(trn$y>16,1,0)
# Fit an XGBoost classification(logistic) model
set.seed(102)
bst <- xgboost(data = data.matrix(subset(trn, select = -y)),
label = trn$y,
objective = "reg:logistic",
nrounds = 100,
max_depth = 2,
eta = 0.1)
#partial dependency plot
pd <- partial(bst$handle,
pred.var = c("x.1"),
grid.resolution = 10,
train = data.matrix(subset(trn, select = -y)),
prob=TRUE,
plot = FALSE,
.progress = "text")
Warning message:
In superType.default(object) :
`type` could not be determined; assuming `type = "regression"`
In this case, you can safely ignore the warning; however, it did lead me to a small bug in the pdp package for which I will push a fix to shortly. Thanks for reporting!

Why doesnt the early.stop.round argument in xgboost work?

I try to use the early.stop.round argument in the xgb.cv function of the xgboost library, however, I got an error. After I leave the early.stop.round unspecified, the function runs without any problem. What did I do wrong?
Here is my example code:
library(xgboost)
train = matrix(as.numeric(1:100),20,5)
Y = rep(c(0,1),10)
dtrain = xgb.DMatrix(train, label=Y)
#cross validation when early.stop.round =5, gives an error
CV = xgb.cv(data = dtrain, nround=200, nfold =2, metrics=list("auc"),
objective = "binary:logistic",early.stop.round = 5)
#cross validation when early.stop.round is not specified, works
CV = xgb.cv(data = dtrain, nround=200, nfold =2, metrics=list("auc"),
objective = "binary:logistic")
I am using xgboost_0.4-2
Looks like something goes wrong when using the metrics parameter and early.stop simultaneously. Remove metrics and use early.stop with eval_metric="auc" instead.

Fatal error with train() in caret on Windows 7, R 3.0.2, caret 6.0-21

I am trying to use train() in caret to fit a classification model, but I'm hitting some kind of unhandled exception and my R session crashes before outputting any error information in the R console.
Windows error:
R for Windows terminal front-end has stopped working
I am running Windows 7, R 3.0.2, caret 6.0-21, and have tried running this on both 32/64 versions of R, in R Studio and also directly in the R console, and am getting the same results each time.
Here is my call to train:
library("AppliedPredictiveModeling")
library("caret")
data("AlzheimerDisease")
data <- data.frame(predictors, diagnosis)
tuneGrid <- expand.grid(interaction.depth = 1:2, n.trees = 100, shrinkage = 0.1)
trainControl <- trainControl(method = "cv", number = 5, verboseIter = TRUE)
gbmFit <- train(diagnosis ~ ., data = data, method = "gbm", trControl = trainControl, tuneGrid = tuneGrid)
There are no more errors using this parameter grid instead:
tuneGrid <- expand.grid(interaction.depth = 1, n.trees = 100:101, shrinkage = 0.1)
However, I am still getting all nans in the ValidDeviance column. Is this normal?
Note: My original problem is resolved, and this is a continuation from the comments section. Formatting blocks of code in the comments section is unreadable so I'm posting it up here. This is no longer a question regarding caret, but gbm instead.
I am still having issues, however, with direct calls to gbm using a single predictor with cv.folds specified. Here is the code:
library("AppliedPredictiveModeling")
library("caret")
data("AlzheimerDisease")
diagnosis <- as.numeric(diagnosis)
diagnosis[diagnosis == 1] <- 0
diagnosis[diagnosis == 2] <- 1
data <- data.frame(diagnosis, predictors[, 1])
gbmFit <- gbm(diagnosis ~ ., data = data, cv.folds = 5)
Again, this works without specifying cv.folds but with it, returns an error:
Error in checkForRemoteErrors(val) : 5 nodes produced errors; first error: incorrect number of dimensions
It is a bug that occurs when method = 'gbm' is used with a single model (i.e. nrow(tuneGrid) == 1). I'm about to release a new version, so I will fix this in that version.
One side note... it looks like you want to do classification. In that case, y should be a factor (and you shouldn't use only integers as the classes) otherwise it will be doing regression. These changes will work for now:
y <- factor(paste("Class", y, sep = ""))
and
tuneGrid <- expand.grid(interaction.depth = 1,
n.trees = 100:101,
shrinkage = 0.1)
Thanks,
Max

Caret and GBM Errors

I am attempting to use the caret package in R for several nested cross-validation processes with user-defined performance metrics. I have had all kinds of problems, so I pulled back to see see if there were issues with a more out of the box use of caret and it seems I have run into one.
If I run the following:
install.packages("caret")
install.packages("gbm")
library(caret)
library(gbm)
data(GermanCredit)
GermanCredit$Class<-ifelse(GermanCredit$Class=='Bad',1,0)
gbmGrid <- expand.grid(.interaction.depth = 1,
.n.trees = 150,
.shrinkage = 0.1)
gbmMOD <- train(Class~., data=GermanCredit
,method = "gbm",
tuneGrid= gbmGrid,
distribution="bernoulli",
bag.fraction = 0.5,
train.fraction = 0.5,
n.minobsinnode = 10,
cv.folds = 1,
keep.data=TRUE,
verbose=TRUE
)
I get the error (or similar):
Error in { :
task 1 failed - "arguments imply differing number of rows: 619, 381"
with warnings:
1: In eval(expr, envir, enclos) :
model fit failed for Resample01: interaction.depth=1, n.trees=150, shrinkage=0.1
But, if I run just the gbm routine everything finishes fine.
gbm1 <- gbm(Class~., data=GermanCredit,
distribution="bernoulli",
n.trees=150, # number of trees
shrinkage=0.10,
interaction.depth=1,
bag.fraction = 0.5,
train.fraction = 0.5,
n.minobsinnode = 10,
cv.folds = 1,
keep.data=TRUE,
verbose=TRUE
)
There were two issues: passing cv.folds caused a problem. Also, you don't need to convert the outcome to a binary number; this causes train to think that it is a regression problem. The idea behind the train function is to smooth out the inconsistencies with the modeling functions, so we use factors for classification and numbers for regression.
Just for note - although this issue has been caused by the reason described in the answer, the error message (given below) may also occur with older version of caret and gbm. I encountered this error and after spending a lot of time trying to figure out what the issue was it turned out that I had to upgrade to the most recent version of caret (5.17-7) and gbm (2.1-0.1). These are the most recent version as of today on CRAN.
Error in { :
task 1 failed - "arguments imply differing number of rows: ...

Resources