Object p not found when running gbm() - r

I am aware of the question GBM: Object 'p' not found; however it did not contain sufficient information to allow the stack to answer. I don't believe this is a duplicate as I've followed what was indicated in this question and the linked duplicate Error in R gbm function when cv.folds > 0 which, does not describe the same error.
I have been sure to follow the recommendation of leaving out any columns that were not used in the model.
This error appears when the cv.folds is greater than 0:
object 'p' not found
From what I can see, setting cv.folds to 0 is not producing meaningful outputs.I have attempted different distributions, fractions, trees etc. I'm confident I've parameterized something incorrectly but I can't for the life of me see what it is.
Model and output:
model_output <- gbm(formula = ign ~ . ,
distribution = "bernoulli",
var.monotone = rep(0,9),
data = model_sample,
train.fraction = 0.50,
n.cores = 1,
n.trees = 150,
cv.folds = 1,
keep.data = T,
verbose=T)
Iter TrainDeviance ValidDeviance StepSize Improve
1 nan nan 0.1000 nan
2 nan nan 0.1000 nan
3 nan nan 0.1000 nan
4 nan nan 0.1000 nan
5 nan nan 0.1000 nan
6 nan nan 0.1000 nan
7 nan nan 0.1000 nan
8 nan nan 0.1000 nan
9 nan nan 0.1000 nan
10 nan nan 0.1000 nan
20 nan nan 0.1000 nan
40 nan nan 0.1000 nan
60 nan nan 0.1000 nan
80 nan nan 0.1000 nan
100 nan nan 0.1000 nan
120 nan nan 0.1000 nan
140 nan nan 0.1000 nan
150 nan nan 0.1000 nan
Minimum data to generate error used to be here, however once the suggest by #StupidWolf is employed it is too small, the suggestion below will get passed the initial error. Subsequent errors are occurring and solutions will be posted here upon discovery.

It's not meant to deal with the situation someone sets cv.folds = 1. By definition, k fold means splitting the data into k parts, training on 1 part and testing on the other.. So I am not so sure what is 1 -fold cross validation, and if you look at the code for gbm, at line 437
if(cv.folds > 1) {
cv.results <- gbmCrossVal(cv.folds = cv.folds, nTrain = nTrain,
....
p <- cv.results$predictions
}
It makes the predictions and when it collects the results into gbm, line 471:
if (cv.folds > 0) {
gbm.obj$cv.fitted <- p
}
So if cv.folds ==1, p is not calculated, but it is > 0 hence you get the error.
Below is a reproducible example:
library(MASS)
test = Pima.tr
test$type = as.numeric(test$type)-1
model_output <- gbm(type~ . ,
distribution = "bernoulli",
var.monotone = rep(0,7),
data = test,
train.fraction = 0.5,
n.cores = 1,
n.trees = 30,
cv.folds = 1,
keep.data = TRUE,
verbose=TRUE)
gives me the error object 'p' not found
Set it to cv.folds = 2, and it runs smoothly....
model_output <- gbm(type~ . ,
distribution = "bernoulli",
var.monotone = rep(0,7),
data = test,
train.fraction = 0.5,
n.cores = 1,
n.trees = 30,
cv.folds = 2,
keep.data = TRUE,
verbose=TRUE)

Related

twang - Error in Di - crossprod(WX[index, ], X[index, ]) : non-conformable arrays

I'm trying to build propensity scores with the twang package, but I keep getting this error:
Error in Di - crossprod(WX[index, ], X[index, ]) : non-conformable arrays
I'm attaching the code:
ps.TPSV.gbm = ps(Cardioversione ~ Sesso+ age,
data = prova)
> ps.TPSV.gbm = ps(Cardioversione ~ Sesso+ age,
+ data = prova)
Fitting boosted model
Iter TrainDeviance ValidDeviance StepSize Improve
1 0.6590 nan 0.0100 nan
2 0.6581 nan 0.0100 nan
3 0.6572 nan 0.0100 nan
4 0.6564 nan 0.0100 nan
5 0.6556 nan 0.0100 nan
6 0.6548 nan 0.0100 nan
7 0.6540 nan 0.0100 nan
8 0.6533 nan 0.0100 nan
9 0.6526 nan 0.0100 nan
...
9900 0.4164 nan 0.0100 nan
9920 0.4161 nan 0.0100 nan
9940 0.4160 nan 0.0100 nan
9960 0.4158 nan 0.0100 nan
9980 0.4157 nan 0.0100 nan
10000 0.4155 nan 0.0100 nan
Diagnosis of unweighted analysis
Error in Di - crossprod(WX[index, ], X[index, ]) : non-conformable arrays
I honestly don't understand which is the problem, the variables are one factorial (Sesso) and one numeric (age), there are no missing values...could anyone help me?
Thank you in advance
I've already tried changing the variables introduced in the PS but there's no way, I tried if the example code works with the lalonde dataset included in twang and it works well.

Question: Problem to improve GBM (gradient boosting) model and nan's generated using gbm function in r. train.fraction correctly set to <1

I am trying to test different parameters in the function gbm of R in order to make predictions with my data . I have a huge table of 79866 rows and 1586 columns where columns are counts for motifs in the DNA and rows indicate diferent regions/positions in the DNA and the organism to which the counts belong. There are only 3 organism but the counts are sepatated by the positions(peakid).
data looks like this:
chrII:11889760_11890077 worm 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
As I have problems with the memory that I don't know how to solve yet (because of the size of the table) so I am using a subset of the data:
motifs.table.sub<-motifs.table[1:1000, 1:1000]
set.seed(123)
motifs_split.sub <- initial_split(motifs.table.sub, prop = .7)
motifs_train.sub <- training(motifs_split.sub)
I create a table with different parameters to test
hyper_grid <- expand.grid(
shrinkage = c(.01, .1, .3),
interaction.depth = c(1, 3, 5),
n.minobsinnode = c(5, 10, 15),
bag.fraction = c(.65, .8, 1),
optimal_trees = 0,
min_RMSE = 0)
Then I randomize the training data:
random_index.sub <- sample(1:nrow(motifs_train.sub), nrow(motifs_train.sub))
random_motifs_train.sub <- motifs_train.sub[random_index.sub, ]
test the different parameters with 1000 trees
for(i in 1:nrow(hyper_grid)) {#
set.seed(123)
gbm.tune <- gbm(
formula = organism ~ .,
distribution = "gaussian", #default
data = random_motifs_train.sub,
n.trees = 1000,
interaction.depth = hyper_grid$interaction.depth[i],
shrinkage = hyper_grid$shrinkage[i],
n.minobsinnode = hyper_grid$n.minobsinnode[i],
bag.fraction = hyper_grid$bag.fraction[i],
train.fraction = 0.70,
n.cores = NULL,
verbose = V)
print(head(gbm.tune$valid.error))}
The problem is that the model never improves:
Iter TrainDeviance ValidDeviance StepSize Improve
1 nan nan 0.0100 nan
2 nan nan 0.0100 nan
3 nan nan 0.0100 nan
4 nan nan 0.0100 nan
5 nan nan 0.0100 nan
6 nan nan 0.0100 nan
7 nan nan 0.0100 nan
8 nan nan 0.0100 nan
9 nan nan 0.0100 nan
10 nan nan 0.0100 nan
And values like valid.error are not calculated, they remain as 'NA'. I tried changing the size of the data subset and the same happens with all the parameters I am testing. My data table is huge and it has a lot of zeros, I thouth of removing motifs with low counts but don't think that would help since only 127 motifs out of 1586 have less than 10 counts.
apparently TrainDeviance is nan if train.fraction is not <1: Gradient Boosting using gbm in R with distribution = "bernoulli"
But not my case, I set train.fraction to 0.7
Any ideas of what I am doing wrong?
thanks!
PS: I am following this tutorial: http://uc-r.github.io/gbm_regression

How to stop printing for "ps" function in "twang" package?

The "ps" function (propensity score estimation) in "twang" package in R keeps printing its report. How can I turn that off?
I already tried to set the "print.level" argument to be 0. But it is not working for me.
D = rbinom(100, size = 1, prob = 0.5)
X1 = rnorm(100)
X2 = rnorm(100)
ps(D ~ ., data = data.frame(D, X1, X2), stop.method = 'es.mean',
estimand = "ATE", print.level = 0)
I hope there is no printing of the process, but it keeps giving me something like:
Fitting gbm model
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.3040 nan 0.0100 nan
2 1.3012 nan 0.0100 nan
3 1.2985 nan 0.0100 nan
4 1.2959 nan 0.0100 nan
5 1.2932 nan 0.0100 nan
6 1.2907 nan 0.0100 nan
7 1.2880 nan 0.0100 nan
8 1.2855 nan 0.0100 nan
9 1.2830 nan 0.0100 nan
10 1.2804 nan 0.0100 nan
20 1.2562 nan 0.0100 nan
.....
which is annoying.
Presumably you want to capture the result in a variable; if you combine that with the verbose = FALSE parameter, it should do what you need:
res <- ps(D ~ ., data = data.frame(D, X1, X2), stop.method = 'es.mean',
estimand = "ATE", print.level = 0, verbose = FALSE)
I haven't tested whether you still need print.level = 0.

How to suppress iteration output from Boosted tree model gbm in Caret from R studio

If I run this code tot train a gbm-model with Knitr, I receive several pages of Iter output like copied below. Is there a method to suppress this output?
mod_gbm <- train(classe ~ ., data = TrainSet, method = "gbm")
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1322
## 2 1.5210 nan 0.1000 0.0936
## 3 1.4608 nan 0.1000 0.0672
## 4 1.4165 nan 0.1000 0.0561
## 5 1.3793 nan 0.1000 0.0441
Thank you!
Try passing train the argument trace = FALSE.
This is a parameter not defined in the train documentation explicitly as it is part of the ... optional parameters.

How can I offset exposures in a gbm model in R?

I am trying to fit a gradient boosting machine (GBM) to insurance claims. The observations have unequal exposure so I am trying to use an offset equal to the log of exposures. I tried two different ways:
Put an offset term in the formula. This resulted in nan for the train and validation deviance for every iteration.
Use the offset parameter in the gbm function. This parameter is listed under gbm.more. This results in an error message that there is an unused parameter.
I can't share my company's data but I reproduced the problem using the Insurance data table in the MASS package. See the code and output below.
library(MASS)
library(gbm)
data(Insurance)
# Try using offset in the formula.
fm1 = formula(Claims ~ District + Group + Age + offset(log(Holders)))
fitgbm1 = gbm(fm1, distribution = "poisson",
data = Insurance,
n.trees = 10,
shrinkage = 0.1,
verbose = TRUE)
# Try using offset in the gbm statement.
fm2 = formula(Claims ~ District + Group + Age)
offset2 = log(Insurance$Holders)
fitgbm2 = gbm(fm2, distribution = "poisson",
data = Insurance,
n.trees = 10,
shrinkage = 0.1,
offset = offset2,
verbose = TRUE)
This then outputs:
> source('D:/Rprojects/auto_tutorial/rcode/example_gbm.R')
Iter TrainDeviance ValidDeviance StepSize Improve
1 -347.8959 nan 0.1000 0.0904
2 -348.2181 nan 0.1000 0.0814
3 -348.3845 nan 0.1000 0.0616
4 -348.5424 nan 0.1000 0.0333
5 -348.6732 nan 0.1000 0.0850
6 -348.7744 nan 0.1000 0.0610
7 -348.8795 nan 0.1000 0.0633
8 -348.9132 nan 0.1000 -0.0109
9 -348.9200 nan 0.1000 -0.0212
10 -349.0271 nan 0.1000 0.0267
Error in gbm(fm2, distribution = "poisson", data = Insurance, n.trees = 10, :
unused argument (offset = offset2)
My question is what am I doing wrong? Also, is there another way? I noticed a weights parameter in the gbm function. Should I use that?
Your first suggestion works if you specify a training fraction less than 1. The default is 1, which means there is no validation set.
library(MASS)
library(gbm)
data(Insurance)
# Try using offset in the formula.
fm1 = formula(Claims ~ District + Group + Age + offset(log(Holders)))
fitgbm1 = gbm(fm1, distribution = "poisson",
data = Insurance,
n.trees = 10,
shrinkage = 0.1,
verbose = TRUE,
train.fraction = .75)
results in
Iter TrainDeviance ValidDeviance StepSize Improve
1 -428.8293 -105.1735 0.1000 0.0888
2 -429.0869 -105.3063 0.1000 0.0708
3 -429.1805 -105.3941 0.1000 0.0486
4 -429.3414 -105.4816 0.1000 0.0933
5 -429.4934 -105.5432 0.1000 0.0566
6 -429.6714 -105.5188 0.1000 0.1212
7 -429.8470 -105.5200 0.1000 0.0833
8 -429.9655 -105.6073 0.1000 0.0482
9 -430.1367 -105.6003 0.1000 0.0473
10 -430.2462 -105.6100 0.1000 0.0487

Resources