How can I offset exposures in a gbm model in R? - r

I am trying to fit a gradient boosting machine (GBM) to insurance claims. The observations have unequal exposure so I am trying to use an offset equal to the log of exposures. I tried two different ways:
Put an offset term in the formula. This resulted in nan for the train and validation deviance for every iteration.
Use the offset parameter in the gbm function. This parameter is listed under gbm.more. This results in an error message that there is an unused parameter.
I can't share my company's data but I reproduced the problem using the Insurance data table in the MASS package. See the code and output below.
library(MASS)
library(gbm)
data(Insurance)
# Try using offset in the formula.
fm1 = formula(Claims ~ District + Group + Age + offset(log(Holders)))
fitgbm1 = gbm(fm1, distribution = "poisson",
data = Insurance,
n.trees = 10,
shrinkage = 0.1,
verbose = TRUE)
# Try using offset in the gbm statement.
fm2 = formula(Claims ~ District + Group + Age)
offset2 = log(Insurance$Holders)
fitgbm2 = gbm(fm2, distribution = "poisson",
data = Insurance,
n.trees = 10,
shrinkage = 0.1,
offset = offset2,
verbose = TRUE)
This then outputs:
> source('D:/Rprojects/auto_tutorial/rcode/example_gbm.R')
Iter TrainDeviance ValidDeviance StepSize Improve
1 -347.8959 nan 0.1000 0.0904
2 -348.2181 nan 0.1000 0.0814
3 -348.3845 nan 0.1000 0.0616
4 -348.5424 nan 0.1000 0.0333
5 -348.6732 nan 0.1000 0.0850
6 -348.7744 nan 0.1000 0.0610
7 -348.8795 nan 0.1000 0.0633
8 -348.9132 nan 0.1000 -0.0109
9 -348.9200 nan 0.1000 -0.0212
10 -349.0271 nan 0.1000 0.0267
Error in gbm(fm2, distribution = "poisson", data = Insurance, n.trees = 10, :
unused argument (offset = offset2)
My question is what am I doing wrong? Also, is there another way? I noticed a weights parameter in the gbm function. Should I use that?

Your first suggestion works if you specify a training fraction less than 1. The default is 1, which means there is no validation set.
library(MASS)
library(gbm)
data(Insurance)
# Try using offset in the formula.
fm1 = formula(Claims ~ District + Group + Age + offset(log(Holders)))
fitgbm1 = gbm(fm1, distribution = "poisson",
data = Insurance,
n.trees = 10,
shrinkage = 0.1,
verbose = TRUE,
train.fraction = .75)
results in
Iter TrainDeviance ValidDeviance StepSize Improve
1 -428.8293 -105.1735 0.1000 0.0888
2 -429.0869 -105.3063 0.1000 0.0708
3 -429.1805 -105.3941 0.1000 0.0486
4 -429.3414 -105.4816 0.1000 0.0933
5 -429.4934 -105.5432 0.1000 0.0566
6 -429.6714 -105.5188 0.1000 0.1212
7 -429.8470 -105.5200 0.1000 0.0833
8 -429.9655 -105.6073 0.1000 0.0482
9 -430.1367 -105.6003 0.1000 0.0473
10 -430.2462 -105.6100 0.1000 0.0487

Related

Object p not found when running gbm()

I am aware of the question GBM: Object 'p' not found; however it did not contain sufficient information to allow the stack to answer. I don't believe this is a duplicate as I've followed what was indicated in this question and the linked duplicate Error in R gbm function when cv.folds > 0 which, does not describe the same error.
I have been sure to follow the recommendation of leaving out any columns that were not used in the model.
This error appears when the cv.folds is greater than 0:
object 'p' not found
From what I can see, setting cv.folds to 0 is not producing meaningful outputs.I have attempted different distributions, fractions, trees etc. I'm confident I've parameterized something incorrectly but I can't for the life of me see what it is.
Model and output:
model_output <- gbm(formula = ign ~ . ,
distribution = "bernoulli",
var.monotone = rep(0,9),
data = model_sample,
train.fraction = 0.50,
n.cores = 1,
n.trees = 150,
cv.folds = 1,
keep.data = T,
verbose=T)
Iter TrainDeviance ValidDeviance StepSize Improve
1 nan nan 0.1000 nan
2 nan nan 0.1000 nan
3 nan nan 0.1000 nan
4 nan nan 0.1000 nan
5 nan nan 0.1000 nan
6 nan nan 0.1000 nan
7 nan nan 0.1000 nan
8 nan nan 0.1000 nan
9 nan nan 0.1000 nan
10 nan nan 0.1000 nan
20 nan nan 0.1000 nan
40 nan nan 0.1000 nan
60 nan nan 0.1000 nan
80 nan nan 0.1000 nan
100 nan nan 0.1000 nan
120 nan nan 0.1000 nan
140 nan nan 0.1000 nan
150 nan nan 0.1000 nan
Minimum data to generate error used to be here, however once the suggest by #StupidWolf is employed it is too small, the suggestion below will get passed the initial error. Subsequent errors are occurring and solutions will be posted here upon discovery.
It's not meant to deal with the situation someone sets cv.folds = 1. By definition, k fold means splitting the data into k parts, training on 1 part and testing on the other.. So I am not so sure what is 1 -fold cross validation, and if you look at the code for gbm, at line 437
if(cv.folds > 1) {
cv.results <- gbmCrossVal(cv.folds = cv.folds, nTrain = nTrain,
....
p <- cv.results$predictions
}
It makes the predictions and when it collects the results into gbm, line 471:
if (cv.folds > 0) {
gbm.obj$cv.fitted <- p
}
So if cv.folds ==1, p is not calculated, but it is > 0 hence you get the error.
Below is a reproducible example:
library(MASS)
test = Pima.tr
test$type = as.numeric(test$type)-1
model_output <- gbm(type~ . ,
distribution = "bernoulli",
var.monotone = rep(0,7),
data = test,
train.fraction = 0.5,
n.cores = 1,
n.trees = 30,
cv.folds = 1,
keep.data = TRUE,
verbose=TRUE)
gives me the error object 'p' not found
Set it to cv.folds = 2, and it runs smoothly....
model_output <- gbm(type~ . ,
distribution = "bernoulli",
var.monotone = rep(0,7),
data = test,
train.fraction = 0.5,
n.cores = 1,
n.trees = 30,
cv.folds = 2,
keep.data = TRUE,
verbose=TRUE)

How to stop printing for "ps" function in "twang" package?

The "ps" function (propensity score estimation) in "twang" package in R keeps printing its report. How can I turn that off?
I already tried to set the "print.level" argument to be 0. But it is not working for me.
D = rbinom(100, size = 1, prob = 0.5)
X1 = rnorm(100)
X2 = rnorm(100)
ps(D ~ ., data = data.frame(D, X1, X2), stop.method = 'es.mean',
estimand = "ATE", print.level = 0)
I hope there is no printing of the process, but it keeps giving me something like:
Fitting gbm model
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.3040 nan 0.0100 nan
2 1.3012 nan 0.0100 nan
3 1.2985 nan 0.0100 nan
4 1.2959 nan 0.0100 nan
5 1.2932 nan 0.0100 nan
6 1.2907 nan 0.0100 nan
7 1.2880 nan 0.0100 nan
8 1.2855 nan 0.0100 nan
9 1.2830 nan 0.0100 nan
10 1.2804 nan 0.0100 nan
20 1.2562 nan 0.0100 nan
.....
which is annoying.
Presumably you want to capture the result in a variable; if you combine that with the verbose = FALSE parameter, it should do what you need:
res <- ps(D ~ ., data = data.frame(D, X1, X2), stop.method = 'es.mean',
estimand = "ATE", print.level = 0, verbose = FALSE)
I haven't tested whether you still need print.level = 0.

Interpretation of AUC NaN values in h2o cross-validation predictions summary

I have noticed that for some runs of:
train=as.h2o(u)
mod = h2o.glm(family= "binomial", x= c(1:15), y="dc",
training_frame=train, missing_values_handling = "Skip",
lambda = 0, compute_p_values = TRUE, nfolds = 10,
keep_cross_validation_predictions= TRUE)
there are NaNs in cross-validation metrics summary of AUC for some cv iterations of the model.
For example:
print(mod#model$cross_validation_metrics_summary["auc",])
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid cv_8_valid cv_9_valid cv_10_valid
auc 0.63244045 0.24962118 0.25 0.6666667 0.8095238 1.0 0.6666667 0.46666667 NaN NaN 1.0 0.2
NaN in CV seems to appear less frequently when I set smaller nfolds=7.
How these NaN values should be interpreted and when h2o cross-validation outputs them?
I suppose it happens when AUC can't be assessed correctly in an iteration. My training set has 70 complete rows.
Can such AUC cross-validation results (containing NaNs) be considered as reliable?
There are specific cases that could cause division by zero when calculating the ROC curve, which could cause an AUC to be NaN. It's probable that due to small data you have some folds that have no true positives and are causing this issue.
We can test this by keeping the fold column and then counting the values of dc in each fold:
...
train <- as.h2o(u)
mod <- h2o.glm(family = "binomial"
, x = c(1:15)
, y = "dc"
, training_frame = train
, missing_values_handling = "Skip"
, lambda = 0
, compute_p_values = TRUE
, nfolds = 10
, keep_cross_validation_fold_assignment = TRUE
, seed = 1234)
fold <- as.data.frame(h2o.cross_validation_fold_assignment(mod))
df <- cbind(u,fold)
table(df[c("dc","fold_assignment")])
fold_assignment
dc 0 1 2 3 4 5 6 7 8 9
0 4 6 6 2 9 6 6 4 4 6
1 2 2 3 4 0 2 0 0 1 2
mod#model$cross_validation_metrics_summary["auc",]
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid
auc 0.70238096 0.19357596 0.875 0.6666667 0.5 0.375 NaN 0.5833333 NaN
cv_8_valid cv_9_valid cv_10_valid
auc NaN 1.0 0.9166667
We see that the folds with NaN are the same folds that have only dc=0.
Not counting the NaN, the wide variety of AUC for your folds (from 0.2 to 1) tells us that this is not a robust model, and it is likely being overfitted. Can you add more data?

R logistic regression and marginal effects - how to exclude NA values in categorical independent variable

I am a beginner with R. I am using glm to conduct logistic regression and then using the 'margins' package to calculate marginal effects but I don't seem to be able to exclude the missing values in my categorical independent variable.
I have tried to ask R to exclude NAs from the regression. The categorical variable is weight status at age 9 (wgt9), and it has three levels (1, 2, 3) and some NAs.
What am I doing wrong? Why do I get a wgt9NA result in my outputs and how can I correct it?
Thanks in advance for any help/advice.
Conduct logistic regression
summary(logit.phbehav <- glm(obese13 ~ gender + as.factor(wgt9) + aded08b,
data = gui, weights = bdwg01, family = binomial(link = "logit")))
Regression output
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -3.99 0.293 -13.6 2.86e- 42
2 gender 0.387 0.121 3.19 1.42e- 3
3 as.factor(wgt9)2 2.49 0.177 14.1 3.28e- 45
4 as.factor(wgt9)3 4.65 0.182 25.6 4.81e-144
5 as.factor(wgt9)NA 2.60 0.234 11.1 9.94e- 29
6 aded08b -0.0755 0.0224 -3.37 7.47e- 4
Calculate the marginal effects
effects_logit_phtotal = margins(logit.phtot)
print(effects_logit_phtotal)
summary(effects_logit_phtotal)
Marginal effects output
> summary(effects_logit_phtotal)
factor AME SE z p lower upper
aded08a -0.0012 0.0002 -4.8785 0.0000 -0.0017 -0.0007
gender 0.0115 0.0048 2.3899 0.0169 0.0021 0.0210
wgt92 0.0941 0.0086 10.9618 0.0000 0.0773 0.1109
wgt93 0.4708 0.0255 18.4569 0.0000 0.4208 0.5207
wgt9NA 0.1027 0.0179 5.7531 0.0000 0.0677 0.1377
First of all welcome to stack overflow. Please check the answer here to see how to make a great R question. Not providing a sample of your data, some times makes it impossible to answer the question. However taking a guess, I think that you have not set your NA values correctly but as strings. This behavior can be seen in the dummy data below.
First let's create the dummy data:
v1 <- c(2,3,3,3,2,2,2,2,NA,NA,NA)
v2 <- c(2,3,3,3,2,2,2,2,"NA","NA","NA")
v3 <- c(11,5,6,7,10,8,7,6,2,5,3)
obese <- c(0,1,1,0,0,1,1,1,0,0,0)
df <- data.frame(obese,v1,v2)
Using the variable named v1, does not include NA as a category:
glm(formula = obese ~ as.factor(v1) + v3, family = binomial(link = "logit"),
data = df)
Deviance Residuals:
1 2 3 4 5 6 7 8
-2.110e-08 2.110e-08 1.168e-05 -1.105e-05 -2.110e-08 3.094e-06 2.110e-08 2.110e-08
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 401.48 898581.15 0 1
as.factor(v1)3 -96.51 326132.30 0 1
v3 -46.93 106842.02 0 1
While making the string "NA" to factor gives an output similar to the one in question:
glm(formula = obese ~ as.factor(v2) + v3, family = binomial(link = "logit"),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.402e-05 -2.110e-08 -2.110e-08 2.110e-08 1.472e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 394.21 744490.08 0.001 1
as.factor(v2)3 -95.33 340427.26 0.000 1
as.factor(v2)NA -327.07 613934.84 -0.001 1
v3 -45.99 84477.60 -0.001 1
Try the following to replace NAs that are strings:
gui$wgt9[ gui$wgt9 == "NA" ] <- NA
Don't forget to accept any answer that solved your problem.

How to suppress iteration output from Boosted tree model gbm in Caret from R studio

If I run this code tot train a gbm-model with Knitr, I receive several pages of Iter output like copied below. Is there a method to suppress this output?
mod_gbm <- train(classe ~ ., data = TrainSet, method = "gbm")
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1322
## 2 1.5210 nan 0.1000 0.0936
## 3 1.4608 nan 0.1000 0.0672
## 4 1.4165 nan 0.1000 0.0561
## 5 1.3793 nan 0.1000 0.0441
Thank you!
Try passing train the argument trace = FALSE.
This is a parameter not defined in the train documentation explicitly as it is part of the ... optional parameters.

Resources