Some questions about rpart & gbm in R - r

I am trying to model the claim frequency with the rpart and gbm packages. And I have a few questions regarding these packages.
In the rpart-model, what is the purpose/function of the shrink-parameter?
In the gbm-model, do I use the weights correct? I get an output (no errors), but I just want to be sure I have understood it correct.
I the gbm-model, I know that the parameter n.minobsinnode let me say that it should at least be 10 observations in each node. But is there a way to say that each node should have at least 1 claim? I don’t want a model that predicts 0 in claim frequency for some observations.
In RandomForest, d variables are randomly picked from the n variables for each split. But in the gbm-model, all n variables are considered for each split?
In tree-based-models, is it possible to offset one variable (e.g. deductible)?
Regression tree
Model_tree <- rpart(cbind(duration, nclaims) ~ Var_1 + … + Var_n ,
data = data ,
method = "poisson",
parms = list(shrink = 1),control=rpart.control(minbucket = 10, cp = 0.00005 , maxdepth = 5))
# Gradient Boosting Model
Model_gbm <- gbm(nclaims ~ Var_1 + … + Var_n,
data = data,
weights = duration,
distribution = "poisson",
cv.folds = 0,
shrinkage = 0.01,
interaction.depth = 5,
n.trees = 5000,
n.minobsinnode = 10,
bag.fraction = 1,
train.fraction = 1)
# Predict with a gbm
predict.gbm(object = Model_gbm, n.trees = 1000, newdata = testdata, type = "response")

Related

How to obtain list of AUC and SE after MICE (used for pool_auc in psfmi package)?

I have used themice package to obtain 20 imputed datasets.
Then I used the function with in order to have one model for one dataset, then I used pool function in order to have the pooled model. Please find my code:
IMP2 <- mice(Baseline_SEL,
defaultMethod = c("norm", "logreg", "polyreg"),
vis = c(1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 13, 14, 6, 12),
maxit = 20, seed = 189, m = 20,
pred = pre)
List_fit1 <- with(data = IMP2,
expr = glm(formula = Echec~INCLUSIONSITE + FIB_SEVERE + Agec + SEX + DG_USE + LIV_CL + IN_COUPLE +geno3_vsOt,
family = binomial(link = "logit")))
summary(pool(List_fit0_Incsite), conf.int = TRUE, exponentiate = TRUE)
# Comparison of nested model
List_fit1_minusdg_use <- with(data = IMP2,
expr = glm(formula = Echec~INCLUSIONSITE + FIB_SEVERE + Agec + SEX + LIV_CL + IN_COUPLE +geno3_vsOt,
family = binomial(link = "logit")))
D3(List_fit1, List_fit1_minusdg_use)
So I used the LR test in order to compare the model.
But now, for the model, I would like to pool all the AUC obtain for each dataset in order to obtain the goodness of fit of the pooled model. I have found psfmi::pool_auc, but I cannot find how to use it. In particular, I didn't find how to obtain the list of AUC and SE after the multiple imputation.
Do you have an idea, how to get these 2 lists?
See this vignette: https://mwheymans.github.io/psfmi/articles/psfmi_mice.html to get pooled logistic regression models after mice. Than use psfmi::pool_performance to get pooled AUC values (uses pool_auc).

Why does my code take so long to process?

I try to run code from this web site in my computer.
I use data set from kaggle competition
In my training data 1022 rows and 81 variables.
I run this code:
hyper_grid <- expand.grid(
shrinkage = c(.01, .1, .3),
interaction.depth = c(1, 3, 5),
n.minobsinnode = c(5, 10, 15),
bag.fraction = c(.65, .8, 1),
optimal_trees = 0, # a place to dump results
min_RMSE = 0 # a place to dump results
)
random_index <- sample(1:nrow(ames_train), nrow(ames_train))
random_ames_train <- ames_train[random_index, ]
# grid search
for(i in 1:nrow(hyper_grid)) {
# reproducibility
set.seed(123)
# train model
gbm.tune <- gbm(
formula = SalePrice ~ .,
distribution = "gaussian",
data = random_ames_train,
n.trees = 5000,
interaction.depth = hyper_grid$interaction.depth[i],
shrinkage = hyper_grid$shrinkage[i],
n.minobsinnode = hyper_grid$n.minobsinnode[i],
bag.fraction = hyper_grid$bag.fraction[i],
train.fraction = .75,
n.cores = NULL, # will use all cores by default
verbose = FALSE
)
I'm waiting more than 1 hour.
I think it's bacause my laptop is not powerful.
On the picture you can see parameters of my computer.
Please, answer: can my computer perform this operation?
If yes, how long should I wait?
It's taking a long time because you're training 81 GBM models, and GBM's are complex. To get a rough estimate of training time, you could train one model and then multiply that time by 81.

Save Gradient Boosting Machine values obtained with Bootstrap

I am calculating the boosting gradient to identify the importance of variables in the model, however I am performing resampling to identify how the importance of each variable behaves.
But I can't correctly save the variable name with it's importance calculated in each bootstrap.
I'm doing this using a function, which is called within the bootstrap package
boost command.
Below is a minimally reproducible example adapted for AmesHousing data:
library(gbm)
library(boot)
library(AmesHousing)
df <- make_ames()
imp_gbm <- function(data, indices) {
d <- data[indices,]
gbm.fit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = d,
n.trees = 100,
interaction.depth = 5,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL,
verbose = FALSE
)
return(summary(gbm.fit)[,2])
}
results_GBM <- boot(data = df,statistic = imp_gbm, R=100)
results_GBM$t0
I expect to save the bootstrap results with their variable names but I can only save the importance of variables without their names.
with summary.gbm, the default is to order the variables according to importance. you need to set it to FALSE, and also not plot. Then the returned variable importance is the same as the order of variables in the fit.
imp_gbm <- function(data, indices) {
d <- data[indices,]
# use gbmfit because gbm.fit is a function
gbmfit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = d,
n.trees = 100,
interaction.depth = 5,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL,
verbose = FALSE
)
o= summary(gbmfit,plotit=FALSE,order=FALSE)[,2]
names(o) = gbmfit$var.names
return(o)
}

R gbm why the length of fit$trees is always 3 times the n.trees for data set iris

Regardless of the method ('cv', 'OOB' or 'test') we choose, the number of trees in the fitted gbm object is always 3*n.trees for data set iris.
Is it because the iris data set has a target categorical variable with 3 levels.
If so and the target variable has values A, B and C, is the 1st tree for A, 2nd tree for B, 3rd tree for C and 4th tree for A again?
Or if the n.trees is set to 100, the first 100 trees are for A, 2nd 100 trees are for B and the last 100 for C?
fit = gbm(data=iris, Species ~., shrinkage = 0.2, n.trees = 50, cv.folds = 2)
best.iter = gbm.perf(fit, method = 'cv')
fit = gbm(data=iris, Species ~., shrinkage = 0.2, n.trees = 40, train.fraction = 0.8)
best.iter = gbm.perf(fit, method = 'test')
fit = gbm(data=iris, Species ~., shrinkage = 0.2, n.trees = 50)
best.iter = gbm.perf(fit, method = 'OOB')
You are right that the number of trees is 3 times because of the factor on the left side of the formula you are trying to fit. R will break that down as 3 separate variables on the left hand side of your formula and you will get 3 separate fits.
If you filter the data to get only two levels of the factor, you will end up with 2*n.trees instead.
library(gbm)
iris.sub <- iris[iris$Species != "setosa", ]
iris.sub$Species <- factor(as.character(iris.sub$Species))
levels(iris.sub$Species)
fit = gbm(data=iris.sub, Species ~., shrinkage = 0.2, n.trees = 50, cv.folds = 2, distribution = "multinomial")
length(fit$trees)
As for how the data is organized within the gbm.object, I was unable to figure out which order the trees are in.

Bayesian error-in-variables (total least squares) model in R using MCMCglmm

I am fitting some Bayesian linear mixed models using the MCMCglmm package in R. My data includes predictors that are measured with error. I'd therefore like to build a model that takes this into account. My understanding is that a basic mixed effects model in MCMCglmm will minimize error only for the response variable (as in ols regression). In other words, vertical errors will be minimized. I'd like to minimize errors orthogonal to the regression line/plane/hyperplane.
Is it possible to fit an error-in-variables (aka total least
squares) model using MCMCglmm or would I have to use JAGS/STAN to
do this?
Is it possible to do this with multiple predictors in the
same model (I have some models with 3 or 4 predictors, each measured
with error)?
If it is possible, how would I specify the model?
I've included a data set below, with a random variable height that is measured with error to illustrate a basic set up with MCMCglmm.
library(nlme)
library(MCMCglmm)
data(Orthodont)
set.seed(1234)
Orthodont$height <- c(rnorm(54, 170, 10), rnorm(54, 150, 10))
prior1 <- list(
B = list(mu = rep(0, 3), V = diag(1e+08, 3)),
G = list(G1 = list(V = 1, nu = 1, alpha.mu = 0, alpha.V = 1000)),
R = list(V = 1, nu = 0.002)
)
model1 <- MCMCglmm(
fixed = distance ~ height + Sex,
random = ~ Subject,
rcov = ~ units,
data = Orthodont,
family = "gaussian",
prior = prior1,
nitt = 1.1e+4,
thin = 10,
burnin = 1e+3,
verbose = FALSE
)
summary(model1)

Resources