I'm trying to train a gbm using the caret package in R. I initially got the following error and thought it was due to lack of an input, so I created the gbmGrid but am still getting the same error message.
sub4Collect1 <- data.frame(testing$row_id)
>
> cl <- makeCluster(10, type = "SOCK")
> registerDoSNOW(cl)
> ptm <- proc.time()
>
> for(i in 2:7){
+ trainClass <- postPrior1[,i]
+ testClass <- postTest1[,i]
+ gbmGrid <- expand.grid(.interaction.depth = (1:5) * 2, .n.trees = (1:5)*50, .shrinkage = .1)
+ bootControl <- trainControl(number = 1)
+ set.seed(2)
+ gbmFit <- train(prePrior1[,-c(2,60,61,161)], trainClass, method = "gbm", tuneLength = 5,
+ trControl = bootControl
+ ##, scaled = FALSE
+ , tuneGrid = gbmGrid
+ )
+ pred1 <- predict(gbmFit$finalModel, newdata = preTest1[,-c(2,60,61,161)])
+ sub4Collect1 <- cbind(sub4Collect1, pred1)
+ print(i)
+ flush.console()
+ }
Iter TrainDeviance ValidDeviance StepSize Improve
1 0.0000 -nan 0.1000 0.0000
2 0.0000 -nan 0.1000 0.0000
3 0.0000 -nan 0.1000 0.0000
4 0.0000 -nan 0.1000 0.0000
5 0.0000 -nan 0.1000 0.0000
6 0.0000 -nan 0.1000 0.0000
7 0.0000 -nan 0.1000 0.0000
8 0.0000 -nan 0.1000 0.0000
9 0.0000 -nan 0.1000 0.0000
10 0.0000 -nan 0.1000 0.0000
50 0.0000 -nan 0.1000 0.0000
Error in n.trees[n.trees > object$n.trees] <- object$n.trees :
argument "n.trees" is missing, with no default
> stopCluster(cl)
> timee4 <- proc.time() - ptm
> timee4
user system elapsed
3.563 0.306 14.472
Any suggestions?
The proper code for the predict() function requires feeding in the .n.trees parameter manually from the gbmFit$finalModel object as such:
pred1 <- predict(gbmFit$finalModel, newdata = preTest1[,-c(2,60,61,161)],
n.trees=gbmFit1$bestTune$.n.trees)
If this is not working :
pred1 <- predict(gbmFit$finalModel, newdata = preTest1[,-c(2,60,61,161)],
n.trees=gbmFit1$bestTune$.n.trees)
you can use this:
pred1 <- predict(gbmFit, newdata = preTest1[,-c(2,60,61,161)],
n.trees=gbmFit1$n.trees)
I don;t think you need to pass both the tuneLength and tuneGrid paramters. Try just one or the other and see if the problem persists.
Related
How to get the result of lrm() respectively?
I use lrm() to bulid a logistic model, and get the result as follows:
n <- 1000 # define sample size
y <- rep(0:1, 500)
age <- rnorm(n, 50, 10)
sex <- factor(sample(c('female','male'), n,TRUE))
f <- lrm(y ~ age + sex, x=TRUE, y=TRUE)
f
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 1000 LR chi2 1.50 R2 0.002 C 0.520
0 500 d.f. 2 g 0.088 Dxy 0.040
1 500 Pr(> chi2) 0.4714 gr 1.092 gamma 0.040
max |deriv| 2e-13 gp 0.022 tau-a 0.020
Brier 0.250
Coef S.E. Wald Z Pr(>|Z|)
Intercept 0.2206 0.3370 0.65 0.5127
age -0.0030 0.0065 -0.46 0.6485
sex=male -0.1455 0.1266 -1.15 0.2504
How to get the result above as data.frame respectively? like:
mydf$df1
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 1000 LR chi2 1.50 R2 0.002 C 0.520
0 500 d.f. 2 g 0.088 Dxy 0.040
1 500 Pr(> chi2) 0.4714 gr 1.092 gamma 0.040
max |deriv| 2e-13 gp 0.022 tau-a 0.020
Brier 0.250
mydf$df2
Coef S.E. Wald Z Pr(>|Z|)
Intercept 0.2206 0.3370 0.65 0.5127
age -0.0030 0.0065 -0.46 0.6485
sex=male -0.1455 0.1266 -1.15 0.2504
Try,
res = capture.output(print(f), append = F, sep = " ")
lapply(res, function(x) write.table(data.frame(x), 'res.csv' , append= T, sep=',' ))
The "ps" function (propensity score estimation) in "twang" package in R keeps printing its report. How can I turn that off?
I already tried to set the "print.level" argument to be 0. But it is not working for me.
D = rbinom(100, size = 1, prob = 0.5)
X1 = rnorm(100)
X2 = rnorm(100)
ps(D ~ ., data = data.frame(D, X1, X2), stop.method = 'es.mean',
estimand = "ATE", print.level = 0)
I hope there is no printing of the process, but it keeps giving me something like:
Fitting gbm model
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.3040 nan 0.0100 nan
2 1.3012 nan 0.0100 nan
3 1.2985 nan 0.0100 nan
4 1.2959 nan 0.0100 nan
5 1.2932 nan 0.0100 nan
6 1.2907 nan 0.0100 nan
7 1.2880 nan 0.0100 nan
8 1.2855 nan 0.0100 nan
9 1.2830 nan 0.0100 nan
10 1.2804 nan 0.0100 nan
20 1.2562 nan 0.0100 nan
.....
which is annoying.
Presumably you want to capture the result in a variable; if you combine that with the verbose = FALSE parameter, it should do what you need:
res <- ps(D ~ ., data = data.frame(D, X1, X2), stop.method = 'es.mean',
estimand = "ATE", print.level = 0, verbose = FALSE)
I haven't tested whether you still need print.level = 0.
I am trying to fit a gradient boosting machine (GBM) to insurance claims. The observations have unequal exposure so I am trying to use an offset equal to the log of exposures. I tried two different ways:
Put an offset term in the formula. This resulted in nan for the train and validation deviance for every iteration.
Use the offset parameter in the gbm function. This parameter is listed under gbm.more. This results in an error message that there is an unused parameter.
I can't share my company's data but I reproduced the problem using the Insurance data table in the MASS package. See the code and output below.
library(MASS)
library(gbm)
data(Insurance)
# Try using offset in the formula.
fm1 = formula(Claims ~ District + Group + Age + offset(log(Holders)))
fitgbm1 = gbm(fm1, distribution = "poisson",
data = Insurance,
n.trees = 10,
shrinkage = 0.1,
verbose = TRUE)
# Try using offset in the gbm statement.
fm2 = formula(Claims ~ District + Group + Age)
offset2 = log(Insurance$Holders)
fitgbm2 = gbm(fm2, distribution = "poisson",
data = Insurance,
n.trees = 10,
shrinkage = 0.1,
offset = offset2,
verbose = TRUE)
This then outputs:
> source('D:/Rprojects/auto_tutorial/rcode/example_gbm.R')
Iter TrainDeviance ValidDeviance StepSize Improve
1 -347.8959 nan 0.1000 0.0904
2 -348.2181 nan 0.1000 0.0814
3 -348.3845 nan 0.1000 0.0616
4 -348.5424 nan 0.1000 0.0333
5 -348.6732 nan 0.1000 0.0850
6 -348.7744 nan 0.1000 0.0610
7 -348.8795 nan 0.1000 0.0633
8 -348.9132 nan 0.1000 -0.0109
9 -348.9200 nan 0.1000 -0.0212
10 -349.0271 nan 0.1000 0.0267
Error in gbm(fm2, distribution = "poisson", data = Insurance, n.trees = 10, :
unused argument (offset = offset2)
My question is what am I doing wrong? Also, is there another way? I noticed a weights parameter in the gbm function. Should I use that?
Your first suggestion works if you specify a training fraction less than 1. The default is 1, which means there is no validation set.
library(MASS)
library(gbm)
data(Insurance)
# Try using offset in the formula.
fm1 = formula(Claims ~ District + Group + Age + offset(log(Holders)))
fitgbm1 = gbm(fm1, distribution = "poisson",
data = Insurance,
n.trees = 10,
shrinkage = 0.1,
verbose = TRUE,
train.fraction = .75)
results in
Iter TrainDeviance ValidDeviance StepSize Improve
1 -428.8293 -105.1735 0.1000 0.0888
2 -429.0869 -105.3063 0.1000 0.0708
3 -429.1805 -105.3941 0.1000 0.0486
4 -429.3414 -105.4816 0.1000 0.0933
5 -429.4934 -105.5432 0.1000 0.0566
6 -429.6714 -105.5188 0.1000 0.1212
7 -429.8470 -105.5200 0.1000 0.0833
8 -429.9655 -105.6073 0.1000 0.0482
9 -430.1367 -105.6003 0.1000 0.0473
10 -430.2462 -105.6100 0.1000 0.0487
So essentially I have two matrices containing the excess returns of stocks (R) and the expected excess return (ER).
R<-matrix(runif(47*78),ncol = 78)
ER<-matrix(runif(47*78),ncol = 78)
I then combine these removing the first row of R and adding the first row of ER to form a new matrix R1.
I then do this for R2 i.e. removing first two rows of and R and rbinding it with the first 2 rows of ER.
I do this until I have n-1 new matrices from R1 to R47.
I then find the Var-Cov matrix of each of the Return matrices using cov() i.e. Var-Cov1 to Var-Cov47.
n<-47
switch_matrices <- function(mat1, mat2, nrows){
rbind(mat1[(1+nrows):nrow(mat1),],mat2[1:nrows,])
}
l<-lapply(1:n-1, function(nrows) switch_matrices(R,ER, nrows))
list2env(setNames(l,paste0("R",seq_along(l))), envir = parent.frame())
b<-lapply(l, cov)
list2env(setNames(b,paste0("VarCov",seq_along(b))), envir = parent.frame())
I am now trying to find the asset allocation using quadprog. So for example:
D_mat <- 2*VarCov1
d_vec <- rep(0,78)
A_mat <- cbind(rep(1,78),diag(78))
b_vec <- c(1,d_vec)
library(quadprog)
output <- solve.QP(Dmat = D_mat, dvec = d_vec,Amat = A_mat, bvec = b_vec,meq =1)
# The asset allocation
(round(output$solution, 4))
For some reason when running solve.QP with any Var-Cov matrix found I get this error:
Error in solve.QP(Dmat = D_mat, dvec = d_vec, Amat = A_mat, bvec = b_vec, :
matrix D in quadratic function is not positive definite!
I'm wondering what I am doing wrong or even why this is not working.
The input matrix isn't positive definite, which is a necessary condition for the optimization algorithm.
Why your matrix isn't positive definite will have to do with your specific data (the real data, not the randomly generated example) and will be both a statistical and subject matter specific question.
However, from a programming perspective there is a workaround. We can use nearPD from the Matrix package to find the nearest positive definite matrix as a viable alternative:
# Data generated by code in the question using set.seed(123)
library(quadprog)
library(Matrix)
pd_D_mat <- nearPD(D_mat)
output <- solve.QP(Dmat = as.matrix(pd_D_mat$mat),
dvec = d_vec,
Amat = A_mat,
bvec = b_vec,
meq = 1)
# The asset allocation
(round(output$solution, 4))
[1] 0.0052 0.0000 0.0173 0.0739 0.0000 0.0248 0.0082 0.0180 0.0000 0.0217 0.0177 0.0000 0.0000 0.0053 0.0000 0.0173 0.0216 0.0000
[19] 0.0000 0.0049 0.0042 0.0546 0.0049 0.0088 0.0250 0.0272 0.0325 0.0298 0.0000 0.0160 0.0000 0.0064 0.0276 0.0145 0.0178 0.0000
[37] 0.0258 0.0000 0.0413 0.0000 0.0071 0.0000 0.0268 0.0095 0.0326 0.0112 0.0381 0.0172 0.0000 0.0179 0.0000 0.0292 0.0125 0.0000
[55] 0.0000 0.0000 0.0232 0.0058 0.0000 0.0000 0.0000 0.0143 0.0274 0.0160 0.0000 0.0287 0.0000 0.0000 0.0203 0.0226 0.0311 0.0345
[73] 0.0012 0.0004 0.0000 0.0000 0.0000 0.0000
I am using gbm package in R and applying the 'bernoulli' option for distribution to build a classifier and i get unusual results of 'nan' and i'm unable to predict any classification results. But i do not encounter the same errors when i use 'adaboost'. Below is the sample code, i replicated the same errors with the iris dataset.
## using the iris data for gbm
library(caret)
library(gbm)
data(iris)
Data <- iris[1:100,-5]
Label <- as.factor(c(rep(0,50), rep(1,50)))
# Split the data into training and testing
inTraining <- createDataPartition(Label, p=0.7, list=FALSE)
training <- Data[inTraining, ]
trainLab <- droplevels(Label[inTraining])
testing <- Data[-inTraining, ]
testLab <- droplevels(Label[-inTraining])
# Model
model_gbm <- gbm.fit(x=training, y= trainLab,
distribution = "bernoulli",
n.trees = 20, interaction.depth = 1,
n.minobsinnode = 10, shrinkage = 0.001,
bag.fraction = 0.5, keep.data = TRUE, verbose = TRUE)
## output on the console
Iter TrainDeviance ValidDeviance StepSize Improve
1 -nan -nan 0.0010 -nan
2 nan -nan 0.0010 nan
3 -nan -nan 0.0010 -nan
4 nan -nan 0.0010 nan
5 -nan -nan 0.0010 -nan
6 nan -nan 0.0010 nan
7 -nan -nan 0.0010 -nan
8 nan -nan 0.0010 nan
9 -nan -nan 0.0010 -nan
10 nan -nan 0.0010 nan
20 nan -nan 0.0010 nan
Please let me know if there is a work around to get this working. The reason i am using this is to experiment with Additive Logistic Regression, please suggest if there are any other alternatives in R to go about this.
Thanks.
Is there a reason you are using gbm.fit() instead of gbm()?
Based on the package documentation, the y variable in gbm.fit() needs to be a vector.
I tried making sure the vector was forced using
trainLab <- as.vector(droplevels(Label[inTraining])) #vector of chars
Which gave the following output on the console. Unfortunately I'm not sure why the valid deviance is still -nan.
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.3843 -nan 0.0010 0.0010
2 1.3823 -nan 0.0010 0.0010
3 1.3803 -nan 0.0010 0.0010
4 1.3783 -nan 0.0010 0.0010
5 1.3763 -nan 0.0010 0.0010
6 1.3744 -nan 0.0010 0.0010
7 1.3724 -nan 0.0010 0.0010
8 1.3704 -nan 0.0010 0.0010
9 1.3684 -nan 0.0010 0.0010
10 1.3665 -nan 0.0010 0.0010
20 1.3471 -nan 0.0010 0.0010
train.fraction should be <1 to get ValidDeviance, because this way we are creating a validation dataset.
Thanks!