I am using gbm package in R and applying the 'bernoulli' option for distribution to build a classifier and i get unusual results of 'nan' and i'm unable to predict any classification results. But i do not encounter the same errors when i use 'adaboost'. Below is the sample code, i replicated the same errors with the iris dataset.
## using the iris data for gbm
library(caret)
library(gbm)
data(iris)
Data <- iris[1:100,-5]
Label <- as.factor(c(rep(0,50), rep(1,50)))
# Split the data into training and testing
inTraining <- createDataPartition(Label, p=0.7, list=FALSE)
training <- Data[inTraining, ]
trainLab <- droplevels(Label[inTraining])
testing <- Data[-inTraining, ]
testLab <- droplevels(Label[-inTraining])
# Model
model_gbm <- gbm.fit(x=training, y= trainLab,
distribution = "bernoulli",
n.trees = 20, interaction.depth = 1,
n.minobsinnode = 10, shrinkage = 0.001,
bag.fraction = 0.5, keep.data = TRUE, verbose = TRUE)
## output on the console
Iter TrainDeviance ValidDeviance StepSize Improve
1 -nan -nan 0.0010 -nan
2 nan -nan 0.0010 nan
3 -nan -nan 0.0010 -nan
4 nan -nan 0.0010 nan
5 -nan -nan 0.0010 -nan
6 nan -nan 0.0010 nan
7 -nan -nan 0.0010 -nan
8 nan -nan 0.0010 nan
9 -nan -nan 0.0010 -nan
10 nan -nan 0.0010 nan
20 nan -nan 0.0010 nan
Please let me know if there is a work around to get this working. The reason i am using this is to experiment with Additive Logistic Regression, please suggest if there are any other alternatives in R to go about this.
Thanks.
Is there a reason you are using gbm.fit() instead of gbm()?
Based on the package documentation, the y variable in gbm.fit() needs to be a vector.
I tried making sure the vector was forced using
trainLab <- as.vector(droplevels(Label[inTraining])) #vector of chars
Which gave the following output on the console. Unfortunately I'm not sure why the valid deviance is still -nan.
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.3843 -nan 0.0010 0.0010
2 1.3823 -nan 0.0010 0.0010
3 1.3803 -nan 0.0010 0.0010
4 1.3783 -nan 0.0010 0.0010
5 1.3763 -nan 0.0010 0.0010
6 1.3744 -nan 0.0010 0.0010
7 1.3724 -nan 0.0010 0.0010
8 1.3704 -nan 0.0010 0.0010
9 1.3684 -nan 0.0010 0.0010
10 1.3665 -nan 0.0010 0.0010
20 1.3471 -nan 0.0010 0.0010
train.fraction should be <1 to get ValidDeviance, because this way we are creating a validation dataset.
Thanks!
Related
I'm trying to build propensity scores with the twang package, but I keep getting this error:
Error in Di - crossprod(WX[index, ], X[index, ]) : non-conformable arrays
I'm attaching the code:
ps.TPSV.gbm = ps(Cardioversione ~ Sesso+ age,
data = prova)
> ps.TPSV.gbm = ps(Cardioversione ~ Sesso+ age,
+ data = prova)
Fitting boosted model
Iter TrainDeviance ValidDeviance StepSize Improve
1 0.6590 nan 0.0100 nan
2 0.6581 nan 0.0100 nan
3 0.6572 nan 0.0100 nan
4 0.6564 nan 0.0100 nan
5 0.6556 nan 0.0100 nan
6 0.6548 nan 0.0100 nan
7 0.6540 nan 0.0100 nan
8 0.6533 nan 0.0100 nan
9 0.6526 nan 0.0100 nan
...
9900 0.4164 nan 0.0100 nan
9920 0.4161 nan 0.0100 nan
9940 0.4160 nan 0.0100 nan
9960 0.4158 nan 0.0100 nan
9980 0.4157 nan 0.0100 nan
10000 0.4155 nan 0.0100 nan
Diagnosis of unweighted analysis
Error in Di - crossprod(WX[index, ], X[index, ]) : non-conformable arrays
I honestly don't understand which is the problem, the variables are one factorial (Sesso) and one numeric (age), there are no missing values...could anyone help me?
Thank you in advance
I've already tried changing the variables introduced in the PS but there's no way, I tried if the example code works with the lalonde dataset included in twang and it works well.
The "ps" function (propensity score estimation) in "twang" package in R keeps printing its report. How can I turn that off?
I already tried to set the "print.level" argument to be 0. But it is not working for me.
D = rbinom(100, size = 1, prob = 0.5)
X1 = rnorm(100)
X2 = rnorm(100)
ps(D ~ ., data = data.frame(D, X1, X2), stop.method = 'es.mean',
estimand = "ATE", print.level = 0)
I hope there is no printing of the process, but it keeps giving me something like:
Fitting gbm model
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.3040 nan 0.0100 nan
2 1.3012 nan 0.0100 nan
3 1.2985 nan 0.0100 nan
4 1.2959 nan 0.0100 nan
5 1.2932 nan 0.0100 nan
6 1.2907 nan 0.0100 nan
7 1.2880 nan 0.0100 nan
8 1.2855 nan 0.0100 nan
9 1.2830 nan 0.0100 nan
10 1.2804 nan 0.0100 nan
20 1.2562 nan 0.0100 nan
.....
which is annoying.
Presumably you want to capture the result in a variable; if you combine that with the verbose = FALSE parameter, it should do what you need:
res <- ps(D ~ ., data = data.frame(D, X1, X2), stop.method = 'es.mean',
estimand = "ATE", print.level = 0, verbose = FALSE)
I haven't tested whether you still need print.level = 0.
If I run this code tot train a gbm-model with Knitr, I receive several pages of Iter output like copied below. Is there a method to suppress this output?
mod_gbm <- train(classe ~ ., data = TrainSet, method = "gbm")
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1322
## 2 1.5210 nan 0.1000 0.0936
## 3 1.4608 nan 0.1000 0.0672
## 4 1.4165 nan 0.1000 0.0561
## 5 1.3793 nan 0.1000 0.0441
Thank you!
Try passing train the argument trace = FALSE.
This is a parameter not defined in the train documentation explicitly as it is part of the ... optional parameters.
I used the gbm function to implement gradient boosting. And I want to perform classification.
After that, I used the varImp() function to print variable importance in gradient boosting modeling.
But... only 4 variables have non-zero importance. There are 371 variables in my big data.... Is it right?
This is my code and result.
>asd<-read.csv("bigdatafile.csv",header=TRUE)
>asd1<-gbm(TARGET~.,n.trees=50,distribution="adaboost", verbose=TRUE,interaction.depth = 1,data=asd)
Iter TrainDeviance ValidDeviance StepSize Improve
1 0.5840 nan 0.0010 0.0011
2 0.5829 nan 0.0010 0.0011
3 0.5817 nan 0.0010 0.0011
4 0.5806 nan 0.0010 0.0011
5 0.5795 nan 0.0010 0.0011
6 0.5783 nan 0.0010 0.0011
7 0.5772 nan 0.0010 0.0011
8 0.5761 nan 0.0010 0.0011
9 0.5750 nan 0.0010 0.0011
10 0.5738 nan 0.0010 0.0011
20 0.5629 nan 0.0010 0.0011
40 0.5421 nan 0.0010 0.0010
50 0.5321 nan 0.0010 0.0010
>varImp(asd1,numTrees = 50)
Overall
CA0000801 0.00000
AS0000138 0.00000
AS0000140 0.00000
A1 0.00000
PROFILE_CODE 0.00000
A2 0.00000
CB_thinfile2 0.00000
SP_thinfile2 0.00000
thinfile1 0.00000
EW0001901 0.00000
EW0020901 0.00000
EH0001801 0.00000
BS_Seg1_Score 0.00000
BS_Seg2_Score 0.00000
LA0000106 0.00000
EW0001903 0.00000
EW0002801 0.00000
EW0002902 0.00000
EW0002903 0.00000
EW0002904 0.00000
EW0002906 0.00000
LA0300104_SP 56.19052
ASMGRD2 2486.12715
MIX_GRD 2211.03780
P71010401_1 0.00000
PS0000265 0.00000
P11021100 0.00000
PE0000123 0.00000
There are 371 variables. So above the result,I didn't write other variables. That all have zero importance.
TARGET is target variable. And I produced 50 trees. TARGET variable has two levels. so I used adaboost.
Is there a mistake in my code??? There are a little non-zero variables....
Thank you for your reply.
You cannot use importance() NOR varImp() this is only for Random Forest.
However, you can use summary.gbm from the gbm package.
Ex:
summary.gbm(boost_model)
Output will look like:
In your code, n.trees is very low and shrinkage is very high.
Just adjust this two factor.
n.trees is Number of trees. N increasing N reduces the error on training set, but setting it too high may lead to over-fitting.
interaction.depth(maximum nodes per tree) is number of splits it has to perform on a tree(starting from a single node).
shrinkage is considered as a learning rate. shrinkage is commonly used in ridge regression where it reduces regression coefficients to zero and, thus, reduces the impact of potentially unstable regression coefficients.
I recommend uses 0.1 for all data sets with more than 10,000 records.
Also! use a small shrinkage when growing many trees.
If you input 1,000 in n.trees & 0.1 in shrinkage, you can get different value.
And if you want to know relative influence of each variable in the gbm, Use summary.gbm() not varImp(). Of course, varImp() is good function. but I recommend summary.gbm().
Good luck.
I'm trying to train a gbm using the caret package in R. I initially got the following error and thought it was due to lack of an input, so I created the gbmGrid but am still getting the same error message.
sub4Collect1 <- data.frame(testing$row_id)
>
> cl <- makeCluster(10, type = "SOCK")
> registerDoSNOW(cl)
> ptm <- proc.time()
>
> for(i in 2:7){
+ trainClass <- postPrior1[,i]
+ testClass <- postTest1[,i]
+ gbmGrid <- expand.grid(.interaction.depth = (1:5) * 2, .n.trees = (1:5)*50, .shrinkage = .1)
+ bootControl <- trainControl(number = 1)
+ set.seed(2)
+ gbmFit <- train(prePrior1[,-c(2,60,61,161)], trainClass, method = "gbm", tuneLength = 5,
+ trControl = bootControl
+ ##, scaled = FALSE
+ , tuneGrid = gbmGrid
+ )
+ pred1 <- predict(gbmFit$finalModel, newdata = preTest1[,-c(2,60,61,161)])
+ sub4Collect1 <- cbind(sub4Collect1, pred1)
+ print(i)
+ flush.console()
+ }
Iter TrainDeviance ValidDeviance StepSize Improve
1 0.0000 -nan 0.1000 0.0000
2 0.0000 -nan 0.1000 0.0000
3 0.0000 -nan 0.1000 0.0000
4 0.0000 -nan 0.1000 0.0000
5 0.0000 -nan 0.1000 0.0000
6 0.0000 -nan 0.1000 0.0000
7 0.0000 -nan 0.1000 0.0000
8 0.0000 -nan 0.1000 0.0000
9 0.0000 -nan 0.1000 0.0000
10 0.0000 -nan 0.1000 0.0000
50 0.0000 -nan 0.1000 0.0000
Error in n.trees[n.trees > object$n.trees] <- object$n.trees :
argument "n.trees" is missing, with no default
> stopCluster(cl)
> timee4 <- proc.time() - ptm
> timee4
user system elapsed
3.563 0.306 14.472
Any suggestions?
The proper code for the predict() function requires feeding in the .n.trees parameter manually from the gbmFit$finalModel object as such:
pred1 <- predict(gbmFit$finalModel, newdata = preTest1[,-c(2,60,61,161)],
n.trees=gbmFit1$bestTune$.n.trees)
If this is not working :
pred1 <- predict(gbmFit$finalModel, newdata = preTest1[,-c(2,60,61,161)],
n.trees=gbmFit1$bestTune$.n.trees)
you can use this:
pred1 <- predict(gbmFit, newdata = preTest1[,-c(2,60,61,161)],
n.trees=gbmFit1$n.trees)
I don;t think you need to pass both the tuneLength and tuneGrid paramters. Try just one or the other and see if the problem persists.