I used the gbm function to implement gradient boosting. And I want to perform classification.
After that, I used the varImp() function to print variable importance in gradient boosting modeling.
But... only 4 variables have non-zero importance. There are 371 variables in my big data.... Is it right?
This is my code and result.
>asd<-read.csv("bigdatafile.csv",header=TRUE)
>asd1<-gbm(TARGET~.,n.trees=50,distribution="adaboost", verbose=TRUE,interaction.depth = 1,data=asd)
Iter TrainDeviance ValidDeviance StepSize Improve
1 0.5840 nan 0.0010 0.0011
2 0.5829 nan 0.0010 0.0011
3 0.5817 nan 0.0010 0.0011
4 0.5806 nan 0.0010 0.0011
5 0.5795 nan 0.0010 0.0011
6 0.5783 nan 0.0010 0.0011
7 0.5772 nan 0.0010 0.0011
8 0.5761 nan 0.0010 0.0011
9 0.5750 nan 0.0010 0.0011
10 0.5738 nan 0.0010 0.0011
20 0.5629 nan 0.0010 0.0011
40 0.5421 nan 0.0010 0.0010
50 0.5321 nan 0.0010 0.0010
>varImp(asd1,numTrees = 50)
Overall
CA0000801 0.00000
AS0000138 0.00000
AS0000140 0.00000
A1 0.00000
PROFILE_CODE 0.00000
A2 0.00000
CB_thinfile2 0.00000
SP_thinfile2 0.00000
thinfile1 0.00000
EW0001901 0.00000
EW0020901 0.00000
EH0001801 0.00000
BS_Seg1_Score 0.00000
BS_Seg2_Score 0.00000
LA0000106 0.00000
EW0001903 0.00000
EW0002801 0.00000
EW0002902 0.00000
EW0002903 0.00000
EW0002904 0.00000
EW0002906 0.00000
LA0300104_SP 56.19052
ASMGRD2 2486.12715
MIX_GRD 2211.03780
P71010401_1 0.00000
PS0000265 0.00000
P11021100 0.00000
PE0000123 0.00000
There are 371 variables. So above the result,I didn't write other variables. That all have zero importance.
TARGET is target variable. And I produced 50 trees. TARGET variable has two levels. so I used adaboost.
Is there a mistake in my code??? There are a little non-zero variables....
Thank you for your reply.
You cannot use importance() NOR varImp() this is only for Random Forest.
However, you can use summary.gbm from the gbm package.
Ex:
summary.gbm(boost_model)
Output will look like:
In your code, n.trees is very low and shrinkage is very high.
Just adjust this two factor.
n.trees is Number of trees. N increasing N reduces the error on training set, but setting it too high may lead to over-fitting.
interaction.depth(maximum nodes per tree) is number of splits it has to perform on a tree(starting from a single node).
shrinkage is considered as a learning rate. shrinkage is commonly used in ridge regression where it reduces regression coefficients to zero and, thus, reduces the impact of potentially unstable regression coefficients.
I recommend uses 0.1 for all data sets with more than 10,000 records.
Also! use a small shrinkage when growing many trees.
If you input 1,000 in n.trees & 0.1 in shrinkage, you can get different value.
And if you want to know relative influence of each variable in the gbm, Use summary.gbm() not varImp(). Of course, varImp() is good function. but I recommend summary.gbm().
Good luck.
Related
I'm using mle2 to estimate a parameters for a non-linear model and I want estimates of error around the parameter estimates (std. error). As well, I'd like to use the model to then predict with newdata, and I'm having problems (errors) with a couple steps in this process.
Here's the data:
table<- "ac.temp performance
1 2.17 47.357923
4 2.17 234.255317
7 2.17 138.002633
10 2.17 227.545902
13 2.17 28.118072
16 9.95 175.638448
19 9.95 167.392218
22 9.95 118.162747
25 9.95 102.770622
28 9.95 191.874867
31 16.07 206.897159
34 16.07 74.741619
37 16.07 127.219884
40 16.07 208.231559
42 16.07 89.544612
44 20.14 314.946107
47 20.14 290.994063
50 20.14 243.322497
53 20.14 192.335133
56 20.14 133.841776
58 23.83 139.746673
61 23.83 224.135993
64 23.83 126.726493
67 23.83 246.443386
70 23.83 163.019896
83 28.04 4.614154
84 28.04 2.851866
85 28.04 2.935584
86 28.04 153.868415
87 28.04 103.884295
88 30.60 0.000000
89 29.60 0.000000
90 30.30 0.000000
91 29.90 0.000000
92 30.80 0.000000
93 28.90 0.000000
94 30.00 0.000000
95 30.20 0.000000
96 30.40 0.000000
97 30.70 0.000000
98 27.90 0.000000
99 28.60 0.000000
100 28.60 0.000000
101 30.40 0.000000
102 30.60 0.000000
103 29.70 0.000000
104 30.50 0.000000
105 29.30 0.000000
106 28.90 0.000000
107 30.40 0.000000
108 30.20 0.000000
109 30.10 0.000000
110 29.50 0.000000
111 31.00 0.000000
112 30.60 0.000000
113 30.90 0.000000
114 31.10 0.000000"
perfdat<- read.table(text=table, header=T)
First I have to set a couple of the fixed parameters for my non-linear model on performance of an animal with respect to temperature
pi = mean(subset(perfdat, ac.temp<5)$performance)
ti = min(perfdat$ac.temp)
define the x variable (temperature)
t = perfdat$ac.temp
create a function for non-linear model
tpc = function(tm, topt, pmax) {
perfmu = pi+(pmax-pi)*(1+((topt-t)/(topt-tm)))*(((t-ti)/(topt-ti))^((tm-ti)/(topt-tm)))
perfmu[perfmu<0] = 0.00001
return(perfmu)
}
create the negative log likelihood function
LL1 = function(tm, topt, pmax, performance=perfdat$performance) {
perfmu = tpc(tm=tm, topt=topt, pmax=pmax)
loglike = -sum(dnorm(x=performance, mean=perfmu, log=TRUE))
return(loglike)
}
model performance using mle2 - maximum likelihood estimator
m1<- mle2(minuslogl = LL1, start=list(tm=15, topt=20, pmax=150), control=list(maxit=5000))
summary(m1)
This gives me parameter estimates but not estimates of error (std. error) with the warning message: In sqrt(diag(object#vcov)) : NaNs produced. However, the parameters estimates are good and get me predictions that make sense.
Plot of non-linear curve using parameter estimates
I have tried many different optimizers and methods and get the same error about not being able to calculate std. error, usually with warnings about not being able to invert the hessian. OR I get really wonky estimates of my parameters that don't make sense.
IF I use:
confint(m1)
I get 95% intervals for each of my parameters, but I can't incorporate those into a prediction method that I could use for making a graph like below, which I made using an nls model and predict():
non-linear model with error graphed
If I recreate my mle2() model by embedding the model formula into the mle2 function:
tpcfun<- function(t, tm.f, topt.f, pmax.f) {
perfmu = pi+(pmax.f-pi)*(1+((topt.f-t)/(topt.f-tm)))*(((t-ti)/(topt.f-ti))^((tm.f-ti)/(topt.f-tm.f)))
perfmu[perfmu<0] = 0.00001
return(perfmu)
}
m2<- mle2(performance ~ dnorm(mean=-sum(log(tpcfun(t=ac.temp, tm.f, topt.f, pmax.f))), sd=1), method='L-BFGS-B', lower=c(tm.f=1, topt.f=1, pmax.f=1), control=list(maxit=5000, parscale=c(tm.f=1e1, topt.f=1e1, pmax.f=1e2)), start=list(tm.f=15, topt.f=20, pmax.f=150), data=perfdat)
summary(m2)
I get non-sensical estimates for my parameters and I still don't get estimates for error.
My question is, can anyone see anything wrong with either of my models (the model functions and likelihood functions) or anything else that I am doing wrong? I have a feeling that I may be writing the likelihood function wrong but I've tried all sorts of distributions and different ways but I may be totally screwing it up.
OR is there a way that I can get estimates of error around my parameters such that I can use them to visualize error around my model prediction in graphs?
Thanks,
Rylee
PS. I decided to make a graph with just point estimates and then the trend line without error around it, but I wanted to put the bars for 95% CI on each of the point estimates, but confint() is giving me redicylously small CI's which don't even show up on the graph because they are smaller than the point character I'm using, ha.
I think the problem is in the "maxint" arg. Try to use good start values and avoid high interations. The second problem is the algorithm "L-BFGS-B" unless the default. When we use confint function it's normal obtain the intervals whether the mle2 otimization converged. Check if the profiles can be made as a plot (plotprofmle function). It's more safe.
It's normal a "NaNs produced" error if your data contain zero values when apply log. I suggest use this sequence:
loglike = -sum(dnorm(x=performance, mean=perfmu, log=TRUE), na.rm=TRUE)
Verify if the result is plausible.
If I run this code tot train a gbm-model with Knitr, I receive several pages of Iter output like copied below. Is there a method to suppress this output?
mod_gbm <- train(classe ~ ., data = TrainSet, method = "gbm")
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1322
## 2 1.5210 nan 0.1000 0.0936
## 3 1.4608 nan 0.1000 0.0672
## 4 1.4165 nan 0.1000 0.0561
## 5 1.3793 nan 0.1000 0.0441
Thank you!
Try passing train the argument trace = FALSE.
This is a parameter not defined in the train documentation explicitly as it is part of the ... optional parameters.
Good morning,
I am having trouble understanding some of my outputs for my Kaplan Meier analyses.
I have managed to produce the following plots and outputs using ggsurvplot and survfit.
I first made a plot of survival time of 55 nest with time and then did the same with the top predictors for nest failure, one being microtopography, as seen in this example.
Call: npsurv(formula = (S) ~ 1, data = nestdata, conf.type = "log-log")
26 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
55 45 0 13 29 2 NA
Call: npsurv(formula = (S) ~ Microtopography, data = nestdata, conf.type = "log-log")
29 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
Microtopography=0 14 13 0 1 NA NA NA
Microtopography=1 26 21 0 7 NA 29 NA
Microtopography=2 12 8 0 5 3 2 NA
So, I have two primary questions.
1. The survival curves are for a ground nesting bird with an egg incubation time of 21-23 days. Incubation time is the number of days the hen sits of the eggs before they hatch. Knowing that, how is it possible that the median survival time in plot #1 is 29 days? It seems to fit with the literature I have read on this same species, however, I assume it has something to do with the left censoring in my models, but am honestly at a loss. If anyone has any insight or even any litterature that could help me understand this concept, I would really appreciate it.
I am also wondering how I can compare median survival times for the 2nd plot. Because microtopography survival curves 1 and 2 never croos the .5 pt, the median survival times returned are NA. I understand I can chose another interval, such as .75, but in this example that still wouldnt help me because microtopography 0 never drops below .9 or so. How would one go about reporting this data. Would the work around be to choose a survival interval, using:
summary(s,times=c(7,14,21,29))
Call: npsurv(formula = (S) ~ Microtopography, data = nestdata,
conf.type =
"log-log")
29 observations deleted due to missingness
Microtopography=0
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 3 0 0 1.000 0.0000 1.000 1.000
14 7 0 0 1.000 0.0000 1.000 1.000
21 13 0 0 1.000 0.0000 1.000 1.000
29 8 1 5 0.909 0.0867 0.508 0.987
Microtopography=1
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 9 0 0 1.000 0.0000 1.000 1.000
14 17 1 0 0.933 0.0644 0.613 0.990
21 21 3 0 0.798 0.0909 0.545 0.919
29 15 3 7 0.655 0.1060 0.409 0.819
Microtopography=2
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 1 2 0 0.333 0.272 0.00896 0.774
14 7 1 0 0.267 0.226 0.00968 0.686
21 8 1 0 0.233 0.200 0.00990 0.632
29 3 1 5 0.156 0.148 0.00636 0.504
Late to the party...
The median survival time of 29 days is the median incubation time that birds of this species are expected to be in the egg until they hatch - based on your data. Your median of 21-24 (based on ?) is probably based on many experiments/studies of eggs that have hatched, ignoring those that haven't hatched yet (those that failed?).
From your overall survival curve, it is clear that some eggs have not yet hatched, even after more than 35 days. These are taken into account when calculating the expected survival times. If you think that these eggs will fail, then omit them. Otherwise, the software cannot possibly know that they will eventually fail. But how can anyone know for sure if an egg is going to fail, even after 30 days? Is there a known maximum hatching time? The record-breaker of all hatched eggs?
There are not really R questions, so this question might be more appropriate for the statistics site. But the following might help.
how is it possible that the median survival time in plot #1 is 29 days?
The median survival is where the survival curve passes the 50% mark. Eyeballing it, 29 days looks right.
I am also wondering how I can compare median survival times for the 2nd plot. Because microtopography survival curves 1 and 2 never croos the .5 pt.
Given your data, you cannot compare the median. You can compare the 75% or 90%, if you must. You can compare the point survival at, say, 30 days. You can compare the truncated average survival in the first 30 days.
In order to compare the median, you would have to make an assumption. I reasonable assumption would be an exponential decay after some tenure point that includes at least one failure.
I am using gbm package in R and applying the 'bernoulli' option for distribution to build a classifier and i get unusual results of 'nan' and i'm unable to predict any classification results. But i do not encounter the same errors when i use 'adaboost'. Below is the sample code, i replicated the same errors with the iris dataset.
## using the iris data for gbm
library(caret)
library(gbm)
data(iris)
Data <- iris[1:100,-5]
Label <- as.factor(c(rep(0,50), rep(1,50)))
# Split the data into training and testing
inTraining <- createDataPartition(Label, p=0.7, list=FALSE)
training <- Data[inTraining, ]
trainLab <- droplevels(Label[inTraining])
testing <- Data[-inTraining, ]
testLab <- droplevels(Label[-inTraining])
# Model
model_gbm <- gbm.fit(x=training, y= trainLab,
distribution = "bernoulli",
n.trees = 20, interaction.depth = 1,
n.minobsinnode = 10, shrinkage = 0.001,
bag.fraction = 0.5, keep.data = TRUE, verbose = TRUE)
## output on the console
Iter TrainDeviance ValidDeviance StepSize Improve
1 -nan -nan 0.0010 -nan
2 nan -nan 0.0010 nan
3 -nan -nan 0.0010 -nan
4 nan -nan 0.0010 nan
5 -nan -nan 0.0010 -nan
6 nan -nan 0.0010 nan
7 -nan -nan 0.0010 -nan
8 nan -nan 0.0010 nan
9 -nan -nan 0.0010 -nan
10 nan -nan 0.0010 nan
20 nan -nan 0.0010 nan
Please let me know if there is a work around to get this working. The reason i am using this is to experiment with Additive Logistic Regression, please suggest if there are any other alternatives in R to go about this.
Thanks.
Is there a reason you are using gbm.fit() instead of gbm()?
Based on the package documentation, the y variable in gbm.fit() needs to be a vector.
I tried making sure the vector was forced using
trainLab <- as.vector(droplevels(Label[inTraining])) #vector of chars
Which gave the following output on the console. Unfortunately I'm not sure why the valid deviance is still -nan.
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.3843 -nan 0.0010 0.0010
2 1.3823 -nan 0.0010 0.0010
3 1.3803 -nan 0.0010 0.0010
4 1.3783 -nan 0.0010 0.0010
5 1.3763 -nan 0.0010 0.0010
6 1.3744 -nan 0.0010 0.0010
7 1.3724 -nan 0.0010 0.0010
8 1.3704 -nan 0.0010 0.0010
9 1.3684 -nan 0.0010 0.0010
10 1.3665 -nan 0.0010 0.0010
20 1.3471 -nan 0.0010 0.0010
train.fraction should be <1 to get ValidDeviance, because this way we are creating a validation dataset.
Thanks!
> lenss
xd yd zd
1 0.0000 0.0000 2.44479
2 0.0937 0.0000 2.73183
3 0.3750 0.0000 2.97785
4 0.8437 0.0000 3.18626
5 1.5000 0.0000 3.36123
6 2.3437 0.0000 3.50624
7 3.3750 0.0000 3.62511
8 4.5937 0.0000 3.72124
9 5.9999 0.0000 3.79778
10 7.5936 0.0000 3.85744
11 9.3749 0.0000 3.90241
12 11.3436 0.0000 3.93590
13 13.4998 0.0000 3.96011
14 15.8435 0.0000 3.97648
15 18.3748 0.0000 3.98236
16 21.0935 0.0000 3.99406
17 23.9997 0.0000 3.99732
18 27.0934 0.0000 3.99911
19 30.3746 0.0000 4.00004
20 33.8433 0.0000 4.00005
21 37.4995 0.0000 4.00006
22 0.0663 0.0663 3.99973
23 0.2652 0.2652 3.99988
24 0.5966 0.5966 3.99931
25 1.0606 1.0606 3.99740
26 1.6573 1.6573 3.99375
27 2.3865 2.3865 3.98732
28 3.2482 3.2482 3.97640
29 4.2426 4.2426 3.95999
30 5.3695 5.3695 3.93598
31 6.6290 6.6290 3.90258
32 8.0211 8.0211 3.85171
33 9.5458 9.5458 3.79754
34 11.2031 11.2031 3.72156
35 12.9929 12.9929 3.62538
36 14.9153 14.9153 3.50636
37 16.9703 16.9703 3.36129
38 19.1579 19.1579 3.18622
39 21.4781 21.4781 2.97802
40 23.9308 23.9308 2.73206
41 26.5162 26.5162 2.44464
> rd=sqrt(xd^2+yd^2)
> fit=nls(zd~(rd^2/R)/(1+sqrt(1-(1+k)*rd^2/R^2))+d,start=list(R=75,k=-1,d=1))
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
In addition: Warning message:
In sqrt(1 - (1 + k) * rd^2/R^2) : NaNs produced
The function of that model was given above. The question states that there are a few inaccurate measurements in the data and I need to find them. I was going to fit the model first and work out every residuals in every measurement.
The argument of sqrt must be non-negative but there is no assurance that it is in the setup shown in the question. Furthermore, even if that is fixed it seems unlikely that the model can be fit in the way shown in the question since it consists of two distinct curves (see graphic below) which likely will have to be separately fit.
Using the drc package we can get a reasonable fit using its LL2.5 model like this:
library(drc)
plot(zd ~ rd)
g <- rep(1:2, c(21, 20))
fit1 <- drm(zd ~ rd, fct = LL2.5(), subset = g == 1)
fit2 <- drm(zd ~ rd, fct = LL2.5(), subset = g == 2)
lines(fitted(fit1) ~ rd[g == 1])
lines(fitted(fit2) ~ rd[g == 2])
This involves 10 parameters (5 for each curve). You might try the different models available there by using different models for the fct argument to see if you can find somnething more parsimonious.