GBM Rule Generation - Coding Advice - r

I use the R package GBM as probably my first choice for predictive modeling. There are so many great things about this algorithm but the one "bad" is that I cant easily use model code to score new data outside of R. I want to write code that can be used in SAS or other system (I will start with SAS (no access to IML)).
Lets say I have the following data set (from GBM manual) and model code:
library(gbm)
set.seed(1234)
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)
# introduce some missing values
#X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA
X3[sample(1:N,size=30)] <- NA
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
# fit initial model
gbm1 <- gbm(Y~X1+X2+X3+X4+X5+X6, # formula
data=data, # dataset
var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
distribution="gaussian",
n.trees=2, # number of trees
shrinkage=0.005, # shrinkage or learning rate,
# 0.001 to 0.1 usually work
interaction.depth=5, # 1: additive model, 2: two-way interactions, etc.
bag.fraction = 1, # subsampling fraction, 0.5 is probably best
train.fraction = 1, # fraction of data for training,
# first train.fraction*N used for training
n.minobsinnode = 10, # minimum total weight needed in each node
cv.folds = 5, # do 5-fold cross-validation
keep.data=TRUE, # keep a copy of the dataset with the object
verbose=TRUE) # print out progress
Now I can see the individual trees using pretty.gbm.tree as in
pretty.gbm.tree(gbm1,i.tree = 1)[1:7]
which yields
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight
0 2 1.5000000000 1 8 15 983.34315 1000
1 1 1.0309565491 2 6 7 190.62220 501
2 2 0.5000000000 3 4 5 75.85130 277
3 -1 -0.0102671518 -1 -1 -1 0.00000 139
4 -1 -0.0050342273 -1 -1 -1 0.00000 138
5 -1 -0.0076601353 -1 -1 -1 0.00000 277
6 -1 -0.0014569934 -1 -1 -1 0.00000 224
7 -1 -0.0048866747 -1 -1 -1 0.00000 501
8 1 0.6015416372 9 10 14 160.97007 469
9 -1 0.0007403551 -1 -1 -1 0.00000 142
10 2 2.5000000000 11 12 13 85.54573 327
11 -1 0.0046278704 -1 -1 -1 0.00000 168
12 -1 0.0097445692 -1 -1 -1 0.00000 159
13 -1 0.0071158065 -1 -1 -1 0.00000 327
14 -1 0.0051854993 -1 -1 -1 0.00000 469
15 -1 0.0005408284 -1 -1 -1 0.00000 30
The manual page 18 shows the following:
Based on the manual, the first split occurs on the 3rd variable (zero based in this output) which is gbm1$var.names[3] "X3". The variable is ordered factor.
types<-lapply (lapply(data[,gbm1$var.names],class), function(i) ifelse (strsplit(i[1]," ")[1]=="ordered","ordered",i))
types[3]
So, the split is at 1.5 meaning the value 'd and c' levels[[3]][1:2.5] (also zero based) splits to left node and the others levels[[3]][3:4] go to the right.
Next, the rule continues with a split at gbm1$var.names[2] as denoted by SplitVar=1 in the row indexed 1.
Has anyone written anything to move through this data structure (for each tree), constructing rules such as:
"If X3 in ('d','c') and X2<1.0309565491 and X3 in ('d') then scoreTreeOne= -0.0102671518"
which is how I think the first rule from this tree reads.
Or have any advice how to best do this?

The mlmeta package has a function gbm2sas that exports a GBM model from R to SAS.

Here is a very generic answer of how this might be done.
Add some R code to write the output to a file. https://stat.ethz.ch/R-manual/R-devel/library/base/html/sink.html
Then through SAS, access the ability to execute R with: http://support.sas.com/documentation/cdl/en/hostunx/61879/HTML/default/viewer.htm#a000303551.htm
(You'll need to know where your R executable is to point the R code you have written above at the executable)
From there you should be able to manipulate the output within SAS to do any scoring you may need.
If it is simply a one time scoring and not a process, omit the SAS execution of R and simply develop SAS code to parse through the R output file.

Related

sandwich + mlogit: `Error in ef/X : non-conformable arrays` when using `vcovHC()` to compute robust/clustered standard errors

I am trying to compute robust/cluster standard errors after using mlogit() to fit a Multinomial Logit (MNL) in a Discrete Choice problem. Unfortunately, I suspect I am having problems with it because I am using data in long format (this is a must in my case), and getting the error #Error in ef/X : non-conformable arrays after sandwich::vcovHC( , "HC0").
The Data
For illustration, please gently consider the following data. It represents data from 5 individuals (id_ind ) that choose among 3 alternatives (altern). Each of the five individuals chose three times; hence we have 15 choice situations (id_choice). Each alternative is represented by two generic attributes (x1 and x2), and the choices are registered in y (1 if selected, 0 otherwise).
df <- read.table(header = TRUE, text = "
id_ind id_choice altern x1 x2 y
1 1 1 1 1.586788801 0.11887832 1
2 1 1 2 -0.937965347 1.15742493 0
3 1 1 3 -0.511504401 -1.90667519 0
4 1 2 1 1.079365680 -0.37267925 0
5 1 2 2 -0.009203032 1.65150370 1
6 1 2 3 0.870474033 -0.82558651 0
7 1 3 1 -0.638604013 -0.09459502 0
8 1 3 2 -0.071679538 1.56879334 0
9 1 3 3 0.398263302 1.45735788 1
10 2 4 1 0.291413453 -0.09107974 0
11 2 4 2 1.632831160 0.92925495 0
12 2 4 3 -1.193272276 0.77092623 1
13 2 5 1 1.967624379 -0.16373709 1
14 2 5 2 -0.479859282 -0.67042130 0
15 2 5 3 1.109780885 0.60348187 0
16 2 6 1 -0.025834772 -0.44004183 0
17 2 6 2 -1.255129594 1.10928280 0
18 2 6 3 1.309493274 1.84247199 1
19 3 7 1 1.593558740 -0.08952151 0
20 3 7 2 1.778701074 1.44483791 1
21 3 7 3 0.643191170 -0.24761157 0
22 3 8 1 1.738820924 -0.96793288 0
23 3 8 2 -1.151429915 -0.08581901 0
24 3 8 3 0.606695064 1.06524268 1
25 3 9 1 0.673866953 -0.26136206 0
26 3 9 2 1.176959443 0.85005871 1
27 3 9 3 -1.568225496 -0.40002252 0
28 4 10 1 0.516456176 -1.02081089 1
29 4 10 2 -1.752854918 -1.71728381 0
30 4 10 3 -1.176101700 -1.60213536 0
31 4 11 1 -1.497779616 -1.66301234 0
32 4 11 2 -0.931117325 1.50128532 1
33 4 11 3 -0.455543630 -0.64370825 0
34 4 12 1 0.894843784 -0.69859139 0
35 4 12 2 -0.354902281 1.02834859 0
36 4 12 3 1.283785176 -1.18923098 1
37 5 13 1 -1.293772990 -0.73491317 0
38 5 13 2 0.748091387 0.07453705 1
39 5 13 3 -0.463585127 0.64802031 0
40 5 14 1 -1.946438667 1.35776140 0
41 5 14 2 -0.470448172 -0.61326604 1
42 5 14 3 1.478763383 -0.66490028 0
43 5 15 1 0.588240775 0.84448489 1
44 5 15 2 1.131731049 -1.51323232 0
45 5 15 3 0.212145247 -1.01804594 0
")
The problem
Consequently, we can fit an MNL using mlogit() and extract their robust variance-covariance as follows:
library(mlogit)
library(sandwich)
mo <- mlogit(formula = y ~ x1 + x2|0 ,
method ="nr",
data = df,
idx = c("id_choice", "altern"))
sandwich::vcovHC(mo, "HC0")
#Error in ef/X : non-conformable arrays
As we can see there is an error produced by sandwich::vcovHC, which says that ef/X is non-conformable. Where X <- model.matrix(x) and ef <- estfun(x, ...). After looking through the source code on the mirror on GitHub I spot the problem which comes from the fact that, given that the data is in long format, ef has dimensions 15 x 2 and X has 45 x 2.
My workaround
Given that the show must continue, I am computing the robust and cluster standard errors manually using some functions that I borrow from sandwich and I adjusted to accommodate the Stata's output.
> Robust Standard Errors
These lines are inspired on the sandwich::meat() function.
psi<- estfun(mo)
k <- NCOL(psi)
n <- NROW(psi)
rval <- (n/(n-1))* crossprod(as.matrix(psi))
vcov(mo) %*% rval %*% vcov(mo)
# x1 x2
# x1 0.23050261 0.09840356
# x2 0.09840356 0.12765662
Stata Equivalent
qui clogit y x1 x2 ,group(id_choice) r
mat li e(V)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .23050262
y:x2 .09840356 .12765662
> Clustered Standard Errors
Here, given that each individual answers 3 questions is highly likely that there is some degree of correlation among individuals; hence cluster corrections should be preferred in such situations. Below I compute the cluster correction in this case and I show the equivalence with the Stata output of clogit , cluster().
id_ind_collapsed <- df$id_ind[!duplicated(mo$model$idx$id_choice,)]
psi_2 <- rowsum(psi, group = id_ind_collapsed )
k_cluster <- NCOL(psi_2)
n_cluster <- NROW(psi_2)
rval_cluster <- (n_cluster/(n_cluster-1))* crossprod(as.matrix(psi_2))
vcov(mo) %*% rval_cluster %*% vcov(mo)
# x1 x2
# x1 0.1766707 0.1007703
# x2 0.1007703 0.1180004
Stata equivalent
qui clogit y x1 x2 ,group(id_choice) cluster(id_ind)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .17667075
y:x2 .1007703 .11800038
The Question:
I would like to accommodate my computations within the sandwich ecosystem, meaning not computing the matrices manually but actually using the sandwich functions. Is it possible to make it work with models in long format like the one described here? For example, providing the meat and bread objects directly to perform the computations? Thanks in advance.
PS: I noted that there is a dedicated bread function in sandwich for mlogit, but I could not spot something like meat for mlogit, but anyways I am probably missing something here...
Why vcovHC does not work for mlogit
The class of HC covariance estimators can just be applied in models with a single linear predictor where the score function aka estimating function is the product of so-called "working residuals" and a regressor matrix. This is explained in some detail in the Zeileis (2006) paper (see Equation 7), provided as vignette("sandwich-OOP", package = "sandwich") in the package. The ?vcovHC also pointed to this but did not explain it very well. I have improved this in the documentation at http://sandwich.R-Forge.R-project.org/reference/vcovHC.html now:
The function meatHC is the real work horse for estimating the meat of HC sandwich estimators - the default vcovHC method is a wrapper calling sandwich and bread. See Zeileis (2006) for more implementation details. The theoretical background, exemplified for the linear regression model, is described below and in Zeileis (2004). Analogous formulas are employed for other types of models, provided that they depend on a single linear predictor and the estimating functions can be represented as a product of “working residual” and regressor vector (Zeileis 2006, Equation 7).
This means that vcovHC() is not applicable to multinomial logit models as they generally use separate linear predictors for the separate response categories. Similarly, two-part or hurdle models etc. are not supported.
Basic "robust" sandwich covariance
Generally, for computing the basic Eicker-Huber-White sandwich covariance matrix estimator, the best strategy is to use the sandwich() function and not the vcovHC() function. The former works for any model with estfun() and bread() methods.
For linear models sandwich(..., adjust = FALSE) (default) and sandwich(..., adjust = TRUE) correspond to HC0 and HC1, respectively. In a model with n observations and k regression coefficients the former standardizes with 1/n and the latter with 1/(n-k).
Stata, however, divides by 1/(n-1) in logit models, see:
Different Robust Standard Errors of Logit Regression in Stata and R. To the best of my knowledge there is no clear theoretical reason for using specifically one or the other adjustment. And already in moderately large samples, this makes no difference anyway.
Remark: The adjustment with 1/(n-1) is not directly available in sandwich() as an option. However, coincidentally, it is the default in vcovCL() without specifying a cluster variable (i.e., treating each observation as a separate cluster). So this is a convenient "trick" if you want to get exactly the same results as Stata.
Clustered covariance
This can be computed "as usual" via vcovCL(..., cluster = ...). For mlogit models you just have to consider that the cluster variable just needs to be provided once (as opposed to stacked several times in long format).
Replicating Stata results
With the data and model from your post:
vcovCL(mo)
## x1 x2
## x1 0.23050261 0.09840356
## x2 0.09840356 0.12765662
vcovCL(mo, cluster = df$id_choice[1:15])
## x1 x2
## x1 0.1766707 0.1007703
## x2 0.1007703 0.1180004

extract h2o random forest in format like rpart frame

The following code:
library(randomForest)
z.auto <- randomForest(Mileage ~ Weight,
data=car.test.frame,
ntree=1,
nodesize = 15)
tree <- getTree(z.auto,k=1,labelVar = T)
tree
Gives this as text output:
left daughter right daughter split var split point status prediction
1 2 3 Weight 2567.5 -3 24.45000
2 0 0 <NA> 0.0 -1 30.66667
3 4 5 Weight 3087.5 -3 22.37778
4 6 7 Weight 2747.5 -3 24.00000
5 8 9 Weight 3637.5 -3 19.94444
6 0 0 <NA> 0.0 -1 25.20000
7 10 11 Weight 2770.0 -3 23.29412
8 0 0 <NA> 0.0 -1 21.18182
9 0 0 <NA> 0.0 -1 18.00000
10 0 0 <NA> 0.0 -1 22.50000
11 0 0 <NA> 0.0 -1 23.72727
From this data I can see the logic of an individual tree.
How do I get the much longer table, based on this, that describes all the trees in a random forest, from h2o?
I like 'h2o' because it cleanly uses all the cores, and goes at a pretty good clip on my system. It is a nice tool. It is, however, a library separate from 'r' so I am unsure how to access various parts of my data.
How do I get something like the above printed output, in the form of a csv file, from an h2o random forest?
H2O doesn't currently have a function to display a table like that, but you can export the random forest model to POJO (a Java file) using the
h2o.download_pojo() function and then inspect the tree (individual rules) manually.
H2O also accepts feature requests.

Understanding tree structure in R gbm package

I am having some difficulty understanding how the trees are structured in R's gbm gradient boosted machine package. Specifically, looking at the output of the pretty.gbm.tree Which features do the indices in SplitVar point to?
I trained a GBM on a dataset, here is the top ~quarter of one of my trees -- the result of a call to pretty.gbm.tree:
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 9 6.250000e+01 1 2 21 0.6634681 5981 0.005000061
1 -1 1.895699e-12 -1 -1 -1 0.0000000 3013 0.018956988
2 31 4.462500e+02 3 4 20 1.0083722 2968 -0.009168477
3 -1 1.388483e-22 -1 -1 -1 0.0000000 1430 0.013884830
4 38 5.500000e+00 5 18 19 1.5748155 1538 -0.030602956
5 24 7.530000e+03 6 13 17 2.8329899 361 -0.078738904
6 41 2.750000e+01 7 11 12 2.2499063 334 -0.064752766
7 28 -3.155000e+02 8 9 10 1.5516610 57 -0.243675567
8 -1 -3.379312e-11 -1 -1 -1 0.0000000 45 -0.337931219
9 -1 1.922333e-10 -1 -1 -1 0.0000000 12 0.109783128
```
It looks to me here that the indices are 0 based, from looking at how LeftNode, RightNode, and MissingNode point to different rows. When testing this out by using data samples and following it down the tree to their prediction, I get the correct answer when I consider SplitVar to be using 1 based indexing.
However, 1 of the many trees I build has a zero in the SplitVar column! Here is this tree:
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 4 1.462500e+02 1 2 21 0.41887 5981 0.0021651262
1 -1 4.117688e-22 -1 -1 -1 0.00000 512 0.0411768781
2 4 1.472500e+02 3 4 20 1.05222 5469 -0.0014870985
3 -1 -2.062798e-11 -1 -1 -1 0.00000 23 -0.2062797579
4 0 4.750000e+00 5 6 19 0.65424 5446 -0.0006222011
5 -1 3.564879e-23 -1 -1 -1 0.00000 4897 0.0035648788
6 28 -3.195000e+02 7 11 18 1.39452 549 -0.0379703437
What is the correct way to view the indexing used by gbm's trees?
The first column that is printed when you use the pretty.gbm.tree is the row.names that is assigned in the script pretty.gbm.tree.R. In the script, the row.names is assigned as row.names(temp) <- 0:(nrow(temp)-1) where temp is the tree information stored in data.frame form. The right way to interpret the row.names is to read it as the node_id with the root node being assigned a 0 value.
In your example:
Id SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 9 6.250000e+01 1 2 21 0.6634681 5981 0.005000061
means that the root node (indicated by the row number 0) is split by the 9-th split variable (the numbering of the split variable here starts from 0, so the split variable is the 10th column in the training set x). SplitCodePred of 6.25 denotes that all points less than 6.25 went to the LeftNode 1 and all points greater than 6.25 went to RightNode 2. All points that had a missing value in this column were assigned to the MissingNode 21. The ErrorReduction was 0.6634 due to this split and there were 5981 (Weight) in the root node. Prediction of 0.005 denotes the value assigned to all values at this node before the point was split. In the case of terminal nodes (or leaves) denoted by -1 in SplitVar, LeftNode, RightNode, and MissingNode, the Prediction denotes the value predicted for all the points belonging to this leaf node adjusted (times) times the shrinkage.
To understand the tree structure, its important to note that the splitting of the tree happens in a depth first fashion. So when the root node (with node id 0) is split into its left node and right node, the left side is processed until no further splits are possible before returning and labeling the right node. In both the trees in your example, the RightNode gets a value of 2. This is because in both cases, the LeftNode turns out to be a leaf node.

Novice needs to loop lm in R

I'm a PhD student of genetics and I am trying do association analysis of some genetic data using linear regression. In the table below I'm regressing each 'trait' against each 'SNP' There is also a interaction term include as 'var'
I've only used R for 2 weeks and I don't have any programming background so please explain any help provided as I want to understand.
This is a sample of my data:
Sample ID var trait 1 trait 2 trait 3 SNP1 SNP2 SNP3
77856517 2 188 3 2 1 0 0
375689755 8 17 -1 -1 1 -1 -1
392513415 8 28 14 4 1 1 1
393612038 8 85 14 6 1 1 0
401623551 8 152 11 -1 1 0 0
348466144 7 -74 11 6 1 0 0
77852806 4 81 16 6 1 1 0
440614343 8 -93 8 0 0 1 0
77853193 5 3 6 5 1 1 1
and this is the code I've been using for a single regression:
result1 <-lm(trait1~SNP1+var+SNP1*var, na.action=na.exclude)
I want to run a loop where every trait is tested against each SNP.
I've been trying to modify codes I've found online but I always run into some error that I don't understand how to solve.
Thank you for any and all help.
Personally I don't find the problem so easy. Specially for an R novice.
Here a solution based on creating dynamically the regression formula.
The idea is to use paste function to create different formula terms, y~ x + var + x * var then coercing the result string tp a formula using as.formula. Here y and x are the formula dynamic terms: y in c(trait1,trai2,..) and x in c(SNP1,SNP2,...). Of course here I use lapply to loop.
lapply(1:3,function(i){
y <- paste0('trait',i)
x <- paste0('SNP',i)
factor1 <- x
factor2 <- 'var'
factor3 <- paste(x,'var',sep='*')
listfactor <- c(factor1,factor2,factor3)
form <- as.formula(paste(y, "~",paste(listfactor,collapse="+")))
lm(formula = form, data = dat)
})
I hope someone come with easier solution, ore more R-ish one:)
EDIT
Thanks to #DWin comment , we can simplify the formula to just y~x*var since it means y is modeled by x,var and x*var
So the code above will be simplified to :
lapply(1:3,function(i){
y <- paste0('trait',i)
x <- paste0('SNP',i)
LHS <- paste(x,'var',sep='*')
form <- as.formula(paste(y, "~",LHS)
lm(formula = form, data = dat)
})

Circular Significance Testing

If I have wind direction readings from a collection of wind vanes, is there something like a t.test (or other significance test) that I can perform on the circular data? I am assuming a normal distribution (which the data below is from). I found the CircStats package, but figured I would check here for some additional guidance.
Some sample data:
df1 <- data.frame(unit=letters, wind.direction=c(99,88,93,99,86,90,101,109,109,91,86,94,106,92,99,103,110,98,107,109,93,102,92,99,109,85))
That one works fine using just a standard t.test since it doesn't wrap around zero. But,
df2 <- data.frame(unit=letters, wind.direction=c(1,350,355,1,348,352,3,11,11,353,348,356,8,3,1,5,12,0,9,11,355,4,354,1,11,347))
doesn't since its circular mean is ~0 but linear mean is ~139...
You can use aov.circular, in the circular package.
# Sample data (with two groups, to compare the means)
library(circular)
x <- as.circular(
c(1,350,355,1,348,352,3,11,11,353,348,356,
8,3,1,5,12,0,9,11,355,4,354,1,11,347),
unit="degrees"
)
g <- sample(LETTERS[1:2], 26, replace=TRUE)
# Test
aov.circular(x, g)
This is what I meant to say:
> df2$wd.scaled = apply(as.matrix(df2[,2]),1,function(x) ifelse(x>180,x-360,x))
> df2
unit wind.direction wd2 wd.scaled
1 a 1 1 1
2 b 350 -10 -10
3 c 355 -5 -5
4 d 1 1 1
5 e 348 -12 -12
6 f 352 -8 -8
> mean(df2$wd.scaled)
[1] 0.3846154
This would work if you don't have many observations near 180.

Resources