Understanding tree structure in R gbm package - r

I am having some difficulty understanding how the trees are structured in R's gbm gradient boosted machine package. Specifically, looking at the output of the pretty.gbm.tree Which features do the indices in SplitVar point to?
I trained a GBM on a dataset, here is the top ~quarter of one of my trees -- the result of a call to pretty.gbm.tree:
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 9 6.250000e+01 1 2 21 0.6634681 5981 0.005000061
1 -1 1.895699e-12 -1 -1 -1 0.0000000 3013 0.018956988
2 31 4.462500e+02 3 4 20 1.0083722 2968 -0.009168477
3 -1 1.388483e-22 -1 -1 -1 0.0000000 1430 0.013884830
4 38 5.500000e+00 5 18 19 1.5748155 1538 -0.030602956
5 24 7.530000e+03 6 13 17 2.8329899 361 -0.078738904
6 41 2.750000e+01 7 11 12 2.2499063 334 -0.064752766
7 28 -3.155000e+02 8 9 10 1.5516610 57 -0.243675567
8 -1 -3.379312e-11 -1 -1 -1 0.0000000 45 -0.337931219
9 -1 1.922333e-10 -1 -1 -1 0.0000000 12 0.109783128
```
It looks to me here that the indices are 0 based, from looking at how LeftNode, RightNode, and MissingNode point to different rows. When testing this out by using data samples and following it down the tree to their prediction, I get the correct answer when I consider SplitVar to be using 1 based indexing.
However, 1 of the many trees I build has a zero in the SplitVar column! Here is this tree:
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 4 1.462500e+02 1 2 21 0.41887 5981 0.0021651262
1 -1 4.117688e-22 -1 -1 -1 0.00000 512 0.0411768781
2 4 1.472500e+02 3 4 20 1.05222 5469 -0.0014870985
3 -1 -2.062798e-11 -1 -1 -1 0.00000 23 -0.2062797579
4 0 4.750000e+00 5 6 19 0.65424 5446 -0.0006222011
5 -1 3.564879e-23 -1 -1 -1 0.00000 4897 0.0035648788
6 28 -3.195000e+02 7 11 18 1.39452 549 -0.0379703437
What is the correct way to view the indexing used by gbm's trees?

The first column that is printed when you use the pretty.gbm.tree is the row.names that is assigned in the script pretty.gbm.tree.R. In the script, the row.names is assigned as row.names(temp) <- 0:(nrow(temp)-1) where temp is the tree information stored in data.frame form. The right way to interpret the row.names is to read it as the node_id with the root node being assigned a 0 value.
In your example:
Id SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 9 6.250000e+01 1 2 21 0.6634681 5981 0.005000061
means that the root node (indicated by the row number 0) is split by the 9-th split variable (the numbering of the split variable here starts from 0, so the split variable is the 10th column in the training set x). SplitCodePred of 6.25 denotes that all points less than 6.25 went to the LeftNode 1 and all points greater than 6.25 went to RightNode 2. All points that had a missing value in this column were assigned to the MissingNode 21. The ErrorReduction was 0.6634 due to this split and there were 5981 (Weight) in the root node. Prediction of 0.005 denotes the value assigned to all values at this node before the point was split. In the case of terminal nodes (or leaves) denoted by -1 in SplitVar, LeftNode, RightNode, and MissingNode, the Prediction denotes the value predicted for all the points belonging to this leaf node adjusted (times) times the shrinkage.
To understand the tree structure, its important to note that the splitting of the tree happens in a depth first fashion. So when the root node (with node id 0) is split into its left node and right node, the left side is processed until no further splits are possible before returning and labeling the right node. In both the trees in your example, the RightNode gets a value of 2. This is because in both cases, the LeftNode turns out to be a leaf node.

Related

Hierarchical clustering with specific number of data in each cluster

I'm clustering a set of words using "Hierarchical Clustering". I want each cluster to contain a certain number of words, for example 2 words, or 3 words.
I'm trying to modify existing code for this clustering.
I just put the value of max(d) to Inf as well
Lm[min(d),] <- sl
Lm[,min(d)] <- sl
if (length(cluster)>2){#if it's already clustered with more than 2 points
#then dont't cluster them again by setting values to Inf
Lm[min(d), min(d)] <- Inf
Lm[max(d), max(d)] <- Inf
Lm[max(d),] <- Inf
Lm[,max(d)] <- Inf
Lm[min(d),] <- Inf
Lm[,min(d)] <- Inf
}
However, it doesn't give me the expected results, I was wondering if it's correct approach? How can I do this type of clustering with constraint in r ?
example of results that I got
row V1 V2
166 -194 -38
167 166 -1
……..
240 239 239
241 240 240
242 241 241
243 242 242
244 243 243
This will be tough to optimize, or it can produce arbitrarily bad results. Because your size constraint goes against the principles of clustering.
Consider the one-dimensional data set -100, -1, 1, 100. Assuming you want to limit the cluster size to 2 elements. Hierarchical clustering will first merge -1 and +1 because they are closest. Now they have reached maximum size, so the only option is now to cluster -100 and +100, the worst possible result - this cluster is as big as the entire data set.
Just to give you an example of what I meant with partitional clustering:
library(cluster)
data("ruspini")
desired_cluster_size <- 3L
corresponding_num_clusters <- round(nrow(ruspini) / desired_cluster_size)
km <- kmeans(ruspini, corresponding_num_clusters)
table(km$cluster)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
3 3 2 4 2 2 2 1 3 3 2 3 2 3 3 2 6 3 2 1 3 6 2 8 4
This definitely can't guarantee how many observations you'll have in each group,
and it's not deterministic,
but it at least gives you an approximation.
In the tabulated results you can see that many clusters (1 through 25) ended up with 2 or 3 elements.

extract h2o random forest in format like rpart frame

The following code:
library(randomForest)
z.auto <- randomForest(Mileage ~ Weight,
data=car.test.frame,
ntree=1,
nodesize = 15)
tree <- getTree(z.auto,k=1,labelVar = T)
tree
Gives this as text output:
left daughter right daughter split var split point status prediction
1 2 3 Weight 2567.5 -3 24.45000
2 0 0 <NA> 0.0 -1 30.66667
3 4 5 Weight 3087.5 -3 22.37778
4 6 7 Weight 2747.5 -3 24.00000
5 8 9 Weight 3637.5 -3 19.94444
6 0 0 <NA> 0.0 -1 25.20000
7 10 11 Weight 2770.0 -3 23.29412
8 0 0 <NA> 0.0 -1 21.18182
9 0 0 <NA> 0.0 -1 18.00000
10 0 0 <NA> 0.0 -1 22.50000
11 0 0 <NA> 0.0 -1 23.72727
From this data I can see the logic of an individual tree.
How do I get the much longer table, based on this, that describes all the trees in a random forest, from h2o?
I like 'h2o' because it cleanly uses all the cores, and goes at a pretty good clip on my system. It is a nice tool. It is, however, a library separate from 'r' so I am unsure how to access various parts of my data.
How do I get something like the above printed output, in the form of a csv file, from an h2o random forest?
H2O doesn't currently have a function to display a table like that, but you can export the random forest model to POJO (a Java file) using the
h2o.download_pojo() function and then inspect the tree (individual rules) manually.
H2O also accepts feature requests.

Creating a Dichotomous Variable in R

I have imported a csv file which contains 2044 observations of 3 variables, CASEID, DEGREE, and HRS1.
The first 6 observations appeared as:
head(degree.wrk)
CASEID DEGREE HRS1
1 53044 3 55
2 53045 3 45
3 53046 0 -1
4 53047 0 -1
5 53048 0 -1
6 53049 0 -1
I want to create a dichotomous variable based on DEGREE which determines if a person has earned at
least a bachelors degree. According to the codebook, DEGREE values greater than or equal to 3 indicate a minimum of Bachelor's Degree earned. If the minimum has been met, I want it to return "Yes", if not, I want it to return "No". I used the ifelse() function and it appears to have worked, but I wonder if replacing the numerical value of DEGREE with the YES or NO category label is the correct action when seeking to create a dichotomous variable, or if I have simply replaced or recoded an existing variable.
The results of the ifelse() function are as follows:
degree.wrk$DEGREE <- ifelse(degree.wrk$DEGREE >=3,
c("Yes"),
c("No"))
head(degree.wrk)
CASEID DEGREE HRS1
1 53044 Yes 55
2 53045 Yes 45
3 53046 No -1
4 53047 No -1
5 53048 No -1
6 53049 No -1
Any advice as to whether or not I adequately created a dichotomous variable using this method?

Phylogenetic tree

I am working to have a phylogenetic tree based on pairwise-data of genes.Below is my subset of the data(test.txt).The tree does not has to be constructed on the basis of any DNA sequences,but just treating it as words.
ID gene1 gene2
1 ADRA1D ADK
2 ADRA1B ADK
3 ADRA1A ADK
4 ADRB1 ASIC1
5 ADRB1 ADK
6 ADRB2 ASIC1
7 ADRB2 ADK
8 AGTR1 ACHE
9 AGTR1 ADK
10 ALOX5 ADRB1
11 ALOX5 ADRB2
12 ALPPL2 ADRB1
13 ALPPL2 ADRB2
14 AMY2A AGTR1
15 AR ADORA1
16 AR ADRA1D
17 AR ADRA1B
18 AR ADRA1A
19 AR ADRA2A
20 AR ADRA2B
Below is my code in R
library(ape)
tab=read.csv("test.txt",sep="\t",header=TRUE)
d=dist(tab,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
My figure is attached here
I have a question on how they are clustered.Since the pairs
17 AR ADRA1B
18 AR ADRA1A
and
2 ADRA1B ADK
3 ADRA1A ADK
should be clustered closely because they have one common gene.so 17 and 2 should be together,and 18 and 3.
Should I use any other method,if I am wrong in using this method(Euclidean distance)?
Should I convert my data to a matrix of rows and columns ,where gene1 is x-axis ,and gene2 is y-axis,each cell being filled by 1 or 0?(Basically if they are paired would mean 1, and if not then 0)
Updated Code :
table=table(tab$gene1, tab$gene2)
d <- dist(table,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
However, in this I get only the genes from gene1 and not gene2 column.The below figure is exactly what I want but should have genes from gene2 column as well
There is some room for interpretation in the example of the question. My answer is only valid if there are really only two genes present in each individual and each row describes an individual. If, however, each row means that gene1 occurs with gene2 with certainty no useful clustering can be performed, in my opinion. In that case I would expect an additional column stating the probability for their common occurrence and something like an principal component analysis (PCA) may be preferred, but I am far away from being an expert on (hierarchial) clustering.
Before you can use the dist function, you have to bring your data into an appropriate format:
# convert test data into suitable format
gene.names <- sort(unique(c(tab[,"gene1"],tab[,"gene2"])))
gene.matrix <- cbind(tab[,"ID"],matrix(0L,nrow=nrow(tab),ncol=length(gene.names)))
colnames(gene.matrix) <- c("ID",gene.names)
lapply(seq_len(nrow(tab)),function(x) gene.matrix[x,match(tab[x,c("gene1","gene2")],colnames(gene.matrix))]<<-1)
The obtained gene.matrix has the shape:
ID ACHE ADK ADORA1 ADRA1A ADRA1B ADRA1D ADRA2A ...
[1,] 1 0 1 0 0 0 1 0
[2,] 2 0 1 0 0 1 0 0
[3,] 3 0 1 0 1 0 0 0
[4,] 4 0 0 0 0 0 0 0
...
So each row represents an observation (=individual) where the first column identifies the individual and each of the subsequent columns contains 1 if the gene is present and 0 if it is missing. On this matrix the dist function can be reasonably applied (ID column removed):
d <- dist(gene.matrix[,-1],method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
Maybe, it is a good idea to read up the differences between the distance measures euclidean, manhattan etc. For instance, the euclidian distance between the individuals with ID=1 and ID=2 is:
euclidean_dist = sqrt((0-0)^2 + (1-1)^2 + (0-0)^2 + (0-0)^2 + (0-1)^2 + ...)
whereas the manhattan distance is
manhattan_dist = abs(0-0) + abs(1-1) + abs(0-0) + abs(0-0) + abs(0-1) + ...

GBM Rule Generation - Coding Advice

I use the R package GBM as probably my first choice for predictive modeling. There are so many great things about this algorithm but the one "bad" is that I cant easily use model code to score new data outside of R. I want to write code that can be used in SAS or other system (I will start with SAS (no access to IML)).
Lets say I have the following data set (from GBM manual) and model code:
library(gbm)
set.seed(1234)
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)
# introduce some missing values
#X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA
X3[sample(1:N,size=30)] <- NA
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
# fit initial model
gbm1 <- gbm(Y~X1+X2+X3+X4+X5+X6, # formula
data=data, # dataset
var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
distribution="gaussian",
n.trees=2, # number of trees
shrinkage=0.005, # shrinkage or learning rate,
# 0.001 to 0.1 usually work
interaction.depth=5, # 1: additive model, 2: two-way interactions, etc.
bag.fraction = 1, # subsampling fraction, 0.5 is probably best
train.fraction = 1, # fraction of data for training,
# first train.fraction*N used for training
n.minobsinnode = 10, # minimum total weight needed in each node
cv.folds = 5, # do 5-fold cross-validation
keep.data=TRUE, # keep a copy of the dataset with the object
verbose=TRUE) # print out progress
Now I can see the individual trees using pretty.gbm.tree as in
pretty.gbm.tree(gbm1,i.tree = 1)[1:7]
which yields
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight
0 2 1.5000000000 1 8 15 983.34315 1000
1 1 1.0309565491 2 6 7 190.62220 501
2 2 0.5000000000 3 4 5 75.85130 277
3 -1 -0.0102671518 -1 -1 -1 0.00000 139
4 -1 -0.0050342273 -1 -1 -1 0.00000 138
5 -1 -0.0076601353 -1 -1 -1 0.00000 277
6 -1 -0.0014569934 -1 -1 -1 0.00000 224
7 -1 -0.0048866747 -1 -1 -1 0.00000 501
8 1 0.6015416372 9 10 14 160.97007 469
9 -1 0.0007403551 -1 -1 -1 0.00000 142
10 2 2.5000000000 11 12 13 85.54573 327
11 -1 0.0046278704 -1 -1 -1 0.00000 168
12 -1 0.0097445692 -1 -1 -1 0.00000 159
13 -1 0.0071158065 -1 -1 -1 0.00000 327
14 -1 0.0051854993 -1 -1 -1 0.00000 469
15 -1 0.0005408284 -1 -1 -1 0.00000 30
The manual page 18 shows the following:
Based on the manual, the first split occurs on the 3rd variable (zero based in this output) which is gbm1$var.names[3] "X3". The variable is ordered factor.
types<-lapply (lapply(data[,gbm1$var.names],class), function(i) ifelse (strsplit(i[1]," ")[1]=="ordered","ordered",i))
types[3]
So, the split is at 1.5 meaning the value 'd and c' levels[[3]][1:2.5] (also zero based) splits to left node and the others levels[[3]][3:4] go to the right.
Next, the rule continues with a split at gbm1$var.names[2] as denoted by SplitVar=1 in the row indexed 1.
Has anyone written anything to move through this data structure (for each tree), constructing rules such as:
"If X3 in ('d','c') and X2<1.0309565491 and X3 in ('d') then scoreTreeOne= -0.0102671518"
which is how I think the first rule from this tree reads.
Or have any advice how to best do this?
The mlmeta package has a function gbm2sas that exports a GBM model from R to SAS.
Here is a very generic answer of how this might be done.
Add some R code to write the output to a file. https://stat.ethz.ch/R-manual/R-devel/library/base/html/sink.html
Then through SAS, access the ability to execute R with: http://support.sas.com/documentation/cdl/en/hostunx/61879/HTML/default/viewer.htm#a000303551.htm
(You'll need to know where your R executable is to point the R code you have written above at the executable)
From there you should be able to manipulate the output within SAS to do any scoring you may need.
If it is simply a one time scoring and not a process, omit the SAS execution of R and simply develop SAS code to parse through the R output file.

Resources