Extract variable labels from rpart decision tree - r

I've used part to build a decision tree on a dataset that has categorical variables with hundreds of levels. The tree splits these variables based on select values of the variable. I would like to examine the labels on which the split is made.
If I just run the decision tree result, the display listing the splits in the console gets truncated and either way, it is not in an easily-interpretable format (separated by commas). Is there a way to access this as an R object? I'm open to using another package to build the tree.

One issue here is that some of the functions in the rpart package are not exported. It appears you're looking to capture the output of the function rpart:::print.rpart. So, beginning with a reproducible example:
set.seed(1)
df1 <- data.frame(y=rbinom(n=100, size=1, prob=0.5),
x1=rbinom(n=100, size=1, prob=0.25),
x2=rbinom(n=100, size=1, prob=0.75))
(r1 <- rpart(y ~ ., data=df1))
giving
n= 100
node), split, n, deviance, yval
* denotes terminal node
1) root 100 24.960000 0.4800000
2) x1< 0.5 78 19.179490 0.4358974
4) x2>=0.5 66 15.954550 0.4090909 *
5) x2< 0.5 12 2.916667 0.5833333 *
3) x1>=0.5 22 5.090909 0.6363636
6) x2< 0.5 7 1.714286 0.4285714 *
7) x2>=0.5 15 2.933333 0.7333333 *
Now, looking at rpart:::print.rpart, we see a call to rpart:::labels.rpart, giving us the splits (or names of the 'rows' in the output above). The value of n, deviance, yval and more are stored in r1$frame, which can be seen by inspecting the output from unclass(r1).
Thus we could extract the above with
(df2 <- data.frame(split=rpart:::labels.rpart(r1), n=r1$frame$n))
giving
split n
1 root 100
2 x1< 0.5 78
3 x2>=0.5 66
4 x2< 0.5 12
5 x1>=0.5 22
6 x2< 0.5 7
7 x2>=0.5 15

Related

Interpreting decision tree regression output in R

I created a decision tree in R using the "tree" package, however, then I look at the details of the model, I struggle with interpreting the results.
The output of the model looks like this:
> model
node), split, n, deviance, yval
* denotes terminal node
1) root 23 16270.0 32.350
2) Y1 < 31 8 4345.0 59.880 *
3) Y1 > 31 15 2625.0 17.670
6) Y2 < 11.5 8 1310.0 26.000 *
7) Y2 > 11.5 7 124.9 8.143 *
I don't understand the numbers that are shown in each line after the features. what are 16270.0 and 32.350? Or what are 2625.0 and 17.670? Why do some of the numbers have asterisks?
Any help is appreciated.
Thank you
The rules that you got are equivalent to the following tree.
Each row in the output has five columns. Let's look at one that you asked about:
Y1 > 31 15 2625.0 17.670
Y1 > 31 is the splitting rule being applied to the parent node
15 is the number of points that would be at this node of the tree
2625.0 is the deviance at this node (used to decide how the split was made)
17.670 is what you would predict for points at this node
if you split no further.
The asterisks indicate leaf nodes - ones that are not split any further.
So in the node described above, Y1 > 31, You could stop at that node and
predict 17.670 for all 15 points, but the full tree would split this into
two nodes: one with 8 points for Y2 < 11.5 and another with 7 points for
Y2 > 11.5. If you make this further split, you would predict 26.0 for the 8 points
with Y2 < 11.5 (and Y1 > 31) and predict 8.143 for the 7 points with Y2 > 11.5
(and Y1 > 31).

Adjusted survival curve based on weigthed cox regression

I'm trying to make an adjusted survival curve based on a weighted cox regression performed on a case cohort data set in R, but unfortunately, I can't make it work. I was therefore hoping that some of you may be able to figure it out why it isn't working.
In order to illustrate the problem, I have used (and adjusted a bit) the example from the "Package 'survival'" document, which means im working with:
data("nwtco")
subcoh <- nwtco$in.subcohort
selccoh <- with(nwtco, rel==1|subcoh==1)
ccoh.data <- nwtco[selccoh,]
ccoh.data$subcohort <- subcoh[selccoh]
ccoh.data$age <- ccoh.data$age/12 # Age in years
fit.ccSP <- cch(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data,subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing")
The data set is looking like this:
seqno instit histol stage study rel edrel age in.subcohort subcohort
4 4 2 1 4 3 0 6200 2.333333 TRUE TRUE
7 7 1 1 4 3 1 324 3.750000 FALSE FALSE
11 11 1 2 2 3 0 5570 2.000000 TRUE TRUE
14 14 1 1 2 3 0 5942 1.583333 TRUE TRUE
17 17 1 1 2 3 1 960 7.166667 FALSE FALSE
22 22 1 1 2 3 1 93 2.666667 FALSE FALSE
Then, I'm trying to illustrate the effect of stage in an adjusted survival curve, using the ggadjustedcurves-function from the survminer package:
library(suvminer)
ggadjustedcurves(fit.ccSP, variable = ccoh.data$stage, data = ccoh.data)
#Error in survexp(as.formula(paste("~", variable)), data = ndata, ratetable = fit) :
# Invalid rate table
But unfortunately, this is not working. Can anyone figure out why? And can this somehow be fixed or done in another way?
Essentially, I'm looking for a way to graphically illustrate the effect of a continuous variable in a weighted cox regression performed on a case cohort data set, so I would, generally, also be interested in hearing if there are other alternatives than the adjusted survival curves?
Two reasons it is throwing errors.
The ggadjcurves function is not being given a coxph.object, which it's halp page indicated was the designed first object.
The specification of the variable argument is incorrect. The correct method of specifying a column is with a length-1 character vector that matches one of the names in the formula. You gave it a vector whose value was a vector of length 1154.
This code succeeds:
fit.ccSP <- coxph(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data)
ggadjustedcurves(fit.ccSP, variable = 'stage', data = ccoh.data)
It might not answer your desires, but it does answer the "why-error" part of your question. You might want to review the methods used by Therneau, Cynthia S Crowson, and Elizabeth J Atkinson in their paper on adjusted curves:
https://cran.r-project.org/web/packages/survival/vignettes/adjcurve.pdf

Give out and plot p-values for rpart decision-tree

I want to plot a decision tree in R with rpart and fancyRpartPlot. The code is working, but I want to show the p-value of each split. When I execute the tree (last line of the code), I get the stars behind the nodes which usually indicate statistical significance - I guess this is the case here too. However, I want to access the calculated p-values and include them in the plot. I would be very grateful, if anyone has an idea on how to do this. Thanks!
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
seatbelts <- Seatbelts
seatbelts <- as.data.frame(seatbelts)
unique(seatbelts$law)
seatbelts_tree <- rpart(law ~ ., data=seatbelts)
plot(seatbelts_tree, uniform = TRUE, margin = 0.5)
text(seatbelts_tree)
prp(seatbelts_tree)
fancyRpartPlot(seatbelts_tree, type=2)
seatbelts_tree
The ouput of the above code contains the answer, that the * indicates a terminal node , which is harder to spot given text output depending on the format.
n= 192
node), split, n, deviance, yval
* denotes terminal node
1) root 192 20.244790 0.11979170
2) drivers>=1303 178 8.544944 0.05056180
4) front>=663 158 1.974684 0.01265823
8) kms< 18147.5 144 0.000000 0.00000000 *
9) kms>=18147.5 14 1.714286 0.14285710 *
5) front< 663 20 4.550000 0.35000000
10) PetrolPrice< 0.1134217 11 0.000000 0.00000000 *
11) PetrolPrice>=0.1134217 9 1.555556 0.77777780 *
3) drivers< 1303 14 0.000000 1.00000000 *
If you want p-values you should look into the cpart library. Here is a related question to it with a short explanation and further reading material.
https://stats.stackexchange.com/questions/255150/how-to-interpret-this-decision-tree/255156
The stars in the print method of rpart highlight terminal nodes, not p values. A decision tree is a descriptive method. It is not designed to test hypotheses.

Poisson GLM with categorical data

I'm trying to fit a Poisson generalized mixed model using counts of categorical data labeled as s and v. Since the data was collected within sessions that have a different duration (see session_dur_s), I want to include this information as a predictor by putting offset in the glm model.
Here is my table:
label session counts session_dur_s
s 1 587 6843
s 2 203 2095
s 3 187 1834
s 4 122 1340
s 5 40 1108
s 6 64 476
s 7 60 593
v 1 147 6721
v 2 57 2095
v 3 58 1834
v 4 22 986
v 5 8 1108
v 6 12 476
v 7 11 593
My data:
label <- c("s","s","s","s","s","s","s","v","v","v","v","v","v","v")
session <- c(1,2,3,4,5,6,7,1,2,3,4,5,6,7)
counts <- c(587,203,187,122,40,64,60,147,54,58,22,8,12,11)
session_dur_s <-c(6843,2095,1834,1340,1108,476,593,6721,2095,1834,986,1108,476,593)
sv_dur <- data.frame(label,session,counts,session_dur_s)
That's my code:
sv_dur_mod <- glm(counts ~ label * session, data=sv_dur, family = "poisson",offset =session_dur_s)
summary(sv_dur_mod)
plot(allEffects(sv_dur_mod),type="response")
I can't execute the glm function because I receive the beautiful error:
Error: no valid set of coefficients has been found: please supply starting values
I'm not sure how to go about it. I would be really happy if a smart head could point me what can I do in order to work it out.
If there is a better model that I can use to predict the counts over time for the both s and v labels, I'm more than open to go for it.
Many thanks for comments and suggestions!
P.S. I'm running it in the R markdown script using following packages tidyverse, effects and dplyr
A Poisson GLM uses a log link as default. That is, it can executed as:
sv_dur_mod <- glm(counts ~ label * session,
data = sv_dur,
family = poisson("log"))
Accordingly, a log offset is generally appropriate:
sv_dur_mod <- glm(counts ~ label * session,
data = sv_dur,
offset = log(session_dur_s),
family = poisson("log"))
Which executes as expected. See the answer here for more information on using a log offset: https://stats.stackexchange.com/a/237980/70372

Grouping in R changes mean substantially

I have a file containing the predictions for two models (A and B) on a binary classification problem. Now I'd like to understand how good they are predicting the observations that they are most confident about. To do that I want to group their predictions into 10 groups based on how confident they are. Each of these groups should have an identical number of observations. However, when I do that the accuracy of the models change substantially! How can that be?
I've also tested with n_groups=100, but it only makes a minor difference. The CSV file is here and the code is below:
# Grouping observations
conf <- read.table(file="conf.csv", sep=',', header=T)
n_groups <- 10
conf$model_a_conf <- pmax(conf$model_a_pred_0, conf$model_a_pred_1)
conf$model_b_conf <- pmax(conf$model_b_pred_0, conf$model_b_pred_1)
conf$conf_group_model_a <- cut(conf$model_a_conf, n_groups, labels=FALSE, ordered_result=TRUE)
conf$conf_group_model_b <- cut(conf$model_b_conf, n_groups, labels=FALSE, ordered_result=TRUE)
# Test of original mean.
mean(conf$model_a_acc) # 0.78
mean(conf$model_b_acc) # 0.777
# Test for mean in aggregated data. They should be similar.
(acc_model_a <- mean(tapply(conf$model_a_acc, conf$conf_group_model_a, FUN=mean))) # 0.8491
(acc_model_b <- mean(tapply(conf$model_b_acc, conf$conf_group_model_b, FUN=mean))) # 0.7526
Edited to clarify slightly.
table(conf$conf_group_model_a)
1 2 3 4 5 6 7 8 9 10
2515 2628 2471 2128 1792 1321 980 627 398 140
The groups you are using are unbalanced. So when you take the mean of each of those groups with tapply thats fine, however to simply take the mean afterwards is not the way to go.
You need to weight the means by their size if you want to do the process you have.
something like this is quick and dirty:
mean(tapply(conf$model_a_acc, conf$conf_group_model_a, FUN=mean) * (table(conf$conf_group_model_a)/nrow(conf)) * 1000)

Resources