Give out and plot p-values for rpart decision-tree - r

I want to plot a decision tree in R with rpart and fancyRpartPlot. The code is working, but I want to show the p-value of each split. When I execute the tree (last line of the code), I get the stars behind the nodes which usually indicate statistical significance - I guess this is the case here too. However, I want to access the calculated p-values and include them in the plot. I would be very grateful, if anyone has an idea on how to do this. Thanks!
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
seatbelts <- Seatbelts
seatbelts <- as.data.frame(seatbelts)
unique(seatbelts$law)
seatbelts_tree <- rpart(law ~ ., data=seatbelts)
plot(seatbelts_tree, uniform = TRUE, margin = 0.5)
text(seatbelts_tree)
prp(seatbelts_tree)
fancyRpartPlot(seatbelts_tree, type=2)
seatbelts_tree

The ouput of the above code contains the answer, that the * indicates a terminal node , which is harder to spot given text output depending on the format.
n= 192
node), split, n, deviance, yval
* denotes terminal node
1) root 192 20.244790 0.11979170
2) drivers>=1303 178 8.544944 0.05056180
4) front>=663 158 1.974684 0.01265823
8) kms< 18147.5 144 0.000000 0.00000000 *
9) kms>=18147.5 14 1.714286 0.14285710 *
5) front< 663 20 4.550000 0.35000000
10) PetrolPrice< 0.1134217 11 0.000000 0.00000000 *
11) PetrolPrice>=0.1134217 9 1.555556 0.77777780 *
3) drivers< 1303 14 0.000000 1.00000000 *
If you want p-values you should look into the cpart library. Here is a related question to it with a short explanation and further reading material.
https://stats.stackexchange.com/questions/255150/how-to-interpret-this-decision-tree/255156

The stars in the print method of rpart highlight terminal nodes, not p values. A decision tree is a descriptive method. It is not designed to test hypotheses.

Related

Extract variable labels from rpart decision tree

I've used part to build a decision tree on a dataset that has categorical variables with hundreds of levels. The tree splits these variables based on select values of the variable. I would like to examine the labels on which the split is made.
If I just run the decision tree result, the display listing the splits in the console gets truncated and either way, it is not in an easily-interpretable format (separated by commas). Is there a way to access this as an R object? I'm open to using another package to build the tree.
One issue here is that some of the functions in the rpart package are not exported. It appears you're looking to capture the output of the function rpart:::print.rpart. So, beginning with a reproducible example:
set.seed(1)
df1 <- data.frame(y=rbinom(n=100, size=1, prob=0.5),
x1=rbinom(n=100, size=1, prob=0.25),
x2=rbinom(n=100, size=1, prob=0.75))
(r1 <- rpart(y ~ ., data=df1))
giving
n= 100
node), split, n, deviance, yval
* denotes terminal node
1) root 100 24.960000 0.4800000
2) x1< 0.5 78 19.179490 0.4358974
4) x2>=0.5 66 15.954550 0.4090909 *
5) x2< 0.5 12 2.916667 0.5833333 *
3) x1>=0.5 22 5.090909 0.6363636
6) x2< 0.5 7 1.714286 0.4285714 *
7) x2>=0.5 15 2.933333 0.7333333 *
Now, looking at rpart:::print.rpart, we see a call to rpart:::labels.rpart, giving us the splits (or names of the 'rows' in the output above). The value of n, deviance, yval and more are stored in r1$frame, which can be seen by inspecting the output from unclass(r1).
Thus we could extract the above with
(df2 <- data.frame(split=rpart:::labels.rpart(r1), n=r1$frame$n))
giving
split n
1 root 100
2 x1< 0.5 78
3 x2>=0.5 66
4 x2< 0.5 12
5 x1>=0.5 22
6 x2< 0.5 7
7 x2>=0.5 15

Regression tree with simulated data - rpart package

I have simulated some data to create a regression tree with 3 terminal nodes:
set.seed(1988)
n=1000
X1<-rnorm(n,mean=0,sd=2)
X2<-rnorm(n,mean=0,sd=2)
e<-rnorm(n)
Y=5*I(X1<1)*I(X2<0.2)+4*I(X1<1)*I(X2>=0.2)+3*I(X1>=1)+e
mydat=as.data.frame(cbind(Y,X1,X2))
So, I want first to split by X1<1, and for X1<1 I want to split by X2<0.2. The values of Y in the leaves are the coefficient of the indicator.
If I run the procedure implemented in the RPART package everything is ok in the case above.
mytree<-rpart(Y~.,data=mydat)
mytree
Output:
node), split, n, deviance, yval
* denotes terminal node
1) root 1000 1627.0670 4.043696
2) X1>=0.9490461 326 373.8485 3.124825 *
3) X1< 0.9490461 674 844.8367 4.488135
6) X2>=0.2488142 327 312.7506 3.970742 *
7) X2< 0.2488142 347 362.0582 4.975708 *
It runs also if I try with coefficient all negative.
But when I try to generate some negative and some positive values in the final terms (it means in the "interaction" of the tree, so where the split is divided at a second level), RPART change the order of the split and the value in the leaves are not correct:
Y=-5*I(X1<1)*I(X2<0.2)+4*I(X1<1)*I(X2>=0.2)+3*I(X1>=1)+e
mydat=as.data.frame(cbind(Y,X1,X2))
mytree<-rpart(Y~.,data=mydat)
mytree
Output:
node), split, n, deviance, yval
* denotes terminal node
1) root 1000 17811.4000 0.6136962
2) X2< 0.1974489 515 8116.5350 -2.3192910
4) X1< 1.002815 343 359.7394 -5.0305350 *
5) X1>=1.002815 172 207.4313 3.0874360 *
3) X2>=0.1974489 485 560.3419 3.7281050 *
Anyone have some idea for that problem?
Thanks
You need to tune the complexity parameter cp. See the code below.
# Data Generating Process
set.seed(1988)
n=1000
X1<-rnorm(n,mean=0,sd=2)
X2<-rnorm(n,mean=0,sd=2)
e<-rnorm(n)
Y=-5*I(X1<1)*I(X2<0.2)+4*I(X1<1)*I(X2>=0.2)+3*I(X1>=1)+e
mydat=as.data.frame(cbind(Y,X1,X2))
library(rpart)
mytree<-rpart(Y~.,data=mydat, cp=0.0001)
# Plot the cross-validation error vs the complexity parameter
plotcp(mytree)
# Find the optimal value of the complexity parameter cp
optcp <- mytree$cptable[which.min(mytree$cptable[,4]),1]
# Prune the tree using the optial complexity parameter
mytree <- prune(mytree,optcp)
The pruned tree correctly represents the underlying data generating process
library(rattle)
fancyRpartPlot(mytree)

rpart package median or geometric mean instead of mean

Is it possible to change the average estimator in a region by something different from the mean, like median or geometric mean using the rpart library in R? (or another library)
I believe my tree partitioning is highly affected by extreme values and I would like to build trees showing other estimators.
Thanks!
One of the usual tricks for right-skewed responses would be to take logs. In many applications this makes the response distribution more symmetric and then you don't need to switch from the usual mean predictions.
Another solution for changing the learning of the tree would be to use some more robust scores, e.g., ranks etc. The ctree() function from the partykit offers a nonparametric inference framework for this.
Finally, the partykit package also allows to compute other predictions than the means from all the terminal nodes. You can easily transform rpart trees to party trees via as.party(). A very simple example would be to learn an rpart tree for the cars data
library("rpart")
data("cars", package = "datasets")
rp <- rpart(dist ~ speed, data = cars)
And then transform it to party:
library("partykit")
pr <- as.party(rp)
The tree structure remains unchanged but you get enhanced plotting and predictions. The default plot methods yield:
Furthermore, the default predictions on both objects are the same.
nd <- data.frame(speed = c(10, 15, 20))
predict(rp, nd)
## 1 2 3
## 18.20000 39.75000 65.26316
predict(pr, nd)
## 1 2 3
## 18.20000 39.75000 65.26316
However, the latter allows you to specify a FUNction that should be used in each of the nodes. This must be of the form function(y, w) where y is the response and w are the case weights. As we haven't used any weights here, we can simply ignore that argument and do:
predict(pr, nd, FUN = function(y, w) mean(y))
## 1 2 3
## 18.20000 39.75000 65.26316
predict(pr, nd, FUN = function(y, w) median(y))
## 1 2 3
## 18 35 64
predict(pr, nd, FUN = function(y, w) quantile(y, 0.9))
## 1 2 3
## 28.0 57.0 92.2
And so on... See the package vignettes for more details.

not creating tree by rpart in R

I'm new to R and rpart package. I want to create a tree using the following sample data.
My data set is similar to this
mydata =
"","A","B","C","status"
"1",TRUE,TRUE,TRUE,"okay"
"2",TRUE,TRUE,FALSE,"okay"
"3",TRUE,FALSE,TRUE,"okay"
"4",TRUE,FALSE,FALSE,"notokay"
"5",FALSE,TRUE,TRUE,"notokay"
"6",FALSE,TRUE,FALSE,"notokay"
"7",FALSE,FALSE,TRUE,"okay"
"8",FALSE,FALSE,FALSE,"okay"
fit <- rpart(status ~ A + B + C, data = mydata, method = "class")
or
I tried with different formulas and different methods in this. But always only the root node is produced. no plot possible.
its showing
fit
n= 8
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 8 3 okay (0.3750000 0.6250000) *
How to create the tree.? I need to show percentage of
"okay" and "notokay" on each node. and i need to specify one out of A, B or C
for spliting and show the statistics
With the default settings of rpart() no splits are considered at all. The minsplit parameter is 20 by default (see ?rpart.control) which is "the minimum number of observations that must exist in a node in order for a split to be attempted." So for your 8 observations no splitting is even considered.
If you are determined to consider splitting, then you could decrease the minbucket and/or minsplit parameters. For example
fit <- rpart(status ~ A + B + C, data = mydata,
control = rpart.control(minsplit = 3))
produces the following tree:
The display is created by
plot(partykit::as.party(fit), tp_args = list(beside = TRUE))
and the print output from rpart is:
n= 8
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 8 3 okay (0.3750000 0.6250000)
2) A=FALSE 4 2 notokay (0.5000000 0.5000000)
4) B=TRUE 2 0 notokay (1.0000000 0.0000000) *
5) B=FALSE 2 0 okay (0.0000000 1.0000000) *
3) A=TRUE 4 1 okay (0.2500000 0.7500000) *
Whether or not this is particularly useful is a different question though...

Chi squared goodness of fit for a geometric distribution

As an assignment I had to develop and algorithm and generate a samples for a given geometric distribution with PMF
Using the inverse transform method, I came up with the following expression for generating the values:
Where U represents a value, or n values depending on the size of the sample, drawn from a Unif(0,1) distribution and p is 0.3 as stated in the PMF above.
I have the algorithm, the implementation in R and I already generated QQ Plots to visually assess the adjustment of the empirical values to the theoretical ones (generated with R), i.e., if the generated sample follows indeed the geometric distribution.
Now I wanted to submit the generated sample to a goodness of fit test, namely the Chi-square, yet I'm having trouble doing this in R.
[I think this was moved a little hastily, in spite of your response to whuber's question, since I think before solving the 'how do I write this algorithm in R' problem, it's probably more important to deal with the 'what you're doing is not the best approach to your problem' issue (which certainly belongs where you posted it). Since it's here, I will deal with the 'doing it in R' aspect, but I would urge to you go back an ask about the second question (as a new post).]
Firstly the chi-square test is a little different depending on whether you test
H0: the data come from a geometric distribution with parameter p
or
H0: the data come from a geometric distribution with parameter 0.3
If you want the second, it's quite straightforward. First, with the geometric, if you want to use the chi-square approximation to the distribution of the test statistic, you will need to group adjacent cells in the tail. The 'usual' rule - much too conservative - suggests that you need an expected count in every bin of at least 5.
I'll assume you have a nice large sample size. In that case, you'll have many bins with substantial expected counts and you don't need to worry so much about keeping it so high, but you will still need to choose how you will bin the tail (whether you just choose a single cut-off above which all values are grouped, for example).
I'll proceed as if n were say 1000 (though if you're testing your geometric random number generation, that's pretty low).
First, compute your expected counts:
dgeom(0:20,.3)*1000
[1] 300.0000000 210.0000000 147.0000000 102.9000000 72.0300000 50.4210000
[7] 35.2947000 24.7062900 17.2944030 12.1060821 8.4742575 5.9319802
[13] 4.1523862 2.9066703 2.0346692 1.4242685 0.9969879 0.6978915
[19] 0.4885241 0.3419669 0.2393768
Warning, dgeom and friends goes from x=0, not x=1; while you can shift the inputs and outputs to the R functions, it's much easier if you subtract 1 from all your geometric values and test that. I will proceed as if your sample has had 1 subtracted so that it goes from 0.
I'll cut that off at the 15th term (x=14), and group 15+ into its own group (a single group in this case). If you wanted to follow the 'greater than five' rule of thumb, you'd cut it off after the 12th term (x=11). In some cases (such as smaller p), you might want to split the tail across several bins rather than one.
> expec <- dgeom(0:14,.3)*1000
> expec <- c(expec, 1000-sum(expec))
> expec
[1] 300.000000 210.000000 147.000000 102.900000 72.030000 50.421000
[7] 35.294700 24.706290 17.294403 12.106082 8.474257 5.931980
[13] 4.152386 2.906670 2.034669 4.747562
The last cell is the "15+" category. We also need the probabilities.
Now we don't yet have a sample; I'll just generate one:
y <- rgeom(1000,0.3)
but now we want a table of observed counts:
(x <- table(factor(y,levels=0:14),exclude=NULL))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <NA>
292 203 150 96 79 59 47 25 16 10 6 7 0 2 5 3
Now you could compute the chi-square directly and then calculate the p-value:
> (chisqstat <- sum((x-expec)^2/expec))
[1] 17.76835
(pval <- pchisq(chisqstat,15,lower.tail=FALSE))
[1] 0.2750401
but you can also get R to do it:
> chisq.test(x,p=expec/1000)
Chi-squared test for given probabilities
data: x
X-squared = 17.7683, df = 15, p-value = 0.275
Warning message:
In chisq.test(x, p = expec/1000) :
Chi-squared approximation may be incorrect
Now the case for unspecified p is similar, but (to my knowledge) you can no longer get chisq.test to do it directly, you have to do it the first way, but you have to estimate the parameter from the data (by maximum likelihood or minimum chi-square), and then test as above but you have one fewer degree of freedom for estimating the parameter.
See the example of doing a chi-square for a Poisson with estimated parameter here; the geometric follows the much same approach as above, with the adjustments as at the link (dealing with the unknown parameter, including the loss of 1 degree of freedom).
Let us assume you've got your randomly-generated variates in a vector x. You can do the following:
x <- rgeom(1000,0.2)
x_tbl <- table(x)
x_val <- as.numeric(names(x_tbl))
x_df <- data.frame(count=as.numeric(x_tbl), value=x_val)
# Expand to fill in "gaps" in the values caused by 0 counts
all_x_val <- data.frame(value = 0:max(x_val))
x_df <- merge(all_x_val, x_df, by="value", all.x=TRUE)
x_df$count[is.na(x_df$count)] <- 0
# Get theoretical probabilities
x_df$eprob <- dgeom(x_df$val, 0.2)
# Chi-square test: once with asymptotic dist'n,
# once with bootstrap evaluation of chi-sq test statistic
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE)
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE,
simulate.p.value=TRUE, B=10000)
There's a "goodfit" function described as "Goodness-of-fit Tests for Discrete Data" in package "vcd".
G.fit <- goodfit(x, type = "nbinomial", par = list(size = 1))
I was going to use the code you had posted in an earlier question, but it now appears that you have deleted that code. I find that offensive. Are you using this forum to gather homework answers and then defacing it to remove the evidence? (Deleted questions can still be seen by those of us with sufficient rep, and the interface prevents deletion of question with upvoted answers so you should not be able to delete this one.)
Generate a QQ Plot for testing a geometrically distributed sample
--- question---
I have a sample of n elements generated in R with
sim.geometric <- function(nvals)
{
p <- 0.3
u <- runif(nvals)
ceiling(log(u)/log(1-p))
}
for which i want to test its distribution, specifically if it indeed follows a geometric distribution. I want to generate a QQ PLot but have no idea how to.
--------reposted answer----------
A QQ-plot should be a straight line when compared to a "true" sample drawn from a geometric distribution with the same probability parameter. One gives two vectors to the functions which essentially compares their inverse ECDF's at each quantile. (Your attempt is not particularly successful:)
sim.res <- sim.geometric(100)
sim.rgeom <- rgeom(100, 0.3)
qqplot(sim.res, sim.rgeom)
Here I follow the lead of the authors of qqplot's help page (which results in flipping that upper curve around the line of identity):
png("QQ.png")
qqplot(qgeom(ppoints(100),prob=0.3), sim.res,
main = expression("Q-Q plot for" ~~ {G}[n == 100]))
dev.off()
---image not included---
You can add a "line of good fit" by plotting a line through through the 25th and 75th percentile points for each distribution. (I added a jittering feature to this to get a better idea where the "probability mass" was located:)
sim.res <- sim.geometric(500)
qqplot(jitter(qgeom(ppoints(500),prob=0.3)), jitter(sim.res),
main = expression("Q-Q plot for" ~~ {G}[n == 100]), ylim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )),
xlim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )))
qqline(sim.res, distribution = function(p) qgeom(p, 0.3),
prob = c(0.25, 0.75), col = "red")

Resources