I created a decision tree in R using the "tree" package, however, then I look at the details of the model, I struggle with interpreting the results.
The output of the model looks like this:
> model
node), split, n, deviance, yval
* denotes terminal node
1) root 23 16270.0 32.350
2) Y1 < 31 8 4345.0 59.880 *
3) Y1 > 31 15 2625.0 17.670
6) Y2 < 11.5 8 1310.0 26.000 *
7) Y2 > 11.5 7 124.9 8.143 *
I don't understand the numbers that are shown in each line after the features. what are 16270.0 and 32.350? Or what are 2625.0 and 17.670? Why do some of the numbers have asterisks?
Any help is appreciated.
Thank you
The rules that you got are equivalent to the following tree.
Each row in the output has five columns. Let's look at one that you asked about:
Y1 > 31 15 2625.0 17.670
Y1 > 31 is the splitting rule being applied to the parent node
15 is the number of points that would be at this node of the tree
2625.0 is the deviance at this node (used to decide how the split was made)
17.670 is what you would predict for points at this node
if you split no further.
The asterisks indicate leaf nodes - ones that are not split any further.
So in the node described above, Y1 > 31, You could stop at that node and
predict 17.670 for all 15 points, but the full tree would split this into
two nodes: one with 8 points for Y2 < 11.5 and another with 7 points for
Y2 > 11.5. If you make this further split, you would predict 26.0 for the 8 points
with Y2 < 11.5 (and Y1 > 31) and predict 8.143 for the 7 points with Y2 > 11.5
(and Y1 > 31).
Related
I am working with R.
Suppose you have the following data:
#generate data
set.seed(123)
a1 = rnorm(1000,100,10)
b1 = rnorm(1000,100,10)
c1 = rnorm(1000,5,1)
train_data = data.frame(a1,b1,c1)
#view data
a1 b1 c1
1 94.39524 90.04201 4.488396
2 97.69823 89.60045 5.236938
3 115.58708 99.82020 4.458411
4 100.70508 98.67825 6.219228
5 101.29288 74.50657 5.174136
6 117.15065 110.40573 4.384732
We can visualize the data as follows:
#visualize data
par(mfrow=c(2,2))
plot(train_data$a1, train_data$b1, col = train_data$c1, main = "plot of a1 vs b1, points colored by c1")
hist(train_data$a1)
hist(train_data$b1)
hist(train_data$c1)
Here is the Problem :
From the data, only take variables "a1" and "b1" : using only 2 "logical conditions", split this data into 3 regions (e.g. Region 1 WHERE 20 > a1 >0 AND 0< b1 < 25)
In each region, you want the "average value of c1" within that region to be as small as possible - but each region must have at least some minimum number of data points, e.g. 100 data points (to prevent trivial solutions)
Goal : Is it possible to determine the "boundaries" of these 3 regions that minimizes :
the mean value of "c1" for region 1
the mean value of "c1" for region 2
the mean value of "c1" for region 3
the average "mean value of c1 for all 3 regions" (i.e. c_avg = (region1_c1_avg + region2_c1_avg + region3_c1_avg) / 3)
In the end, for a given combination, you would find the following, e.g. (made up numbers):
Region 1 : WHERE 20> a1 >0 AND 0 < b1 < 25 ; region1_c1_avg = 4
Region 2 : WHERE 50> a1 >20 AND 25 < b1 < 60 ; region2_c1_avg = 2.9
Region 3 : WHERE a1>50 AND b1 > 60 ; region3_c1_avg = 1.9
c_avg = (4 + 2.9 + 1.9) / 3 = 2.93
And hope that (region1_c1_avg, region2_c1_avg, region3_c1_avg and c_avg) are minimized
My Question:
Does this kind of problem have an "exact solution"? The only thing I can think of is performing a "random search" that considers many different definitions of (Region 1, Region 2 and Region 3) and compares the corresponding values of (region1_c1_avg, region2_c1_avg, region3_c1_avg and c_avg), until a minimum value is found. Is this an application of linear programming or multi-objective optimization (e.g. genetic algorithm)? Has anyone worked on something like this before?
I have done a lot of research and haven't found a similar problem like this. I decided to formulate this problem as a "multi-objective constrained optimization problem", and figured out how to implement algorithms like "random search" and "genetic algorithm".
Thanks
Note 1: In the context of multi-objective optimization, for a given set of definitions of (Region1, Region2 and Region3): to collectively compare whether a set of values for (region1_c1_avg, region2_c1_avg, region3_c1_avg and c_avg) are satisfactory, the concept of "Pareto Optimality" (https://en.wikipedia.org/wiki/Multi-objective_optimization#Visualization_of_the_Pareto_front) is often used to make comparisons between different sets of {(Region1, Region2 and Region3) and (region1_c1_avg, region2_c1_avg, region3_c1_avg and c_avg)}
Note 2 : Ultimately, these 3 Regions can defined by any set of 4 numbers. If each of these 4 numbers can be between "0 and 100", and through 0.1 increments (e.g. 12, 12.1, 12.2, 12.3, etc) : this means that there exists 1000 ^ 4 = 1 e^12 possible solutions (roughly 1 trillion) to compare. There are simply far too many solutions to individually verify and compare. I am thinking that a mathematical based search/optimization problem can be used to strategically search for an optimal solution.
I want to plot a decision tree in R with rpart and fancyRpartPlot. The code is working, but I want to show the p-value of each split. When I execute the tree (last line of the code), I get the stars behind the nodes which usually indicate statistical significance - I guess this is the case here too. However, I want to access the calculated p-values and include them in the plot. I would be very grateful, if anyone has an idea on how to do this. Thanks!
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
seatbelts <- Seatbelts
seatbelts <- as.data.frame(seatbelts)
unique(seatbelts$law)
seatbelts_tree <- rpart(law ~ ., data=seatbelts)
plot(seatbelts_tree, uniform = TRUE, margin = 0.5)
text(seatbelts_tree)
prp(seatbelts_tree)
fancyRpartPlot(seatbelts_tree, type=2)
seatbelts_tree
The ouput of the above code contains the answer, that the * indicates a terminal node , which is harder to spot given text output depending on the format.
n= 192
node), split, n, deviance, yval
* denotes terminal node
1) root 192 20.244790 0.11979170
2) drivers>=1303 178 8.544944 0.05056180
4) front>=663 158 1.974684 0.01265823
8) kms< 18147.5 144 0.000000 0.00000000 *
9) kms>=18147.5 14 1.714286 0.14285710 *
5) front< 663 20 4.550000 0.35000000
10) PetrolPrice< 0.1134217 11 0.000000 0.00000000 *
11) PetrolPrice>=0.1134217 9 1.555556 0.77777780 *
3) drivers< 1303 14 0.000000 1.00000000 *
If you want p-values you should look into the cpart library. Here is a related question to it with a short explanation and further reading material.
https://stats.stackexchange.com/questions/255150/how-to-interpret-this-decision-tree/255156
The stars in the print method of rpart highlight terminal nodes, not p values. A decision tree is a descriptive method. It is not designed to test hypotheses.
I've used part to build a decision tree on a dataset that has categorical variables with hundreds of levels. The tree splits these variables based on select values of the variable. I would like to examine the labels on which the split is made.
If I just run the decision tree result, the display listing the splits in the console gets truncated and either way, it is not in an easily-interpretable format (separated by commas). Is there a way to access this as an R object? I'm open to using another package to build the tree.
One issue here is that some of the functions in the rpart package are not exported. It appears you're looking to capture the output of the function rpart:::print.rpart. So, beginning with a reproducible example:
set.seed(1)
df1 <- data.frame(y=rbinom(n=100, size=1, prob=0.5),
x1=rbinom(n=100, size=1, prob=0.25),
x2=rbinom(n=100, size=1, prob=0.75))
(r1 <- rpart(y ~ ., data=df1))
giving
n= 100
node), split, n, deviance, yval
* denotes terminal node
1) root 100 24.960000 0.4800000
2) x1< 0.5 78 19.179490 0.4358974
4) x2>=0.5 66 15.954550 0.4090909 *
5) x2< 0.5 12 2.916667 0.5833333 *
3) x1>=0.5 22 5.090909 0.6363636
6) x2< 0.5 7 1.714286 0.4285714 *
7) x2>=0.5 15 2.933333 0.7333333 *
Now, looking at rpart:::print.rpart, we see a call to rpart:::labels.rpart, giving us the splits (or names of the 'rows' in the output above). The value of n, deviance, yval and more are stored in r1$frame, which can be seen by inspecting the output from unclass(r1).
Thus we could extract the above with
(df2 <- data.frame(split=rpart:::labels.rpart(r1), n=r1$frame$n))
giving
split n
1 root 100
2 x1< 0.5 78
3 x2>=0.5 66
4 x2< 0.5 12
5 x1>=0.5 22
6 x2< 0.5 7
7 x2>=0.5 15
I'm very new to machine learning so I apologize if the answer to this is very obvious.
I'm using a decision tree, using the rpart package, to attempt to predict when a structure fire may result in a fatality using a variety of variables related to that structure fire such as what was the cause, the extent of damage etc.
The chance of a fatality resulting from structure fire is about 1 in 100.
In short I have about 154,000 observations in my training set. I have noticed that when I use the full training set, that the complexity parameter cp has to be reduced all the way down to .0003.
> rpart(Fatality~.,data=train_val,method="class", control=rpart.control(minsplit=50,minbucket = 1, cp=0.00035))
n= 154181
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 154181 1881 0 (0.987800053 0.012199947)
2) losscat=Minor_Loss,Med_Loss 105538 567 0 (0.994627528 0.005372472) *
3) losscat=Major_Loss,Total_Loss 48643 1314 0 (0.972986863 0.027013137)
6) HUM_FAC_1=3,6,N, 46102 1070 0 (0.976790595 0.023209405) *
7) HUM_FAC_1=1,2,4,5,7 2541 244 0 (0.903974813 0.096025187)
14) AREA_ORIG=21,24,26,47,72,74,75,76,Other 1846 126 0 (0.931744312 0.068255688)
28) CAUSE_CODE=1,2,5,6,7,8,9,10,12,14,15 1105 45 0 (0.959276018 0.040723982) *
29) CAUSE_CODE=3,4,11,13,16 741 81 0 (0.890688259 0.109311741)
58) FIRST_IGN=10,12,15,17,18,Other,UU 690 68 0 (0.901449275 0.098550725) *
59) FIRST_IGN=00,21,76,81 51 13 0 (0.745098039 0.254901961)
118) INC_TYPE=111,121 48 10 0 (0.791666667 0.208333333) *
119) INC_TYPE=112,120 3 0 1 (0.000000000 1.000000000) *
15) AREA_ORIG=14,UU 695 118 0 (0.830215827 0.169784173)
30) CAUSE_CODE=1,2,4,7,8,10,11,12,13,14,15,16 607 86 0 (0.858319605 0.141680395) *
31) CAUSE_CODE=3,5,6,9 88 32 0 (0.636363636 0.363636364)
62) HUM_FAC_1=1,2 77 24 0 (0.688311688 0.311688312) *
63) HUM_FAC_1=4,5,7 11 3 1 (0.272727273 0.727272727) *
However, when I just grab the first 10,000 observations (no meaningful order) I can now run with a cp of .01
> rpart(Fatality~., data = test, method = "class",
+ control=rpart.control(minsplit=10,minbucket = 1, cp=0.01))
n= 10000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 10000 112 0 (0.988800000 0.011200000)
2) losscat=Minor_Loss,Med_Loss 6889 26 0 (0.996225867 0.003774133) *
3) losscat=Major_Loss,Total_Loss 3111 86 0 (0.972356156 0.027643844)
6) HUM_FAC_1=3,7,N 2860 66 0 (0.976923077 0.023076923) *
7) HUM_FAC_1=1,2,4,5,6 251 20 0 (0.920318725 0.079681275)
14) CAUSE_CODE=1,3,4,6,7,8,9,10,11,14,15 146 3 0 (0.979452055 0.020547945) *
15) CAUSE_CODE=5,13,16 105 17 0 (0.838095238 0.161904762)
30) weekday=Friday,Monday,Saturday,Tuesday,Wednesday 73 6 0 (0.917808219 0.082191781) *
31) weekday=Sunday,Thursday 32 11 0 (0.656250000 0.343750000)
62) AREA_ORIG=21,26,47,Other 17 2 0 (0.882352941 0.117647059) *
63) AREA_ORIG=14,24,UU 15 6 1 (0.400000000 0.600000000)
126) month=2,6,7,9 7 1 0 (0.857142857 0.142857143) *
127) month=1,4,10,12 8 0 1 (0.000000000 1.000000000) *
Why is it that a greater number of observations is resulting in me
having to reduce complexity? Intuitively I would think it should be
opposite.
Is having to reduce cp to .003 "bad"?
Generally, is there any other advice for improving the effectiveness of a decision tree, especially when predicting something that has such low probability in the first place?
cp, from what I read, is a parameter that is used to decide when to stop adding more leaves to the tree (for a node to be considered for another split, the improvement of the relative error by allowing a new split must by more than that cp threshold). Thus, the lower the number, the more leaves it can add. More observations implies that there is an opportunity to lower the threshold, I'm not sure I understand that you "have to" reduce cp... but I could be wrong. If this is a very rare event and your data doesn't lend itself to showing significant improvement in the early stages of the model, it may require that you "increase the sensitivity" by lowering the cp... but you probably know your data better than me.
If you're modeling a rare event, no. If it's not a rare event, the lower your cp the more likely you are to overfit to the bias of your sample. I don't think that minbucket=1 ever leads to a model that is interpretable, either... for similar reasons.
Decision Trees, to me, don't make very much sense beyond 3-4 levels unless you really believe that these hard cuts truly create criteria that justify a final "bucket"/node or a prediction (e.g. if I wanted to bucket you into something financial like a loan or insurance product that fits your risk profile, and my actuaries made hard cuts to split the prospects). After you've split your data 3-4 times, producing a minimum of 8-16 nodes at the bottom of your tree, you've essentially built a model that could be thought of as 3rd or 4th order interactions of independent categorical variables. If you put 20 statisticians (not econo-missed's) in a room and ask them about the number of times they've seen significant 3rd or 4th order interactions in a model, they'd probably scratch their heads. Have you tried any other methods? Or started with dimension reduction? More importantly, what inferences are you trying to make about the data?
I'm new to R and rpart package. I want to create a tree using the following sample data.
My data set is similar to this
mydata =
"","A","B","C","status"
"1",TRUE,TRUE,TRUE,"okay"
"2",TRUE,TRUE,FALSE,"okay"
"3",TRUE,FALSE,TRUE,"okay"
"4",TRUE,FALSE,FALSE,"notokay"
"5",FALSE,TRUE,TRUE,"notokay"
"6",FALSE,TRUE,FALSE,"notokay"
"7",FALSE,FALSE,TRUE,"okay"
"8",FALSE,FALSE,FALSE,"okay"
fit <- rpart(status ~ A + B + C, data = mydata, method = "class")
or
I tried with different formulas and different methods in this. But always only the root node is produced. no plot possible.
its showing
fit
n= 8
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 8 3 okay (0.3750000 0.6250000) *
How to create the tree.? I need to show percentage of
"okay" and "notokay" on each node. and i need to specify one out of A, B or C
for spliting and show the statistics
With the default settings of rpart() no splits are considered at all. The minsplit parameter is 20 by default (see ?rpart.control) which is "the minimum number of observations that must exist in a node in order for a split to be attempted." So for your 8 observations no splitting is even considered.
If you are determined to consider splitting, then you could decrease the minbucket and/or minsplit parameters. For example
fit <- rpart(status ~ A + B + C, data = mydata,
control = rpart.control(minsplit = 3))
produces the following tree:
The display is created by
plot(partykit::as.party(fit), tp_args = list(beside = TRUE))
and the print output from rpart is:
n= 8
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 8 3 okay (0.3750000 0.6250000)
2) A=FALSE 4 2 notokay (0.5000000 0.5000000)
4) B=TRUE 2 0 notokay (1.0000000 0.0000000) *
5) B=FALSE 2 0 okay (0.0000000 1.0000000) *
3) A=TRUE 4 1 okay (0.2500000 0.7500000) *
Whether or not this is particularly useful is a different question though...