How to get classification values in RWeka? - r

can anybody explain how I get the results of each leave in a decision tree made by J48 from the RWeka package?
So for example we have this iris dataset in R:
library(RWeka)
m1 <- J48(Species ~ ., data = iris)
m1
In prediction I want to use the proportion in a leave. I tried to use the package Partykit but still it looks to complicated just to get the proportion in each leave.
library(partykit)
pres <- as.party(m1)
partykit:::.list.rules.party(pres)
At least I get the number of leaves in the list, but can't find the probability.
pres
Model formula:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
Fitted party:
[1] root
| [2] Petal.Width <= 0.6: setosa (n = 50, err = 0.0%)
| [3] Petal.Width > 0.6
| | [4] Petal.Width <= 1.7
| | | [5] Petal.Length <= 4.9: versicolor (n = 48, err = 2.1%)
| | | [6] Petal.Length > 4.9
| | | | [7] Petal.Width <= 1.5: virginica (n = 3, err = 0.0%)
| | | | [8] Petal.Width > 1.5: versicolor (n = 3, err = 33.3%)
| | [9] Petal.Width > 1.7: virginica (n = 46, err = 2.2%)
Number of inner nodes: 4
Number of terminal nodes: 5
So as prediction I want for example the result for a new datapoint where Petal.Width > 0.6; Petal.Width <= 1.7; Petal.Length <= 4.9 the result versicolor 97,9%. and 2,1% other. How can I get these predictions?

Your point is not a point. If you fully specify a point, you can simply plug it into the predict function. For example, I will generate a point that meets the specifications, but is unlike other iris points - then classify it.
## Generate wild new point
NewPoint = iris[1,]
NewPoint[1,3:4] = c(2.0,1.7)
NewPoint
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 2 1.7 setosa
## Look at where the new point is
plot(iris[,3:4], pch=20, col=rainbow(3, alpha=0.3)[iris$Species])
points(NewPoint[,3:4], pch=16, col="orange")
## Get the probability from the model
predict(m1, newdata = NewPoint, type = "probability")
setosa versicolor virginica
1 0 0.9791667 0.02083333

Related

print decision tree in text nicely / with custom control [r]

I'd like to print a decision tree in text nicely. For example, I can print the tree object itself:
library(rpart)
f = as.formula('Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species')
fit = rpart(f, data = iris, control = rpart.control(xval = 3))
fit
yields
n= 150
node), split, n, deviance, yval
* denotes terminal node
1) root 150 102.1683000 5.843333
2) Petal.Length< 4.25 73 13.1391800 5.179452
4) Petal.Length< 3.4 53 6.1083020 5.005660
8) Sepal.Width< 3.25 20 1.0855000 4.735000 *
9) Sepal.Width>=3.25 33 2.6696970 5.169697 *
... # omitted
partykit prints it neater:
library(partykit)
as.party(fit)
yields
Model formula:
Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
Fitted party:
[1] root
| [2] Petal.Length < 4.25
| | [3] Petal.Length < 3.4
| | | [4] Sepal.Width < 3.25: 4.735 (n = 20, err = 1.1)
| | | [5] Sepal.Width >= 3.25: 5.170 (n = 33, err = 2.7)
| | [6] Petal.Length >= 3.4: 5.640 (n = 20, err = 1.2)
...# omitted
Number of inner nodes: 6
Number of terminal nodes: 7
Is there a way I have have more control? Eg, I don't want to print n and err, or want standard deviation instead of err printed.
Not a very elegant answer, but if you just want to get rid of n= and err= you can capture the output and edit it.
CO = capture.output(print(as.party(fit)))
CO2 = sub("\\(.*\\)", "", CO)
cat(paste(CO2, collapse="\n"))
Model formula:
Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
Fitted party:
[1] root
| [2] Petal.Length < 4.25
| | [3] Petal.Length < 3.4
| | | [4] Sepal.Width < 3.25: 4.735
| | | [5] Sepal.Width >= 3.25: 5.170
| | [6] Petal.Length >= 3.4: 5.640
| [7] Petal.Length >= 4.25
I am not sure what standard deviation you want to insert, but I expect you could edit that in the same way.
The print() method for party objects is quite flexible and can be controlled through various panel functions and customizations. See ?print.party for an overview. The documentation is somewhat short and technical, though.
In your case, the easiest solution is to set up a function of the response y, the case weights w (defaulting to all 1 in your case), and the desired number of digits:
myfun <- function(y, w, digits = 2) {
n <- sum(w)
m <- weighted.mean(y, w)
s <- sqrt(weighted.mean((y - m)^2, w) * n/(n - 1))
sprintf("%s (serr = %s)",
round(m, digits = digits),
round(s, digits = digits))
}
And then you can pass that to your print() call:
p <- as.party(fit)
print(p, FUN = myfun)
## Model formula:
## Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
##
## Fitted party:
## [1] root
## | [2] Petal.Length < 4.25
## | | [3] Petal.Length < 3.4
## | | | [4] Sepal.Width < 3.25: 4.735 (serr = 0.239)
## | | | [5] Sepal.Width >= 3.25: 5.17 (serr = 0.289)
## | | [6] Petal.Length >= 3.4: 5.64 (serr = 0.25)
## | [7] Petal.Length >= 4.25
## | | [8] Petal.Length < 6.05
## | | | [9] Petal.Length < 5.15
## | | | | [10] Sepal.Width < 3.05: 6.055 (serr = 0.404)
## | | | | [11] Sepal.Width >= 3.05: 6.53 (serr = 0.38)
## | | | [12] Petal.Length >= 5.15: 6.604 (serr = 0.302)
## | | [13] Petal.Length >= 6.05: 7.578 (serr = 0.228)
##
## Number of inner nodes: 6
## Number of terminal nodes: 7

partykit object varid mismatch

I am using partykit and noticed a possible varid mismatch (unless I misunderstood something). Below is the example code.
The root node as returned by nodeapply shows variable 5 as the split variable.
Also the first element of the explicitly generated list has split$varid 5. If we look at the iris data frame then the 5th column is Species, and Petal.Width is 4th column which should be the varid for the root node as shown by the j48_party object.
It seems like the varid are actual feature used +1, is this intentional?
> library(partykit)
> library(RWeka)
> data("iris")
> j48 <- J48(Species~., data=iris)
> j48_party <- as.party(j48)
> j48_party
Model formula:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
Fitted party:
[1] root
| [2] Petal.Width <= 0.6: setosa (n = 50, err = 0.0%)
| [3] Petal.Width > 0.6
| | [4] Petal.Width <= 1.7
| | | [5] Petal.Length <= 4.9: versicolor (n = 48, err = 2.1%)
| | | [6] Petal.Length > 4.9
| | | | [7] Petal.Width <= 1.5: virginica (n = 3, err = 0.0%)
| | | | [8] Petal.Width > 1.5: versicolor (n = 3, err = 33.3%)
| | [9] Petal.Width > 1.7: virginica (n = 46, err = 2.2%)
Number of inner nodes: 4
Number of terminal nodes: 5
> colnames(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
> nodeapply(j48_party)
$`1`
[1] root
| [2] V5 <= 0.6 *
| [3] V5 > 0.6
| | [4] V5 <= 1.7
| | | [5] V4 <= 4.9 *
| | | [6] V4 > 4.9
| | | | [7] V5 <= 1.5 *
| | | | [8] V5 > 1.5 *
| | [9] V5 > 1.7 *
> nodes <- as.list(j48_party$node)
> nodes[[1]]$split$varid
[1] 5
The difference is due to the following: J48() like most other modeling functions (such as lm(), glm(), etc.) does not simply directly use the data supplied but first builds up a model.frame. This already carries out variable transformations (e.g., taking logs, creating factors or Surv() objects), collecting variables that might not be in data but in the calling environment, and leaving out variables that are not in the model formula etc. See ?model.frame for further information and links.
Therefore, the object created by J48() has a model.frame that is not exactly the iris data but the response variable was moved to the first column:
head(model.frame(j48))
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.1 3.5 1.4 0.2
## 2 setosa 4.9 3.0 1.4 0.2
## 3 setosa 4.7 3.2 1.3 0.2
## 4 setosa 4.6 3.1 1.5 0.2
## 5 setosa 5.0 3.6 1.4 0.2
## 6 setosa 5.4 3.9 1.7 0.4
And the information from this is also carried over to the party object.
j48_party$data
## [1] Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <0 rows> (or 0-length row.names)
[Note: In the case of J48() this only stores the meta-information but drops the actual data because it is not needed here. But this is different for ctree() for example.]
To see that this model.frame() can be different from the original data consider the following situation: we create a new noise variable that is not part of iris but just in the calling environment, take logs, and omit several variables:
set.seed(1)
noise <- rnorm(150)
j48 <- J48(Species ~ log(Petal.Width) + noise, data = iris)
j48_party <- as.party(j48)
head(model.frame(j48))
## Species log(Petal.Width) noise
## 1 setosa -1.6094379 -0.6264538
## 2 setosa -1.6094379 0.1836433
## 3 setosa -1.6094379 -0.8356286
## 4 setosa -1.6094379 1.5952808
## 5 setosa -1.6094379 0.3295078
## 6 setosa -0.9162907 -0.8204684
j48_party$data
## [1] Species log(Petal.Width) noise
## <0 rows> (or 0-length row.names)

From Decision Tree to Dataframe or List

I have the following J48 decision tree (res): (This is an example)
> res
J48 pruned tree
------------------
Petal.Width <= 0.6: setosa (50.0)
Petal.Width > 0.6
| Petal.Width <= 1.7
| | Petal.Length <= 4.9: versicolor (48.0/1.0)
| | Petal.Length > 4.9
| | | Petal.Width <= 1.5: virginica (3.0)
| | | Petal.Width > 1.5: versicolor (3.0/1.0)
| Petal.Width > 1.7: virginica (46.0/1.0)
Number of Leaves : 5
Size of the tree : 9
It is created as followed:
library(RWeka)
data(iris)
res = J48(Species ~., data = iris)
I would like to transform it into dataframe or list in the following format:
source_node1 target_node1
source_node2 target_node2
Here is the required result:
First format (with numbers):
Petal.Width_0.6 Petal.Width_1.7
Petal.Width_1.7 Petal.Length_4.9
Petal.Length_4.9 Petal.Width_1.5
Second format (same with no numbers):
Petal.Width Petal.Width
Petal.Width Petal.Length
Petal.Length Petal.Width

Using data.tree for txt format trees and getting kids for each level

I have the following tree in txt file (you can copy paste and save it into txt file):
R> res
J48 pruned tree
------------------
Petal.Width <= 0.6: setosa (50.0)
Petal.Width > 0.6
| Petal.Width <= 1.7
| | Petal.Length <= 4.9: versicolor (48.0/1.0)
| | Petal.Length > 4.9
| | | Petal.Width <= 1.5: virginica (3.0)
| | | Petal.Width > 1.5: versicolor (3.0/1.0)
| Petal.Width > 1.7: virginica (46.0/1.0)
Number of Leaves : 5
Size of the tree : 9
This is my input file. I would like to get the list of each node (father) and its kids of the tree (it is only an example). I would like to know if I can transfer this txt format into a tree by using data.tree. And how can I get the kids of each level (father)?

R function to get the rules applied by rpart

iris <- read.csv("iris.csv") #iris data available in R
library(rpart)
iris.rpart <- rpart(Species~Sepal.length+Sepal.width+Petal.width+Petal.length,
data=iris)
plotcp(iris.rpart)
printcp(iris.rpart)
iris.rpart1 <- prune(iris.rpart, cp=0.047)
plot(iris.rpart1,uniform=TRUE)
text(iris.rpart1, use.n=TRUE, cex=0.6)
I have tried to get the rpart done on the iris data. However, is it possible by using some function in R to get the rules applied by rpart for this current tree preparation so that we know how the classifications are made when we add further new points to the data set?
The
rpart.plot
package has a function
rpart.rules for generating a set of rules for a tree. For example
library(rpart.plot)
iris.rpart <- rpart(Species~., data=iris)
rpart.rules(iris.rpart)
gives
Species seto vers virg
setosa [1.00 .00 .00] when Petal.Length < 2.5
versicolor [ .00 .91 .09] when Petal.Length >= 2.5 & Petal.Width < 1.8
virginica [ .00 .02 .98] when Petal.Length >= 2.5 & Petal.Width >= 1.8
And
options(width=1000)
rpart.predict(iris.rpart, newdata=iris[50:52,], rules=TRUE)
gives you the rule used to make each prediction:
setosa versicolor virginica
50 1 0.00000 0.000000 because Petal.Length < 2.5
51 0 0.90741 0.092593 because Petal.Length >= 2.5 & Petal.Width < 1.8
52 0 0.90741 0.092593 because Petal.Length >= 2.5 & Petal.Width < 1.8
For more examples see Chapter 4 of the
rpart.plot vignette.

Resources