I am using partykit and noticed a possible varid mismatch (unless I misunderstood something). Below is the example code.
The root node as returned by nodeapply shows variable 5 as the split variable.
Also the first element of the explicitly generated list has split$varid 5. If we look at the iris data frame then the 5th column is Species, and Petal.Width is 4th column which should be the varid for the root node as shown by the j48_party object.
It seems like the varid are actual feature used +1, is this intentional?
> library(partykit)
> library(RWeka)
> data("iris")
> j48 <- J48(Species~., data=iris)
> j48_party <- as.party(j48)
> j48_party
Model formula:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
Fitted party:
[1] root
| [2] Petal.Width <= 0.6: setosa (n = 50, err = 0.0%)
| [3] Petal.Width > 0.6
| | [4] Petal.Width <= 1.7
| | | [5] Petal.Length <= 4.9: versicolor (n = 48, err = 2.1%)
| | | [6] Petal.Length > 4.9
| | | | [7] Petal.Width <= 1.5: virginica (n = 3, err = 0.0%)
| | | | [8] Petal.Width > 1.5: versicolor (n = 3, err = 33.3%)
| | [9] Petal.Width > 1.7: virginica (n = 46, err = 2.2%)
Number of inner nodes: 4
Number of terminal nodes: 5
> colnames(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
> nodeapply(j48_party)
$`1`
[1] root
| [2] V5 <= 0.6 *
| [3] V5 > 0.6
| | [4] V5 <= 1.7
| | | [5] V4 <= 4.9 *
| | | [6] V4 > 4.9
| | | | [7] V5 <= 1.5 *
| | | | [8] V5 > 1.5 *
| | [9] V5 > 1.7 *
> nodes <- as.list(j48_party$node)
> nodes[[1]]$split$varid
[1] 5
The difference is due to the following: J48() like most other modeling functions (such as lm(), glm(), etc.) does not simply directly use the data supplied but first builds up a model.frame. This already carries out variable transformations (e.g., taking logs, creating factors or Surv() objects), collecting variables that might not be in data but in the calling environment, and leaving out variables that are not in the model formula etc. See ?model.frame for further information and links.
Therefore, the object created by J48() has a model.frame that is not exactly the iris data but the response variable was moved to the first column:
head(model.frame(j48))
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.1 3.5 1.4 0.2
## 2 setosa 4.9 3.0 1.4 0.2
## 3 setosa 4.7 3.2 1.3 0.2
## 4 setosa 4.6 3.1 1.5 0.2
## 5 setosa 5.0 3.6 1.4 0.2
## 6 setosa 5.4 3.9 1.7 0.4
And the information from this is also carried over to the party object.
j48_party$data
## [1] Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <0 rows> (or 0-length row.names)
[Note: In the case of J48() this only stores the meta-information but drops the actual data because it is not needed here. But this is different for ctree() for example.]
To see that this model.frame() can be different from the original data consider the following situation: we create a new noise variable that is not part of iris but just in the calling environment, take logs, and omit several variables:
set.seed(1)
noise <- rnorm(150)
j48 <- J48(Species ~ log(Petal.Width) + noise, data = iris)
j48_party <- as.party(j48)
head(model.frame(j48))
## Species log(Petal.Width) noise
## 1 setosa -1.6094379 -0.6264538
## 2 setosa -1.6094379 0.1836433
## 3 setosa -1.6094379 -0.8356286
## 4 setosa -1.6094379 1.5952808
## 5 setosa -1.6094379 0.3295078
## 6 setosa -0.9162907 -0.8204684
j48_party$data
## [1] Species log(Petal.Width) noise
## <0 rows> (or 0-length row.names)
Related
can anybody explain how I get the results of each leave in a decision tree made by J48 from the RWeka package?
So for example we have this iris dataset in R:
library(RWeka)
m1 <- J48(Species ~ ., data = iris)
m1
In prediction I want to use the proportion in a leave. I tried to use the package Partykit but still it looks to complicated just to get the proportion in each leave.
library(partykit)
pres <- as.party(m1)
partykit:::.list.rules.party(pres)
At least I get the number of leaves in the list, but can't find the probability.
pres
Model formula:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
Fitted party:
[1] root
| [2] Petal.Width <= 0.6: setosa (n = 50, err = 0.0%)
| [3] Petal.Width > 0.6
| | [4] Petal.Width <= 1.7
| | | [5] Petal.Length <= 4.9: versicolor (n = 48, err = 2.1%)
| | | [6] Petal.Length > 4.9
| | | | [7] Petal.Width <= 1.5: virginica (n = 3, err = 0.0%)
| | | | [8] Petal.Width > 1.5: versicolor (n = 3, err = 33.3%)
| | [9] Petal.Width > 1.7: virginica (n = 46, err = 2.2%)
Number of inner nodes: 4
Number of terminal nodes: 5
So as prediction I want for example the result for a new datapoint where Petal.Width > 0.6; Petal.Width <= 1.7; Petal.Length <= 4.9 the result versicolor 97,9%. and 2,1% other. How can I get these predictions?
Your point is not a point. If you fully specify a point, you can simply plug it into the predict function. For example, I will generate a point that meets the specifications, but is unlike other iris points - then classify it.
## Generate wild new point
NewPoint = iris[1,]
NewPoint[1,3:4] = c(2.0,1.7)
NewPoint
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 2 1.7 setosa
## Look at where the new point is
plot(iris[,3:4], pch=20, col=rainbow(3, alpha=0.3)[iris$Species])
points(NewPoint[,3:4], pch=16, col="orange")
## Get the probability from the model
predict(m1, newdata = NewPoint, type = "probability")
setosa versicolor virginica
1 0 0.9791667 0.02083333
New here and not very experienced, and I'm trying to get a project in R shinyapp to work.
I have a list of data frames which have a column labeled 'Gender' containing all/M/F. I want to filter all data frames based on the input, so that if the input is male, only rows containing M or all are kept.
list_tables <- list(adverb,adjective,simplenoun,verber,thingnoun,
personnoun,name_firstpart,name_secondpart)
input$gender <- "male
if(input$gender == "male"){
for (i in list_tables){
list_tables$i <- i[which((i$Gender=="M")|(i$Gender=="all")),]
}
}
Problem is, if I check the list afterwards, nothing has changed. If I do the same, but instead of using a for loop to cycle through the dataframes, I perform the same actions on only one dataframe, it does work. Theoretically, I could make a line of code for each dataframe separately, but it doesn't seem very neat and I have the feeling that the for loop should work but I'm just missing something. Would love to hear tips if anyone has them!
i is not a named-entry within list_tables, so list_tables$i doesn't work. Inside that loop, i is the data.frame you're trying to modify, but you don't update it.
Try either:
for (ind in seq_along(list_tables)) {
i <- list_tables[[ind]] # feels a little sloppt, but it's compact ...
list_tables[[ind]] <- i[which((i$Gender=="M")|(i$Gender=="all")),]
}
or even better
list_tables <- lapply(list_tables, function(i) i[which((i$Gender=="M")|(i$Gender=="all")),])
You could use lapply with subset:
example:
list_tables <- replicate(2,iris[c(1,51,101),],F)
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 101 6.3 3.3 6.0 2.5 virginica
#
# [[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 101 6.3 3.3 6.0 2.5 virginica
solution:
lapply(list_tables,subset,Species %in% c("setosa","virginica"))
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 101 6.3 3.3 6.0 2.5 virginica
#
# [[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 101 6.3 3.3 6.0 2.5 virginica
In your case that would be:
lapply(list_tables,subset,Gender %in% c("M","all"))
Hi I'm currently trying to extract some of the inner node information stored in the constant partying object in R using ctree in partykit but I'm finding navigating the objects a bit difficult, I'm able to display the information on a plot but I'm not sure how to extract the information - I think it requires nodeapply or another function in the partykit?
library(partykit)
irisct <- ctree(Species ~ .,data = iris)
plot(irisct, inner_panel = node_barplot(irisct))
Plot with inner node details
All the information is accessible by the functions to plot, but I'm after a text output similar to:
Example output
The main trick (as previously pointed out by #G5W) is to take the [id] subset of the party object and then extract the data (by either $data or using the data_party() function) which contains the response. I would recommend to build a table with absolute frequencies first and then compute the relative and marginal frequencies from that. Using the irisct object the plain table can be obtained by
tab <- sapply(1:length(irisct), function(id) {
y <- data_party(irisct[id])
y <- y[["(response)"]]
table(y)
})
tab
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## setosa 50 50 0 0 0 0 0
## versicolor 50 0 50 49 45 4 1
## virginica 50 0 50 5 1 4 45
Then we can add a little bit of formatting to a nice table object:
colnames(tab) <- 1:length(irisct)
tab <- as.table(tab)
names(dimnames(tab)) <- c("Species", "Node")
And then use prop.table() and margin.table() to compute the frequencies we are interested in. The as.data.frame() method transform from the table layout to a "long" data.frame:
as.data.frame(prop.table(tab, 1))
## Species Node Freq
## 1 setosa 1 0.500000000
## 2 versicolor 1 0.251256281
## 3 virginica 1 0.322580645
## 4 setosa 2 0.500000000
## 5 versicolor 2 0.000000000
## 6 virginica 2 0.000000000
## 7 setosa 3 0.000000000
## 8 versicolor 3 0.251256281
## 9 virginica 3 0.322580645
## 10 setosa 4 0.000000000
## 11 versicolor 4 0.246231156
## 12 virginica 4 0.032258065
## 13 setosa 5 0.000000000
## 14 versicolor 5 0.226130653
## 15 virginica 5 0.006451613
## 16 setosa 6 0.000000000
## 17 versicolor 6 0.020100503
## 18 virginica 6 0.025806452
## 19 setosa 7 0.000000000
## 20 versicolor 7 0.005025126
## 21 virginica 7 0.290322581
as.data.frame(margin.table(tab, 2))
## Node Freq
## 1 1 150
## 2 2 50
## 3 3 100
## 4 4 54
## 5 5 46
## 6 6 8
## 7 7 46
And the split information can be obtained with the (still unexported) .list.rules.party() function. You just need to ask for all node IDs (the default is to use just the terminal node IDs):
partykit:::.list.rules.party(irisct, i = nodeids(irisct))
## 1
## ""
## 2
## "Petal.Length <= 1.9"
## 3
## "Petal.Length > 1.9"
## 4
## "Petal.Length > 1.9 & Petal.Width <= 1.7"
## 5
## "Petal.Length > 1.9 & Petal.Width <= 1.7 & Petal.Length <= 4.8"
## 6
## "Petal.Length > 1.9 & Petal.Width <= 1.7 & Petal.Length > 4.8"
## 7
## "Petal.Length > 1.9 & Petal.Width > 1.7"
Most of the information that you want is accessible without much work.
I will show how to get the information, but leave you to format the
information into a pretty table.
Notice that your tree structure irisct is just a list of each of the nodes.
length(irisct)
[1] 7
Each node has a field data that contains the points that have made it down
this far in the tree, so you can get the number of observations at the node
by counting the rows.
dim(irisct[4]$data)
[1] 54 5
nrow(irisct[4]$data)
[1] 54
Or doing them all at once to get your table 2
NObs = sapply(1:7, function(n) { nrow(irisct[n]$data) })
NObs
[1] 150 50 100 54 46 8 46
The first column of the data at a node is the class (Species),
so you can get the count of each class and the probability of each class
at a node
table(irisct[4]$data[1])
setosa versicolor virginica
0 49 5
table(irisct[4]$data[1]) / NObs[4]
setosa versicolor virginica
0.00000000 0.90740741 0.09259259
The split information in your table 3 is a bit more awkward. Still,
you can get a text version of what you need just by printing out the
top level node
irisct[1]
Model formula:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
Fitted party:
[1] root
| [2] Petal.Length <= 1.9: setosa (n = 50, err = 0.0%)
| [3] Petal.Length > 1.9
| | [4] Petal.Width <= 1.7
| | | [5] Petal.Length <= 4.8: versicolor (n = 46, err = 2.2%)
| | | [6] Petal.Length > 4.8: versicolor (n = 8, err = 50.0%)
| | [7] Petal.Width > 1.7: virginica (n = 46, err = 2.2%)
Number of inner nodes: 3
Number of terminal nodes: 4
To save the output for parsing and display
TreeSplits = capture.output(print(irisct[1]))
I have the following J48 decision tree (res): (This is an example)
> res
J48 pruned tree
------------------
Petal.Width <= 0.6: setosa (50.0)
Petal.Width > 0.6
| Petal.Width <= 1.7
| | Petal.Length <= 4.9: versicolor (48.0/1.0)
| | Petal.Length > 4.9
| | | Petal.Width <= 1.5: virginica (3.0)
| | | Petal.Width > 1.5: versicolor (3.0/1.0)
| Petal.Width > 1.7: virginica (46.0/1.0)
Number of Leaves : 5
Size of the tree : 9
It is created as followed:
library(RWeka)
data(iris)
res = J48(Species ~., data = iris)
I would like to transform it into dataframe or list in the following format:
source_node1 target_node1
source_node2 target_node2
Here is the required result:
First format (with numbers):
Petal.Width_0.6 Petal.Width_1.7
Petal.Width_1.7 Petal.Length_4.9
Petal.Length_4.9 Petal.Width_1.5
Second format (same with no numbers):
Petal.Width Petal.Width
Petal.Width Petal.Length
Petal.Length Petal.Width
I have the following tree in txt file (you can copy paste and save it into txt file):
R> res
J48 pruned tree
------------------
Petal.Width <= 0.6: setosa (50.0)
Petal.Width > 0.6
| Petal.Width <= 1.7
| | Petal.Length <= 4.9: versicolor (48.0/1.0)
| | Petal.Length > 4.9
| | | Petal.Width <= 1.5: virginica (3.0)
| | | Petal.Width > 1.5: versicolor (3.0/1.0)
| Petal.Width > 1.7: virginica (46.0/1.0)
Number of Leaves : 5
Size of the tree : 9
This is my input file. I would like to get the list of each node (father) and its kids of the tree (it is only an example). I would like to know if I can transfer this txt format into a tree by using data.tree. And how can I get the kids of each level (father)?