R Extracting inner node information and splits from ctree (partykit) - r

Hi I'm currently trying to extract some of the inner node information stored in the constant partying object in R using ctree in partykit but I'm finding navigating the objects a bit difficult, I'm able to display the information on a plot but I'm not sure how to extract the information - I think it requires nodeapply or another function in the partykit?
library(partykit)
irisct <- ctree(Species ~ .,data = iris)
plot(irisct, inner_panel = node_barplot(irisct))
Plot with inner node details
All the information is accessible by the functions to plot, but I'm after a text output similar to:
Example output

The main trick (as previously pointed out by #G5W) is to take the [id] subset of the party object and then extract the data (by either $data or using the data_party() function) which contains the response. I would recommend to build a table with absolute frequencies first and then compute the relative and marginal frequencies from that. Using the irisct object the plain table can be obtained by
tab <- sapply(1:length(irisct), function(id) {
y <- data_party(irisct[id])
y <- y[["(response)"]]
table(y)
})
tab
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## setosa 50 50 0 0 0 0 0
## versicolor 50 0 50 49 45 4 1
## virginica 50 0 50 5 1 4 45
Then we can add a little bit of formatting to a nice table object:
colnames(tab) <- 1:length(irisct)
tab <- as.table(tab)
names(dimnames(tab)) <- c("Species", "Node")
And then use prop.table() and margin.table() to compute the frequencies we are interested in. The as.data.frame() method transform from the table layout to a "long" data.frame:
as.data.frame(prop.table(tab, 1))
## Species Node Freq
## 1 setosa 1 0.500000000
## 2 versicolor 1 0.251256281
## 3 virginica 1 0.322580645
## 4 setosa 2 0.500000000
## 5 versicolor 2 0.000000000
## 6 virginica 2 0.000000000
## 7 setosa 3 0.000000000
## 8 versicolor 3 0.251256281
## 9 virginica 3 0.322580645
## 10 setosa 4 0.000000000
## 11 versicolor 4 0.246231156
## 12 virginica 4 0.032258065
## 13 setosa 5 0.000000000
## 14 versicolor 5 0.226130653
## 15 virginica 5 0.006451613
## 16 setosa 6 0.000000000
## 17 versicolor 6 0.020100503
## 18 virginica 6 0.025806452
## 19 setosa 7 0.000000000
## 20 versicolor 7 0.005025126
## 21 virginica 7 0.290322581
as.data.frame(margin.table(tab, 2))
## Node Freq
## 1 1 150
## 2 2 50
## 3 3 100
## 4 4 54
## 5 5 46
## 6 6 8
## 7 7 46
And the split information can be obtained with the (still unexported) .list.rules.party() function. You just need to ask for all node IDs (the default is to use just the terminal node IDs):
partykit:::.list.rules.party(irisct, i = nodeids(irisct))
## 1
## ""
## 2
## "Petal.Length <= 1.9"
## 3
## "Petal.Length > 1.9"
## 4
## "Petal.Length > 1.9 & Petal.Width <= 1.7"
## 5
## "Petal.Length > 1.9 & Petal.Width <= 1.7 & Petal.Length <= 4.8"
## 6
## "Petal.Length > 1.9 & Petal.Width <= 1.7 & Petal.Length > 4.8"
## 7
## "Petal.Length > 1.9 & Petal.Width > 1.7"

Most of the information that you want is accessible without much work.
I will show how to get the information, but leave you to format the
information into a pretty table.
Notice that your tree structure irisct is just a list of each of the nodes.
length(irisct)
[1] 7
Each node has a field data that contains the points that have made it down
this far in the tree, so you can get the number of observations at the node
by counting the rows.
dim(irisct[4]$data)
[1] 54 5
nrow(irisct[4]$data)
[1] 54
Or doing them all at once to get your table 2
NObs = sapply(1:7, function(n) { nrow(irisct[n]$data) })
NObs
[1] 150 50 100 54 46 8 46
The first column of the data at a node is the class (Species),
so you can get the count of each class and the probability of each class
at a node
table(irisct[4]$data[1])
setosa versicolor virginica
0 49 5
table(irisct[4]$data[1]) / NObs[4]
setosa versicolor virginica
0.00000000 0.90740741 0.09259259
The split information in your table 3 is a bit more awkward. Still,
you can get a text version of what you need just by printing out the
top level node
irisct[1]
Model formula:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
Fitted party:
[1] root
| [2] Petal.Length <= 1.9: setosa (n = 50, err = 0.0%)
| [3] Petal.Length > 1.9
| | [4] Petal.Width <= 1.7
| | | [5] Petal.Length <= 4.8: versicolor (n = 46, err = 2.2%)
| | | [6] Petal.Length > 4.8: versicolor (n = 8, err = 50.0%)
| | [7] Petal.Width > 1.7: virginica (n = 46, err = 2.2%)
Number of inner nodes: 3
Number of terminal nodes: 4
To save the output for parsing and display
TreeSplits = capture.output(print(irisct[1]))

Related

Appending List Elements in write.table

I have a list of what are essentially tables of different variables, with a reproducible dummy example below (it's a little ugly, but it gets the idea across).
results <- list()
for(ii in names(iris)[1:4]) {
mytab <- table(iris[,i] > mean(iris[,i]), iris$Species)
myp <- chisq.test(mytab)$p.value
results[[ii]] <- as.data.frame(cbind(mytab, P.value=myp))
results[[ii]] <- tibble::rownames_to_column(results[[ii]], ii)
}
In a previous version R (at least 4.0), I used to be able to do something like:
lapply(results, function(x) write.table(x, "myfile.txt", append=T, sep="\t", quote=F, row.names=F))
which would generate a file called myfile.txt and fill it with all of my tables, much like the list of printed tables from results. I've had this code (which was functioning as expected) since at least the end of 2021. However, I now get the error:
Error in write.table(x, "myfile.txt", append = T, sep = "\t", quote = T, :
(converted from warning) appending column names to file
And to some extent I get it -- the column names I'm using aren't identical to what I'm appending, but I don't really care for my purposes. I just want my printed list of tables. Is there a way to force appending irrespective of mismatched column names? I've tried using col.names=NA but then receive the error that using col.names=NA with row.names=F "makes no sense". Do I need to resign myself to using functions like sink for this? I'd really like everything to remain tab-separated if possible.
It appears to be baked-in, depending solely on the col.names and append arguments and no easy way to squelch it there.
In general it's just a warning, but since it was elevated to Error status, that suggests you've set options(warn = 2) or higher. It's not a factor for these resolutions (which result in no warning being emitted and therefore no escalation to an error).
Suppress it and all other warnings (for good or bad):
write.table(data.frame(a=1,b=2), "quux.csv", append=T, sep="\t", quote=F, row.names=F)
# Error in write.table(data.frame(a = 1, b = 2), "quux.csv", append = T, :
# (converted from warning) appending column names to file
suppressWarnings(write.table(data.frame(a=1,b=2), "quux.csv", append=T, sep="\t", quote=F, row.names=F))
### nothing emitted, file appended
Suppress just that warning, allowing others (since suppressing all can hide other issues):
withCallingHandlers(
write.table(data.frame(a=1,b=2), "quux.csv", append=T, sep="\t", quote=F, row.names=F),
warning = function(w) {
if (grepl("appending column names to file", conditionMessage(w))) {
invokeRestart("muffleWarning")
}
})
### nothing emitted, file appended
withCallingHandlers(
write.table(data.frame(a=1,b=2), "quux.csv", append=T, sep="\t", quote=F, row.names=F),
warning = function(w) {
if (grepl("something else", conditionMessage(w))) {
invokeRestart("muffleWarning")
}
})
# Error in write.table(data.frame(a = 1, b = 2), "quux.csv", append = T, :
# (converted from warning) appending column names to file
Another potential solution is to use write.list() from the erer package:
library(erer)
#> Loading required package: lmtest
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
#> Registered S3 method overwritten by 'quantmod':
#> method from
#> as.zoo.data.frame zoo
results <- list()
for(ii in names(iris)[1:4]) {
mytab <- table(iris[,ii] > mean(iris[,ii]), iris$Species)
myp <- chisq.test(mytab)$p.value
results[[ii]] <- as.data.frame(cbind(mytab, P.value=myp))
results[[ii]] <- tibble::rownames_to_column(results[[ii]], ii)
}
write.list(z = results, file = "myfile.txt", row.names = FALSE, quote = FALSE)
read.csv("~/Desktop/myfile.txt")
#> Result Sepal.Length setosa versicolor virginica P.value
#> 1 Sepal.Length FALSE 50 24 6 8.373761e-18
#> 2 Sepal.Length TRUE 0 26 44 8.373761e-18
#> 3
#> 4 Result Sepal.Width setosa versicolor virginica P.value
#> 5 Sepal.Width FALSE 8 42 33 1.24116e-11
#> 6 Sepal.Width TRUE 42 8 17 1.24116e-11
#> 7
#> 8 Result Petal.Length setosa versicolor virginica P.value
#> 9 Petal.Length FALSE 50 7 0 9.471374e-28
#> 10 Petal.Length TRUE 0 43 50 9.471374e-28
#> 11
#> 12 Result Petal.Width setosa versicolor virginica P.value
#> 13 Petal.Width FALSE 50 10 0 4.636126e-26
#> 14 Petal.Width TRUE 0 40 50 4.636126e-26
#> 15
# You can also specify the table names, e.g.
write.list(z = results, file = "myfile2.txt", row.names = FALSE, quote = FALSE, t.name = 1:4)
read.csv("~/Desktop/myfile2.txt")
#> Result Sepal.Length setosa versicolor virginica P.value
#> 1 1 FALSE 50 24 6 8.373761e-18
#> 2 1 TRUE 0 26 44 8.373761e-18
#> 3
#> 4 Result Sepal.Width setosa versicolor virginica P.value
#> 5 2 FALSE 8 42 33 1.24116e-11
#> 6 2 TRUE 42 8 17 1.24116e-11
#> 7
#> 8 Result Petal.Length setosa versicolor virginica P.value
#> 9 3 FALSE 50 7 0 9.471374e-28
#> 10 3 TRUE 0 43 50 9.471374e-28
#> 11
#> 12 Result Petal.Width setosa versicolor virginica P.value
#> 13 4 FALSE 50 10 0 4.636126e-26
#> 14 4 TRUE 0 40 50 4.636126e-26
#> 15
Created on 2022-07-19 by the reprex package (v2.0.1)

partykit object varid mismatch

I am using partykit and noticed a possible varid mismatch (unless I misunderstood something). Below is the example code.
The root node as returned by nodeapply shows variable 5 as the split variable.
Also the first element of the explicitly generated list has split$varid 5. If we look at the iris data frame then the 5th column is Species, and Petal.Width is 4th column which should be the varid for the root node as shown by the j48_party object.
It seems like the varid are actual feature used +1, is this intentional?
> library(partykit)
> library(RWeka)
> data("iris")
> j48 <- J48(Species~., data=iris)
> j48_party <- as.party(j48)
> j48_party
Model formula:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
Fitted party:
[1] root
| [2] Petal.Width <= 0.6: setosa (n = 50, err = 0.0%)
| [3] Petal.Width > 0.6
| | [4] Petal.Width <= 1.7
| | | [5] Petal.Length <= 4.9: versicolor (n = 48, err = 2.1%)
| | | [6] Petal.Length > 4.9
| | | | [7] Petal.Width <= 1.5: virginica (n = 3, err = 0.0%)
| | | | [8] Petal.Width > 1.5: versicolor (n = 3, err = 33.3%)
| | [9] Petal.Width > 1.7: virginica (n = 46, err = 2.2%)
Number of inner nodes: 4
Number of terminal nodes: 5
> colnames(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
> nodeapply(j48_party)
$`1`
[1] root
| [2] V5 <= 0.6 *
| [3] V5 > 0.6
| | [4] V5 <= 1.7
| | | [5] V4 <= 4.9 *
| | | [6] V4 > 4.9
| | | | [7] V5 <= 1.5 *
| | | | [8] V5 > 1.5 *
| | [9] V5 > 1.7 *
> nodes <- as.list(j48_party$node)
> nodes[[1]]$split$varid
[1] 5
The difference is due to the following: J48() like most other modeling functions (such as lm(), glm(), etc.) does not simply directly use the data supplied but first builds up a model.frame. This already carries out variable transformations (e.g., taking logs, creating factors or Surv() objects), collecting variables that might not be in data but in the calling environment, and leaving out variables that are not in the model formula etc. See ?model.frame for further information and links.
Therefore, the object created by J48() has a model.frame that is not exactly the iris data but the response variable was moved to the first column:
head(model.frame(j48))
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.1 3.5 1.4 0.2
## 2 setosa 4.9 3.0 1.4 0.2
## 3 setosa 4.7 3.2 1.3 0.2
## 4 setosa 4.6 3.1 1.5 0.2
## 5 setosa 5.0 3.6 1.4 0.2
## 6 setosa 5.4 3.9 1.7 0.4
And the information from this is also carried over to the party object.
j48_party$data
## [1] Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <0 rows> (or 0-length row.names)
[Note: In the case of J48() this only stores the meta-information but drops the actual data because it is not needed here. But this is different for ctree() for example.]
To see that this model.frame() can be different from the original data consider the following situation: we create a new noise variable that is not part of iris but just in the calling environment, take logs, and omit several variables:
set.seed(1)
noise <- rnorm(150)
j48 <- J48(Species ~ log(Petal.Width) + noise, data = iris)
j48_party <- as.party(j48)
head(model.frame(j48))
## Species log(Petal.Width) noise
## 1 setosa -1.6094379 -0.6264538
## 2 setosa -1.6094379 0.1836433
## 3 setosa -1.6094379 -0.8356286
## 4 setosa -1.6094379 1.5952808
## 5 setosa -1.6094379 0.3295078
## 6 setosa -0.9162907 -0.8204684
j48_party$data
## [1] Species log(Petal.Width) noise
## <0 rows> (or 0-length row.names)

R - pareto like summary for histograms

I would like to generate summary of a histogram in a table format. With plot=FALSE, i am able to get histogram object.
> hist(y,plot=FALSE)
$breaks
[1] 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8
$counts
[1] 48 1339 20454 893070 1045286 24284 518 171 148
[10] 94 42 42 37 25 18 21 14 5
$density
[1] 0.00012086929 0.00337174962 0.05150542703 2.24884871999 2.63214538964
[6] 0.06114978928 0.00130438111 0.00043059685 0.00037268032 0.00023670236
[11] 0.00010576063 0.00010576063 0.00009317008 0.00006295276 0.00004532598
[16] 0.00005288032 0.00003525354 0.00001259055
$mids
[1] 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7
$xname
[1] "y"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
Is there a way to summarize this object like pareto chart summary. (Below summary is for different data, including this as an example)
Pareto chart analysis for counts
Frequency Cum.Freq. Percentage Cum.Percent.
c 2294652 2294652 33.689225770 33.68923
f 1605467 3900119 23.570868362 57.26009
g 896893 4797012 13.167848880 70.42794
i 464220 5261232 6.815505091 77.24345
b 365399 5626631 5.364651985 82.60810
j 332239 5958870 4.877809219 87.48591
h 215313 6174183 3.161145249 90.64705
l 129871 6304054 1.906717637 92.55377
e 107001 6411055 1.570948818 94.12472
k 104954 6516009 1.540895526 95.66562
d 103648 6619657 1.521721321 97.18734
m 56172 6675829 0.824696377 98.01203
o 51093 6726922 0.750128391 98.76216
n 49320 6776242 0.724097865 99.48626
p 32321 6808563 0.474524881 99.96079
q 1334 6809897 0.019585291 99.98037
r 620 6810517 0.009102609 99.98947
s 247 6810764 0.003626362 99.99310
u 182 6810946 0.002672056 99.99577
t 162 6811108 0.002378424 99.99815
z 126 6811234 0.001849885 100.00000
You can write a wrapper function that will convert the relevant parts of the hist output into a data.frame:
myfun <- function(x) {
h <- hist(x, plot = FALSE)
data.frame(Frequency = h$counts,
Cum.Freq = cumsum(h$counts),
Percentage = h$density/sum(h$density),
Cum.Percent = cumsum(h$density)/sum(h$density))
}
Here's an example on the built-in iris dataset:
myfun(iris$Sepal.Width)
# Frequency Cum.Freq Percentage Cum.Percent
# 1 4 4 0.026666667 0.02666667
# 2 7 11 0.046666667 0.07333333
# 3 13 24 0.086666667 0.16000000
# 4 23 47 0.153333333 0.31333333
# 5 36 83 0.240000000 0.55333333
# 6 24 107 0.160000000 0.71333333
# 7 18 125 0.120000000 0.83333333
# 8 10 135 0.066666667 0.90000000
# 9 9 144 0.060000000 0.96000000
# 10 3 147 0.020000000 0.98000000
# 11 2 149 0.013333333 0.99333333
# 12 1 150 0.006666667 1.00000000

Store attribute error value in R for confusion matrix?

I am very new to R and I wanted to know how can I store the classification error value which results from confusion matrix:
Example:
confusion(predict(irisfit, iris), iris$Species)
## Setosa Versicolor Virginica
## Setosa 50 0 0
## Versicolor 0 48 1
## Virginica 0 2 49
## attr(, "error"):
## [1] 0.02
I want to fetch the classification error value 0.02 and store it somewhere. How can I do that!?
Assuming that your code works. You should be able to do the following
myconf<-confusion(predict(irisfit, iris), iris$Species)
myerr<-attr(myconf, "error")
which will put the value 0.02 in the variable myerr.

Why is using update on a lm inside a grouped data.table losing its model data?

Ok, this is a weird one. I suspect this is a bug inside data.table, but it would be useful if anyone can explain why this is happening - what is update doing exactly?
I'm using the list(list()) trick inside data.table to store fitted models. When you create a sequence of lm objects each for different groupings, and then update those models, the model data for all models becomes that of the last grouping. This seems like a reference is hanging around somewhere where a copy should have been made, but I can't find where and I can't reproduce this outside of lm and update.
Concrete example:
Starting with the iris data, first make the three species different sample sizes, then fit an lm model to each species, the update those models:
set.seed(3)
DT = data.table(iris)
DT = DT[rnorm(150) < 0.9]
fit = DT[, list(list(lm(Sepal.Length ~ Sepal.Width + Petal.Length))),
by = Species]
fit2 = fit[, list(list(update(V1[[1]], ~.-Sepal.Length))), by = Species]
The original data table has different numbers of each species
DT[,.N, by = Species]
# Species N
# 1: setosa 41
# 2: versicolor 39
# 3: virginica 42
And the first fit confirms thsi:
fit[, nobs(V1[[1]]), by = Species]
# Species V1
# 1: setosa 41
# 2: versicolor 39
# 3: virginica 42
But the updated second fit is showing 42 for all models
fit2[, nobs(V1[[1]]), by = Species]
# Species V1
# 1: setosa 42
# 2: versicolor 42
# 3: virginica 42
We can also look at the model attribute which contains the data used for fitting, and see that all the model are indeed using the final groups data. The question is how has this happened?
head(fit$V1[[1]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 5.1 3.5 1.4
# 2 4.9 3.0 1.4
# 3 4.7 3.2 1.3
# 4 4.6 3.1 1.5
# 5 5.0 3.6 1.4
# 6 5.4 3.9 1.7
head(fit$V1[[3]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
head(fit2$V1[[1]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
head(fit2$V1[[3]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
This is not an answer, but is too long for a comment
The .Environment for the terms component is identical for each resulting model
e1 <- attr(fit[['V1']][[1]]$terms, '.Environment')
e2 <- attr(fit[['V1']][[2]]$terms, '.Environment')
e3 <- attr(fit[['V1']][[3]]$terms, '.Environment')
identical(e1,e2)
## TRUE
identical(e2, e3)
## TRUE
It appears that data.table is using the same bit of memory (my non-technical term) for
each evaluation of j by group (which is efficient). However when update is called, it is using this to refit the model. This will contain the values from the last group.
So, if you fudge this, it will work
fit = DT[, { xx <-list2env(copy(.SD))
mymodel <-lm(Sepal.Length ~ Sepal.Width + Petal.Length)
attr(mymodel$terms, '.Environment') <- xx
list(list(mymodel))}, by= 'Species']
lfit2 <- fit[, list(list(update(V1[[1]], ~.-Sepal.Width))), by = Species]
lfit2[,lapply(V1,nobs)]
V1 V2 V3
1: 41 39 42
# using your exact diagnostic coding.
lfit2[,nobs(V1[[1]]),by = Species]
Species V1
1: setosa 41
2: versicolor 39
3: virginica 42
not a long term solution, but at least a workaround.

Resources