R extract terminal node info from partykit decision tree - r

I have created a constparty decision tree (customized split rules) and print out the tree result. The result looks like this:
Fitted party:
[1] root
| [2] value.a < 1651: 0.067 (n = 1419, err = 88.6)
| [3] value.a >= 1651: 0.571 (n = 7, err = 1.7)
I am trying to extract terminal node info
(the yval: 0.067 and 0.571;
the n on each node: 1419 and 7;
and err: 88.6 and 1.7) and put them into a list while having the corresponding node id (node ID 2 and 3) so that I can utilize those info later.
I have been looking into partykit functions for a while and could not find a function that could help me extracting those info I just listed.
Could someone help me please? Thank you!

As usual there are several approaches to obtain the information you are looking for. The technical way for extracting the info stored in a particular node is to use nodeapply(object, ids, info_node) where info_node returns a list of information stored in the respective node.
However, in the terminal nodes of constparty objects there is nothing stored. Instead, the whole distribution of the response by fitted node is stored and can be extracted by fitted(object). This contains a data frame with the observed (response) the (fitted) node and the observation (weights) (if any). And then you can easily use tapply() or aggregate() or something like that to compute node-wise means etc.
Alternatively, you can convert the constparty object to a simpleparty object which stores the printed information in the nodes and extract it.
A worked example for both strategies is a simple regression tree for the cars data:
library("partykit")
data("cars", package = "datasets")
ct <- ctree(dist ~ speed, data = cars)
Then you can easily compute node-wise means by
with(fitted(ct), tapply(`(response)`, `(fitted)`, mean))
## 3 4 5
## 18.20000 39.75000 65.26316
Of course, you can replace mean by any other summary statistic you are interested in.
The nodeapply() for the simpleparty can be obtained by:
nodeapply(as.simpleparty(ct), ids = nodeids(ct, terminal = TRUE), info_node)
## $`3`
## $`3`$prediction
## [1] 18.2
##
## $`3`$n
## n
## 15
##
## $`3`$error
## [1] 1176.4
##
## $`3`$distribution
## NULL
##
## $`3`$p.value
## NULL
##
##
## $`4`
## $`4`$prediction
## [1] 39.75
## ...

Related

Optimization function gives incorrect results for 2 similar data sets

I have 2 datasets not very different to each other. Each dataset has 27 rows of actual and forecast values. When tested against Solver in Excel for minimization of the absolute error (abs(actual - par * forecast) they both give nearly equal values for the parameter 'par'. However, when each of these data sets are passed on to the same optimization function that I have written, it only works for one of them. For the other data set, the objective always gets evaluated to zero (0) with'par' assisgned the upper bound value.
This is definitely incorrect. What I am not able to understand is why is R doing so?
Here are the 2 data sets :-
test
dateperiod,usage,fittedlevelusage
2019-04-13,16187.24,17257.02
2019-04-14,16410.18,17347.49
2019-04-15,18453.52,17246.88
2019-04-16,18113.1,17929.24
2019-04-17,17712.54,17476.67
2019-04-18,15098.13,17266.89
2019-04-19,13026.76,15298.11
2019-04-20,13689.49,13728.9
2019-04-21,11907.81,14122.88
2019-04-22,13078.29,13291.25
2019-04-23,15823.23,14465.34
2019-04-24,14602.43,15690.12
2019-04-25,12628.7,13806.44
2019-04-26,15064.37,12247.59
2019-04-27,17163.32,16335.43
2019-04-28,17277.18,16967.72
2019-04-29,20093.13,17418.99
2019-04-30,18820.68,18978.9
2019-05-01,18799.63,17610.66
2019-05-02,17783.24,17000.12
2019-05-03,17965.56,17818.84
2019-05-04,16891.25,18002.03
2019-05-05,18665.49,18298.02
2019-05-06,21043.86,19157.41
2019-05-07,22188.93,21092.36
2019-05-08,22358.08,21232.56
2019-05-09,22797.46,22229.69
Optimization result from R
$minimum
[1] 1.018188
$objective
[1] 28031.49
test1
dateperiod,Usage,fittedlevelusage
2019-04-13,16187.24,17248.29
2019-04-14,16410.18,17337.86
2019-04-15,18453.52,17196.25
2019-04-16,18113.10,17896.74
2019-04-17,17712.54,17464.45
2019-04-18,15098.13,17285.82
2019-04-19,13026.76,15277.10
2019-04-20,13689.49,13733.90
2019-04-21,11907.81,14152.27
2019-04-22,13078.29,13337.53
2019-04-23,15823.23,14512.41
2019-04-24,14602.43,15688.68
2019-04-25,12628.70,13808.58
2019-04-26,15064.37,12244.91
2019-04-27,17163.32,16304.28
2019-04-28,17277.18,16956.91
2019-04-29,20093.13,17441.80
2019-04-30,18820.68,18928.29
2019-05-01,18794.10,17573.40
2019-05-02,17779.00,16969.20
2019-05-03,17960.16,17764.47
2019-05-04,16884.77,17952.23
2019-05-05,18658.16,18313.66
2019-05-06,21036.49,19149.12
2019-05-07,22182.11,21103.37
2019-05-08,22335.57,21196.23
2019-05-09,22797.46,22180.51
Optimization result from R
$minimum
[1] 1.499934
$objective
[1] 0
The optimization function used is shown below :-
optfn <- function(x)
{act <- x$usage
fcst <- x$fittedlevelusage
fn <- function(par)
{sum(abs(act - (fcst * par)))
}
adjfac <- optimize(fn, c(0.5, 1.5))
return(adjfac)
}
adjfacresults <- optfn(test)
adjfacresults <- optfn(test1)
Optimization result from R
adjfacresults <- optfn(test)
$minimum
[1] 1.018188
$objective
[1] 28031.49
Optimization result from R
adjfacresults <- optfn(test1)
$minimum [1]
1.499934
$objective
[1] 0
Can anyone help to identify why is R not doing the same process over the 2 data sets and outputting the correct results in both the cases.
The corresponding results using Excel Solver for the 2 datasets are as follows :-
For 'test' data set
par value = 1.018236659
objective function valule (min) : 28031
For 'test1' data set
par value = 1.01881062927878
objective function valule (min) : 28010
Best regards
Deepak
That's because the second column of test1 is named Usage, not usage. Therefore, act = x$usage is NULL, and the function fn returns sum(abs(NULL - something)) = sum(NULL) = 0. You have to rename this column to usage.

What is the difference between data and data.frame in R?

I know data.frame is a 2-D matrix with columns with different types. I think data is another type of data structure in R, which can take multiple data.frames.
In RStudio, now I have two data: dcd and pdb:
I was trying to understand the properties of them:
> dcd
Total Frames#: 101
Total XYZs#: 19851, (Atoms#: 6617)
[1] 65.59 84.65 90.92 <...> 59.76 55.48 83.68 [2004951]
+ attr: Matrix DIM = 101 x 19851
> class(dcd)
[1] "xyz" "matrix"
> dcd$xyz
Error in dcd$xyz : $ operator is invalid for atomic vectors
> pdb
Call: read.pdb(file = pdbfile)
Total Models#: 1
Total Atoms#: 6598, XYZs#: 19794 Chains#: 2 (values: L H)
Protein Atoms#: 6598 (residues/Calpha atoms#: 442)
Nucleic acid Atoms#: 0 (residues/phosphate atoms#: 0)
Non-protein/nucleic Atoms#: 0 (residues: 0)
Non-protein/nucleic resid values: [ none ]
Protein sequence:
DIQMTQSPSSLSASVGDRVTITCKASQNVRTVVAWYQQKPGKAPKTLIYLASNRHTGVPS
RFSGSGSGTDFTLTISSLQPEDFATYFCLQHWSYPLTFGQGTKVEIKRTVAAPSVFIFPP
SDEQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLT
LSKADYEKHKVYACEVTHQGLSSPVTKSFNRGECEVQLVESGGGL...<cut>...TSAA
+ attr: atom, xyz, calpha, call
> class(pdb)
[1] "pdb" "sse"
> pdb$xyz
Total Frames#: 1
Total XYZs#: 19794, (Atoms#: 6598)
[1] 24.33 14.711 -3.854 <...> -34.374 -6.315 14.986 [19794]
+ attr: Matrix DIM = 1 x 19794
My questions are:
Is dcd similar to a matrix with 101 rows and 19851 columns?
class(dcd) outputs "xyz" and "matrix", does it mean the dcd belongs to both "xyz" and "matrix" types in the same time?
How can I create a data like pdb which includes multiple data.frame?
e.g. if I have
students <- data.frame(c("Cedric","Fred","George"),c(3,2,2))
names(students) <- c("name", "year")
teachers <- data.frame(c("John","Alice","Mike"),c(6,9,5))
names(teachers) <- c("name", "year")
how can I combine students and teachers into a data called people, so that I can use people$students or people$teachers?
If you're asking how to create a dataframe named people, so you can access the names of the people using people$students or people$teachers, then the code to achieve that is:
people <- data.frame(students = students$name, teachers = teachers$name)
people$students
people would be a dataframe that looks like this:
If you want a list, you can create a list object like the following:
people2 <- as.list(c("students" = students, "teachers" = teachers))
people2$students.name
# returns [1] Cedric Fred George
And people2 would be a list:
See the $ (dollar sign) next to each item in the list? That tells you how to access them. If you wanted teachers.name, then print(people2$teachers.name) will do that for you.
As for your other questions:
Is dcd similar to a matrix with 101 rows and 19851 columns?
You can verify the dimension of a matrix-like object using dim(), ncol() or nrow(). In your case yes it has 101 rows and 19851 columns.
class(dcd) outputs "xyz" and "matrix", does it mean the dcd belongs to both "xyz" and "matrix" types in the same time?
Simplistically, you can think of it inheriting a matrix class as well as xyz. You may want to read about classes and inheritance in R.
How can I create a data like pdb which includes multiple data.frame?
Look at my code above. people2 <- as.list(c("students" = students, "teachers" = teachers)) creates a list of "multiple" dataframes.

ctree() - How to get the list of splitting conditions for each terminal node when the response variable is Categorical variable [duplicate]

I am trying to extract the tree information from the output of ctree. I tried the Class "BinaryTree" info but with no success. Any input is appreciated.
Thank You
The ctree objects are S4 objects at least at the top, and the tree information is in the "tree" slot. The "tree slot can be access ed with the # operator. If you take the first example in the help(ctree) page you can get a graphical display with:
plot(airct)
And then you can look are branches of the tree by traversing with list operations. The "leaves" of the tree are descendents of nodes with "terminal"==TRUE:
> airct#tree$right$terminal
[1] FALSE
> airct#tree$left$terminal
[1] FALSE
> airct#tree$right$right$terminal
[1] TRUE
> airct#tree$right$left$terminal
[1] TRUE
> airct#tree$left$left$terminal
[1] TRUE
> airct#tree$left$right$terminal
[1] FALSE
Information at nodes above the leaves can also be recovered:
> airct#tree$left$right
4) Temp <= 77; criterion = 0.997, statistic = 11.599
5)* weights = 48
4) Temp > 77
6)* weights = 21
This is the same information that the nodes function will recover if you know the number of the node:
> nodes(airct,4)
[[1]]
4) Temp <= 77; criterion = 0.997, statistic = 11.599
5)* weights = 48
4) Temp > 77
6)* weights = 21
The mlmeta R package converts fitted ctree models to SAS code. It can be easily adapted to other languages and is generally instructive on the internals of the object.
Let's say your ctree model is named ct. Then
print(ct)
worked for me to see the tree structure.

Error when using cv.tree

Hi I tried using the function cv.tree from the package tree. I have a binary categorical response (called Label) and 30 predictors. I fit a tree object using all predictors.
I got the following error message that I don't understand:
Error in as.data.frame.default(data, optional = TRUE) :
cannot coerce class ""function"" to a data.frame
The data is the file 'training' taken from this site.
This is what I did:
x <- read.csv("training.csv")
attach(x)
library(tree)
Tree <- tree(Label~., x, subset=sample(1:nrow(x), nrow(x)/2))
CV <- cv.tree(Tree,FUN=prune.misclass)
The error occurs once cv.tree calls model.frame. The 'call' element of the tree object must contain a reference to a data frame whose name is also not the name of a loaded function.
Thus, not only will subsetting in the call to tree generate the error when cv.tree later uses the 'call' element of the tree object, using a dataframe with a name like "df" would give an error as well because model.frame will take this to be name of an existing function (i.e. the 'density of F distribution' from the stats package).
I think the problem is in the dependent variable list. The following works, but I think you need to read the problem description more carefully. First, setup the formula without weight.
x <- read.csv("training.csv")
vars<-setdiff(names(x),c("EventId","Label","Weight"))
fmla <- paste("Label", "~", vars[1], "+",
paste(vars[-c(1)], collapse=" + "))
Here's what you've been running
Tree <- tree(fmla, x, subset=sample(1:nrow(x), nrow(x)/2))
plot(Tree)
$size
[1] 6 5 4 3 1
$dev
[1] 25859 25859 27510 30075 42725
$k
[1] -Inf 0.0 1929.0 2791.0 6188.5
$method
[1] "misclass"
attr(,"class")
[1] "prune" "tree.sequence"
You may want to consider package rpart also
urows = sample(1:nrow(x), nrow(x)/2)
x_sub <- x[urows,]
Tree <- tree(fmla, x_sub)
plot(Tree)
CV <- cv.tree(Tree,FUN=prune.misclass)
CV
library(rpart)
tr <- rpart(fmla, data=x_sub, method="class")
printcp(tr)
Classification tree:
rpart(formula = fmla, data = x_sub, method = "class")
Variables actually used in tree construction:
[1] DER_mass_MMC DER_mass_transverse_met_lep
[3] DER_mass_vis
Root node error: 42616/125000 = 0.34093
n= 125000
CP nsplit rel error xerror xstd
1 0.153733 0 1.00000 1.00000 0.0039326
2 0.059274 2 0.69253 0.69479 0.0035273
3 0.020016 3 0.63326 0.63582 0.0034184
4 0.010000 5 0.59323 0.59651 0.0033393
If you include weight, then that is the only split.
vars<-setdiff(names(x),c("EventId","Label"))

Subset by function's variable using $variable

I am having trouble to subset from a list using a variable of my function.
rankhospital <- function(state,outcome,num = "best") {
#code here
e3<-dataframe(...,state.name,...)
if (num=="worst"){ return(worst(state,outcome))
}else if((num%in%b=="TRUE" & outcome=="heart attack")=="TRUE"){
sep<-split(e3,e3$state.name)
hosp.estado<-sep$state
hospital<-hosp.estado[num,1]
return(as.character(hospital))
I split my data frame by state (which is a variable of my function)
But hosp.estado<-sep$state doesn't work. I have also tried as.data.frame.
The function (rankhospital("NY"....) returns me a character(0).
When I feed the sep$state with sep$"NY" directly in code it works perfectly so I guess the problem is I can't use a function's variable to do this. Am I right? What could I use instead?
Thank you!!
If state is a variable in your function, you can refer to a column with the name given by state using: sep[state] or sep[[state]]. The first produces a data frame with one column named based on the value of state. The second produces an unnamed vector.
df=data.frame(NY=rnorm(10),CA=rnorm(10), IL=rnorm(10))
state="NY"
df[state]
# NY
# 1 -0.79533912
# 2 -0.05487747
# 3 0.25014132
# 4 0.61824329
# 5 -0.17262350
# 6 -2.22390027
# 7 -1.26361438
# 8 0.35872890
# 9 -0.01104548
# 10 -0.94064916
df[[state]]
# [1] -0.79533912 -0.05487747 0.25014132 0.61824329 -0.17262350 -2.22390027 -1.26361438 0.35872890 -0.01104548 -0.94064916
class(df[state])
# [1] "data.frame"
class(df[[state]])
# [1] "numeric"
It seems like you are trying to get the top hospital in a state. You don't want to split here (see the result of sep to see what I mean). Instead, use:
as.character(e3[e3$state.name==state, 1][num])
This hopefully does what you want.
You need sep[[state]] instead of sep$state to get the data frame out of your sep list, which matches the state parameter of your function. Like this:
e3 <- read.csv("https://raw.github.com/Hindol/data-analysis-coursera/master/HW3/hospital-data.csv")
state <- "WY"
num <- 1:5
sep<-split(e3,e3$State)
hosp.estado<-sep[[state]]
hospital<-hosp.estado[num,1]
as.character(hospital)
# [1] "530002" "530006" "530008" "530010" "530011"

Resources