Plot ctree using rpart.plot functionality - r

Been trying to use the rpart.plot package to plot a ctree from the partykit library. The reason for this being that the default plot method is terrible when the tree is deep. In my case, my max_depth = 5.
I really enjoy rpart.plot's output as it allows for deep trees to visually display better. How the output looks for a simple example:
rpart
library(partykit)
library(rpart)
library(rpart.plot)
df_test <- cu.summary[complete.cases(cu.summary),]
multi.class.model <- rpart(Reliability~., data = df_test)
rpart.plot(multi.class.model)
I would like to get this output from the partykit model using ctree
ctree
multi.class.model <- ctree(Reliability~., data = df_test)
rpart.plot(multi.class.model)
>Error: the object passed to prp is not an rpart object
Is there some way one could coerce the ctree object to rpart so this would run?

To the best of my knowledge all the other packages for visualizing rpart trees are really rpart-specific and not based on the agnostic party class for representing trees/recursive partitions. Also, we haven't tried to implement an as.rpart() method for party objects because the rpart class is really not well-suited for this.
But you can try to tweak the partykit visualizations which are customizable through panel functions for almost all aspects of the tree. One thing that might be helpful is to compute a simpleparty object which has all sorts of simple summary information in the $info of each node. This can then be used in the node_terminal() panel function for printing information in the tree display. Consider the following simple example for predicting one of three school types in the German Socio-Economic Panel. To achieve the desired depth I switch significance testing essentiall off:
library("partykit")
data("GSOEP9402", package = "AER")
ct <- ctree(school ~ ., data = GSOEP9402, maxdepth = 5, alpha = 0.5)
The default plot(ct) on a sufficiently big device gives you:
When turning the tree into a simpleparty you get a textual summary by default:
st <- as.simpleparty(ct)
plot(st)
This has still overlapping labels so we could set up a small convenience function that extracts the interesting bits from the $info of each node and puts them into a longer character vector with less wide entries:
myfun <- function(i) c(
as.character(i$prediction),
paste("n =", i$n),
format(round(i$distribution/i$n, digits = 3), nsmall = 3)
)
plot(st, tp_args = list(FUN = myfun), ep_args = list(justmin = 20))
In addition to the arguments of the terminal panel function (tp_args) I have tweaked the arguments of the edge panel function (ep_args) to avoid some of the overplotting in the edges.
Of course, you could also change the entire panel function and roll your own...

Related

randomForest in R is including the class label as a feature prevents classifier from predicting on new dataset

So I have two datasets, og.data and newdata.df. I have matched their features and I want to use a feature from og.data to train a model so I can identify cases of this class in newdata.df. I am using the randomForest package in R documentation for it is here: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
split <- sample.split(og.data$class_label, SplitRatio = 0.7)
training_set = subset(og.data$class_label, split == TRUE)
test_set = subset(og.data$class_label, split == FALSE)
rf.classifier.object = randomForest(x = training_set[-1],
y = training_set$Engramcell,
ntree = 500)
I then use the test set to calculate the AUC, visualize ROC, precision, recall etc etc.
I do that using prediction probability generated like so...
predictions.df <- as.data.frame(predict(rf.classifier.object,
test_set,
type = "prob")
)
All is good I proceed to try to use the classifier I've trained on new data and now I am encountering a problem because the new data does not contain the feature class label. Whihc is annoying as the entire purpose of training the classifier to to label this newdata.
predictions.df <- as.data.frame(predict(rf.classifier.object,
newdata.df,
type = "prob")
)
Please note the error has different variable names simply because I changed the code to make it more general for readability.
Error in predict.randomForest(rf.classifier.object, newdata.df, :
variables in the training data missing in newdata
As per this stack post predict.randomForest(), called here as predict(), uses rownames of feature importance to make its precitions. And when I checked with a search of the feature names I find that it is infact the class label preventing me from making the test as I show bellow.
# > rownames(rf.classifier.object$importance)[!(rownames(rf.classifier.object$importance) %in% colnames(newdata) )]
# [1] "class_label"
It is not clear to me what I should change in my script so that the classifier can be used on other data than the testing set. I have followed the instructions exactly this seems like a bad design choice to have made the function this way. The class label should not be used for calculating feature importance at all and should not even be considered a feature imo.

Converting a R2jags object into a Stanreg (rstanarm) object

I made a model using R2jags. I like the jags syntax but I find the output produced by R2jags not easy to use. I recently read about the rstanarm package. It has many useful functions and is well supported by the tidybayes and bayesplot packages for easy model diagnostics and visualisation. However, I'm not a fan of the syntax used to write a model in rstanarm. Ideally, I would like to get the best of the two worlds, that is writing the model in R2jags and convert the output into a Stanreg object to use rstanarm functions.
Is that possible? If so, how?
I think then question isn't necessarily whether or not it's possible - I suspect it probably is. The question really is how much time you're prepared to spend doing it. All you'd have to do is try to replicate in structure the object that gets created by rstanarm, to the extent that it's possible with the R2jags output. That would make it so that some post-processing tasks would probably work.
If I might be so bold, I suspect a better use of your time would be to turn the R2jags object into something that could be used with the post-processing functions you want to use. For example, it only takes a small modification to the JAGS output to make all of the mcmc_*() plotting functions from bayesplot work. Here's an example. Below is the example model from the jags() function help.
# An example model file is given in:
model.file <- system.file(package="R2jags", "model", "schools.txt")
# data
J <- 8.0
y <- c(28.4,7.9,-2.8,6.8,-0.6,0.6,18.0,12.2)
sd <- c(14.9,10.2,16.3,11.0,9.4,11.4,10.4,17.6)
jags.data <- list("y","sd","J")
jags.params <- c("mu","sigma","theta")
jags.inits <- function(){
list("mu"=rnorm(1),"sigma"=runif(1),"theta"=rnorm(J))
}
jagsfit <- jags(data=jags.data, inits=jags.inits, jags.params,
n.iter=5000, model.file=model.file, n.chains = 2)
Now, what the mcmc_*() plotting functions from bayesplot expect is a list of matrices of MCMC draws where the column names give the name of the parameter. By default, jags() puts all of them into a single matrix. In the above case, there are 5000 iterations in total, with 2500 as burnin (leaving 2500 sampled) and the n.thin is set to 2 in this case (jags() has an algorithm for identifying the thinning parameter), but in any case, the jagsfit$BUGSoutput$n.keep element identifies how many iterations are kept. In this case, it's 1250. So you could use that to make a list of two matrices from the output.
jflist <- list(jagsfit$BUGSoutput$sims.matrix[1:jagsfit$BUGSoutput$n.keep, ],
jagsfit$BUGSoutput$sims.matrix[(jagsfit$BUGSoutput$n.keep+1):(2*jagsfit$BUGSoutput$n.keep), ])
Now, you'd just have to call some of the plotting functions:
mcmc_trace(jflist, regex_pars="theta")
or
mcmc_areas(jflist, regex_pars="theta")
So, instead of trying to replicate all of the output that rstanarm produces, it might be a better use of your time to try to bend the jags output into a format that would be amenable to the post-processing functions you want to use.
EDIT - added possibility for pp_check() from bayesplot.
The posterior draws of y in this case are in the theta parameters. So, we make an object that has elements y and yrep and make it of class foo
x <- list(y = y, yrep = jagsfit$BUGSoutput$sims.list$theta)
class(x) <- "foo"
We can then write a pp_check method for objects of class foo. This come straight out of the help file for bayesplot::pp_check().
pp_check.foo <- function(object, ..., type = c("multiple", "overlaid")) {
y <- object[["y"]]
yrep <- object[["yrep"]]
switch(match.arg(type),
multiple = ppc_hist(y, yrep[1:min(8, nrow(yrep)),, drop = FALSE]),
overlaid = ppc_dens_overlay(y, yrep[1:min(8, nrow(yrep)),, drop = FALSE]))
}
Then, just call the function:
pp_check(x, type="overlaid")

Specifying custom weights for the nonparametric estimate of spatially-varying relative risk in spatstat

Is there a way to specify weights in relrisk.ppp function in spatstat (version 1.63-3)?
The relrisk.ppp function calls the density.ppp function, which does allow users to specify their own weights.
For example, let us build upon the provided spatstat.data::urkiola data where, instead of individual trees, the locations are tree stands and we have a second numeric mark for the frequency of trees at each point-location:
urkiola_new <- spatstat.data::urkiola
urkiola_new$marks <- data.frame("type" = urkiola_new$marks, "freq" = rpois(urkiola_new$n, 3))
f1 <- spatstat::relrisk(urkiola_new, weights = urkiola_new$marks$freq)
When using the urkiola_new in a call of relrisk, urkiola_new is caught by stopifnot(is.multitype(X)) in relrisk.ppp. I next tried specifying the weights separately as a vector while using the original urkiola data,
f2 <- spatstat::relrisk(urkiola, weights = urkiola_new$marks$freq)
but was caught by a warning from the pixellate.ppp function within the internal density.ppp function:
Error in pixellate.ppp(x, ..., padzero = TRUE) : length(weights) == npoints(x) || length(weights) == 1 is not TRUE
The same error occurs when I convert the weights into a list
urkiola_weights <- split(urkiola_new$marks$freq, urkiola_new$marks$type)
f3 <- spatstat::relrisk(urkiola, weights = urkiola_weights)
I suspect there is a way to specify the weights cleverly, but it yet escapes me. Any suggestions or guidance would be helpful, thank you!
The function relrisk.ppp is not currently designed to handle weights. The help entry for relrisk.ppp does not mention weights.
The example above does not work because relrisk.ppp applies density.ppp separately to the sub-patterns of points of each type, and the extra argument weights is the wrong length for these sub-patterns.
I will take this question as a feature request, to add this capability to relrisk.ppp. It should be done soon.
Update: this is now implemented in the development version, spatstat 1.64-0.018 available at the spatstat github repository

R - partykit - custom model tree nodes and splits

In package partykit there is a way to build custom trees by specifying predictor and split. For example:
data("WeatherPlay", package = "partykit")
#create a split
sp_o <- partysplit(3L, breaks = 75)
#create a node
n1 <- partynode(id = 1L, split = sp_o, kids = lapply(2L:3L, partynode))
#and make a "tree" out of it
t2 <- party(
n1,
data = WeatherPlay,
fitted = data.frame(
"(fitted)" = fitted_node(n1, data = WeatherPlay),
"(response)" = WeatherPlay$play,
check.names = FALSE
),
terms = terms(play ~ ., data = WeatherPlay),
)
t2 <- as.constparty(t2)
t2
plot(t2)
Is this possible for model trees (returned by mob())? Can i build the tree node by node, and then fit specified function to terminal nodes?
In principle, it is possible to build a "modelparty" object (as returned by mob()) by hand. However, there is not a simple coercion function like as.constparty() in the example you cite. The reason is that for building the model trees, and for printing and especially predicting with them, much more detailed information about the model is needed.
I would recommend to build the tree structure ("partynode") first, then fill the overall $info slot (with call, formula, Formula, terms, fit, control, dots, and nreg). And then it should be possible to call refit.modelparty() to refit the models in all the terminal nodes. And this can then be used to fill the $node$info (with coefficients, objfun, nobs, ...).
Disclaimer: All of this is completely untested. But instead of mimicking "modelparty" you could, of course, also create your own way of storing the models in the "party" object and just use more basic building blocks provided by partykit.

Multichannel sequence analysis through WeightedCluster package

I would like to apply the functions available in the WeightedCluster package to analyze multichannel sequences I obtained through TraMineR. I am trying so, but due to the fact that multichannel sequences are lists composed by each channel separatedly, I get errors in functions like seqtreedisplay() and all those which require a sequence object.
This is an example:
fullsequences <- list(
work_sequence2 = work_sequence[which(rownames(work_sequence) %in% commonid),],
educ_sequence2 = educ_sequence[which(rownames(educ_sequence) %in% commonid),],
part_sequence2 = part_sequence[which(rownames(part_sequence) %in% commonid),],
kid_sequence2 = kid_sequence[which(rownames(kid_sequence) %in% commonid),]
) # a total of 926 with complete sequences on all channels
multidist <- seqdistmc(
channels = fullsequences,
method = "OM",
norm = FALSE,
sm = list("TRATE","TRATE","TRATE","TRATE"),
with.missing=FALSE,
full.matrix=TRUE,
link="sum")
clusterward <- hclust(as.dist(multidist), method = "ward")
seqtreedisplay(as.seqtree(clusterward, ncluster = 5,
seqdata = fullsequences , diss = multidist))
Error in seqlegend(seqdata, fontsize = legend.fontsize, title = "Legend", :
data is not a sequence object, use seqdef function to create one
Is there a method to use the functionalities of WeightedCluster package upon a multichannel-type object (list of sequences). I am specially interested in using the Partition Around Medioids algorithm with initial ward clusters (function wcKMedioids()). If it is not possible, which is the best alternative to cluster multichannels in R?
Thanks a lot in advance!
The as.seqtree function (from WeightedCluster) requires an object of class stslist (as produced by the TraMineR seqdef function) as seqdata argument. In your case, fullsequences is a list of such objects (the list of parallel sequences), which is NOT itself of class stslist. This causes the error.
Even if you would be able to define a tree of parallel sequences, the problem would be that the seqtreedisplay does not know how to plot parallel sequences. This means that you would have to define a plot function for a list of state sequences and, using the more general disstreedisplay function instead of seqtreedisplay, pass the plot function as imagefunc argument.
To summarize, there are two problems. First you need some as.disstree equivalent of as.seqtree that would work for hierarchical clustering of non-stslist objects. Second, you need a plot function for parallel sequences. The first problem is purely technical and should be easily solved. The second is more conceptual.

Resources