not creating tree by rpart in R - r

I'm new to R and rpart package. I want to create a tree using the following sample data.
My data set is similar to this
mydata =
"","A","B","C","status"
"1",TRUE,TRUE,TRUE,"okay"
"2",TRUE,TRUE,FALSE,"okay"
"3",TRUE,FALSE,TRUE,"okay"
"4",TRUE,FALSE,FALSE,"notokay"
"5",FALSE,TRUE,TRUE,"notokay"
"6",FALSE,TRUE,FALSE,"notokay"
"7",FALSE,FALSE,TRUE,"okay"
"8",FALSE,FALSE,FALSE,"okay"
fit <- rpart(status ~ A + B + C, data = mydata, method = "class")
or
I tried with different formulas and different methods in this. But always only the root node is produced. no plot possible.
its showing
fit
n= 8
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 8 3 okay (0.3750000 0.6250000) *
How to create the tree.? I need to show percentage of
"okay" and "notokay" on each node. and i need to specify one out of A, B or C
for spliting and show the statistics

With the default settings of rpart() no splits are considered at all. The minsplit parameter is 20 by default (see ?rpart.control) which is "the minimum number of observations that must exist in a node in order for a split to be attempted." So for your 8 observations no splitting is even considered.
If you are determined to consider splitting, then you could decrease the minbucket and/or minsplit parameters. For example
fit <- rpart(status ~ A + B + C, data = mydata,
control = rpart.control(minsplit = 3))
produces the following tree:
The display is created by
plot(partykit::as.party(fit), tp_args = list(beside = TRUE))
and the print output from rpart is:
n= 8
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 8 3 okay (0.3750000 0.6250000)
2) A=FALSE 4 2 notokay (0.5000000 0.5000000)
4) B=TRUE 2 0 notokay (1.0000000 0.0000000) *
5) B=FALSE 2 0 okay (0.0000000 1.0000000) *
3) A=TRUE 4 1 okay (0.2500000 0.7500000) *
Whether or not this is particularly useful is a different question though...

Related

Logistic Regression in R: glm() vs rxGlm()

I fit a lot of GLMs in R. Usually I used revoScaleR::rxGlm() for this because I work with large data sets and use quite complex model formulae - and glm() just won't cope.
In the past these have all been based on Poisson or gamma error structures and log link functions. It all works well.
Today I'm trying to build a logistic regression model, which I haven't done before in R, and I have stumbled across a problem. I'm using revoScaleR::rxLogit() although revoScaleR::rxGlm() produces the same output - and has the same problem.
Consider this reprex:
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1)) # number of successes
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
The first call to glm() produces the correct answer. The second call to rxLogit() does not. Reading the docs for rxLogit(): https://learn.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/rxlogit it states that "Dependent variable must be binary".
So it looks like rxLogit() needs me to use y as the dependent variable rather than p. However if I run
glm_2 <- rxLogit(y ~ 1,
data = df_reprex,
pweights = "x")
I get an overall average
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1]))
of 0.5 instead, which also isn't the correct answer.
Does anyone know how I can fix this? Do I need to use an offset() term in the model formula, or change the weights, or...
(by using the revoScaleR package I occasionally painting myself into a corner like this, because not many other seem to use it)
I'm flying blind here because I can't verify these in RevoScaleR myself -- but would you try running the code below and leave a comment as to what the results were? I can then edit/delete this post accordingly
Two things to try:
Expand data, get rid of weights statement
use cbind(y,x-y)~1 in either rxLogit or rxGlm without weights and without expanding data
If the dependent variable is required to be binary, then the data has to be expanded so that each row corresponds to each 1 or 0 response and then this expanded data is run in a glm call without a weights argument.
I tried to demonstrate this with your example by applying labels to df_reprex and then making a corresponding df_reprex_expanded -- I know this is unfortunate, because you said the data you were working with was already large.
Does rxLogit allow a cbind representation, like glm() does (I put an example as glm1b), because that would allow data to stay same sizeā€¦ from the rxLogit page, I'm guessing not for rxLogit, but rxGLM might allow it, given the following note in the formula page:
A formula typically consists of a response, which in most RevoScaleR
functions can be a single variable or multiple variables combined
using cbind, the "~" operator, and one or more predictors,typically
separated by the "+" operator. The rxSummary function typically
requires a formula with no response.
Does glm_2b or glm_2c in the example below work?
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1), # number of successes
trial=c("first", "second", "third", "fourth")) # trial label
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
df_reprex_expanded <- data.frame(y=c(0,1,0,0,1,0),
trial=c("first","second","third", "third", "fourth", "fourth"))
## binary dependent variable
## expanded data
## no weights
glm_1a <- glm(y ~ 1,
family = binomial,
data = df_reprex_expanded)
exp(glm_1a$coefficients[1]) / (1 + exp(glm_1a$coefficients[1])) # overall fitted average 0.333 - correct
## cbind(success, failures) dependent variable
## compressed data
## no weights
glm_1b <- glm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_1b$coefficients[1]) / (1 + exp(glm_1b$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
glm_2a <- rxLogit(y ~ 1,
data = df_reprex_expanded)
exp(glm_2a$coefficients[1]) / (1 + exp(glm_2a$coefficients[1])) # overall fitted average ???
# try cbind() in rxLogit. If no, then try rxGlm below
glm_2b <- rxLogit(cbind(y,x-y)~1,
data=df_reprex)
exp(glm_2b$coefficients[1]) / (1 + exp(glm_2b$coefficients[1])) # overall fitted average ???
# cbind() + rxGlm + family=binomial FTW(?)
glm_2c <- rxGlm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_2c$coefficients[1]) / (1 + exp(glm_2c$coefficients[1])) # overall fitted average ???

Model-based partitioning with "two-layer interaction" (segmented models)

I am trying to build a model-based tree with a type of "two-layer interaction" where the models in the nodes of the tree are segmented again.
I am using the mob() function to this aim but I could not manage to make the argument for the fit function work with the lmtree() function.
In the following example a is function of b and the relationship between a and b depends on d and on b | d.
library("partykit")
set.seed(321)
b <- runif(200)
d <- sample(1:2, 200, replace = TRUE)
a <- jitter(ifelse(d == 1, 2 * b - 1, 4 * b - 1.2), amount = .1)
a[b < .5 & d == 1] <- jitter(rep(0, length(a[b < .5 & d == 1])))
a[b < .3 & d == 2] <- jitter(rep(0, length(a[b < .3 & d == 2])))
fit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ..., estfun = FALSE, object = FALSE)
{
x <- x[, 2]
l <- lmtree(y ~ x | b)
return(l)
}
m <- mob(a ~ b | d, fit = fit) # not working
Of course with this simple example I could use lmtree(a ~ b | d + b) to find every interaction but is there a way to use as fit function of mob() a lmtree()?
No but yes ;-)
No, lmtree() cannot be used easily as a fitter for a mob().
The dimension of the inner tree (lmtree()) is not fixed, i.e., you may get a tree without any partition or with many subgroups, and this would be confusing for the outer tree (mob()).
Even if one worked around the dimension issue or fixed it by always forcing one break, one would need more work to set up the right coefficient vector, matrix of estimating functions, etc. This is also not straightforward because the convergence rate (and hence the inference) is different if breakpoints are given (e.g., for a binary factor) or have to be estimated (such as for your numeric variables b).
The way you set up your fit() function, the inner lmtree() does not know where to find b. All it has is a numeric vector y and a numeric matrix x but not the original data.
But yes, I think that all of these issues can be addressed if changing the view from fitting a "two-layer" tree to fitting a "segmented" model inside a tree. My impression is that you want to fit a model y ~ x (or a ~ b in your example) where a piecewise linear function is used with an additional breakpoint in x. If the piecewise linear function is supposed to be continuous in x, then the segmented package can be easily used. If not, then strucchange could be leveraged. Assuming you want the former (as you have simulated your data like this), I include a worked segmented example below (and also slightly modified your question to reflect this).
Changing the names and code a little bit, your data d has a segmented piecewise linear relationship of y ~ x with coefficients depending on a group variable g.
set.seed(321)
d <- data.frame(
x = runif(200),
g = factor(sample(1:2, 200, replace = TRUE))
)
d$y <- jitter(ifelse(d$g == "1",
pmax(0, 2 * d$x - 1),
pmax(0, 4 * d$x - 1.2)
), amount = 0.1)
Within every node of a tree I can then fit a model segmented(lm(y ~ x)) which comes with suitable extractors for coef(), logLik(), estfun() etc. Thus, the mobster function is simply:
segfit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...)
{
x <- as.numeric(x[, 2])
segmented::segmented(lm(y ~ x))
}
(Note: I haven't tried whether segmented() would also support lm() objects with weights and offset.)
With this we can obtain the full tree which simply splits in g in this basic example:
library("partykit")
segtree <- mob(y ~ x | g, data = d, fit = segfit)
plot(segtree, terminal_panel = node_bivplot, tnex = 2)
A hands-on introduction to segmented is available in: Muggeo VMR (2008). "segmented: An R Package to Fit Regression Models with Broken-Line Relationships." R News, 8(1), 20-25. https://CRAN.R-project.org/doc/Rnews/
For the underlying methodological background see: Muggeo VMR (2003). "Estimating Regression Models with Unknown Break-Points." Statistics in Medicine, 22(19), 3055-3071. doi:10.1002/sim.1545

Setting up a statnet model in R

I would like to simulate exponential family random graphs, and I just started learning to use the statnet and ergm R packages. From the tutorial I found online, I am able to learn an ERGM model from an example dataset:
# install.packages('statnet')
# install.packages('ergm')
# install.packages('coda')
library(statnet)
set.seed(123)
data(package='ergm') # tells us the datasets in our packages
data(florentine) # loads flomarriage and flobusiness data
# Triad model
flomodel <- ergm(flomarriage ~ edges + triangle)
summary(flomodel)
Currently, I would like to use the simulate command to simulate networks with a pre-specified number of nodes from a pre-specified formula (that is not learned from any particular dataset), for example, P(y) = 1/Z exp(a * num_edges + b * num_triangles), where a and b are user-specified coefficients.
How should I go about writing such a model in statnet?
You can simulate from a given formula with simulate (or simulate.formula):
simulate(flomarriage ~ edges + triangles, coef = c(3,1))
To fix a simulation to have the same number of edges as the given graph (flomarriage in this case)
simulate(flomarriage ~ edges + triangles, coef = c(3,1), constraints = ~edges)
Not every constraint you might want to apply is available since each requires a specific mcmc sampler, but for a list of what is available see ?ergm.constraints
To fix the simulation to have an arbitrary number of nodes and edges (not based on an observed data) a workaround is to create such a network first. For example, to simulate over networks with 17 nodes and 16 edges.
test.mat = matrix(0, 17, 17)
test.mat[1,] = 1 #adds 16 edges
test.net = as.network(test.mat, directed = F)
test.sim = simulate(test.net ~ triangles, coef = 1, constraints = ~edges)
summary.statistics(test.sim ~ edges() + triangles())
p.s. I don't recommend using the triangles term in ERGM models. The geometrically weighted terms (gwesp, gwdsp) are the best substitutes which are more stable.

Regression tree with simulated data - rpart package

I have simulated some data to create a regression tree with 3 terminal nodes:
set.seed(1988)
n=1000
X1<-rnorm(n,mean=0,sd=2)
X2<-rnorm(n,mean=0,sd=2)
e<-rnorm(n)
Y=5*I(X1<1)*I(X2<0.2)+4*I(X1<1)*I(X2>=0.2)+3*I(X1>=1)+e
mydat=as.data.frame(cbind(Y,X1,X2))
So, I want first to split by X1<1, and for X1<1 I want to split by X2<0.2. The values of Y in the leaves are the coefficient of the indicator.
If I run the procedure implemented in the RPART package everything is ok in the case above.
mytree<-rpart(Y~.,data=mydat)
mytree
Output:
node), split, n, deviance, yval
* denotes terminal node
1) root 1000 1627.0670 4.043696
2) X1>=0.9490461 326 373.8485 3.124825 *
3) X1< 0.9490461 674 844.8367 4.488135
6) X2>=0.2488142 327 312.7506 3.970742 *
7) X2< 0.2488142 347 362.0582 4.975708 *
It runs also if I try with coefficient all negative.
But when I try to generate some negative and some positive values in the final terms (it means in the "interaction" of the tree, so where the split is divided at a second level), RPART change the order of the split and the value in the leaves are not correct:
Y=-5*I(X1<1)*I(X2<0.2)+4*I(X1<1)*I(X2>=0.2)+3*I(X1>=1)+e
mydat=as.data.frame(cbind(Y,X1,X2))
mytree<-rpart(Y~.,data=mydat)
mytree
Output:
node), split, n, deviance, yval
* denotes terminal node
1) root 1000 17811.4000 0.6136962
2) X2< 0.1974489 515 8116.5350 -2.3192910
4) X1< 1.002815 343 359.7394 -5.0305350 *
5) X1>=1.002815 172 207.4313 3.0874360 *
3) X2>=0.1974489 485 560.3419 3.7281050 *
Anyone have some idea for that problem?
Thanks
You need to tune the complexity parameter cp. See the code below.
# Data Generating Process
set.seed(1988)
n=1000
X1<-rnorm(n,mean=0,sd=2)
X2<-rnorm(n,mean=0,sd=2)
e<-rnorm(n)
Y=-5*I(X1<1)*I(X2<0.2)+4*I(X1<1)*I(X2>=0.2)+3*I(X1>=1)+e
mydat=as.data.frame(cbind(Y,X1,X2))
library(rpart)
mytree<-rpart(Y~.,data=mydat, cp=0.0001)
# Plot the cross-validation error vs the complexity parameter
plotcp(mytree)
# Find the optimal value of the complexity parameter cp
optcp <- mytree$cptable[which.min(mytree$cptable[,4]),1]
# Prune the tree using the optial complexity parameter
mytree <- prune(mytree,optcp)
The pruned tree correctly represents the underlying data generating process
library(rattle)
fancyRpartPlot(mytree)

How to delete certain nodes from a regression tree built by `ctree()` from `party` package

I've built a regression tree using ctree() from package party.
The results of my model have many nodes which contain equal probability of dependent variables (E.g. : class A = 0.33, class B = 0.33, Class C = 0.33). I want to take out these nodes from the model. The package tree has the snip.tree() command where we can specify the node numbers to be deleted from the model.This command does not recognize regression trees built with ctree(). Please let me know if there is a way to delete certain nodes from a regression tree built using ctree()
I have used the model:
rv.mod1 <- ctree(ldclas ~ L2 + L3 + L4 + L5 + L6 + ele + ndvi + nd_var + nd_ps, data = rv, controls = ctree_control(minsplit = 0, minbucket = 0))
pr.rv.mod1 <- snip.tree(rv.mod1, nodes = nn2.rv.mod1$nodes)
nn2.rv.mod1$nodes is a vector with nodes to be deleted from the rv.mod1 model.But I get an error:
Error in snip.tree(rv.mod1, nodes = nn2.rv.mod1$nodes) :
not legitimate tree
I don't think there is direct way to do this, but I will propose a "hack" using the weights argument in ctree.
Let's start with a reproducible example
library(party)
irisct <- ctree(Species ~ .,data = iris)
plot(irisct)
Now, suppose you want to get rid of node number 5. You can do the following
NewWeigths <- rep(1, dim(iris)[1]) # Setting a weights vector which will be passed into the `weights` attribute in `ctree`
Node <- 5 # Selecting node #5
n <- nodes(irisct, Node)[[1]] # Retrieving the weights of that node
NewWeigths[which(as.logical(n$weights))] <- 0 # Setting these weigths to zero, so `ctree` will disregard them
irisct2 <- ctree(Species ~ .,data = iris, weights = NewWeigths) # creating the new tree with new weights
plot(irisct2)
Note how nodes 2, 6 and 7 (now they are named 2, 4 and 5 because we have less splits) remained exactly with the same distribution and splitting conditions.
I didn't test it for all nodes, but it seem to work fairly well

Resources