Tree sizes given by CP table in rpart - r

In the R package rpart, what determines the size of trees presented within the CP table for a decision tree? In the below example, the CP table defaults to presenting only trees with 1, 2, and 5 nodes (as nsplit = 0, 1 and 4 respectively).
library(rpart)
fit <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis)
> printcp(fit)
Classification tree:
rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
method = "class")
Variables actually used in tree construction:
[1] Age Start
Root node error: 17/81 = 0.20988
n= 81
CP nsplit rel error xerror xstd
1 0.176471 0 1.00000 1.00000 0.21559
2 0.019608 1 0.82353 0.94118 0.21078
3 0.010000 4 0.76471 0.94118 0.21078
Is there an inherent rule rpart() used to determine what size of trees to present? And is it possible to force printcp() to return cross-validation statistics for all possible sizes of tree, i.e. for the above example, also include rows for trees with 3 and 4 nodes (nsplit = 2, 3)?

The rpart() function is controlled using the rpart.control() function. It has parameters such as minsplit which tells the function to only split when there are more observations then the value specified and cp which tells the function to only split if the overall lack of fit is decreased by a factor of cp.
If you look at summary(fit) on your above example it shows the statistics for all values of nsplit. To get these values to print when using printcp(fit) you need to choose appropriate values of cp and minsplit when calling the original rpart function.

The cran-r documentation on rpart mentions adding option cp=0 to the rpart function. http://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf
It also mentions other options which can be given in the rpart function for eg to control the number of splits.
dfit <- rpart(y ~ x, method='class',
control = rpart.control(xval = 10, minbucket = 2, **cp = 0**))

Related

RPART model ignoring variable while fitting the model

When I am trying to fit a classification tree model using Survival~Sex+Pclass , it is not considering the Pclass and is only considering sex (when Survival, Sex, and Pclass are factored as shown in the code)no matter what the control parameter is specified.
Code:
library(titanic)
library(rpart)
library(rpart.plot)
train = titanic_train
titanic_train$Survived = factor(titanic_train$Survived)
titanic_train$Sex = factor(titanic_train$Sex)
titanic_train$Pclass = factor(titanic_train$Pclass)
ctrl=rpart.control(minsplit = 6, cp=0.001)
fit = rpart(Survived ~ Pclass + Sex , data = titanic_train,control=ctrl)
rpart.plot(fit)
It really really doesn't want to split any further. Even setting cp = 0 doesn't do the trick (with minsplit = 1). But cp = -1 does, making the tree branch down to a leaf for each class. (Whether that's desirable or not is another story...)
This is indeed an interesting observation since
we know that Pclass is a highly informative variable,
most other classification tree software will split further on Pclass (e.g. tree::tree, partykit::ctree, sklearn.tree.DecisionTreeClassifier, ...),
the regression tree version of the exact same code (i.e. NOT converting Survived to a factor but keeping it numeric.) results in 4 leaves, even though the Gini impurity is identical to the variance loss function for 0/1 data.
Also difficult to explain why for cp = 0 and minsplit = 1 the resulting tree would not be the deepest possible.
The rpart author allowed me to use his answer, which I paste below:
train <- titanic_train
names(train) <- tolower(names(train)) # I'm lazy
train$pclass <- factor(train$pclass)
fit1 <- rpart(survived ~ pclass + sex, data=train)
fit2 <- rpart(survived ~ pclass + sex, data=train, method="class")
fit1
n= 891
node), split, n, deviance, yval
* denotes terminal node
1) root 891 210.727300 0.3838384
2) sex=male 577 88.409010 0.1889081
4) pclass=2,3 455 54.997800 0.1406593 *
5) pclass=1 122 28.401640 0.3688525 *
3) sex=female 314 60.105100 0.7420382
6) pclass=3 144 36.000000 0.5000000 *
7) pclass=1,2 170 8.523529 0.9470588 *
fit2
n= 891
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 891 342 0 (0.6161616 0.3838384)
2) sex=male 577 109 0 (0.8110919 0.1889081) *
3) sex=female 314 81 1 (0.2579618 0.7420382) *
The issue: when you choose "classification" as the method, either explicitly like I did above or implicitly by setting the outcome to a factor, you have declared that the loss function is a simple "correct/incorrect" for alive/dead. For males, the survival rate is .189, which is < .5, so they class as 0. The next split below gives rates of .14 and .37, both of which are < .5, both are then treated as 0. The second split did not improve the model, according to the criteria that you chose. With or without it all males are a "0", so no need for the second split.
Ditto for the females: the overall and the two subclasses are both >= .5, so the second split does not improve prediction, according to the criteria that you selected.
When I leave the response as continuous, then the final criteria is MSE, and the further splits are counted as an improvement.

How to turn off k fold cross validation in rpart() in r

I have the Bitcoin time series, I use 11 technical indicators as features and I want to fit a regression tree to the data. As far as I know, there are 2 functions in r which can create regression trees, i.e. rpart() and tree(), but both functions do not seem appropriate. rpart() uses k-fold cross validation to validate the optimal cost complexity parameter cp and in tree(), it is not possible to specify the value of cp.
I am aware that cv.tree() looks for the optimal value of cp via cross validation, but again, cv.tee() uses k-fold cross validation. Since I have a time series, and therefore temporal dependencies, I do not want to use k-fold cross validation, because k-fold cross validation will randomly divide the data into k-fold, fit the model on k-1 folds and calculate the MSE on the left out k-th fold, and then the sequence of my time series is obviously ruined.
I found an argument of the rpart() function, i.e. xval, which is supposed to let me specify the number of cross validations, but when I look at the output of the rpart() function call when xval=0, it doesn't seem like cross validation is turned off. Below you can see my function call and the output:
tree.model= rpart(Close_5~ M+ DSMA+ DWMA+ DEMA+ CCI+ RSI+ DKD+ R+ FI+ DVI+
OBV, data= train.subset, method= "anova", control=
rpart.control(cp=0.01,xval= 0, minbucket = 5))
> summary(tree.model)
Call:
rpart(formula = Close_5 ~ M + DSMA + DWMA + DEMA + CCI + RSI +
DKD + R + FI + DVI + OBV, data = train.subset, method = "anova",
control = rpart.control(cp = 0.01, xval = 0, minbucket = 5))
n= 590
CP nsplit rel error
1 0.35433076 0 1.0000000
2 0.10981049 1 0.6456692
3 0.06070669 2 0.5358587
4 0.04154720 3 0.4751521
5 0.02415633 5 0.3920576
6 0.02265346 6 0.3679013
7 0.02139752 8 0.3225944
8 0.02096500 9 0.3011969
9 0.02086543 10 0.2802319
10 0.01675277 11 0.2593665
11 0.01551861 13 0.2258609
12 0.01388126 14 0.2103423
13 0.01161287 15 0.1964610
14 0.01127722 16 0.1848482
15 0.01000000 18 0.1622937
It seems like rpart() cross validated 15 different values of cp. If these values were tested with k-fold cross validation, then again, the sequence of my time series will be ruined and I can basically not use these results. Does anyone know how I can turn off cross validation in rpart() effectively, or how to vary the value of cp in tree()?
UPDATE: I followed the suggestion of one of our colleagues and set xval=1, but that didn't seem to solve the problem. You can see the full function output when xval=1 here. Btw, parameters[j] is the j-th element of a parameter vector. When I called this function, parameters[j]= 0.0009765625
Many thanks in advance
To demonstrate that rpart() is creating tree nodes by iterating over declining values of cp versus resampling, we'll use the Ozone data from the mlbench package to compare the results of rpart() and caret::train() as discussed in the comments to the OP. We'll setup the Ozone data as illustrated in the CRAN documentation for Support Vector Machines, which support nonlinear regression and are comparable to rpart().
library(rpart)
library(caret)
data(Ozone, package = "mlbench")
# split into test and training
index <- 1:nrow(Ozone)
set.seed(01381708)
testIndex <- sample(index, trunc(length(index) / 3))
testset <- na.omit(Ozone[testIndex,-3])
trainset <- na.omit(Ozone[-testIndex,-3])
# rpart version
set.seed(95014) #reset seed to ensure sample is same as caret version
rpart.model <- rpart(V4 ~ .,data = trainset,xval=0)
# summary(rpart.model)
# calculate RMSE
rpart.pred <- predict(rpart.model, testset[,-3])
crossprod(rpart.pred - testset[,3]) / length(testIndex)
...and the output for the RMSE calculation:
> crossprod(rpart.pred - testset[,3]) / length(testIndex)
[,1]
[1,] 18.25507
Next, we'll run the same analysis with caret::train() as proposed in the comments to the OP.
# caret version
set.seed(95014)
rpart.model <- caret::train(x = trainset[,-3],
y = trainset[,3],method = "rpart", trControl = trainControl(method = "none"),
metric = "RMSE", tuneGrid = data.frame(cp=0.01),
preProcess = c("center", "scale"), xval = 0, minbucket = 5)
# summary(rpart.model)
# demonstrate caret version did not do resampling
rpart.model
# calculate RMSE, which matches RMSE from rpart()
rpart.pred <- predict(rpart.model, testset[,-3])
crossprod(rpart.pred - testset[,3]) / length(testIndex)
When we print the model output from caret::train() it clearly notes that there was no resampling.
> rpart.model
CART
135 samples
11 predictor
Pre-processing: centered (9), scaled (9), ignore (2)
Resampling: None
The RMSE for the caret::train() version matches the RMSE from rpart().
> # calculate RMSE, which matches RMSE from rpart()
> rpart.pred <- predict(rpart.model, testset[,-3])
> crossprod(rpart.pred - testset[,3]) / length(testIndex)
[,1]
[1,] 18.25507
>
Conclusions
First, as configured above, neither caret::train() nor rpart() are resampling. If one prints the model output, however, one will see multiple values of cp are used to generate the final tree of 47 nodes via both techniques.
Output from caret summary(rpart.model)
CP nsplit rel error
1 0.58951537 0 1.0000000
2 0.08544094 1 0.4104846
3 0.05237152 2 0.3250437
4 0.04686890 3 0.2726722
5 0.03603843 4 0.2258033
6 0.02651451 5 0.1897648
7 0.02194866 6 0.1632503
8 0.01000000 7 0.1413017
Output from rpart summary(rpart.model)
CP nsplit rel error
1 0.58951537 0 1.0000000
2 0.08544094 1 0.4104846
3 0.05237152 2 0.3250437
4 0.04686890 3 0.2726722
5 0.03603843 4 0.2258033
6 0.02651451 5 0.1897648
7 0.02194866 6 0.1632503
8 0.01000000 7 0.1413017
Second, both models account for time values via the inclusion of month and day variables as independent variables. In the Ozone data set, V1 is the month variable, and V2 is the day variable. All data was collected during 1976, so there is no year variable included in the data set, and in the original analysis in the svm vignette, day of week was dropped prior to analysis.
Third, to account for other time-based effects using algorithms like rpart() or svm() when date attributes are not used as features in the model, one must include lag effects as features in the model because these algorithms do not directly account for a time component. One example of how to do this with an ensemble of regression trees using a range of lagged values is Ensemble Regression Trees for Time Series Predictions.
In your model, simply xval=0 turn off cross validation.
In your output, you have only CP NSPLIT REL ERROR, with cross valisation you should have CP NSPLIT REL ERROR XERROR XSTD.
cp is just your " complexity parameter" (cp=0.01 by default) from 1 to 0.01.
rel error is your predicted error on your dataset train / expected loss from root node.
nsplit number of node relativ at size of your tree according to cp.
Look : https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

rstanarm for Bayesian hierarchical modeling of binomial experiments

Suppose there are three binomial experiments conducted chronologically. For each experiment, I know the #of trials as well as the #of successes. To use the first two older experiments as prior for the third experiment, I want to "fit a Bayesian hierarchical model on the two older experiments and use the posterior form that as prior for the third experiment".
Given my available data (below), my question is: is my rstanarm code below capturing what I described above?
Study1_trial = 70
Study1_succs = 27
#==================
Study2_trial = 84
Study2_succs = 31
#==================
Study3_trial = 100
Study3_succs = 55
What I have tried in package rstanarm:
library("rstanarm")
data <- data.frame(n = c(70, 84, 100), y = c(27, 31, 55));
mod <- stan_glm(cbind(y, n - y) ~ 1, prior = NULL, data = data, family = binomial(link = 'logit'))
## can I use a beta(1.2, 1.2) as prior for the first experiment?
TL;DR: If you were directly predicting the probability of success, the model would be a Bernoulli likelihood with parameter theta (the probability of success) that could take on values between zero and one. You could use a Beta prior for theta in this case. But with a logistic regression model, you're actually modeling the log odds of success, which can take on any value from -Inf to Inf, so a prior with a normal distribution (or some other prior that can take on any real value within some range determined by the available prior information) would be more appropriate.
For a model where the only parameter is the intercept, the prior is the probability distribution for the log odds of success. Mathematically, the model is:
log(p/(1-p)) =  a
Where p is the probability of success and a, the parameter you're estimating, is the intercept, which can be any real number. If the odds of success are 1:1 (that is, p = 0.5) then a = 0. If the odds are greater than 1:1 then a is positive. If the odds are less than 1:1 then a is negative.
Since we want a prior for a, we need a probability distribution that can take on any real value. If we didn't know anything about the odds of success, we might use a very weakly informative prior like a normal distribution with, say, mean=0 and sd=10 (this is the rstanarm default), meaning that one standard deviation would encompass odds of success ranging from about 22000:1 to 1:22000! So this prior is essentially flat.
If we take your first two studies to construct the prior, we can use the probability density based on those studies and then transform it to the log odds scale:
# Possible outcomes (that is, the possible number of successes)
s = 0:(70+84)
# Probability density over all possible outcomes
dens = dbinom(s, 70+84, (27+31)/(70+84))
Assuming we'll use a normal distribution for the prior, we want the most likely probability of success (which will be the mean for the prior) and the standard deviation of the mean.
# Prior parameters
pp = s[which.max(dens)]/(70+84) # most likely probability
psd = sum(dens * (s/max(s) - pp)^2)^0.5 # standard deviation
# Convert prior to log odds scale
pp_logodds = log(pp/(1-pp))
psd_logodds = log(pp/(1-pp)) - log((pp-psd)/(1 - (pp-psd)))
c(pp_logodds, psd_logodds)
[1] -0.5039052 0.1702006
You could generate essentially the same prior by running stan_glm on the first two studies with the default (flat) prior:
prior = stan_glm(cbind(y, n-y) ~ 1,
data = data[1:2,],
family = binomial(link = 'logit'))
c(coef(prior), se(prior))
[1] -0.5090579 0.1664091
Now let's fit the model using data from Study 3 using the default prior and the prior we just generated. I've switched to a standard data frame, since stan_glm seems to fail when the data frame has only one row (as in data = data[3, ]).
# Default weakly informative prior
mod1 <- stan_glm(y ~ 1,
data = data.frame(y=rep(0:1, c(45,55))),
family = binomial(link = 'logit'))
# Prior based on studies 1 & 2
mod2 <- stan_glm(y ~ 1,
data = data.frame(y=rep(0:1, c(45,55))),
prior_intercept = normal(location=pp_logodds, scale=psd_logodds),
family = binomial(link = 'logit'))
For comparison, let's also generate a model with all three studies and the default flat prior. We would expect this model to give virtually the same results as mod2:
mod3 <- stan_glm(cbind(y, n - y) ~ 1,
data = data,
family = binomial(link = 'logit'))
Now let's compare the three models:
library(tidyverse)
list(`Study 3, Flat Prior`=mod1,
`Study 3, Prior from Studies 1 & 2`=mod2,
`All Studies, Flat Prior`=mod3) %>%
map_df(~data.frame(log_odds=coef(.x),
p_success=predict(.x, type="response")[1]),
.id="Model")
Model log_odds p_success
1 Study 3, Flat Prior 0.2008133 0.5500353
2 Study 3, Prior from Studies 1 & 2 -0.2115362 0.4473123
3 All Studies, Flat Prior -0.2206890 0.4450506
For Study 3 with the flat prior (row 1), the predicted probability of success is 0.55, as expected, since that's what the data says and the prior provides no additional information.
For Study 3 with a prior based on studies 1 and 2, the probability of success is 0.45. The lower probability of success is due to the lower probability of success in Studies 1 and 2 adding additional information. In fact, the probability of success from mod2 is exactly what you'd calculate directly from the data: with(data, sum(y)/sum(n)). mod3 puts all the information into the likelihood instead of splitting it between the prior and the likelihood, but is otherwise essentially the same as mod2.
Answer to (now deleted) comment: If all you know is the number of trials and successes and you think that a binomial probability is a reasonable model for how the data were generated, then it doesn't matter how you split up the data into "prior" and "likelihood" or whether you shuffle the order of the data. The resulting model fit will be the same.

Using a survival tree from the 'rpart' package in R to predict new observations

I'm attempting to use the "rpart" package in R to build a survival tree, and I'm hoping to use this tree to then make predictions for other observations.
I know there have been a lot of SO questions involving rpart and prediction; however, I have not been able to find any that address a problem that (I think) is specific to using rpart with a "Surv" object.
My particular problem involves interpreting the results of the "predict" function. An example is helpful:
library(rpart)
library(OIsurv)
# Make Data:
set.seed(4)
dat = data.frame(X1 = sample(x = c(1,2,3,4,5), size = 1000, replace=T))
dat$t = rexp(1000, rate=dat$X1)
dat$t = dat$t / max(dat$t)
dat$e = rbinom(n = 1000, size = 1, prob = 1-dat$t )
# Survival Fit:
sfit = survfit(Surv(t, event = e) ~ 1, data=dat)
plot(sfit)
# Tree Fit:
tfit = rpart(formula = Surv(t, event = e) ~ X1 , data = dat, control=rpart.control(minsplit=30, cp=0.01))
plot(tfit); text(tfit)
# Survival Fit, Broken by Node in Tree:
dat$node = as.factor(tfit$where)
plot( survfit(Surv(dat$t, event = dat$e)~dat$node) )
So far so good. My understanding of what's going on here is that rpart is attempting to fit exponential survival curves to subsets of my data. Based on this understanding, I believe that when I call predict(tfit), I get, for each observation, a number corresponding to the parameter for the exponential curve for that observation. So, for example, if predict(fit)[1] is .46, then this means for the first observation in my original dataset, the curve is given by the equation P(s) = exp(−λt), where λ=.46.
This seems like exactly what I'd want. For each observation (or any new observation), I can get the predicted probability that this observation will be alive/dead for a given time point. (EDIT: I'm realizing this is probably a misconception— these curves don't give the probability of alive/dead, but the probability of surviving an interval. This doesn't change the problem described below, though.)
However, when I try and use the exponential formula...
# Predict:
# an attempt to use the rates extracted from the tree to
# capture the survival curve formula in each tree node.
rates = unique(predict(tfit))
for (rate in rates) {
grid= seq(0,1,length.out = 100)
lines(x= grid, y= exp(-rate*(grid)), col=2)
}
What I've done here is split the dataset in the same way the survival tree did, then used survfit to plot a non-parametric curve for each of these partitions. That's the black lines. I've also drawn lines corresponding to the result of plugging in (what I thought was) the 'rate' parameter into (what I thought was) the survival exponential formula.
I understand that the non-parametric and the parametric fit shouldn't necessarily be identical, but this seems more than that: it seems like I need to scale my X variable or something.
Basically, I don't seem to understand the formula that rpart/survival is using under the hood. Can anyone help me get from (1) rpart model to (2) a survival equation for any arbitrary observation?
The survival data are scaled internally exponentially so that the predicted rate in the root node is always fixed to 1.000. The predictions reported by the predict() method are then always relative to the survival in the root node, i.e., higher or lower by a certain factor. See Section 8.4 in vignette("longintro", package = "rpart") for more details. In any case, the Kaplan-Meier curves you are reported correspond exactly to what is also reported in the rpart vignette.
If you want to obtain directly the plots of the Kaplan-Meier curves in the tree and get predicted median survival times, you can coerce the rpart tree to a constparty tree as provided by the partykit package:
library("partykit")
(tfit2 <- as.party(tfit))
## Model formula:
## Surv(t, event = e) ~ X1
##
## Fitted party:
## [1] root
## | [2] X1 < 2.5
## | | [3] X1 < 1.5: 0.192 (n = 213)
## | | [4] X1 >= 1.5: 0.082 (n = 213)
## | [5] X1 >= 2.5: 0.037 (n = 574)
##
## Number of inner nodes: 2
## Number of terminal nodes: 3
##
plot(tfit2)
The print output shows the median survival time and the visualization the corresponding Kaplan-Meier curve. Both can also be obtained with the predict() method setting the type argument to "response" and "prob" respectively.
predict(tfit2, type = "response")[1]
## 5
## 0.03671885
predict(tfit2, type = "prob")[[1]]
## Call: survfit(formula = y ~ 1, weights = w, subset = w > 0)
##
## records n.max n.start events median 0.95LCL 0.95UCL
## 574.0000 574.0000 574.0000 542.0000 0.0367 0.0323 0.0408
As an alternative to the rpart survival trees you might also consider the non-parametric survival trees based on conditional inference in ctree() (using logrank scores) or fully parametric survival trees using the general mob() infrastructure from the partykit package.
#Achim Zeileis's answer is very helpful, but it seems that the exact #jwdink's question was not answered. I understood it as "If RPart tree splits by best exponential survival fit, what are the Lambdas for these fits in absolute terms, so we can use these exponential survival functions to make predictions". The RPart summary does show the estimated rate, but only in relative terms assuming that the entire population has rate of 1. To overcome, one can fit an exponential survreg, take the referenced lambda from there and then multiply RPart predicted rates by that number (see code below).
That said, this is not how survival rates in RPart are predicted out of a tree. I did not find survival prediction function directly in RPart, however as Achim pointed above, partykit uses Kaplan-Meier estimates, i.e. non-parametric survival from those ending up in a respective final leaf. I think it is the same in survival random forest trees, where K-M curves are used in the final leaves.
The simulated data in this question uses exponential distribution, so K-M and exponential survival curves will be similar by design, however for a different simulated or real-life distribution estimated exponential rates by RPart tree and using K-M curves in the final leaves (of the same tree) will give different survival rates.
sfit = survfit(Surv(t, event = e) ~ 1, data=dat)
tfit = rpart(formula = Surv(t, event = e) ~ X1 , data = dat, control=rpart.control(minsplit=30, cp=0.01))
plot(tfit); text(tfit)
# Survival Fit, Broken by Node in Tree:
dat$node = as.factor(tfit$where)
table(dat$node)
s0 = survreg(Surv(t,e)~ 1, data = dat, dist = "exponential") #-0.6175
e0 = exp(-summary(s0)$coefficients[1]); e0 #1.854
rates = unique(predict(tfit))
#1) plot K-M curves by node (black):
plot( survfit(Surv(dat$t, event = dat$e)~dat$node) )
#2) plot exponential survival with rates = e0 * RPart rates (red):
for (rate in rates) {
grid= seq(0,1,length.out = 100)
lines(x= grid, y= exp(-e0*rate*(grid)), col=2)
}
#3) plot partykit survival curves based on RPart tree (green)
library(partykit)
tfit2 <- as.party(tfit)
col_n = 1
for (node in names(table(dat$node))){
predict_curve = predict(tfit2, newdata = dat[dat$node == node, ], type = "prob")
surv_esitmated = approxfun(predict_curve[[1]]$time, predict_curve[[1]]$surv)
lines(x= grid, y= surv_esitmated(grid), col = 2+col_n)
col_n=+1
}

How to compute error rate from a decision tree?

Does anyone know how to calculate the error rate for a decision tree with R?
I am using the rpart() function.
Assuming you mean computing error rate on the sample used to fit the model, you can use printcp(). For example, using the on-line example,
> library(rpart)
> fit <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
> printcp(fit)
Classification tree:
rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis)
Variables actually used in tree construction:
[1] Age Start
Root node error: 17/81 = 0.20988
n= 81
CP nsplit rel error xerror xstd
1 0.176471 0 1.00000 1.00000 0.21559
2 0.019608 1 0.82353 0.82353 0.20018
3 0.010000 4 0.76471 0.82353 0.20018
The Root node error is used to compute two measures of predictive performance, when considering values displayed in the rel error and xerror column, and depending on the complexity parameter (first column):
0.76471 x 0.20988 = 0.1604973 (16.0%) is the resubstitution error rate (i.e., error rate computed on the training sample) -- this is roughly
class.pred <- table(predict(fit, type="class"), kyphosis$Kyphosis)
1-sum(diag(class.pred))/sum(class.pred)
0.82353 x 0.20988 = 0.1728425 (17.2%) is the cross-validated error rate (using 10-fold CV, see xval in rpart.control(); but see also xpred.rpart() and plotcp() which relies on this kind of measure). This measure is a more objective indicator of predictive accuracy.
Note that it is more or less in agreement with classification accuracy from tree:
> library(tree)
> summary(tree(Kyphosis ~ Age + Number + Start, data=kyphosis))
Classification tree:
tree(formula = Kyphosis ~ Age + Number + Start, data = kyphosis)
Number of terminal nodes: 10
Residual mean deviance: 0.5809 = 41.24 / 71
Misclassification error rate: 0.1235 = 10 / 81
where Misclassification error rate is computed from the training sample.

Resources