RPART model ignoring variable while fitting the model - r

When I am trying to fit a classification tree model using Survival~Sex+Pclass , it is not considering the Pclass and is only considering sex (when Survival, Sex, and Pclass are factored as shown in the code)no matter what the control parameter is specified.
Code:
library(titanic)
library(rpart)
library(rpart.plot)
train = titanic_train
titanic_train$Survived = factor(titanic_train$Survived)
titanic_train$Sex = factor(titanic_train$Sex)
titanic_train$Pclass = factor(titanic_train$Pclass)
ctrl=rpart.control(minsplit = 6, cp=0.001)
fit = rpart(Survived ~ Pclass + Sex , data = titanic_train,control=ctrl)
rpart.plot(fit)

It really really doesn't want to split any further. Even setting cp = 0 doesn't do the trick (with minsplit = 1). But cp = -1 does, making the tree branch down to a leaf for each class. (Whether that's desirable or not is another story...)

This is indeed an interesting observation since
we know that Pclass is a highly informative variable,
most other classification tree software will split further on Pclass (e.g. tree::tree, partykit::ctree, sklearn.tree.DecisionTreeClassifier, ...),
the regression tree version of the exact same code (i.e. NOT converting Survived to a factor but keeping it numeric.) results in 4 leaves, even though the Gini impurity is identical to the variance loss function for 0/1 data.
Also difficult to explain why for cp = 0 and minsplit = 1 the resulting tree would not be the deepest possible.

The rpart author allowed me to use his answer, which I paste below:
train <- titanic_train
names(train) <- tolower(names(train)) # I'm lazy
train$pclass <- factor(train$pclass)
fit1 <- rpart(survived ~ pclass + sex, data=train)
fit2 <- rpart(survived ~ pclass + sex, data=train, method="class")
fit1
n= 891
node), split, n, deviance, yval
* denotes terminal node
1) root 891 210.727300 0.3838384
2) sex=male 577 88.409010 0.1889081
4) pclass=2,3 455 54.997800 0.1406593 *
5) pclass=1 122 28.401640 0.3688525 *
3) sex=female 314 60.105100 0.7420382
6) pclass=3 144 36.000000 0.5000000 *
7) pclass=1,2 170 8.523529 0.9470588 *
fit2
n= 891
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 891 342 0 (0.6161616 0.3838384)
2) sex=male 577 109 0 (0.8110919 0.1889081) *
3) sex=female 314 81 1 (0.2579618 0.7420382) *
The issue: when you choose "classification" as the method, either explicitly like I did above or implicitly by setting the outcome to a factor, you have declared that the loss function is a simple "correct/incorrect" for alive/dead. For males, the survival rate is .189, which is < .5, so they class as 0. The next split below gives rates of .14 and .37, both of which are < .5, both are then treated as 0. The second split did not improve the model, according to the criteria that you chose. With or without it all males are a "0", so no need for the second split.
Ditto for the females: the overall and the two subclasses are both >= .5, so the second split does not improve prediction, according to the criteria that you selected.
When I leave the response as continuous, then the final criteria is MSE, and the further splits are counted as an improvement.

Related

Two-level modelling with lme in R

I am interested in estimating a mixed effect model with two random components (I am sorry for the somewhat unprecise notation. I am somewhat new to these kind of models). Finally, I also want also the standard errors of the variances of the random components. That is why I am somewhat boudn to using the package lme. The reason is that I found this description on how to calculate those standard errors and also interesting, the standard error for function of these variances link.
I believe I know how to use the package lmer. I am finally interested in model2. For the model1, both command yield the same estimates. But model2 with lme yields different results than model2 with lmer from the lme4 package. Could you help me to get around how to set up the random components for lme? This would be much appreciated. Thanks. Please find attached my MWE.
Best
Daniel
#### load all packages #####
loadpackage <- function(x){
for( i in x ){
# require returns TRUE invisibly if it was able to load package
if( ! require( i , character.only = TRUE ) ){
# If package was not able to be loaded then re-install
install.packages( i , dependencies = TRUE )
}
# Load package (after installing)
library( i , character.only = TRUE )
}
}
# Then try/install packages...
loadpackage( c("nlme", "msm", "lmeInfo", "lme4"))
alcohol1 <- read.table("https://stats.idre.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", header=T, sep=",")
attach(alcohol1)
id <- as.factor(id)
age <- as.factor(age)
model1.lmer <-lmer(alcuse ~ 1 + peer + (1|id))
summary(model1.lmer)
model2.lmer <-lmer(alcuse ~ 1 + peer + (1|id) + (1|age))
summary(model2.lmer)
model1.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id, method ="REML")
summary(model1.lme)
model2.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id + 1|age, method ="REML")
Edit (15/09/2021):
Estimating the model as follows end then returning the estimates via nlme::VarCorr gives me different results. While the estimates seem to be in the ball park, it is as they are switched across components.
model2a.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id/age, method ="REML")
summary(model2a.lme)
nlme::VarCorr(model2a.lme)
Variance StdDev
id = pdLogChol(1)
(Intercept) 0.38390274 0.6195989
age = pdLogChol(1)
(Intercept) 0.47892113 0.6920413
Residual 0.08282585 0.2877948
EDIT (16/09/2021):
Since Bob pushed me to think more about my model, I want to give some additional information. Please know that the data I use in the MWE do not match my true data. I just used it for illustrative purposes since I can not upload myy true data. I have a household panel with income, demographic informations and parent indicators.
I am interested in intergenerational mobility. Sibling correlations of permanent income are one industry standard. At the very least, contemporanous observations are very bad proxies of permanent income. Due to transitory shocks, i.e., classical measurement error, those estimates are most certainly attenuated. For this reason, we exploit the longitudinal dimension of our data.
For sibling correlations, this amounts to hypothesising that the income process is as follows:
$$Y_{ijt} = \beta X_{ijt} + \epsilon_{ijt}.$$
With Y being income from individual i from family j in year t. X comprises age and survey year indicators to account for life-cycle effects and macroeconmic conditions in survey years. Epsilon is a compund term comprising a random individual and family component as well as a transitory component (measurement error and short lived shocks). It looks as follows:
$$\epsilon_{ijt} = \alpha_i + \gamma_j + \eta_{ijt}.$$
The variance of income is then:
$$\sigma^2_\epsilon = \sigma^2_\alpha + \sigma^2\gamma + \sigma^2\eta.$$
The quantitiy we are interested in is
$$\rho = \frac(\sigma^2\gamma}{\sigma^2_\alpha + \sigma^2\gamma},$$
which reflects the share of shared family (and other characteristics) among siblings of the variation in permanent income.
B.t.w.: The struggle is simply because I want to have a standard errors for all estimates and for \rho.
This is an example of crossed vs nested random effects. (Note that the example you refer to is fitting a different kind of model, a random-slopes model rather than a model with two different grouping variables ...)
If you try with(alcohol1, table(age,id)) you can see that every id is associated with every possible age (14, 15, 16). Or subset(alcohol1, id==1) for example:
id age coa male age_14 alcuse peer cpeer ccoa
1 1 14 1 0 0 1.732051 1.264911 0.2469111 0.549
2 1 15 1 0 1 2.000000 1.264911 0.2469111 0.549
3 1 16 1 0 2 2.000000 1.264911 0.2469111 0.549
There are three possible models you could fit for a model with random effects of age(indexed by i) and id (indexed by j)
crossed ((1|age) + (1|id)): Y_{ij} = beta0 + beta1*peer + eps1_i + eps2_j +epsr_{ij}; alcohol use varies among individuals and, independently, across ages (this model won't work very well because there are only three distinct ages in the data set, more levels are usually needed)
id nested within age ((1|age/id) = (1|age) + (1|age:id)): Y_{ij} = beta0 + beta1*peer + eps1_i + eps2_{ij} + epsr_{ij}; alcohol use varies across ages, and varies across individuals within ages (see note above about number of levels).
age nested within id ((1|id/age) = (1|id) + (1|age:id)): Y_{ij} = beta0 + beta1*peer + eps1_j + eps2_{ij} + epsr_{ij}; alcohol use varies across individuals, and varies across ages within individuals
Here eps1_i, eps2_{ij}, and epsr_{ij} are normal deviates; epsr is the residual error term.
The latter two models actually don't make sense in this case; because there is only one observation per age/id combination, the nested variance (eps2) is completely confounded with the residual variance (epsr). lme doesn't notice this; if you try to fit one of the nested models in lmer it will give an error that
number of levels of each grouping factor must be < number of observations (problems: id:age)
(Although if you try to compute confidence intervals based on model1.lme you'll get an error "cannot get confidence intervals on var-cov components: Non-positive definite approximate variance-covariance", which is a hint that something is wrong.)
You could restate this problem as saying that the residual variation, and the variation among ages within individuals, are jointly unidentifiable (can't be separated from each other, statistically).
The updated answer here shows how to get the standard errors of the variance components from an lmer model, so you shouldn't be stuck with lme (but you should think carefully about which model you're really trying to fit ...)
The GLMM FAQ might also be useful.
More generally, the standard error of
rho = (V_gamma)/(V_alpha + V_gamma)
will be hard to compute accurately, because this is a nonlinear function of the model parameters. You can apply the delta method, but the most reliable approach would be to use parametric bootstrapping: if you have a fitted model m, then something like this should work:
var_ratio <- function(m) {
v <- as.data.frame(sapply(VarCorr(m), as.numeric))
return(v$family/(v$family + v$id))
}
confint(m, method="boot", FUN =var_ratio)
You should specify random effects in lme by using / not +
By lmer
model2.lmer <-lmer(alcuse ~ 1 + peer + (1|id) + (1|age), data = alcohol1)
summary(model2.lmer)
Linear mixed model fit by REML ['lmerMod']
Formula: alcuse ~ 1 + peer + (1 | id) + (1 | age)
Data: alcohol1
REML criterion at convergence: 651.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.0228 -0.5310 -0.1329 0.5854 3.1545
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.08078 0.2842
age (Intercept) 0.30313 0.5506
Residual 0.56175 0.7495
Number of obs: 246, groups: id, 82; age, 82
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.3039 0.1438 2.113
peer 0.6074 0.1151 5.276
Correlation of Fixed Effects:
(Intr)
peer -0.814
By lme
model2.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id/age, method ="REML")
summary(model2.lme)
Linear mixed-effects model fit by REML
Data: alcohol1
AIC BIC logLik
661.3109 678.7967 -325.6554
Random effects:
Formula: ~1 | id
(Intercept)
StdDev: 0.4381206
Formula: ~1 | age %in% id
(Intercept) Residual
StdDev: 0.4381203 0.7494988
Fixed effects: alcuse ~ 1 + peer
Value Std.Error DF t-value p-value
(Intercept) 0.3038946 0.1438333 164 2.112825 0.0361
peer 0.6073948 0.1151228 80 5.276060 0.0000
Correlation:
(Intr)
peer -0.814
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.0227793 -0.5309669 -0.1329302 0.5853768 3.1544873
Number of Observations: 246
Number of Groups:
id age %in% id
82 82
Okay, finally. Just to sketch my confidential data: I have a panel of individuals. The data includes siblings, identified via mnr. income is earnings, wavey survey year, age age factors. female a factor for gender, pid is the factor identifying the individual.
m1 <- lmer(income ~ age + wavey + female + (1|pid) + (1 | mnr),
data = panel)
vv <- vcov(m1, full = TRUE)
covvar <- vv[58:60, 58:60]
covvar
3 x 3 Matrix of class "dgeMatrix"
cov_pid.(Intercept) cov_mnr.(Intercept) residual
[1,] 2.6528679 -1.4624588 -0.4077576
[2,] -1.4624588 3.1015001 -0.0597926
[3,] -0.4077576 -0.0597926 1.1634680
mean <- as.data.frame(VarCorr(m1))$vcov
mean
[1] 17.92341 16.86084 56.77185
deltamethod(~ x2/(x1+x2), mean, covvar, ses =TRUE)
[1] 0.04242089
The last scalar should be what I interprete as the shared background of the siblings of permanent income.
Thanks to #Ben Bolker who pointed me into this direction.

How can I make logistic model with this data?

http://www.statsci.org/data/oz/snails.txt
You can get data from here.
My data is 4*3*3*2 completely randomized design experiment data. I want to model the probability of survival in terms of the stimulus variables.
I tried ANOVA, but I'm not sure whether it's right or not.
Because I want to model the "probability", should I use logistic model??
(I also tried logistic model. But the data shows the sum of 0(Survived) and 1(Deaths). Even though it is not 0 and 1, can I use logistic??)
I want to put "probability" as Y variable.
So I used logit but it's not working.
The program says that y is Inf.
How can I use logit as Y variable in aov?
glm_a <- glm(Deaths ~ Exposure + Rel.Hum + Temp + Species, data = data,
family = binomial)
prob <- Deaths / 20
logitt <- log(prob / (1 - prob))
logmodel <- lm(logitt ~ data$Species + data$Exposure + data$Rel.Hum + data$Temp)
summary(logmodel)
A <- factor(data$Species, levels = c("A", "B"), labels = c(-1, 1))
glm_a <- glm(Y ~ data$Species * data$Exposure * data$Rel.Hum * data$Temp,
data=data, family = binomial)
summary(glm_a)
help("glm") should direct you to help("family"), which reveals the following
For the binomial and quasibinomial families the response can be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not having the first level (and hence usually of having the second level).
As a numerical vector with values between 0 and 1, interpreted as the proportion of successful cases (with the total number of cases given by the weights).
As a two-column integer matrix: the first column gives the number of successes and the second the number of failures.
So for the question "How can I make logistic model with this data?", we can go with route #3 quite easily:
data <- read.table("http://www.statsci.org/data/oz/snails.txt", header = TRUE)
glm_a <- glm(cbind(Deaths, N - Deaths) ~ Species * Exposure * Rel.Hum * Temp,
data = data, family = binomial)
summary(glm_a)
# [output omitted]
As for the question "I tried ANOVA, but I'm not sure whether it's right or not. Because I want to model the "probability", should I use logistic model?", it's better to ask on Cross Validated

fitting non linear function to data : singular gradient issue

I am trying to fit data to a non linear model, but I am getting "singular gradient" message when I build the model.
here is the data:
> astrodata
temperature intensity
1 277.15 121
2 282.15 131
3 287.15 153
4 292.15 202
5 297.15 311
The function:
y= a * exp(-b * temperature) + c
What I did so far:
> temperature <- astrodata$temperature
temperature
[1] 277.15 282.15 287.15 292.15 297.15
> intensity <- astrodata$intensity
> c.0 <- min(temperature)*0.5
> c.0 <- min(intensity)*0.5
> model.0 <- lm(log(intensity - c.0) ~ temperature, data=astrodata)
> start <- list(a=exp(coef(model.0)[1]), b=coef(model.0)[2], c=c.0)
>
> model <- nls(intensity ~ a * exp(-b * temperature) + c, data = astrodata, start = start)
Error in nls(intensity ~ a * exp(b * temperature) + c, data = astrodata, :
singular gradient
Does anybody has an idea how to solve this ?
The model is linear in a and c and only nonlinear in b. That suggests we try the "plinear" algorithm. It has the advantage that only the non-linear parameters require starting values.
Note that the formula specification for that algorithm is different and has a RHS which is a matrix with one column per linear parameter.
model <- nls(intensity ~ cbind(exp(-b * temperature), 1), data = astrodata,
start = start["b"], algorithm = "plinear")
giving:
> model
Nonlinear regression model
model: intensity ~ cbind(exp(-b * temperature), 1)
data: astrodata
b .lin1 .lin2
-1.598e-01 4.728e-19 1.129e+02
residual sum-of-squares: 0.003853
Number of iterations to convergence: 5
Achieved convergence tolerance: 2.594e-07
Also:
plot(intensity ~ temperature, astrodata)
lines(fitted(model) ~ temperature, astrodata)
Note: Based on the comment below you don't really need an nls model and it may be good enough to just use geom_line
p <- ggplot(astrodata, aes(temperature, intensity)) + geom_point()
p + geom_line()
or splines:
p + geom_line(data = data.frame(spline(temperature, intensity)), aes(x, y))
Your data isn't varied enough.
nls uses least squares to work. This is a measurement of the distance between the model and the data points. If there is no distance, nls doesn't work. Your model fits the data exactly, this is called "zero-residual" data. Hence
singular gradient matrix at initial parameter estimates.
It's an overly complicated error message that simply means "There is no error to measure."
You only have 5 (x,y) combos, so this error is almost guaranteed using non-linear analysis with so little data. Use different data or more data.
One possibility is to double each data point, adding very tiny variations to the doubled data like so:
temperature intensity
1 277.15 121
2 282.15 131
3 287.15 153
4 292.15 202
5 297.15 311
11 277.15000001 121.000001
12 282.15000001 131.000001
13 287.15000001 153.000001
14 292.15000001 202.000001
15 297.15000001 311.000001
In the original data set, each point effectively has the same weight of 1.0, and in the "doubled" data set again each point effectively has the same weight of 2.0 so you get the same fitted parameter values but no error.

Tree sizes given by CP table in rpart

In the R package rpart, what determines the size of trees presented within the CP table for a decision tree? In the below example, the CP table defaults to presenting only trees with 1, 2, and 5 nodes (as nsplit = 0, 1 and 4 respectively).
library(rpart)
fit <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis)
> printcp(fit)
Classification tree:
rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
method = "class")
Variables actually used in tree construction:
[1] Age Start
Root node error: 17/81 = 0.20988
n= 81
CP nsplit rel error xerror xstd
1 0.176471 0 1.00000 1.00000 0.21559
2 0.019608 1 0.82353 0.94118 0.21078
3 0.010000 4 0.76471 0.94118 0.21078
Is there an inherent rule rpart() used to determine what size of trees to present? And is it possible to force printcp() to return cross-validation statistics for all possible sizes of tree, i.e. for the above example, also include rows for trees with 3 and 4 nodes (nsplit = 2, 3)?
The rpart() function is controlled using the rpart.control() function. It has parameters such as minsplit which tells the function to only split when there are more observations then the value specified and cp which tells the function to only split if the overall lack of fit is decreased by a factor of cp.
If you look at summary(fit) on your above example it shows the statistics for all values of nsplit. To get these values to print when using printcp(fit) you need to choose appropriate values of cp and minsplit when calling the original rpart function.
The cran-r documentation on rpart mentions adding option cp=0 to the rpart function. http://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf
It also mentions other options which can be given in the rpart function for eg to control the number of splits.
dfit <- rpart(y ~ x, method='class',
control = rpart.control(xval = 10, minbucket = 2, **cp = 0**))

How to compute error rate from a decision tree?

Does anyone know how to calculate the error rate for a decision tree with R?
I am using the rpart() function.
Assuming you mean computing error rate on the sample used to fit the model, you can use printcp(). For example, using the on-line example,
> library(rpart)
> fit <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
> printcp(fit)
Classification tree:
rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis)
Variables actually used in tree construction:
[1] Age Start
Root node error: 17/81 = 0.20988
n= 81
CP nsplit rel error xerror xstd
1 0.176471 0 1.00000 1.00000 0.21559
2 0.019608 1 0.82353 0.82353 0.20018
3 0.010000 4 0.76471 0.82353 0.20018
The Root node error is used to compute two measures of predictive performance, when considering values displayed in the rel error and xerror column, and depending on the complexity parameter (first column):
0.76471 x 0.20988 = 0.1604973 (16.0%) is the resubstitution error rate (i.e., error rate computed on the training sample) -- this is roughly
class.pred <- table(predict(fit, type="class"), kyphosis$Kyphosis)
1-sum(diag(class.pred))/sum(class.pred)
0.82353 x 0.20988 = 0.1728425 (17.2%) is the cross-validated error rate (using 10-fold CV, see xval in rpart.control(); but see also xpred.rpart() and plotcp() which relies on this kind of measure). This measure is a more objective indicator of predictive accuracy.
Note that it is more or less in agreement with classification accuracy from tree:
> library(tree)
> summary(tree(Kyphosis ~ Age + Number + Start, data=kyphosis))
Classification tree:
tree(formula = Kyphosis ~ Age + Number + Start, data = kyphosis)
Number of terminal nodes: 10
Residual mean deviance: 0.5809 = 41.24 / 71
Misclassification error rate: 0.1235 = 10 / 81
where Misclassification error rate is computed from the training sample.

Resources