Computing deviance for conditional inference trees - r

I am trying to implement the use of conditional inference trees (by package partykit) as induction trees, which purpose is merely describing and not predicting individual cases. According to Ritschard here, here and there, for example, a measure of deviance can be estimated by comparing by means of cross-tabs the real and estimated distributions of the response variable in relationship to the possible predictors-based profiles, the so called ^T T and tables.
I would like to use deviance and other derivated statistics as a GOF measure of objects obtained by ctree() function. I am introducing myself to this topic, and I would very much appreciate some input, such as a piece of R code or some orientation about the structure of ctree objects that could be involved in the coding.
I have thought myself that I could, from scratch, obtain both target and predicted tables and compute later the deviance formula. I confess being not confident at all about how to proceed though.
Thanks a lot beforehand!

Some background information first: We have discussed adding deviance() or logLik() methods for ctree objects. So far we haven't done so because conditional inference trees are not associated with a particular loss function or even likelihood. Instead, only the associations between response and partitioning variables are assessed by means of conditional inference tests using certain influence and regressor transformations. However, for the default regression and classification case, measures of deviance or log-likelihood can be a useful addition in practice. So maybe we will add these methods in future versions.
If you want to consider trees associated with a formal deviance/likelihood, you may consider using the general mob() framework or the lmtree() and glmtree() convenience functions. If only partitioning variables are specified (and no further regressors to be used in every node), these often lead to very similar trees compared to ctree(). But then you can also use AIC() etc.
But to come back to your original question: You can compute deviance/log-likelihood or other loss functions fairly easily if you look at the model response and the fitted response. Alterantively, you can extract a factor variable that indicates the terminal nodes and refit a linear or multinomial model. This will have the same fitted values but also supply deviance() and logLik(). Below, I illustrate this with the airct and irisct trees that you obtain when running example("ctree", package = "partykit").
Regression: The Gaussian deviance is simply the residual sum of squares:
sum((airq$Ozone - predict(airct, newdata = airq, type = "response"))^2)
## [1] 46825.35
The same can be obtained by re-fitting as a linear regression model:
airq$node <- factor(predict(airct, newdata = airq, type = "node"))
airlm <- lm(Ozone ~ node, data = airq)
deviance(airlm)
## [1] 46825.35
logLik(airlm)
## 'log Lik.' -512.6311 (df=6)
Classification: The log-likelihood is simply the sum of the predicted log-probabilities at the observed classes. And the deviance is -2 times the log-likelihood:
irisprob <- predict(irisct, type = "prob")
sum(log(irisprob[cbind(1:nrow(iris), iris$Species)]))
## [1] -15.18056
-2 * sum(log(irisprob[cbind(1:nrow(iris), iris$Species)]))
## [1] 30.36112
Again, this can also be obtained by re-fitting as a multinomial model:
library("nnet")
iris$node <- factor(predict(irisct, newdata = iris, type = "node"))
irismultinom <- multinom(Species ~ node, data = iris, trace = FALSE)
deviance(irismultinom)
## [1] 30.36321
logLik(irismultinom)
## 'log Lik.' -15.1816 (df=8)
See also the discussion in https://stats.stackexchange.com/questions/6581/what-is-deviance-specifically-in-cart-rpart for the connections between regression and classification trees and generalized linear models.

Related

Optimizing a GAM for Smoothness

I am currently trying to generate a general additive model in R using a response variable and three predictor variables. One of the predictors is linear, and the dataset consists of 298 observations.
I have run the following code to generate a basic GAM:
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
This produces a model with 18 degrees of freedom and seems to substantially overfit the data. I'm wondering how I might generate a GAM that maximizes smoothness and predictive error. I realize that each of these features is going to come at the expense of the other, but is there good a way to find the optimal model that doesn't overfit?
Additionally, I need to perform leave one out cross validation (LOOCV), and I am not sure how to make sure that gam() does this in the MGCV package. Any help on either of these problems uld be greatly appreciated. Thank you.
I've run this to generate a GAM, but it overfits the data.
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
I have also generated 1,000,000 GAMs with varying combinations of smoothing parameters and ranged the maximum degrees of freedom allowed from 10 (as shown in the code below) to 19. The variable "combinations2" is a list of all 1,000,000 combinations of smoothers I selected. This code is designed to try and balance degrees of freedom and AIC score. It does function, but I'm not sure that I'm actually going to be able to find the optimal model from this. I also cannot tell how to make sure that it uses LOOCV.
BestGAM <- gam(response~ linearpredictor+ predictor2+ predictor3, data = data[2:5])
for(i in 1:100000){
PotentialGAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5], sp=c(combinations2[i,]$Var1,combinations2[i,]$Var2))
if (AIC(PotentialGAM,BestGAM)$df[1] <= 10 & AIC(PotentialGAM,BestGAM)$AIC[1] < AIC(PotentialGAM,BestGAM)$AIC[2]){
BestGAM <<- PotentialGAM
listNumber <- i
}
}
You are fitting your GAM using generalised cross validation (GCV) smoothness selection. GCV is a way to get around the invariance problem of ordinary cross validation (OCV; what you also call LOOCV) when estimating GAMs. Note that GCV is the same as OCV on a rotated version of the fitting problem (rotating y - Xβ by Q, any orthogonal matrix), and while when fitting with GCV {mgcv} doesn't actually need to do the rotation and the expected GCV score isn't affected by the rotation, GCV is just OCV (wood 2017, p. 260)
It has been shown that GCV can undersmooth (resulting in more wiggly models) as the objective function (GCV profile) can become flat around the optimum. Instead it is preferred to estimate GAMs (with penalized smooths) using REML or ML smoothness selection; add method = "REML" (or "ML") to your gam() call.
If the REML or ML fit is as wiggly as the GCV one with your data, then I'd be likely to presume gam() is not overfitting, but that there is something about your response data that hasn't been explained here (are the data ordered in time, for example?)
As to your question
how I might generate a GAM that maximizes smoothness and [minimize?] predictive error,
you are already doing that using GCV smoothness selection and for a particular definition of "smoothness" (in this case it is squared second derivatives of the estimated smooths, integrated over the range of the covariates, and summed over smooths).
If you want GCV but smoother models, you can increase the gamma argument above 1; gamma 1.4 is often used for example, which means that each EDF costs 40% more in the GCV criterion.
FWIW, you can get the LOOCV (OCV) score for your model without actually fitting 288 GAMs through the use of the influence matrix A. Here's a reproducible example using my {gratia} package:
library("gratia")
library("mgcv")
df <- data_sim("eg1", seed = 1)
m <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = df, method = "REML")
A <- influence(m)
r <- residuals(m, type = "response")
ocv_score <- mean(r^2 / (1 - A))

Variable selection methods

I have been doing variable selection for a modeling problem.
I have used trial and error for the selection (adding / removing a variable) with a decrease in error. However, I have the challenge as the number of variables grows into the hundreds that manual variable selection can not be performed as the model takes 1/2 hour to compute, rendering the task impossible.
Would you happen to know of any other packages than the regsubsets from the leaps package (which when tested with the same trial and error variables produced a higher error, it did not include some variables which were lineraly dependant - excluding some valuable variables).
You need a better (i.e. not flawed) approach to model selection. There are plenty of options, but one that should be easy to adapt to your situation would be using some form of regularization, such as the Lasso or the elastic net. These apply shrinkage to the sizes of the coefficients; if a coefficient is shrunk from its least squares solution to zero, that variable is removed from the model. The resulting model coefficients are slightly biased but they have lower variance than the selected OLS terms.
Take a look at the lars, glmnet, and penalized packages
Try using the stepAIC function of the MASS package.
Here is a really minimal example:
library(MASS)
data(swiss)
str(swiss)
lm <- lm(Fertility ~ ., data = swiss)
lm$coefficients
## (Intercept) Agriculture Examination Education Catholic
## 66.9151817 -0.1721140 -0.2580082 -0.8709401 0.1041153
## Infant.Mortality
## 1.0770481
st1 <- stepAIC(lm, direction = "both")
st2 <- stepAIC(lm, direction = "forward")
st3 <- stepAIC(lm, direction = "backward")
summary(st1)
summary(st2)
summary(st3)
You should try the 3 directions and ckeck which model works better with your test data.
Read ?stepAIC and take a look at the examples.
EDIT
It's true stepwise regression isn't the greatest method. As it's mentioned in GavinSimpson answer, lasso regression is a better/much more efficient method. It's much faster than stepwise regression and will work with large datasets.
Check out the glmnet package vignette:
http://www.stanford.edu/~hastie/glmnet/glmnet_alpha.html

Which function/package for robust linear regression works with glmulti (i.e., behaves like glm)?

Background: Multi-model inference with glmulti
glmulti is a R function/package for automated model selection for general linear models that constructs all possible general linear models given a dependent variable and a set of predictors, fits them via the classic glm function and allows then for multi-model inference (e.g., using model weights derived from AICc, BIC). glmulti works in theory also with any other function that returns coefficients, the log-likelihood of the model and the number of free parameters (and maybe other information?) in the same format that glm does.
My goal: Multi-model inference with robust errors
I would like to use glmulti with robust modeling of the errors of a quantitative dependent variable to guard against the effect out outliers.
For example, I could assume that the errors in the linear model are distributed as a t distribution instead of as a normal distribution. With its kurtosis parameter the t distribution can have heavy tails and is thus more robust to outliers (as compared to the normal distribution).
However, I'm not committed to using the t distribution approach. I'm happy with any approach that gives back a log-likelihood and thus works with the multimodel approach in glmulti. But that means, that unfortunately I cannot use the well-known robust linear models in R (e.g., lmRob from robust or lmrob from robustbase) because they do not operate under the log-likelihood framework and thus cannot work with glmulti.
The problem: I can't find a robust regression function that works with glmulti
The only robust linear regression function for R I found that operates under the log-likelihood framework is heavyLm (from the heavy package); it models the errors with a t distribution. Unfortunately, heavyLm does not work with glmulti (at least not out of the box) because it has no S3 method for loglik (and possibly other things).
To illustrate:
library(glmulti)
library(heavy)
Using the dataset stackloss
head(stackloss)
Regular Gaussian linear model:
summary(glm(stack.loss ~ ., data = stackloss))
Multi-model inference with glmulti using glm's default Gaussian link function
stackloss.glmulti <- glmulti(stack.loss ~ ., data = stackloss, level=1, crit=bic)
print(stackloss.glmulti)
plot(stackloss.glmulti)
Linear model with t distributed error (default is df=4)
summary(heavyLm(stack.loss ~ ., data = stackloss))
Multi-model inference with glmulti calling heavyLm as the fitting function
stackloss.heavyLm.glmulti <- glmulti(stack.loss ~ .,
data = stackloss, level=1, crit=bic, fitfunction=heavyLm)
gives the following error:
Initialization...
Error in UseMethod("logLik") :
no applicable method for 'logLik' applied to an object of class "heavyLm".
If I define the following function,
logLik.heavyLm <- function(x){x$logLik}
glmulti can get the log-likelihood, but then the next error occurs:
Initialization...
Error in .jcall(molly, "V", "supplyErrorDF",
as.integer(attr(logLik(fitfunc(as.formula(paste(y, :
method supplyErrorDF with signature ([I)V not found
The question: Which function/package for robust linear regression works with glmulti (i.e., behaves like glm)?
There is probably a way to define further functions to get heavyLm working with glmulti, but before embarking on this journey I wanted to ask whether anybody
knows of a robust linear regression function that (a) operates under the log-likelihood framework and (b) behaves like glm (and will thus work with glmulti out-of-the-box).
got heavyLm already working with glmulti.
Any help is very much appreciated!
Here is an answer using heavyLm. Even though this is a relatively old question, the same problem that you mentioned still occurs when using heavyLm (i.e., the error message Error in .jcall(molly, "V", "supplyErrorDF"…).
The problem is that glmulti requires the degrees of freedom of the model, to be passed as an attribute of you need to provide as an attribute of the value returned by function logLik.heavyLm; see the documentation for the function logLik for details. Moreover, it turns out that you also need to provide a function to return the number of data points that were used for fitting the model, since the information criteria (AIC, BIC, …) depend on this value too. This is done by function nobs.heavyLm in the code below.
Here is the code:
nobs.heavyLm <- function(mdl) mdl$dims[1] # the sample size (number of data points)
logLik.heavyLm <- function(mdl) {
res <- mdl$logLik
attr(res, "nobs") <- nobs.heavyLm(mdl) # this is not really needed for 'glmulti', but is included to adhere to the format of 'logLik'
attr(res, "df") <- length(mdl$coefficients) + 1 + 1 # I am also considering the scale parameter that is estimated; see mdl$family
class(res) <- "logLik"
res
}
which, when put together with the code that you provided, produces the following result:
Initialization...
TASK: Exhaustive screening of candidate set.
Fitting...
Completed.
> print(stackloss.glmulti)
glmulti.analysis
Method: h / Fitting: glm / IC used: bic
Level: 1 / Marginality: FALSE
From 8 models:
Best IC: 117.892471265874
Best model:
[1] "stack.loss ~ 1 + Air.Flow + Water.Temp"
Evidence weight: 0.709174196998897
Worst IC: 162.083142797858
2 models within 2 IC units.
1 models to reach 95% of evidence weight.
producing therefore 2 models within the 2 BIC units threshold.
An important remark though: I am not sure that the expression above for the degrees of freedom is strictly correct. For a standard linear model, the degrees of freedom would be equal to p + 1, where p is the number of parameters in the model, and the extra parameter (the + 1) is the "error" variance (which is used to calculate the likelihood). In function logLik.heavyLm above, it is not clear to me whether one should also count the "scale parameter" that is estimated by heavyLm as an extra degree of freedom, and hence the p + 1 + 1, which would be the case if the likelihood is also a function of this parameter. Unfortunately, I cannot confirm this, since I don’t have access to the reference that heavyLm cites (the paper by Dempster et al., 1980). Because of this, I am counting the scale parameter, thereby providing a (slightly more) conservative estimate of model complexity, penalizing "complex" models. This difference should be negligible, except in the small sample case.

Logistic Regression Using R

I am running logistic regressions using R right now, but I cannot seem to get many useful model fit statistics. I am looking for metrics similar to SAS:
http://www.ats.ucla.edu/stat/sas/output/sas_logit_output.htm
Does anyone know how (or what packages) I can use to extract these stats?
Thanks
Here's a Poisson regression example:
## from ?glm:
d.AD <- data.frame(counts=c(18,17,15,20,10,20,25,13,12),
outcome=gl(3,1,9),
treatment=gl(3,3))
glm.D93 <- glm(counts ~ outcome + treatment,data = d.AD, family=poisson())
Now define a function to fit an intercept-only model with the same response, family, etc., compute summary statistics, and combine them into a table (matrix). The formula .~1 in the update command below means "refit the model with the same response variable [denoted by the dot on the LHS of the tilde] but with only an intercept term [denoted by the 1 on the RHS of the tilde]"
glmsumfun <- function(model) {
glm0 <- update(model,.~1) ## refit with intercept only
## apply built-in logLik (log-likelihood), AIC,
## BIC (Bayesian/Schwarz Information Criterion) functions
## to models with and without intercept ('model' and 'glm0');
## combine the results in a two-column matrix with appropriate
## row and column names
matrix(c(logLik(glm.D93),BIC(glm.D93),AIC(glm.D93),
logLik(glm0),BIC(glm0),AIC(glm0)),ncol=2,
dimnames=list(c("logLik","SC","AIC"),c("full","intercept_only")))
}
Now apply the function:
glmsumfun(glm.D93)
The results:
full intercept_only
logLik -23.38066 -26.10681
SC 57.74744 54.41085
AIC 56.76132 54.21362
EDIT:
anova(glm.D93,test="Chisq") gives a sequential analysis of deviance table containing df, deviance (=-2 log likelihood), residual df, residual deviance, and the likelihood ratio test (chi-squared test) p-value.
drop1(glm.D93) gives a table with the AIC values (df, deviances, etc.) for each single-term deletion; drop1(glm.D93,test="Chisq") additionally gives the LRT test p value.
Certainly glm with a family="binomial" argument is the function most commonly used for logistic regression. The default handling of contrasts of factors is different. R uses treatment contrasts and SAS (I think) uses sum contrasts. You can look these technical issues up on R-help. They have been discussed many, many times over the last ten+ years.
I see Greg Snow mentioned lrm in 'rms'. It has the advantage of being supported by several other functions in the 'rms' suite of methods. I would use it , too, but learning the rms package may take some additional time. I didn't see an option that would create SAS-like output.
If you want to compare the packages on similar problems that UCLA StatComputing pages have another resource: http://www.ats.ucla.edu/stat/r/dae/default.htm , where a large number of methods are exemplified in SPSS, SAS, Stata and R.
Using the lrm function in the rms package may give you the output that you are looking for.

Is there an equivalence of "anova" (for lm) to an rpart object?

When using R's rpart function, I can easily fit a model with it. for example:
# Classification Tree with rpart
library(rpart)
# grow tree
fit <- rpart(Kyphosis ~ Age + Number + Start,
method="class", data=kyphosis)
printcp(fit) # display the results
plotcp(fit)
summary(fit) # detailed summary of splits
# plot tree
plot(fit, uniform=TRUE,
main="Classification Tree for Kyphosis")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
My question is -
How can I measure the "importance" of each of my three explanatory variables (Age, Number, Start) to the model?
If this was a regression model, I could have looked at p-values from the "anova" F-test (between lm models with and without the variable). But what is the equivalence of using "anova" on lm to an rpart object?
(I hope I managed to make my question clear)
Thanks.
Of course anova would be impossible, as anova involves calculating the total variation in the response variable and partitioning it into informative components (SSA, SSE). I can't see how one could calculate sum of squares for a categorical variable like Kyphosis.
I think that what you actually talking about is Attribute Selection (or evaluation). I would use the information gain measure for example. I think that this is what is used to select the test attribute at each node in the tree and the attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions.
I am not aware whether there is a method of ranking attributes according to their information gain in R, but I know that there is in WEKA and is named InfoGainAttributeEval It evaluates the worth of an attribute by measuring the information gain with respect to the class. And if you use Ranker as the Search Method, the attributes are ranked by their individual evaluations.
EDIT
I finally found a way to do this in R using Library CORElearn
estInfGain <- attrEval(Kyphosis ~ ., kyphosis, estimator="InfGain")
print(estInfGain)

Resources