Conditional inference trees in party package R: how to predict the model and variance importance based on OOB data? - r

I use cforest of the party package in R to calculate conditional inference trees. Similarly to Random Forest, I would like to retrieve variance explained and the variance importance based on the OOB data (I read that Random Forest returns variance explained and variable importance based on OOB data). To do so with cforest I used the following code:
model <- party::cforest(y ~ x1 + x2 + x3 + x4 , data=trainings_set , control=cforest_unbiased(ntree=1000, minsplit=25 , minbucket=8 , mtry=4))
model.pred <- predict(model, type="response" , OOB=TRUE)
R2=1 - sum((trainings_set$y-model.pred)^2)/sum((trainings_set$y-mean(trainings_set$y))^2)
varimp_model=party::varimp(model, conditional = TRUE, threshold = 0.2, OOB = TRUE)
I am interested in whether the command OOB=TRUE would lead to the model being predicted and variable importance being returned based on the OOB data of the trainings_set?
I posted this question before under a different title, posting it again (slightly redrafted), I hope someone might be able to provide an answer?

The OOB parameter in cforest function is for a logical defining out-of-bag predictions.
This is only TRUE when you pass a newdata parameter in cforest which is generally a test data frame. If the newdata parameter is there and you have set OOB=TRUE, then you will get out-of-bag predictions on this newdata.
I hope this clarifies your doubt.

Related

Longitudinal analysis using sampling weigths in R

I have longitudinal data from two surveys and I want to do a pre-post analysis. Normally, I would use survey::svyglm() or svyVGAM::svy_vglm (for multinomial family) to include sampling weights, but these functions don't account for the random effects. On the other hand, lme4::lmer accounts for the repeated measures, but not the sampling weights.
For continuous outcomes, I understand that I can do
w_data_wide <- svydesign(ids = ~1, data = data_wide, weights = data_wide$weight)
svyglm((post-pre) ~ group, w_data_wide)
and get the same estimates that I would get if I could use lmer(outcome ~ group*time + (1|id), data_long) with weights [please correct me if I'm wrong].
However, for categorical variables, I don't know how to do the analyses. WeMix::mix() has a parameter weights, but I'm not sure if it treats them as sampling weights. Still, this function can't support multinomial family.
So, to resume: can you enlighten me on how to do a pre-post test analysis of categorical outcomes with 2 or more levels? Any tips about packages/functions in R and how to use/write them would be appreciated.
I give below some data sets with binomial and multinomial outcomes:
library(data.table)
set.seed(1)
data_long <- data.table(
id=rep(1:5,2),
time=c(rep("Pre",5),rep("Post",5)),
outcome1=sample(c("Yes","No"),10,replace=T),
outcome2=sample(c("Low","Medium","High"),10,replace=T),
outcome3=rnorm(10),
group=rep(sample(c("Man","Woman"),5,replace=T),2),
weight=rep(c(1,0.5,1.5,0.75,1.25),2)
)
data_wide <- dcast(data_long, id~time, value.var = c('outcome1','outcome2','outcome3','group','weight'))[, `:=` (weight_Post = NULL, group_Post = NULL)]
EDIT
As I said below in the comments, I've been using lmer and glmer with variables used to calculate the weights as predictors. It happens that glmer returns a lot of problems (convergence, high eigenvalues...), so I give another look at #ThomasLumley answer in this post and others (https://stat.ethz.ch/pipermail/r-help/2012-June/315529.html | https://stats.stackexchange.com/questions/89204/fitting-multilevel-models-to-complex-survey-data-in-r).
So, my question is now if a can use participants id as clusters in svydesign
library(survey)
w_data_long_cluster <- svydesign(ids = ~id, data = data_long, weights = data_long$weight)
summary(svyglm(factor(outcome1) ~ group*time, w_data_long_cluster, family="quasibinomial"))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.875e+01 1.000e+00 18.746 0.0339 *
groupWoman -1.903e+01 1.536e+00 -12.394 0.0513 .
timePre 5.443e-09 5.443e-09 1.000 0.5000
groupWoman:timePre 2.877e-01 1.143e+00 0.252 0.8431
and still interpret groupWoman:timePre as differences in the average rate of change/improvement in the outcome over time between sex groups, as if I was using mixed models with participants as random effects.
Thank you once again!
A linear model with svyglm does not give the same parameter estimates as lme4::lmer. It does estimate the same parameters as lme4::lmer if the model is correctly specified, though.
Generalised linear models with svyglm or svy_vglm don't estimate the same parameters as lme4::glmer, as you note. However, they do estimate perfectly good regression parameters and if you aren't specifically interested in the variance components or in estimating the realised random effects (BLUPs) I would recommend just using svy_glm.
Another option if you have non-survey software for random effects versions of the models is to use that. If you scale the weights to sum to the sample size and if all the clustering in the design is modelled by random effects in the model, you will get at least a reasonable approximation to valid inference. That's what I've seen recommended for Bayesian survey modelling, for example.

How to identify the non-zero coefficients in final caret elastic net model -

I have used caret to build a elastic net model using 10-fold cv and I want to see which coefficients are used in the final model (i.e the ones that aren't reduced to zero). I have used the following code to view the coefficients, however, this apears to create a dataframe of every permutation of coefficient values used, rather than the ones used in the final model:
tr_control = train_control(method="cv",number=10)
formula = response ~.
model1 = caret::train(formula,
data=training,
method="glmnet",
trControl=tr_control,
metric = "Accuracy",
family = "binomial")
Then to extract the coefficients from the final model and using the best lambda value, I have used the following:
data.frame(as.matrix(coef(model1$finalModel, model1$bestTune$.lambda)))
However, this just returns a dataframe of all the coefficients and I can see different instances of where the coefficients have been reduced to zero, however, I'm not sure which is the one the final model uses. Using some slightly different code, I get slightly different results, but in this instance, non of the coefficients are reduced to zero, which suggests to me that the the final model isn't reducing any coefficients to zero:
data.frame(as.matrix(coef(model1$finalModel, model1$bestTune$lambda))) #i have removed the full stop preceeding lambda
Basically, I want to know which features are in the final model to assess how the model has performed as a feature reduction process (alongside standard model evaluation metrics such as accuracy, sensitivity etc).
Since you do not provide any example data I post an example based on the iris built-in dataset, slightly modified to fit better your need (a binomial outcome).
First, modify the dataset
library(caret)
set.seed(5)#just for reproducibility
iris
irisn <- iris[iris$Species!="virginica",]
irisn$Species <- factor(irisn$Species,levels = c("versicolor","setosa"))
str(irisn)
summary(irisn)
fit the model (the caret function for setting controls parameters for train is trainControl, not train_control)
tr_control = trainControl(method="cv",number=10)
model1 <- caret::train(Species~.,
data=irisn,
method="glmnet",
trControl=tr_control,
family = "binomial")
You can extract the coefficients of the final model as you already did:
data.frame(as.matrix(coef(model1$finalModel, model1$bestTune$lambda)))
Also here the model did not reduce any coefficients to 0, but what if we add a random variable that explains nothing about the outcome?
irisn$new1 <- runif(nrow(irisn))
model2 <- caret::train(Species~.,
data=irisn,
method="glmnet",
trControl=tr_control,
family = "binomial")
var <- data.frame(as.matrix(coef(model2$finalModel, model2$bestTune$lambda)))
Here, as you can see, the coefficient of the new variable was turning to 0. You can extract the variable name retained by the model with:
rownames(var)[var$X1!=0]
Finally, the accuracy metrics from the test set can be obtained with
confusionMatrix(predict(model1,test),test$outcome)

Variable importance in Caret

I am using the Caret package in R for training logistic regression model for a binary classification problem. I have been able to get the results, accuracy, etc., but I also want the importance of the variables (in decreasing order of importance). I used varImp() function. But according to the documentation, the importance depends on the class :
"For most classification models, each predictor will have a separate variable importance for each class (the exceptions are classification trees, bagged trees and boosted trees)."
How can I get the variable importance for each class ?
Thank you
For the first part, have you tried:
round(importance(myModel$finalModel), 2)
For putting that in decreasing order:
imp <- round(importance(myModel$finalModel), 2)
dfimp <- data.frame(feature = rownames(imp), MeanDecreaseGini = as.numeric(imp))
dfimp[order(dfimp$MeanDecreaseGini, decreasing = T),]

How can I pass a weight decay argument to mlogit()?

How can I specify weight decay in a model fit by the mlogit?
The multinom() function of nnet allows you to specify weight decay for the model that is being fit, and mlogit uses this function behind the scenes to fit its models so I imagine that it should be possible to pass the decay argument to multinom, but have not so far found a way to do this.
So far I have attempted to simply pass a value in the model formula, like this.
library(mlogit)
set.seed(1)
data("Fishing", package = "mlogit")
Fishing$wts <- runif(nrow(Fishing)) #for some weights
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
fit1 <- mlogit(mode ~ 0 | income, data = Fish, weights = wts, decay = .01)
fit2 <- mlogit(mode ~ 0 | income, data = Fish, weights = wts)
But the output is exactly the same:
identical(logLik(fit1), logLik(fit2))
[1] TRUE
mlogit() and nnet::multinom() both fit multinomial logistic models (predicting probability of class membership for multiple classes) but they use different algorithms to fit the model. nnet::multinom() uses a neural network to fit the model and mlogit() uses maximum likelihood.
Weight decay is a parameter for neural networks and is not applicable to maximum likelihood.
The effect of weight decay is keep the weights in the neural network from getting too large by penalizing larger weights during the weight update step of the fitting algorithm. This helps to prevent over-fitting and hopefully creates a more general model.
Consider using the pmlr function in the pmlr package. This function implements a "Penalized maximum likelihood estimation for multinomial logistic regression" when called with the default function parameter penalized = TRUE.

The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R

While using the predict function in R to get the predictions from a Random Forest model, I misspecified the training data as newdata as follows:
RF1pred <- predict(RF1, newdata=TrainS1, type = "class")
Used like this, I get extremely high accuracy and AUC, which I am sure is not right, but I couldn't find a good explanation for it. This thread is the closest I got, but I can's say I fully understand the explanation there.
If someone could elaborate, I will be grateful.
Thank you!
EDIT: Important to note: I get sensible accuracy and AUC if I run the prediction without specifying a dataset altogether, like so:
RF1pred <- predict(RF1, type = "class")
If a new dataset is not explicitly specified, isn't the training data used for prediction. Hence, shouldn't I get the same results from both lines of code?
EDIT2: Here is a sample code with random data that illustrates the point. When predicting without specifying newdata, the AUC is 0.4893. When newdata=train is explicitly specified, the AUC is 0.7125.
# Generate sample data
set.seed(15)
train <- data.frame(x1=sample(0:1, 100, replace=T), x2=rpois(100,10), y=sample(0:1, 100, replace=T))
# Build random forest
library(randomForest)
model <- randomForest(x1 ~ x2, data=train)
pred1 <- predict(model)
pred2 <- predict(model, newdata = train)
# Calculate AUC
library(ROCR)
ROCRpred1 <- prediction(pred1, train$x1)
AUC <- as.numeric(performance(ROCRpred1, "auc")#y.values)
AUC # 0.4893
ROCRpred2 <- prediction(pred2, train$x1)
AUC <- as.numeric(performance(ROCRpred2, "auc")#y.values)
AUC # 0.7125
If you look at the documentation for predict.randomForest you will see that if you do not supply a new data set you will get the out-of-bag (OOB) performance of the model. Since the OOB performance is theoretically related to the performance of your model on a different data set, the results will be much more realistic (although still not a substitute for a real, independently collected, validation set).

Resources