I'm trying to use predictions from a random survival forest computed using Ranger to calculate a c-index at specific time points. I know this can be done easily for a coxph model with the following code:
cox_model = coxph(Surv(time, status == 1) ~ ., data = train)
c_index_test <- pec::cindex(cox_model, formula = Cox_model$formula, data=test, eval.times= c(30, 90, 730))
#want to evaluate at 1 month, 3 months, and 2 years
However, although I can calculate a c-index at these time points easily with a random forest generated using rfsrc(), I haven't been able to do it using ranger.
In addition to the pec cindex() function (which doesn't work with objects of class "ranger", I've also tried the concordance.index function (part of the survcomp package) and tried different combinations of using the predict.ranger function to generate survival probability predictions, but nothing has worked.
If anyone can provide code as to how to calculate a the c-index of a ranger RSF (at specific time points and on an external validation set) I would appreciate it immensely!!! I've been able to do it with randomforestSRC but it just takes so long that often my R session will time out and I haven't actually been able to get ANY results with runs having >10 trees...

The ranger packages computes Harrell’s c-index, which is similar to the concordance statistic. If you have a fitted model rf, the attribute prediction.error is equivalent to 1 - Harrell's c-index. Have a look at the following link for more details.


Obtaining predictions from a pooled imputation model

I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)

Negative binomial model with multiply imputed, weighted dataset in R

I am running an analysis of hospital length of stay based on a number of parameters in R, with one primary exposure. Many of the covariates are missing, most commonly lab values, because these aren't checked for all patients. I've worked out what seems to be a good multiple imputation schema using MICE. Because of imbalance between exposed and unexposed groups, I'm also weighting using propensity scores.
I've managed to run a successful weighted Poisson model with MICE and WeightThem. However, when I checked the models for overdispersion, it does appear that the variance is greater than the mean, implying I should be using a quasipoisson or negative binomial model. However, I can't find documentation on negative binomial models with WeightThem or WeightIt in R.
Does anyone have any experience? To run a negative binomial model, i can just use the following code:
results <- with(models, MASS::glm.nb(LOS ~ exposure + covariate1 + covariate2)
in which "models" is the multiply-imputed WeightIt object.
However, according to the WeightIt documentation, when using any glm model you need to run it as a svyglm to get proper standard errors:
results <- with(models, svyglm(LOS ~ exposure + covariate1 + covariate2,
family = poisson()))
There is a function in the sjstats package called svyglm.nb, but this requires creating a design matrix or the model won't run. I have no idea how/whether this is necessary - is the first version (just glm.nb) sufficient? Am I entirely thinking about this wrong?
Thanks so much, advice is much appreciated.

Predicted probabilities in R ranger package

I am trying to build a model in R with random forest classification. (By editing the code by Ned Horning) I first used randomForest package but then found ranger, which promises faster calculations.
At first, I used the code below to get predicted probabilities for each class after fitting the model with randomForest as:
predProbs <-, imageBlock, type='prob'))
The type of probability here is as follows:
We have 500 trees in the model and 250 of them says the observation is class 1, hence the probability is 250/500 = 50%
In ranger, I realized that there is no type = 'prob' option.
I searched and tried some adjustments but couldn't get any progress. I need an object or so containing probabilities as mentioned above with ranger package.
Could anyone give some advice about the issue?
You need to train a "probabilistic classifier"-type ranger object:
iris.ranger = ranger(Species ~ ., data = iris, probability = TRUE)
This object computes a matrix (n_samples, n_classes) when used in the predict.ranger function:
probabilities = predict(iris.ranger, data = iris)$predictions

How to extract the value of the loss function of Cox models from glmnet in R?

I fit a given data using Cox model via glmnet R package and my
little R example is:
library(fastcox);data(FHT);attach(FHT) #
fit = glmnet(x,Surv(y,status),family="cox",alpha=1)
From the help document, we know glmnet fits penalized models like
-loglik/nobs + λ*penalty
i.e., objective function = loss function + penalty function.
I want to fetch -loglik/nobs (loss function value,
the negative partial log-likelihood of the fitted model
or two term
Taylor series expansions of the log likelihoods) from the fit object.
Any idea? Tks
BTW, we also tried
fit0 = glmnet(x,Surv(y,status),family="cox",alpha=1,lambda=0)
according to -loglik/nobs + λ*penalty, but it shows errors.

Variable selection methods

I have been doing variable selection for a modeling problem.
I have used trial and error for the selection (adding / removing a variable) with a decrease in error. However, I have the challenge as the number of variables grows into the hundreds that manual variable selection can not be performed as the model takes 1/2 hour to compute, rendering the task impossible.
Would you happen to know of any other packages than the regsubsets from the leaps package (which when tested with the same trial and error variables produced a higher error, it did not include some variables which were lineraly dependant - excluding some valuable variables).
You need a better (i.e. not flawed) approach to model selection. There are plenty of options, but one that should be easy to adapt to your situation would be using some form of regularization, such as the Lasso or the elastic net. These apply shrinkage to the sizes of the coefficients; if a coefficient is shrunk from its least squares solution to zero, that variable is removed from the model. The resulting model coefficients are slightly biased but they have lower variance than the selected OLS terms.
Take a look at the lars, glmnet, and penalized packages
Try using the stepAIC function of the MASS package.
Here is a really minimal example:
lm <- lm(Fertility ~ ., data = swiss)
## (Intercept) Agriculture Examination Education Catholic
## 66.9151817 -0.1721140 -0.2580082 -0.8709401 0.1041153
## Infant.Mortality
## 1.0770481
st1 <- stepAIC(lm, direction = "both")
st2 <- stepAIC(lm, direction = "forward")
st3 <- stepAIC(lm, direction = "backward")
You should try the 3 directions and ckeck which model works better with your test data.
Read ?stepAIC and take a look at the examples.
It's true stepwise regression isn't the greatest method. As it's mentioned in GavinSimpson answer, lasso regression is a better/much more efficient method. It's much faster than stepwise regression and will work with large datasets.
Check out the glmnet package vignette:
