Mice pool() function and coxph(): in mice.df (...) : large sample assumed? - r

I used the coxph() function from the survival package in multiply imputed dataset and encountered a warning when trying to pool the results. The warning message states: "In mice.df(m, lambda, dfcom, method) : Large sample assumed.
A reproducible example is below (with publically available data, without worrying to much about the appropriateness of using both mice() and coxph() with these data):
library(mice)
library(survival)
#load publically available data
data(pbc)
#select variables for the reproducable example
pbc.select <- pbc[pbc$status %in% c(0,1) , c("id", "time", "status", "trt")]
imp <- mice(pbc.select) #impute trt
fit <- with(imp, coxph(Surv(time, status) ~ trt)) #fit coxph in each imp
pool(fit) #pool the models; get's the error
This warning seems to result from the pool() function trying to require dfcom from
dfcom <- df.residual(object)
where df.residual() is not applicable to the object referred to in this context, which is of class coxph
class(fit) # "mira" "matrix"
class(fit$analyses[[1]]) "coxph"
My questions are
am I using the correct syntax for purpose
if so, is there a way to supply pool() with the appropriate information?
how does this assumption influence the results?

Related

Error in brglm model with Backward elimination with Interaction: error in do.call("glm.control", control) : second argument must be a list

After fitting a model with glm I got this as a result:
Warning message:
glm.fit: Adjusted probabilities with numerical value 0 or 1.**
After some research on Google, I tried with the brglm package. When I try to apply backward elimination on the model, I get the following error:
Error in do.call("glm.control", control) : second argument must be a list.
I searched on Google but I didn't find anything.
Here is my code with brglm:
library(mlbench)
#require(Amelia)
library(caTools)
library(mlr)
library(ciTools)
library(brglm)
data("BreastCancer")
data_bc <- BreastCancer
data_bc
head(data_bc)
dim(data_bc)
#Delete id column
data_bc<- data_bc[,-1]
data_bc
dim(data_bc)
str(data_bc)
# convert all factors columns to be numeric except class.
for(i in 1:9){
data_bc[,i]<- as.numeric(as.character(data_bc[,i]))
}
str(data_bc)
#convert class: benign and malignant to binary 0 and 1:
data_bc$Class<-ifelse(data_bc$Class=="malignant",1,0)
# now convert class to factor
data_bc$Class<- factor(data_bc$Class, levels = c(0,1))
str(data_bc)
model <- brglm(formula = Class~.^2, data = data_bc, family = "binomial",
na.action = na.exclude )
summary(model)
#Backward Elimination:
final <- step(model, direction = "backward")
You can work around this by using the brglm2 package, which supersedes the brglm package anyway:
model <- glm(formula = Class~.^2, data = na.omit(data_bc), family = "binomial",
na.action = na.fail, method="brglmFit" )
final <- step(model, direction = "backward")
length(coef(model)) ## 46
length(coef(final)) ## 42
setdiff(names(coef(model)), names(coef(final))
## [1] "Cl.thickness:Epith.c.size" "Cell.size:Marg.adhesion"
## [3] "Cell.shape:Bl.cromatin" "Bl.cromatin:Mitoses"
Some general concerns about your approach:
stepwise reduction is one of the worst forms of model reduction (cf. lasso, ridge, elasticnet ...)
in the presence of missing data, model comparison (e.g. by AIC) is questionable, as different models will be fitted to different subsets of the data. Given that you are only going to lose a small fraction of your data by using na.omit() (comparing nrow(bc_data) with sum(complete.cases(bc_data)), I would strongly recommend dropping observations with NA values from the data set before starting
it's also not clear to me that comparing penalized models via AIC is statistically appropriate (see here)

Why are the predict values of gbm (R package) negative?

I analyzed my data with 'gbm' R package. My data is based on a cohort study. Therefore, I ran 'gbm' model based on the 'coxph' results.
After constructing a model, I would like to see how this model can predict well. On the other hand, like the code below, the values of prediction are negative. So, I have a trouble understanding this phenomenon.
Please let me know how to interpret this value.
Here's my code.
install.packages("survival")
install.packages("randomForestSRC")
install.packages("gbm")
library(survival)
library(randomForestSRC)
library(gbm)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
exposure <- names(data[, names(data.model) !=c("days", "status")])
formula <- as.formula(paste("Surv(days, status)~", paste(exposure, collapse="+")))
set.seed(123)
ex <- gbm(Surv(days, status)~.,
data=data,
distribution="coxph",
cv.folds=5,
shrinkage=.01,
n.trees=1000)
set.seed(123)
pred <- predict(ex, n.trees=1000, type="response")
Read the ?predict.gbm help page, particularly the parameter type. By default predictions are on the link scale.

stepAIC handling of multinom models

I am seeing some weird behavior with the stepAIC function in the MASS package when dealing with multinomial logistic models. Here is some sample code:
library(nnet)
library(MASS)
example("birthwt")
race.model <- multinom(race ~ smoke, bwt)
race.model2 <- stepAIC(race.model, k = 2)
In this case race.model and race.model2 have identical terms; stepAIC did not prune anything. However, I need to query certain attributes of the models, and I get an error with race.model2:
formula(race.model)[2]
returns race() but
formula(race.model2)[2]
gives the error:
Error in terms.formula(newformula, specials = names(attr(termobj, "specials"))) :
invalid model formula in ExtractVars
This behavior only seems to occur when stepAIC does not remove terms from the model. In the following code, terms are removed by stepAIC, and both models can be properly queried:
race.big <- multinom(race ~ ., bwt)
race.big2 <- stepAIC(race.big, k = 2)
formula(race.big)[2]
formula(race.big2)[2]
Any ideas about what is going wrong here?

Pooling sandwich variance estimator over multiply imputed datasets

I am running a poisson regression on multiply imputed data to predict a common binary outcome. After running mice, I have obtained a stacked data frame comprising the raw data and five imputed datasets. Here is a toy example:
df <- mice::nhanes
imp <- mice(df) #impute data
com <- complete(imp, "long", TRUE) #creates data frame
I now want to:
Run the regression on each imputed dataset
Calculate robust standard errors using a sandwich variance estimator
Combine / pool the results of both analyses
I can run the regression on the mids object using the with and pool commands:
fit.pois.mids <- with(imp, glm(hyp ~ age + bmi + chl, family = poisson))
summary(pool(fit.pois.mids))
I can also run the regression on each of the imputed datasets before combining them:
imp.df <- split(com, com$.imp); names(imp.df) <- c("raw", "imp1", "imp2", "imp3", "imp4", "imp5") #creates list of data frames representing each imputed dataset
fit.pois <- lapply(imp.df, function(x) {
fit <- glm(hyp ~ age + bmi + chl, data = x, family = poisson)
fit
})
summary(MIcombine(fit.pois))
Similarly, I can calculate the standard errors for each imputed dataset:
sand <- lapply(fit.pois, function(x) {
se <- coeftest(x, vcov = sandwich)
se
})
Unfortunately, MIcombine does not seem to return p-values. This post suggests using Zelig, but for that matter, I may as well just use mice. Further it does not appear to be possible to combine the estimates of the standard errors:
summary(MIcombine(sand.df))
Error in UseMethod("vcov") :
no applicable method for 'vcov' applied to an object of class "coeftest"
For the sake of simplicity, it seems that mice is a better option for pooling the results of the regression; however, I am wondering how I would go about updating (i.e., pooling and combining) the standard errors. What are some ways this could be addressed?

R- Random forest predict fails with NAs in predictors

The documentation (If I'm reading it correctly) says that the random forest predict function produces NA predictions if it encounters NA predictors for certain observations.
NOTE: If the object inherits from randomForest.formula, then any data
with NA are silently omitted from the prediction. The returned value
will contain NA correspondingly in the aggregated and individual tree
predictions (if requested), but not in the proximity or node matrices
However, if I try to use the predict function on a dataset with some NA's in predictors [NA's in 7 observations out of 2688] I encounter the following error condition, and prediction fails.
Error in predict.randomForest(model,
new.ds) : missing values in newdata
There is a slightly messy work-around that I would like to avoid if possible.
Am I doing/reading something wrong? Does it have to do something with the "inherits from randomForest.formula" clause?
Using some examples from the documentation:
set.seed(1)
x <- data.frame(x1=gl(32, 5), x2=runif(160), y=rnorm(160))
rf1 <- randomForest(x[-3], x[[3]], ntree=10)
> inherits(rf1,"randomForest.formula")
[1] FALSE
> iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE,
proximity=TRUE)
> inherits(iris.rf,"randomForest.formula")
[1] TRUE
So you probably called randomForest without using the formula interface to fit your model.

Resources