Pooling glmers of imputed datasets - r

The problem:
I have a dataset, with some missing predictor values. I'd like to pool glmer models together which have been applied to these imputation sets. I'm using the mice package to create the imputations (I've also used amelia and mi too with no success). I'd like to extract the fixed effects primarily.
Using the pool() function within the mice package returns the error:
Error in qhat[i, ] : incorrect number of dimensions
I've tried to use and adapt a previous rewrite of the pool() function here:
https://github.com/stefvanbuuren/mice/pull/5
There's probably an obvious solution I'm overlooking!
Here's an example:
# 1. create data (that can be replicated and converge later)
data = data.frame(x1=c(rep("1",0.1*1000), rep("0",0.5*1000),
rep("1",0.3*1000), rep("0",0.1*1000)),
x2=c(rep("fact1",0.55*1000), rep("fact2",0.1*1000),
rep(NA,0.05*1000), rep("fact3",0.3*1000)),
centre=c(rep("city1",0.1*1000), rep("city2",0.2*1000),
rep("city3",0.15*1000), rep("city1",0.25*1000),
rep("city2",0.3*1000) ))
# 2. set factors
data = sapply(data, as.factor)
# 3. mice imputation
library(mice)
imp.data = mice(data, m=5, maxit=20, seed=1234, pri=F)
# 4. apply the glmer function
library(lme4)
mice.fit = with(imp.data, glmer(x1~x2+(1|centre), family='binomial'))
# 5. pool imputations together
pooled.mi = pool(mice.fit)
The other function I've applied at step 4 is below, in the hope it'd create an object amenable to pool().
mice.fit = lapply(imp.data$imp, function(d){ glmer(x1~x2+(1|centre), data=d,
family='binomial') })
I've got a work around that involves using a meta-analysis model to pool the results of each of the fixed effects of the glmer models. That works- but it'd be much better to have the Rubin model working.

This Just Works for me after making my own fork of mice, pulling the extended version you referenced above into it, and cleaning it up a little bit: try
devtools::install_github("bbolker/mice")
and see how your process goes after that. (If it works, someone should submit a reminder/new pull request ...)

Is there a difference between an object of class "glmerMod" and "lmerMod"? I am unfamiliar with that package lme4. But if there is no difference, you can change the class of the mice.fit analyses to "lmerMod" and then it should run fine.

Related

Why am I getting different statistical outputs than my partner using the same code?

I'm trying to fit a classification tree to the OJ dataset using the ISLR2 textbook. The response variable is "Purchase" which takes one of two values: "MM" or "CH".
library(ISLR2)
library(tree)
# (a) Create a training set containing a random sample of 800 obs
# and a test set containing the remaining observations.
set.seed(1)
train <- sample(1:nrow(OJ), 800)
# (b) Fit a regression tree to the training set. Plot the tree, and
# calculate the test MSE.
OJ.test <- OJ[-train, ]
tree.oj <- tree(Purchase~., OJ, subset = train) ##Produces error "NAs introduced by coercion"
plot(tree.oj) ##Produces error "Cannot plot singlenode tree"
My question is, does something seem wrong with my code or could I have an issue with R studios on my computer? I have the same code as my class partner and she can run the code fine. We also had the same code for our last assignment, but when I ran the code I produced different statistics in repeated runs. This has happened a few other times and it leads me to believe there's an issue with my computer and R, rather than the code. Any suggestions on where to start to resolve this?

Plot an envelope for an mppm object in spatstat

My question is closely related to this previous one: Simulation-based hypothesis testing on spatial point pattern hyperframes using "envelope" function in spatstat
I have obtained an mppm object by fitting a model on several independent datasets using the mppmfunction from the R package spatstat. How can I study its envelope to compare it to my observations ?
I fitted my model as such:
data <- listof(NMJ1,NMJ2,NMJ3)
data <- hyperframe(X=1:3, Points=data)
model <- mppm(Points ~marks*sqrt(x^2+y^2), data)
where NMJ1, NMJ2, and NMJ3 are marked ppp and are independent realizations of the same experiment.
However, the envelope function does not accept inputs of type mppm:
> envelope(model, Kcross.inhom, nsim=10)
Error in UseMethod("envelope") :
no applicable method for 'envelope' applied to an object of class "c('mppm', 'list')"
The answer provided to the previously mentioned question indicates how to plot global envelopes for each pattern, and to use the product rule for multiple testing. However, my fitted model implies that my 3 ppp objects are statistically equivalent, and are independent realizations of the same experiment (ie no different covariates between them). I would thus like to obtain one single plot comparing my fitted model to my 3 datasets. The following code:
gamma= 1 - 0.95^(1/3)
nsims=round(1/gamma-1)
sims <- simulate(model, nsim=2*nsims)
SIMS <- list()
for(i in 1:nrow(sims)) SIMS[[i]] <- as.solist(sims[i,,drop=TRUE])
Hplus <- cbind(data, hyperframe(Sims=SIMS))
EE1 <- with(Hplus, envelope(Points, Kcross.inhom, nsim=nsims, simulate=Sims))
pool(EE1[1],EE1[2],EE1[3])
leads to the following error:
Error in pool.envelope(`1` = list(r = c(0, 0.78125, 1.5625, 2.34375, 3.125, :
Arguments 2 and 3 do not belong to the class “envelope”
Wrong type of subset index. Use
pool(EE1[[1]], EE1[[2]], EE1[[3]])
or just
pool(EE1)
These would have given an error message that the envelope commands should have been called with savefuns=TRUE. So you just need to change that step as well.
However, statistically this procedure makes little sense. You have already fitted a model, which allows for rigorous statistical inference using anova.mppm and other tools. Instead of this, you are generating simulated data from the fitted model and performing a Monte Carlo test, with all the fraught issues of multiple testing and low power. There are additional problems with this approach - for example, even if the model is "the same" for each row of the hyperframe, the patterns are not statistically equivalent unless the windows of the point patterns are identical, and so on.

How to obtain Brier Score in Random Forest in R?

I am having trouble getting the Brier Score for my Machine Learning Predictive models. The outcome "y" was categorical (1 or 0). Predictors are a mix of continuous and categorical variables.
I have created four models with different predictors, I will call them "model_1"-"model_4" here (except predictors, other parameters are the same). Example code of my model is:
Model_1=rfsrc(y~ ., data=TrainTest, ntree=1000,
mtry=30, nodesize=1, nsplit=1,
na.action="na.impute", nimpute=3,seed=10,
importance=T)
When I run the "Model_1" function in R, I got the results:
My question was how can I get the predicted possibility for those 412 people? And how to find the observed probability for each person? Do I need to calculate by hand? I found the function BrierScore() in "DescTools" package.
But I tried "BrierScore(Model_1)", it gives me no results.
codes I added:
library(scoring)
library(DescTools)
BrierScore(Raw_SB)
class(TrainTest$VL_supress03)
TrainTest$VL_supress03_nu<-as.numeric(as.character(TrainTest$VL_supress03))
class(TrainTest$VL_supress03_nu)
prediction_Raw_SB = predict(Raw_SB, TrainTest)
BrierScore(prediction_Raw_SB, as.numeric(TrainTest$VL_supress03) - 1)
BrierScore(prediction_Raw_SB, as.numeric(as.character(TrainTest$VL_supress03)) - 1)
BrierScore(prediction_Raw_SB, TrainTest$VL_supress03_nu - 1)
I tried some codes: have so many error messages:
One assumption I am making about your approach is that you want to compute the BrierScore on the data you train your model on (which is usually not the correct approach, google train-test split if you need more info there).
In general, therefore you should reflect on whether your approach is correct there.
The BrierScore method in DescTools only has a defined method for glm models, otherwise, it expects as input a vector of predicted probabilities and a vector of true values (see ?BrierScore).
What you would need to do though is to predict on your data using:
prediction = predict(model_1, TrainTest, na.action="na.impute")
and then compute the brier score using
BrierScore(as.numeric(TrainTest$y) - 1, prediction$predicted[, 1L])
(Note, that we transform TrainTest$y into a numeric vector of 0's and 1's in order to compute the brier score.)
Note: The randomForestSRC package also prints a normalized brier score when you call print(prediction).
In general, using one of the available workbenches for machine learning in R (mlr3, tidymodels, caret) might simplify this approach for you and prevent a lot of errors in this direction. This is a really good practice, especially if you are less experienced in ML as it can prevent many errors.
See e.g. this chapter in the mlr3 book for more information.
For reference, here is some very similar code using the mlr3 package, automatically also taking care of train-test splits.
data(breast, package = "randomForestSRC") # with target variable "status"
library(mlr3)
library(mlr3extralearners)
task = TaskClassif$new(id = "breast", backend = breast, target = "status")
algo = lrn("classif.rfsrc", na.action = "na.impute", predict_type = "prob")
resample(task, algo, rsmp("holdout", ratio = 0.8))$score(msr("classif.bbrier"))

Predict is not usbale for mira/mipo?

I'm a student who have no prior knowledge in anything coding related but I'm taking a module that requires RStudio and now I'm struggling.
I have an assignment that needed us to explore the methods of dealing with missing data in training data set and test data set (multiple rows and multiple variables) and then creating a linear model lm using training set. Then, use predict with the said lm with newdata = test data to observe the results. I am tasked to learn how to use MICE to deal with this assignment but I am at a dead end.
In my attempt, I tried to fill up missing data of the training data set via MICE with my approach as follows:
train = read.csv("Train_Data.csv", na.strings=c("","NA"))
missingtraindata = mice(train, m=5, maxit = 5, method = 'pmm')
model = with(missingtraindata, lm(LOS~.- PatientID, data = train))
miceresults = pool(model)
summary(miceresults)
Then I tried to use predict() but it doesn't work because it says mira/mipo doesn't work with predict(). I don't know what that means at all.
Honestly I have no idea what any of these codes do, I just tried to apply whatever information available from the notes that I have regarding MICE. I don't know if that's how you correctly use MICE to fill up missing data but i literally spent the entire day researching and trying but it's of no help. Please helppppp!

How do I get a predictions list from running svm in e1071 package

Q1:
I have been trying to get the AUC value for a classification problem and have been trying to use e1071 and ROCR packages in R for this. ROCR has a nice example "ROCR.simple" which has prediction values and label values.
library(ROCR)
data(ROCR.simple)
pred<-prediction(ROCR.simpe$predictions, ROCR.simple$labels)
auc<-performance(pred,"auc")
This gives the AUC value, no problem.
MY PROBLEM is: How do I get the type of data given by ROCR.simple$predictions in the above example?
I run my analysis like
library(e1071)
data(iris)
y<-Species
x<-iris[,1:2]
model<-svm(x,y)
pred<-predict(model,x)
Upto here I'm ok.
Then how do I get the kind of predictions that ROCR.simpe$predictions give?
Q2:
there is a nice example involving ROCR.xvals. This is a problem with 10 cross validations.
They run
pred<-prediction(ROCR.xval$predictions,ROCR.xval$labels)
auc<-performance(pred,"auc")
This gives results for all 10 cross validations.
My problem is:
How do I use
model<-svm(x,y,cross=10) # where x and y are as given in Q1
and get all 10 results of predictions and labels into a list as given in ROCR.xvals?
Q1. You could use
pred<-prediction(as.numeric(pred), as.numeric(iris$Species))
auc<-performance(pred,"auc")
BUT. number of classes is not equal to 2.
ROCR currently supports only evaluation of binary classification tasks (according to the error I got)
Q2. I don't think that the second can be done the way you want. I can only think to perform cross validations manualy i.e.
Get resample.indices (from package peperr)
cv.ind <- resample.indices(nrow(iris), sample.n = 10, method = c("cv"))
x <- lapply(cv.ind$sample.index,function(x){iris[x,1:2]})
y <- lapply(cv.ind$sample.index,function(x){iris[x,5]})
then generate models and predictions for each cv sample
model1<-svm(x[[1]],y[[1]])
pred1<-predict(model1,x[[1]])
etc.
Then you could manualy construct a list like ROCR.xval

Resources