How to combine multiply imputed data with mice? - r

I split a dataset into men and women, and then separately imputed it using the mice package.
#Generate predictormatrix
pred_gender_0<-quickpred(data_gender_0, include=c("age","weight_trunc"),exclude=c("ID","X","gender"),mincor = 0.1)
pred_gender_1<-quickpred(data_gender_1, include=c("age","weight_trunc"),exclude=c("ID","X","gender"),mincor = 0.1)
#impute the data with mice
imp_pred_gen0 <- mice(data_gender_0,
pred=pred_gender_0,
m=10,
maxit=5,
diagnostics=TRUE,
MaxNWts=3000) #i had to set this to 3000 because of an problematic unordered categorical variable
imp_pred_gen1 <- mice(data_gender_1,
pred=pred_gender_1,
m=10,
maxit=5,
diagnostics=TRUE,
MaxNWts=3000)
Now, I have two objects with 10 imputed datasets. One for men, one for women.
My question is, how do combine them?
Normally, I would just use:
comp_imp<-complete(imp,"long")
Should I:
use rbind.mids() to combine data of men and women and then convert it to long format?
do I first convert to long format and then use rbind.mids() or rbind()?
Thanks for any hints! =)
---------------------------------------------------------------------------
UPDATE - REPRODUCIBLE EXAMPLE
library("dplyr")
library("mice")
# We use nhanes-dataset from the mice-package as example
# first: combine age-category 2 and 3 to get two groups (as example)
nhanes$age[nhanes$age == 3] <- "2"
nhanes$age<-as.numeric(nhanes$age)
nhanes$hyp<-as.factor(nhanes$hyp)
#split data into two groups
nhanes_age_1<-nhanes %>% filter(age==1)
nhanes_age_2<-nhanes %>% filter(age==2)
#generate predictormatrix
pred1<-quickpred(nhanes_age_1, mincor=0.1, inc=c('age','bmi'), exc='chl')
pred2<-quickpred(nhanes_age_2, mincor=0.1, inc=c('age','bmi'), exc='chl')
# seperately impute data
set.seed(121012)
imp_gen1 <- mice(nhanes_age_1,
pred=pred1,
m=10,
maxit=5,
diagnostics=TRUE,
MaxNWts=3000)
imp_gen2 <- mice(nhanes_age_2,
pred=pred2,
m=10,
maxit=5,
diagnostics=TRUE,
MaxNWts=3000)
#------ ALTERNATIVE 1:
#combine imputed data
combined_imp<-rbind.mids(imp_gen1,imp_gen2)
complete_imp<-complete(combined_imp,"long")
#output
> combined_imp<-rbind.mids(imp_gen1,imp_gen2)
Warning messages:
1: In rbind.mids(imp_gen1, imp_gen2) :
Predictormatrix is not equal in x and y; y$predictorMatrix is ignored
.
2: In x$visitSequence == y$visitSequence :
longer object length is not a multiple of shorter object length
3: In rbind.mids(imp_gen1, imp_gen2) :
Visitsequence is not equal in x and y; y$visitSequence is ignored
.
> complete_imp<-complete(combined_imp,"long")
Error in inherits(x, "mids") : object 'combined_imp' not found
#------ ALTERNATIVE 2:
complete_imp1<-complete(imp_gen1,"long")
complete_imp2<-complete(imp_gen2,"long")
combined_imp<-rbind.mids(complete_imp1,complete_imp2)
#Output
> complete_imp1<-complete(imp_gen1,"long")
> complete_imp2<-complete(imp_gen2,"long")
> combined_imp<-rbind.mids(complete_imp1,complete_imp2)
Error in if (ncol(y) != ncol(x$data)) stop("The two datasets do not have the same number of columns\n") :
argument is of length zero

You can just use the following to create a new mids object which contains 10 imputed datasets of the men and women.
comp_imp <- rbind(pred_gender_0, pred_gender_1)
Doing this calls rbind.mids, not the regular bind function in R. The new object returned can be analysed in the usual way, e.g. using with.mids to fit your desired model to each of the imputed datasets.

I honestly have no knowledge of the package mice and just a faint idea about the concept of imputation.
I don't know what kind of analysis you would like to perform, but you say that normally you would do: comp_imp<-complete(imp,"long"), so I'll try to answer accordingly.
For me the first approach returns a data.frame, but without any missings. That is weird, since in complete(imp_gen1,"long") there is missing data in hyp. I don't know what rbind.mids is doing there.
I would therefore go with your second approach.
The result from complete(., "long") is a normal data.frame, hence there is no need to bind it with rbind.mids.
I would change your second approach to:
library(dplyr)
complete_imp1 <- complete(imp_gen1, "long")
complete_imp2 <- complete(imp_gen2, "long")
combined_imp <- bind_rows(complete_imp1, complete_imp2)

complete_imp1 <- complete(imp_gen1, "long") already returns the 10 (m parameter) imputed data frames, just count the total rows of complete_imp1 and multiply by m

Related

Combine imputed data by group in r using mice

my question is a follow-up to this question on imputation by group using "mice":
multiple imputation and multigroup SEM in R
The code in the answer works fine as far as the imputation part goes. But afterwards I am left with a list of actually complete data but more than one set. The sample looks as follows:
'Set up data frame'
df.g1<-data.frame(ID=rep("A",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,10,20)),x3=floor(runif(5,100,150)))
df.g2<-data.frame(ID=rep("B",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,25,50)),x3=floor(runif(5,200,250)))
df.g3<-data.frame(ID=rep("C",5),x1=floor(runif(5,4,5)),x2=floor(runif(5,75,99)),x3=floor(runif(5,500,550)))
df<-rbind(df.g1,df.g2,df.g3)
'Introduce NAs'
df$x1[rbinom(15,1,0.1)==1]<-NA
df$x2[rbinom(15,1,0.1)==1]<-NA
df$x3[rbinom(15,1,0.1)==1]<-NA
df
'Impute values by group:'
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(df,m=5)))
df.clean
As you can see, df.clean is a list of 3. One element per group. But each element containing a complete data set I am looking for.
The original answer suggests to rbind() the obtained data in df.clean which leaves me with a new data set with 45 (3x the original size) observations.
Here is the original code for the last step:
imputed.both <- do.call(args = df.clean, what = rbind)
Which data is the "right" one? And why the last step?
Thanks a bunch!
There's a bug in the code, i have a edited version below that works:
#Set up data frame
set.seed(12345)
df.g1<-data.frame(ID=rep("A",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,10,20)),x3=floor(runif(5,100,150)))
df.g2<-data.frame(ID=rep("B",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,25,50)),x3=floor(runif(5,200,250)))
df.g3<-data.frame(ID=rep("C",5),x1=floor(runif(5,4,5)),x2=floor(runif(5,75,99)),x3=floor(runif(5,500,550)))
df<-rbind(df.g1,df.g2,df.g3)
#Introduce NAs
df$x1[rbinom(15,1,0.1)==1]<-NA
df$x2[rbinom(15,1,0.1)==1]<-NA
df$x3[rbinom(15,1,0.1)==1]<-NA
# check NAs
colSums(is.na(df))
#Impute values by group:
# here's the bug
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(x,m=5)))
imputed.both <- do.call(args = df.clean, what = rbind)
dim(imputed.both)
# returns 15,4
In the code in the question, you have
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(df,m=5)))
dim(do.call(rbind,df.clean))
#this returns 45,4
The function is specified with "x" but you call df from the global environment. Hence you impute on the complete df.
So to answer your question, if you do this step:
split(df,df$ID)
You split your data frame into a list of data.frames with only A,B or Cs. Then if you lapply through this list, you get
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(x,m=5)))
names(df.clean)
lapply(df.clean,dim)
each item of the list df.clean contains a subset of the original df, with ID being A, B or C. Now you combine this list together into a data.frame using:
imputed.both <- do.call(rbind,df.clean)

Running for loop across multiple groups

I am running the following imputation task in R as a for loop:
myData <- essuk[c(2,3,4,5,6,12)]
myDataImp <- matrix(0,dim(myData)[1],dim(myData)[2])
lower <- c(0)
upper <- c(Inf)
for (k in c(1:5))
{
gmm.fit1 <- gmm.tmvnorm(matrix(myData[,k],length(myData[,k]),1), lower=lower, upper=upper)
useMu <- matrix(gmm.fit1$coefficients[1],1,1)
useSigma <- matrix(gmm.fit1$coefficients[2],1,1)
replaceThese <- myData[,k]<=0
myDataImp[,k] <- myData[,k]
myDataImp[replaceThese,k] <- rtmvnorm(n=sum(replaceThese), c(useMu), c(useSigma), c(-Inf), c(0))
}
The steps are pretty straightforward
Define the data set and an empty imputation data set.
For column 1-5, fit a model.
Extract model estimates to be used for imputation.
Run a model using model estimates and replace values <= 0 with the new values in the imputation data set.
However, I want to do this separately for multiple groups, rather than for the full sample. Column 12 in the data set contains information on group membership (integers ranging from 1-72).
I have tried several options, including splitting the data frame with data_list <- split(myData, myData$V12) and use the lapply() function. However, this does not work due to how model estimates are formatted:
Error in as.data.frame.default(data) :
cannot coerce class ""gmm"" to a data.frame
I have also thought about the possibility of doing a nested for loop, although I am not sure how that could be accomplished. Any suggestions are much appreciated.
what about using subset() ?
myData$V12 = as.factor(myData$V12)
listofresults= c()
for (i in levels(myData$V12)){
data = subset (myData, myData$V12 == i)
#your analysis here: result saved in myDataImp
listofresults = c(listofresults, myDataImp)
}
not the most elegant, but should work.

'R', 'mice', missing variable imputation - how to only do one column in sparse matrix

I have a matrix that is half-sparse. Half of all cells are blank (na) so when I try to run the 'mice' it tries to work on all of them. I'm only interested in a subset.
Question: In the following code, how do I make "mice" only operate on the first two columns? Is there a clean way to do this using row-lag or row-lead, so that the content of the previous row can help patch holes in the current row?
set.seed(1)
#domain
x <- seq(from=0,to=10,length.out=1000)
#ranges
y <- sin(x) +sin(x/2) + rnorm(n = length(x))
y2 <- sin(x) +sin(x/2) + rnorm(n = length(x))
#kill 50% of cells
idx_na1 <- sample(x=1:length(x),size = length(x)/2)
y[idx_na1] <- NA
#kill more cells
idx_na2 <- sample(x=1:length(x),size = length(x)/2)
y2[idx_na2] <- NA
#assemble base data
my_data <- data.frame(x,y,y2)
#make the rest of the data
for (i in 3:50){
my_data[,i] <- rnorm(n = length(x))
idx_na2 <- sample(x=1:length(x),size = length(x)/2)
my_data[idx_na2,i] <- NA
}
#imputation
est <- mice(my_data)
data2 <- complete(est)
str(data2[,1:3])
Places that I have looked for answers:
help document (link)
google of course...
https://stats.stackexchange.com/questions/99334/fast-missing-data-imputation-in-r-for-big-data-that-is-more-sophisticated-than-s
I think what you are looking for can be done by modifying the parameter "where" of the mice function. The parameter "where" is equal to a matrix (or dataframe) with the same size as the dataset on which you are carrying out the imputation. By default, the "where" parameter is equal to is.na(data): a matrix with cells equal to "TRUE" when the value is missing in your dataset and equal to "FALSE" otherwise. This means that by default, every missing value in your dataset will be imputed. Now if you want to change this and only impute the values in a specific column (in my example column 2) of your dataset you can do this:
# Define arbitrary matrix with TRUE values when data is missing and FALSE otherwise
A <- is.na(data)
# Replace all the other columns which are not the one you want to impute (let say column 2)
A[,-2] <- FALSE
# Run the mice function
imputed_data <- mice(data, where = A)
Instead of the where argument a faster way might be to use the method argument. You can set this argument to "" for the columns/variables you want to skip. Downside is that automatic determination of the method will not work. So:
imp <- mice(data,
method = ifelse(colnames(data) == "your_var", "logreg", ""))
But you can get the default method from the documentation:
defaultMethod
... By default, the method uses pmm, predictive mean matching (numeric data) logreg, logistic regression imputation (binary data, factor with 2 levels) polyreg, polytomous regression imputation for unordered categorical data (factor > 2 levels) polr, proportional odds model for (ordered, > 2 levels).
Your question isn't entirely clear to me. Are you saying you wish to only operate on two columns? In that case mice(my_data[,1:2]) will work. Or you want to use all the data but only fill in missing values for some columns? To do this, I'd just create an indicator matrix along the following lines:
isNA <- data.frame(apply(my_data, 2, is.na))
est <- mice(my_data)
mapply(function(x, isna) {
x[isNA == 1] <- NA
return(x)
}, <each MI mice return object column-wise>, isNA)
For your final question, "can I use mice for rolling data imputation?" I believe the answer is no. But you should double check the documentation.

Covariance matrices by group, lots of NA

This is a follow up question to my earlier post (covariance matrix by group) regarding a large data set. I have 6 variables (HML, RML, FML, TML, HFD, and BIB) and I am trying to create group specific covariance matrices for them (based on variable Group). However, I have a lot of missing data in these 6 variables (not in Group) and I need to be able to use that data in the analysis - removing or omitting by row is not a good option for this research.
I narrowed the data set down into a matrix of the actual variables of interest with:
>MMatrix = MMatrix2[1:2187,4:10]
This worked fine for calculating a overall covariance matrix with:
>cov(MMatrix, use="pairwise.complete.obs",method="pearson")
So to get this to list the covariance matrices by group, I turned the original data matrix into a data frame (so I could use the $ indicator) with:
>CovDataM <- as.data.frame(MMatrix)
I then used the following suggested code to get covariances by group, but it keeps returning NULL:
>cov.list <- lapply(unique(CovDataM$group),function(x)cov(CovDataM[CovDataM$group==x,-1]))
I figured this was because of my NAs, so I tried adding use = "pairwise.complete.obs" as well as use = "na.or.complete" (when desperate) to the end of the code, and it only returned NULLs. I read somewhere that "pairwise.complete.obs" could only be used if method = "pearson" but the addition of that at the end it didn't make a difference either. I need to get covariance matrices of these variables by group, and with all the available data included, if possible, and I am way stuck.
Here is an example that should get you going:
# Create some fake data
m <- matrix(runif(6000), ncol=6,
dimnames=list(NULL, c('HML', 'RML', 'FML', 'TML', 'HFD', 'BIB')))
# Insert random NAs
m[sample(6000, 500)] <- NA
# Create a factor indicating group levels
grp <- gl(4, 250, labels=paste('group', 1:4))
# Covariance matrices by group
covmats <- by(m, grp, cov, use='pairwise')
The resulting object, covmats, is a list with four elements (in this case), which correspond to the covariance matrices for each of the four groups.
Your problem is that lapply is treating your list oddly. If you run this code (which I hope is pretty much analogous to yours):
CovData <- matrix(1:75, 15)
CovData[3,4] <- NA
CovData[1,3] <- NA
CovData[4,2] <- NA
CovDataM <- data.frame(CovData, "group" = c(rep("a",5),rep("b",5),rep("c",5)))
colnames(CovDataM) <- c("a","b","c","d","e", "group")
lapply(unique(as.character(CovDataM$group)), function(x) print(x))
You can see that lapply is evaluating the list in a different manner than you intend. The NAs don't appear to be the problem. When I run:
by(CovDataM[ ,1:5], CovDataM$group, cov, use = "pairwise.complete.obs", method = "pearson")
It seems to work fine. Hopefully that generalizes to your problem.

Finding the mean of a variable within an imputed data set for population quintiles

I have a data set "base_data" which has missing values. I have therefore used the package 'Amelia' to impute the missing values into an object "a.output".
I have been able to find the mean for some variables within the imputed results using the following code:
q.out<-NULL
se.out<-NULL
for(i in 1:m) {
dclus <- svydesign(id=~site, data=a.output$base_data[[i]])
q.out <- rbind(q.out, coef(svymean(~hh_expenditure, dclus)))
se.out <- rbind(se.out, SE(svymean(~hh_expenditure, dclus)))}
I have combined the results using:
svymean.combine <- mi.meld(q = q.out, se = se.out)
Which gives me the mean and standard error for household expenditure (hh_expenditure) across the population.
However I have a variable which splits the population into wealth quintiles (wealth_quin).
As such, I am now wanting to find the average, and standard error, of the household expenditure per wealth_quin (a variable which is either 1,2,3,4,or 5).
I initially tried subsetting the imputed data, but this came up with many errors.
Is there a way to do this without having to split up the data into the 5 wealth quintiles before imputing the data?
Cheers,
Timothy
EDIT: HERE IS A WORKABLE EXAMPLE
require(Amelia)
require(survey)
a<-as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
b<-as.data.frame(c(1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,2))
c<-as.data.frame(c(2,7,8,5,4,4,3,8,7,9,10,1,3,3,2,8))
d<-as.data.frame(c(3,9,7,4,5,5,2,10,8,10,12,2,4,4,3,7))
e<-as.data.frame(c(2500,8000,NA,4500,4500,NA,2500,NA,7400,9648,1112,1532,3487,3544,NA,7000)
impute<-cbind(a,b,c,d,e)
names(impute) <- c("X","site","var2","var3", "hh_inc")
so no we have a data frame to work with, with missing values for hh_inc which I want to impute.
first step, set the number of imputations
m<-5
now run the imputation:
a.output <- amelia(x = impute, m=m, autopri=0.5,cs="X",
idvars=c("site","var2"),
logs=c("hh_inc","var3"))
a.output is now holds the data from the 5 imputations.
What I now want to do is find the average (and standard error) hh_inc for site 1 and site 2 separately using the imputed values from amelia.
How is that possible to do? I know it is possible to do if I just ignore the NA's. But this might introduce bias, hence why I imputed the values in the first place.
Cheers,
Timothy
EDIT:
I have placed a bounty to this. If no one knows the exact way to do it, then the results from the individual imputed data sets can be combined using Rubins formula (http://sites.stat.psu.edu/~jls/mifaq.html#minf)
As such, I will award to bounty to someone who can transform the 5 separate imputed datasets from the Amelia object into 5 separate, complete, data frames.
require(Amelia)
require(survey)
require(data.table)
require(plotrix)
a<-as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
b<-as.data.frame(c(1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,2))
c<-as.data.frame(c(2,7,8,5,4,4,3,8,7,9,10,1,3,3,2,8))
d<-as.data.frame(c(3,9,7,4,5,5,2,10,8,10,12,2,4,4,3,7))
e<-as.data.frame(c(2500,8000,NA,4500,4500,NA,2500,NA,7400,9648,1112,1532,3487,3544,NA,7000))
impute<-cbind(a,b,c,d,e)
names(impute) <- c("X","site","var2","var3", "hh_inc")
summary(impute)
m <- 5
a.output <- amelia(x = impute, m=m, autopri=0.5,cs="X",
idvars=c("site","var2"),
logs=c("hh_inc","var3"))
stats.out <- NULL
for(i in 1:m){
df2 <- data.table(a.output$imputations[[i]])
df3 <- data.frame(dataset=i,df2[,list(std.error(hh_inc),mean(hh_inc)), by="site"])
stats.out <- rbind(stats.out, df3)
}
colnames(stats.out) <- c("dataset","site","stdError","mean")
stats.out
I'm not sure I understand your question or the structure of your data (specifically the importance of whether the data is imputed or not) but here's how I've done some summary stats by group.
require(data.table)
require(plotrix)
# create some data
df1 <- data.frame(id=seq(1,50,1), wealth = runif(50)*1000)
df1$cutter <- cut(df1$wealth, 5, labels=FALSE)
head(df1)
# put the data into a data.table to speed things up
df2 <- as.data.table(df1)
head(df2)
grp1StdErr <- df2[,std.error(wealth), by="cutter"]
grp1Mean <- df2[,mean(wealth), by="cutter"]
Hope this helps.
Or, in one grouping step :
df2[,list(std.error(wealth),mean(wealth)), by=cut(wealth,5,labels=FALSE)]

Resources