Perform operation on each imputed dataset in R's MICE - r

How can I perform an operation (like subsetting or adding a calculated column) on each imputed dataset in an object of class mids from R's package mice? I would like the result to still be a mids object.
Edit: Example
library(mice)
data(nhanes)
# create imputed datasets
imput = mice(nhanes)
The imputed datasets are stored as a list of lists
imput$imp
where there are rows only for the observations with imputation for the given variable.
The original (incomplete) dataset is stored here:
imput$data
For example, how would I create a new variable calculated as chl/2 in each of the imputed datasets, yielding a new mids object?

This can be done easily as follows -
Use complete() to convert a mids object to a long-format data.frame:
long1 <- complete(midsobj1, action='long', include=TRUE)
Perform whatever manipulations needed:
long1$new.var <- long1$chl/2
long2 <- subset(long1, age >= 5)
use as.mids() to convert back manipulated data to mids object:
midsobj2 <- as.mids(long2)
Now you can use midsobj2 as required. Note that the include=TRUE (used to include the original data with missing values) is needed for as.mids() to compress the long-formatted data properly. Note that prior to mice v2.25 there was a bug in the as.mids() function (see this post https://stats.stackexchange.com/a/158327/69413)
EDIT: According to this answer https://stackoverflow.com/a/34859264/4269699 (from what is essentially a duplicate question) you can also edit the mids object directly by accessing $data and $imp. So for example
midsobj2<-midsobj1
midsobj2$data$new.var <- midsobj2$data$chl/2
midsobj2$imp$new.var <- midsobj2$imp$chl/2
You will run into trouble though if you want to subset $imp or if you want to use $call, so I wouldn't recommend this solution in general.

Another option is to calculate the variables before the imputation and place restrictions on them.
library(mice)
# Create the additional variable - this will have missing
nhanes$extra <- nhanes$chl / 2
# Change the method of imputation for extra, so that it always equals chl/2
# Change the predictor matrix so only chl predicts extra
ini <- mice(nhanes, max = 0, print = FALSE)
meth <- ini$meth
meth["extra"] <- "~I(chl / 2)"
pred <- ini$pred # extra isn't used to predict
pred["extra", "chl"] <- 1
# Imputations
imput <- mice(nhanes, seed = 1, pred = pred, meth = meth, print = FALSE)
There are examples in mice: Multivariate Imputation by Chained Equations in R.

There is an overload of with that can help you here
with(imput, chl/2)
the documentation is given at ?with.mids

There's a function for this in the basecamb package:
library(basecamb)
apply_function_to_imputed_data(mids_object, function)

Related

Issue with h2o Package in R using subsetted dataframes leading to near perfect prediction accuracy

I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:
dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors
X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset
train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices
h2o.init(nthreads=2) # Initiate h2o
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions
mean(predictions!=y[test])
The problem is that if I evaluate this against my test subset I get almost 0% error:
[1] 0.0007531255
Has anyone encountered this issue? Have an idea of how to alleviate this problem?
It will be more efficient to use the H2O functions to load the data and split it.
data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])
parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
h2o.performance(mlModel, test)
It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.

Library "TableOne" multiple comparisons. Calculate line by line p-values

I received a comment from a reviewer who wanted to have all the p-values for each line of specific variables levels in a demographic characteristic table (Table 1). Even though the request appears quite strange (and inexact) to me, I would like to agree with his suggestion.
library(tableone)
## Load data
library(survival); data(pbc)
# drop ID from variable list
vars <- names(pbc)[-1]
## Create Table 1 stratified by trt (can add more stratifying variables)
tableOne <- CreateTableOne(vars = vars, strata = c("trt"), data = pbc, factorVars = c("status","edema","stage"))
print(tableOne, nonnormal = c("bili","chol","copper","alk.phos","trig"), exact = c("status","stage"), smd = TRUE)
the output:
I need to have the p-values for each level of the variables status, edema and stage, with Bonferroni correction. I went through the documentation without success.
In addition, is it correct to use chi-squared to compare sample sizes across rows?
UPDATE:
I'm not sure if my approach is correct, however I would like to share it with you. I generated for the variable status a dummy variable for each strata, than I calculated the chisq .
library(tableone)
## Load data
library(survival); data(pbc)
d <- pbc[,c("status", "trt")]
# Convert dummy variables
d$status.0 <- ifelse(d$status==0, 1,0)
d$status.1 <- ifelse(d$status==1, 1,0)
d$status.2 <- ifelse(d$status==2, 1,0)
t <- rbind(
chisq.test(d$status.0, d$trt),
# p-value = 0.7202
chisq.test(d$status.1, d$trt),
# p-value = 1
chisq.test(d$status.2, d$trt)
#p-value = 0.7818
)
t
BONFERRONI ADJ FOR MULTIPLE COMPARISONS:
p <- t[,"p.value"]
p.adjust(p, method = "bonferroni")
This question was posted some time ago, so I supose you already answered the reviewer.
I don't really understand why computing adjusted p values for just three varibles. In fact, adjusting p values depends on the number of comparisons made. If you use p.adjust() with a vector of 3 p values, results will not really be "adjusted" by the amount of comparison made (you really did more than a dozen and a half!)
I show how to extract all p-values so you can compute the adjusted ones.
To extract pValues from package tableOne there is a way calling object attributes (explained first), and two quick and dirty ways (at the bottom part).
To extract them, first I copy your code to create your tableOne:
library(tableone)
## Load data
library(survival); data(pbc)
# drop ID from variable list
vars <- names(pbc)[-1]
## Create Table 1 stratified by trt (can add more stratifying variables)
tableOne <- CreateTableOne(vars = vars, strata = c("trt"), data = pbc, factorVars = c("status","edema","stage"))
You can see what your "tableOne" object has via attributes()
attributes(tableOne)
You can see a tableOne usually has a table for continuous and categorical variables. You can use attributes() in them too
attributes(tableOne$CatTable)
# you can notice $pValues
Now you know "where" the pValues are, you can extract them with attr()
attr(tableOne$CatTable, "pValues")
Something similar with numerical variables:
attributes(tableOne$ContTable)
# $pValues are there
attr(tableOne$ContTable, "pValues")
You have pValues for Normal and NonNormal variables.
As you set them before, you can extract both
mypCont <- attr(tableOne$ContTable, "pValues") # put them in an object
nonnormal = c("bili","chol","copper","alk.phos","trig") # copied from your code
mypCont[rownames(mypCont) %in% c(nonnormal), "pNonNormal"] # extract NonNormal
"%!in%" <- Negate("%in%")
mypCont[rownames(mypCont) %!in% c(nonnormal), "pNonNormal"] # extract Normal
All that said, and your pValues extracted, I think there are two much more convenient quick and dirty ways to accomplish the same:
Quick and dirty way A: using dput() with your printed tableOne. Then search in the console where the pValues are and copy-paste them to the script, to store them in an object
Quick and dirty way B: If you look in tableOne vignette there is an "Exporting" section, you can use print(tableOne, quote = TRUE) and then just copy and paste to a spreadsheet (like LibreOffice, Excel...).
Then I would select the column with pValue, transpose it, and get it back to R, to compute adjusted p values with p.adjust() and copy them back to the spreadsheet for journal submission

compare R packages missForest and Hmisc performance

I am trying to compare the performance of 2 R packages, missForest and Hmisc performace in dealing with missing value, when there are more than 50% missing values.
I got testing data in this way:
data("iris")
library(missForest)
iris.mis <- prodNA(iris, noNA = 0.6)
summary(iris.mis)
mis1 <- iris.mis
mis2 <- iris.mis
In missForest, it has mixError() method which allows you to compare the imputation accuracy with the original data.
# using missForest
missForest_imputed <- missForest(mis1, ntree = 100)
missForest_error <- mixError(missForest_imputed$ximp, mis1, iris)
dim(missForest_imputed$ximp)
missForest_error
Hmisc does not have mixError() method, I am using its powerful aregImpute() to do the imputation, like this:
# using Hmisc
library(Hmisc)
hmisc_imputed <- aregImpute(~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width + Species,
data = mis2, n.impute = 1)
I was hoping to convert the imputed results into a format like missForest_imputed$ximp, so that I can use mixError() method. The problem is, in aregImpute(), no matter I tried n.impute = 1 or n.impute = 5, I cannot have 150 values for each feature like the original data iris... And the number of values in each feature is also different....
So, is there any way to compare the performance of missForest and Hmisc in dealing with missing values?
Part 1
Hmisc::aregImpute returns the imputed values. For your object named hmisc_imputed, they can be found in hmisc_imputed$imputed. However, the imputed object is a list for each dimension.
If you wish to recreate the equivalent of missForest_imputed$ximp, you have to do it yourself. To do so, we can use the fact that:
all.equal(as.integer(attr(xx$Sepal.Length, "dimnames")[[1]]), which(is.na(iris.mis$Sepal.Length))) ## returns true
Which I do here:
check_missing <- function(x, hmisc) {
return(all.equal(which(is.na(x)), as.integer(attr(hmisc, "dimnames")[[1]])))
}
get_level_text <- function(val, lvls) {
return(lvls[val])
}
convert <- function(miss_dat, hmisc) {
m_p <- ncol(miss_dat)
h_p <- length(hmisc)
if (m_p != h_p) stop("miss_dat and hmisc must have the same number of variables")
# assume matches for all if 1 matches
if (!check_missing(miss_dat[[1]], hmisc[[1]]))
stop("missing data an imputed data do not match")
for (i in 1:m_p) {
i_factor <- is.factor(miss_dat[[i]])
if (!i_factor) {miss_dat[[i]][which(is.na(miss_dat[[i]]))] <- hmisc[[i]]}
else {
levels_i <- levels(miss_dat[[i]])
miss_dat[[i]] <- as.character(miss_dat[[i]])
miss_dat[[i]][which(is.na(miss_dat[[i]]))] <- sapply(hmisc[[i]], get_level_text, lvls= levels_i)
miss_dat[[i]] <- factor(miss_dat[[i]])
}
}
return(miss_dat)
}
iris.mis2 <- convert(iris.mis, hmisc_imputed$imputed)
Part 2
mixError uses RMSE to calculate error-rates, ?mixError:
Value imputation error. In case of continuous variables only this is the normalized root mean squared error (NRMSE, see 'help(missForest)' for further details). In case of categorical variables onlty this is the proportion of falsely classified entries (PFC). In case of mixed-type variables both error measures are supplied.
To do this on your object from "Part 1" [iris.mis2], you just need to use the nrmse function, which is provided in library(missForest).

PLS in R: Predicting new observations returns Fitted values instead

In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset. For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set. Here is what the model looks like in short.
# first simulate some data
set.seed(123)
bands=101
data <- data.frame(matrix(runif(56*bands),ncol=bands))
colnames(data) <- paste0(1:bands)
data$height <- rpois(56,10)
data$fbm <- rpois(56,10)
data$nitrogen <- rpois(56,10)
data$carbon <- rpois(56,10)
data$chl <- rpois(56,10)
data$ID <- 1:56
data <- as.data.frame(data)
caldata <- data[1:28,] # define model training set
valdata <- data[29:56,] # define model testing set
# define explanatory variables (x)
spectra <- caldata[,1:101]
# build PLS model using training data only
library(pls)
refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation =
"LOO", jackknife = TRUE)
It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components:
predict(refl.pls, ncomp = 3, newdata = valdata)
Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. I discovered this because the below code, in which newdata = is omitted, yields identical results.
predict(refl.pls, ncomp = 3)
Surely something must be going wrong, although I cannot seem to find out what specifically is. Is there someone out there who can, and is willing to help me move in the right direction?
I think the problem is with the nature of the input data. Looking at ?plsr and str(yarn) that goes with the example, plsr requires a very specific data frame that I find tricky to work with. The input data frame should have a matrix as one of its elements (in your case, the spectral data). I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting):
library("pls")
set.seed(123)
bands=101
spectra = matrix(runif(56*bands),ncol=bands)
DF <- data.frame(spectra = I(spectra),
height = rpois(56,10),
fbm = rpois(56,10),
nitrogen = rpois(56,10),
carbon = rpois(56,10),
chl = rpois(56,10),
ID = 1:56)
class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs"
str(DF)
DF$train <- rep(FALSE, 56)
DF$train[1:20] <- TRUE
refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation =
"LOO", jackknife = TRUE, subset = train)
res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,])
Note that I got the spectral data into the data frame as a matrix by protecting it with I which equates to AsIs. There might be a more standard way to do this, but it works. As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok.
As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr for the data sources to be completely unambiguous.

MICE imputation even when same data in column

Is it possible to get an imputation using the package MICE even when all the values in the column are the same? Then it would impute just with that number.
Example:
test<-data.frame(var1=c(2.3,2.3,2.3,2.3,2.3,NA),var2=c(5.3,5.6,5.9,6.4,4.5,NA))
miceImp<-mice(test)
testImp<-complete(miceImp)
only imputate on var2. I would like it to replace the NA in var1 too with 2.3.
You can use passive imputation for this. For a full explanation, see section 3.4 on page 25 of this article. As applied to constant variables, the objective here would be to set the imputation method for any constant variable x to the constant value of x. If the constant value of x is y, then the imputation method for x should be "~I(y)".
test = data.frame(
var1=c(2.3,2.3,2.3,2.3,2.3,NA,2.3),
var2=c(5.3,5.6,5.9,6.4,4.5,5.1,NA),
var3=c(NA,1:6))
cVars = which(sapply(test,sd,na.rm=T)==0) #determine which vars are constant (props to SimonG)
allMeans = colMeans(test,na.rm=T) #get the column means
miceImp.ini = mice(test,maxit=0,print=F) #initial mids object with no imputations
meth = miceImp.ini$method #extract the imputation method vector
meth[cVars] = paste0("~I(",allMeans[cVars],")") #set the imputation method to be a constant (the current column mean)
miceImp = mice(test,method=meth) #run the imputation with the user defined imputation methods
testImp = complete(miceImp) #extract an imputedly complete dataset
View(testImp) #take a look at it
All that being said, constant values tend not to be of great use in statistics, so it might be more efficient to drop any constant variables before imputation (since imputation is such a costly process).

Resources