Mice: partial imputation using where argument failing - r

I encounter a problem with the use of the mice function to do multiple imputation. I want to do imputation only on part of the missing data, what looking at the help seems possible and straightworward. But i can't get it to work.
here is the example:
I have some missing data on x and y:
library(mice)
plouf <- data.frame(ID = rep(LETTERS[1:10],each = 10), x = sample(10,100,replace = T), y = sample(10,100,replace = T))
plouf[sample(100,10),c("x","y")] <- NA
I want only to impute missing data on y:
where <- data.frame(ID = rep(FALSE,100),x = rep(FALSE,100),y = is.na(plouf$y))
I do the imputation
plouf.imp <- mice(plouf, m = 1,method="pmm",maxit=5,where = where)
I look at the imputed values:
test <- complete(plouf.imp)
Here i still have NAs on y:
> sum(is.na(test$y))
[1] 10
if I use where to say to impute on all values, it works:
where <- data.frame(ID = rep(FALSE,100),x = is.na(plouf$x),y = is.na(plouf$y))
plouf.imp <- mice(plouf, m = 1,method="pmm",maxit=5,where = where)
test <- complete(plouf.imp)
> sum(is.na(test$y))
[1] 0
but it does the imputation on x too, that I don't want in this specific case (speed reason in a statistial simulation study)
Has anyone any idea ?

This is happening because of below code -
plouf[sample(100,10),c("x","y")] <- NA
Let's consider your 1st case wherein you want to impute y only. Check it's PredictorMatrix
plouf.imp <- mice(plouf, m = 1, method="pmm", maxit=5, where = whr)
plouf.imp
#PredictorMatrix:
# ID x y
#ID 0 0 0
#x 0 0 0
#y 1 1 0
It says that y's missing value will be predicted based on ID & x since it's value is 1 in row y.
Now check your sample data where you are populating NA in x & y column. You can notice that wherever y is NA x is also having the same NA value.
So what happens is that when mice refers PredictorMatrix for imputation in y column it encounters NA in x and ignore those rows as all independent variables (i.e. ID & x) are expected to be non-missing in order to predict the outcome i.e. missing values in y.
Try this -
library(mice)
#sample data
set.seed(123)
plouf <- data.frame(ID = rep(LETTERS[1:10],each = 10), x = sample(10,100,replace = T), y = sample(10,100,replace = T))
plouf[sample(100,10), "x"] <- NA
set.seed(999)
plouf[sample(100,10), "y"] <- NA
#missing value imputation
whr <- data.frame(ID = rep(FALSE,100), x = rep(FALSE,100), y = is.na(plouf$y))
plouf.imp <- mice(plouf, m = 1, method="pmm", maxit=5, where = whr)
test <- complete(plouf.imp)
sum(is.na(test$y))
#[1] 1
Here only one value of y is left to be imputed and in this case both x & y are having NA value i.e. row number 39 (similar to your 1st case).

Related

R: How to access a 'complicated list'

I am working on an assignment, which tasks me to generate a list of data, using the below code.
##Use the make_data function to generate 25 different datasets, with mu_1 being a vector
x <- seq(0, 3, len=25)
make_data <- function(a){
n = 1000
p = 0.5
mu_0 = 0
mu_1=a
sigma_0 = 1
sigma_1 = 1
y <- rbinom(n, 1, p)
f_0 <- rnorm(n, mu_0, sigma_0)
f_1 <- rnorm(n, mu_1, sigma_1)
x <- ifelse(y == 1, f_1, f_0)
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
list(train = data.frame(x = x, y = as.factor(y)) %>% slice(-test_index),
test = data.frame(x = x, y = as.factor(y)) %>% slice(test_index))
}
dat <- sapply(x,make_data)
The code looks good to go, and 'dat' appears to be a 25 column, 2 row table, each with its own data frame.
Now, each data frame within a cell has 2 columns.
And this is where I get stuck.
While I can get to the data frame in row 1, column 1, just fine (i.e. just use dat[1,1]), I can't reach the column of 'x' values within dat[1,1]. I've experimented with
dat[1,1]$x
dat[1,1][1]
But they only throw weird responses: error/null.
Any idea how I can pull the column? Thanks.
dat[1, 1] is a list.
class(dat[1, 1])
#[1] "list"
So to reach to x you can do
dat[1, 1]$train$x
Or
dat[1, 1][[1]]$x
As a sidenote, instead of having this 25 X 2 matrix as output in dat I would actually prefer to have a nested list.
dat <- lapply(x,make_data)
#Access `x` column of first list from `train` dataset.
dat[[1]]$train$x
However, this is quite subjective and you can chose whatever format you like the best.

R: upSample in Caret is removing target variable completely

I am trying to upsample an imbalanced dataset in R using the upSample function in Caret. However upon applying the function it completely removes the target variable C_flag from the dataset. Here is my code:
set.seed(100)
'%ni%' <- Negate('%in%')
up_train <- upSample(x = train[, colnames(train) %ni% "C_flag"], #all predictor variables
y = train$C_flag) #target variable
Here is the amount of each category of C_flag in the train set.
0 = 100193, 1=29651.
I test to see if C_flag is there with this result:
print(up_train$C_flag)
NULL
Does anyone know why this function is removing this variable instead of upsampling?
First thing that comes to my mind is if up_train$C_flagis a factor or not. Anyway, I tried this sample dataset:
library(tidyverse)
library(caret)
train <- data.frame(x1 = c(2,3,4,2,3,3,3,8),
x2 = c(1,2,1,2,4,1,1,4),
C_flag = c("A","B","B","A","A","A","A","A"))
train$C_flag <- as.factor(train$C_flag)
'%ni%' <- Negate('%in%')
up_train <- upSample(x = train[,colnames(train) %ni% "C_flag"],
y = train$C_flag)
up_train$C_flag
And it returned me NULL. Why?, because the target column was renamed "Class". So if you want to see the target with the name C_flag add the yname name you want:
up_train <- upSample(x = train[,colnames(train) %ni% "C_flag"],
y = train$C_flag,
yname = "C_flag")
print(up_train$C_flag)
[1] A A A A A A B B B B B B
Levels: A B

Sensitivity Analysis for Missing Data in R with MICE

I am working on a meta analysis and a sensitivity analysis for missing data. I want to replace censorsed data either with 0 or 1 according to a predefined probability.
I have a dataset with colum x: timepoints and y: events (1 = event, 0 = censored). For the analysis I replaced some of the 0 with NAs. Z is the indicator for the treatment arm. I want to replace NAs to either 1 or 0 with a predefined probability.
This is my code:
Just an example:
library(mice)
x <- c(1:10)
y <- c(1,1,1,NA,NA,NA,1,1,0,NA)
z <- rep(2,10)
data <- data.frame(x,y,z)
str(data)
md.pattern(data)
mice.impute.myfunct <- function(y, ry, x, ...)
{event <- sample(c(0:1), size = 1, replace=T, prob=c(0.5,0.5)); return(event)}
data.imp <- mice(data, me = c("","myfunct",""), m = 1)
data.comp <- complete(data.imp)
I would expect that NAs in y will be replaced with 0 (20% of cases) and 1 (80% of cases). But NAs are either replaced only with 0 or only with 1.
I have to admit, that I am quite a beginner with R and did not have to write own little functions before.
Thank you very much for your help!
Here is a possible solution just replacing the missing values with the 0 and 1, and a varying probability between 0.1 and 0,9:
for( i in seq(0.1,0.9,0.1)){
data[[paste0("y_imp",i)]] <- data$y
N <- sum(is.na( data$y))
data[[paste0("y_imp",i)]][is.na(data[[paste0("y_imp",i)]])] <- sample(c(0,1), size = N, replace=T, prob=c(i,1-i))
}
data[[paste0("y_imp",i)]] <- data$y create the column where you has the i probability of replacing the missing by 0.

Optimizing a raster::calc function - function 1 vs 2 - R

I am working on calculating a new raster (output ras) based on 2 rasters (input ras) and a 'stratum' raster. The Stratum raster values (1 to 4) refer to the rows in the bias and weight dataframes. Strata value '4' was used to fill any 'NA' in the Strata raster, otherwise the function would crash. The following input is required.
# load library
library(raster)
# reproducing the bias and weight data.frames
bias <- data.frame(
ras_1 = c(56,-7,-30,0),
ras_2 = c(29,18,-52,0),
ras_3 = c(44,4,-15,0)
)
rownames(bias) <- c("Strat 1","Strat 2","Strat 3","Strat 4")
weight <- data.frame(
ras_1 = c(0.56,0.66,0.23,0.33),
ras_2 = c(0.03,0.18,0.5,0.33),
ras_3 = c(0.41,0.16,0.22,0.34)
)
rownames(weight) <- c("Strat 1","Strat 2","Strat 3","Strat 4")
The following function (fusion) allows me to add a 'bias' value to the input rasters. After the bias has been added, the two corrected input raster cell values will be multiplied by a weight value, depending in which stratum they belong.
The result of the input 2 raster values will be summed and returned using 'calc'.
## Create raster data for input
# create 2 rasters
r1 <- raster(ncol=10,nrow=10)
r2 <- raster(ncol=10,nrow=10)
r1[] <- sample(seq(from = 1, to = 500, by = 1), size = 100, replace = TRUE)
r2[] <- sample(seq(from = 1, to = 500, by = 1), size = 100, replace = TRUE)
r2[1:2] <- NA # include NA in input maps for example purpose
# Create strata raster (4 strata's)
r3 <- raster(ncol=10,nrow=10)
r3[] <- sample(seq(from = 1, to = 4, by = 1), size = 100, replace = TRUE)
Strata.n <- 4 # number of strata values in this example
fusion <- function(x) {
result <- matrix(NA, dim(x)[1], 1)
for (n in 1:Strata.n) {
ok <- !is.na(x[,3]) & x[,3] == n
a <- x[ok,1] + bias[n,1] # add bias to first input raster value
b <- x[ok,2] + bias[n,2] # add bias to second input raster value
result[ok] <- a * weight[n,1] + b * weight[n,2] # Multiply values by weight
}
return(result)
}
s <- stack(r1,r2,r3)
Fused.map <- calc(s, fun = fusion, progress = 'text')
The problem with the above function is that:
It is only suited for 2 rasters
If one raster has NA, then the result will be NA for that cell
is.na(Fused.map#data#values) # check for NA in the fused map
What I would like to have is:
A function that takes any number of input rasters
It can work with NA values (ignores NA values in the rasters)
Re-adjusts the 'weight' if a raster has a NA value, so that the remaining weight values add up to 1
EDIT
The following function does what I need, but is significantly slower than the function above on large rasters. Fusion does it in 10 seconds, fusion2 function below needs 8 hours on large rasters...
fusion2 <- function(x) {
m <- matrix(x, nrow= 1, ncol=3) # Create matrix per stack of cells
n <- m[,3] # get the stratum
g <- m[1:(Strata.n-1)] + as.matrix(bias[n,]) # add bias to raster values
g[g < 0] <- 0 # set values below 0 to 0
w <- weight[n,1:(Strata.n-1)] # get correct strata weight values
w[is.na(g)]<- NA # set weight to NA if (g) raster values are NA
p <- sum(w, na.rm = T) # calculate sum of weight values
pp <- w/p # divide weight values by sum to get the proportion to == 1
pp <- as.numeric(pp)
result <- as.integer(round(sum(pp*g, na.rm = T))) # return raster value
return(result)
}
Fused.map <- calc(s, fun = fusion2, progress = 'text')
Any way to optimize the fusion2 function to a similar method as fusion1?
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Thank you for your time!
There seems to be a lot of unnecessary format conversions going on, and using the simplest data structures available is the fastest. calc parameter is a numeric vector, so you can use numeric vectors everywhere. Also, rounding and casting into an integer is redundant.
fusion3 <- function(x) {
n <- x[3] # get the stratum
g <- x[1:(Strata.n-1)] + as.numeric(bias[n,]) # add bias to raster values
g[g < 0] <- 0 # set values below 0 to 0
w <- as.numeric(weight[n,1:(Strata.n-1)]) # get correct strata weight values
w[is.na(g)]<- NA # set weight to NA if (g) raster values are NA
p <- sum(w, na.rm = T) # calculate sum of weight values
pp <- w/p # divide weight values by sum to get the proportion to == 1
result <- as.integer(sum(pp*g, na.rm = T)+0.5) # return raster value
return(result)
}
On a 100x100 raster, your original functions take:
system.time(Fused.map <- calc(s, fun = fusion, progress = 'text'))
user system elapsed
0.015 0.000 0.015
system.time(Fused.map <- calc(s, fun = fusion2, progress = 'text'))
user system elapsed
8.270 0.078 8.312
The modified function is already 5 times faster:
system.time(Fused.map3 <- calc(s, fun = fusion3, progress = 'text'))
user system elapsed
1.970 0.026 1.987
Next, precompute matrices from the data frames so you don't need to do that for each pixel:
bias_matrix = as.matrix(bias)
weight_matrix = as.matrix(weight)
fusion3 <- function(x) {
n <- x[3] # get the stratum
g <- x[1:(Strata.n-1)] + bias_matrix[n,] # add bias to raster values
g[g < 0] <- 0 # set values below 0 to 0
w <- weight_matrix[n,1:(Strata.n-1)] # get correct strata weight values
w[is.na(g)]<- NA # set weight to NA if (g) raster values are NA
p <- sum(w, na.rm = T) # calculate sum of weight values
pp <- w/p # divide weight values by sum to get the proportion to == 1
result <- as.integer(sum(pp*g, na.rm = T)+0.5) # return raster value
return(result)
}
We get:
system.time(Fused.map3 <- calc(s, fun = fusion3, progress = 'text'))
user system elapsed
0.312 0.008 0.318
And finally, also precompute 1:(Strata.n-1):
bias_matrix = as.matrix(bias)
weight_matrix = as.matrix(weight)
Strata.minus1 = 1:(Strata.n-1)
fusion3 <- function(x) {
n <- x[3] # get the stratum
g <- x[Strata.minus1] + bias_matrix[n,] # add bias to raster values
g[g < 0] <- 0 # set values below 0 to 0
w <- weight_matrix[n,Strata.minus1] # get correct strata weight values
w[is.na(g)]<- NA # set weight to NA if (g) raster values are NA
p <- sum(w, na.rm = T) # calculate sum of weight values
pp <- w/p # divide weight values by sum to get the proportion to == 1
result <- as.integer(sum(pp*g, na.rm = T)+0.5) # return raster value
return(result)
}
We get:
system.time(Fused.map3 <- calc(s, fun = fusion3, progress = 'text'))
user system elapsed
0.252 0.011 0.262
That's not quite 0.015 yet, but you also have to take into consideration that your original function does not output integers, nor does it set values below 0 to 0, nor does it make the proportions sum to 1, nor as you mentioned deal with NAs.
Mind you, this function still only works with only two rasters, because you hardcode stratum as layer 3. You should instead use raster::overlay with two parameters, the stratum raster and the layers themselves (or use calc with the stratum raster as layer 1, but that's not what calc is designed for).

ddply skipping empty dataframe

I try to get the empirical distribution for different levels of a factor from a sample.
For some reason, running the following :
a <- daply(caseDataset, x, nrow) / nrow(caseDataset)
gives me some NA for the cases where the dataset has no values for a level of the factor x
So I have to use override the result with
a[is.na(a)] <- 0
How can I force daply to have a uniform behavior (and pass the empty dataframe down to nrow) ?
Sample for caseDataset:
dataset <- data.frame(
a1 = c(TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE),
a2 = c(TRUE,TRUE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE,TRUE),
a3 = c(1,6,5,4,7,3,8,7,5),
target = c('+','+','-','+','-','-','-','+','-'))
caseDataset <- subset(dataset, target=='-')
daply(caseDataset, "target", nrow)
Does the .drop_i switch do what you are after?
> daply(caseDataset, "target", nrow, .drop_i=FALSE)
- +
5 0

Resources