r mice - "sample" imputation method not working correctly - r

I am using mice to impute missing data in a large dataset (24k obs, 98 vars). I am using the "sample" imputation method to impute some variables (and other methods for the others - many categorical). When I check my imputed data, those variables that I've applied "sample" to are not always imputed and I have missingness in them. I know for sure that I'm applying "sample" to them (I double checked the methods), and I made sure to remove all predictors of them in the prediction matrix. From my understanding, where they are in the visit sequence shouldn't matter (but I make sure they're immediately after variables with no missingness).
I can't give you a reprex because when I try to recreate the problem, it doesn't happen and everything is imputed just fine. I tried simulating my own data and I tried subsetting the dataset to a group of the variables that I want to use the sample method on. That's part of why I'm so stumped - I coded everything the same and it worked with the subset. I didn't think that the sample method would be at all dependent on the presence of any other vars.
EDIT:
This is the code I'm using
#produce prediction matrix
pred1 <- quickpred_ext(data1, mincor = 0.08, include = "age")
pred2 <- pred1
# for vars to not be imputed, set all predictors to 0
data_no_impute <- data1 %>%
select(contains(c("exp_", "outcome_"))) %>%
select(sort(names(.))) %>%
names
data_level3 <- data1 %>%
select(contains(c("f4", "f5", "f6")),
k22) %>%
select(sort(names(.))) %>%
names
pred2[data_no_impute,] <- 0
pred2[data_level3,] <- 0
#produce initial methods and visit sequence
initial <- mice(data1, max = 0, print = F, vis = "monotone",
defaultMethod = c("pmm", "logreg", "polyreg", "polr"))
#edit methods to be blank for vars I don't want to impute, "sample" for level 3
meth1 <- initial$meth
meth2 <- meth1
meth2[data_level3] <- "sample"
meth2[data_no_impute] <- ""
visits1 <- initial$visitSequence
visits2 <- visits1
visits2 <- append(visits2, data_level3,22)
#run mice test
mice_test <- mice(data1, m = 2, print = F,
predictorMatrix = pred2,
method = meth2,
vis = visits2,
nnet.MaxNWts = 3000)
#pull second completed dataset
imput1 <- mice::complete(mice_test, 2, include = F)
#look at missingness patterns
missingness_pattern2 <- md.pattern(imput1, plot = F)

Related

how to use R package `caret` to run `pls::plsr( )` with multiple responses

the caret::train() does not seem to accept y if y is a matrix of multiple columns.
Thanks for any help!
That's correct. Perhaps you want the tidymodels package? Kuhn has said there would be support for multivariate response in it. Here's evidence in favor of my suggestion: https://www.tidymodels.org/learn/models/pls/
Do a search of that document for plsr:
library(tidymodels)
library(pls)
get_var_explained <- function(recipe, ...) {
# Extract the predictors and outcomes into their own matrices
y_mat <- bake(recipe, new_data = NULL, composition = "matrix", all_outcomes())
x_mat <- bake(recipe, new_data = NULL, composition = "matrix", all_predictors())
# The pls package prefers the data in a data frame where the outcome
# and predictors are in _matrices_. To make sure this is formatted
# properly, use the `I()` function to inhibit `data.frame()` from making
# all the individual columns. `pls_format` should have two columns.
pls_format <- data.frame(
endpoints = I(y_mat),
measurements = I(x_mat)
)
# Fit the model
mod <- plsr(endpoints ~ measurements, data = pls_format)
# Get the proportion of the predictor variance that is explained
# by the model for different number of components.
xve <- explvar(mod)/100
# To do the same for the outcome, it is more complex. This code
# was extracted from pls:::summary.mvr.
explained <-
drop(pls::R2(mod, estimate = "train", intercept = FALSE)$val) %>%
# transpose so that components are in rows
t() %>%
as_tibble() %>%
# Add the predictor proportions
mutate(predictors = cumsum(xve) %>% as.vector(),
components = seq_along(xve)) %>%
# Put into a tidy format that is tall
pivot_longer(
cols = c(-components),
names_to = "source",
values_to = "proportion"
)
}
#We compute this data frame for each resample and save the results in the different columns.
folds <-
folds %>%
mutate(var = map(recipes, get_var_explained),
var = unname(var))
#To extract and aggregate these data, simple row binding can be used to stack the data vertically. Most of the action happens in the first 15 components so let’s filter the data and compute the average proportion.
variance_data <-
bind_rows(folds[["var"]]) %>%
filter(components <= 15) %>%
group_by(components, source) %>%
summarize(proportion = mean(proportion))
This might not be a reproducible code block. May need additional data or packages.

"Error in model.frame.default(data = train, formula = cost ~ .) : variable lengths differ", but all variables are length 76?

I'm modeling burrito prices in San Diego to determine whether some burritos are over/under priced (according to the model). I'm attempting to use regsubsets() to determine the best linear model, using the BIC, on a data frame of 76 observations of 14 variables. However, I keep getting an error saying that variable lengths differ, and thus a linear model doesn't work.
I've tried rounding all the observations in the data frame to one decimal place, I've used the length() function on each variable in the data frame to make sure they're all the same length, and before I made the model I used na.omit() on the data frame to make sure no NAs were present. By the way, the original dataset can be found here: https://www.kaggle.com/srcole/burritos-in-san-diego. I cleaned it up a bit in Excel first, removing all the categorical variables that appeared after the "overall" column.
burritos <- read.csv("/Users/Jack/Desktop/R/STOR 565 R Projects/Burritos.csv")
burritos <- burritos[ ,-c(1,2,5)]
burritos <- na.exclude(burritos)
burritos <- round(burritos, 1)
library(leaps)
library(MASS)
yelp <- burritos$Yelp
google <- burritos$Google
cost <- burritos$Cost
hunger <- burritos$Hunger
tortilla <- burritos$Tortilla
temp <- burritos$Temp
meat <- burritos$Meat
filling <- burritos$Meat.filling
uniformity <- burritos$Uniformity
salsa <- burritos$Salsa
synergy <- burritos$Synergy
wrap <- burritos$Wrap
overall <- burritos$overall
variable <- sample(1:nrow(burritos), 50)
train <- burritos[variable, ]
test <- burritos[-variable, ]
null <- lm(cost ~ 1, data = train)
full <- regsubsets(cost ~ ., data = train) #This is where error occurs

Multiple Imputed datasets - pooling results

I have a dataset containing missing values. I have imputed this dataset, as follows:
library(mice)
id <- c(1,2,3,4,5,6,7,8,9,10)
group <- c(0,1,1,0,1,1,0,1,0,1)
measure_1 <- c(60,80,90,54,60,61,77,67,88,90)
measure_2 <- c(55,NA,88,55,70,62,78,66,65,92)
measure_3 <- c(58,88,85,56,68,62,89,62,70,99)
measure_4 <- c(64,80,78,92,NA,NA,87,65,67,96)
measure_5 <- c(64,85,80,65,74,69,90,65,70,99)
measure_6 <- c(70,NA,80,55,73,64,91,65,91,89)
dat <- data.frame(id, group, measure_1, measure_2, measure_3, measure_4, measure_5, measure_6)
dat$group <- as.factor(dat$group)
imp_anova <- mice(dat, maxit = 0)
meth <- imp_anova$method
pred <- imp_anova$predictorMatrix
imp_anova <- mice(dat, method = meth, predictorMatrix = pred, seed = 2018,
maxit = 10, m = 5)
This creates five imputed datasets. Then, I created the complete datasets (example dataset 1):
impute_1 <- mice::complete(imp_anova, 1) # complete set 1
And then I performed the desired analysis:
library(reshape)
library(reshape2)
datLong <- melt(impute_1, id = c("id", "group"), measure.vars = c("measure_1", "measure_2", "measure_3", "measure_4", "measure_5", "measure_6"))
colnames(datLong) <- c("ID", "Gender", "Time", "Value")
table(datLong$Time) # To check if correct
datLong$ID <- as.factor(datLong$ID)
library(ez)
model_mixed_1 <- ezANOVA(data = datLong,
dv = Value,
wid = ID,
within = Time,
between = Gender,
detailed = TRUE,
type = 3,
return_aov = TRUE)
I did this for all the five datasets, resulting in five models:
model_mixed_1
model_mixed_2
model_mixed_3
model_mixed_4
model_mixed_5
Now I want to combine the results of this models, to generate one results.
I have asked a similar question before, but there I focused on the models. Here I just want to ask how I can simply combine five models. Hope someone can help me!
You understood the basic multiple imputation process right. The process is like:
First your create your m imputed datasets. (mice() - function)
Then you do your analysis on each of these datasets. (with() - function)
In the end you combine these results together. (pool() - function)
This is a quite often misunderstand process (often people assume you have to combine your m imputed datasets together to one dataset - which is wrong)
Here is a picture of this process:
Now you have to follow these steps within the mice framework - you did this only till step 1.
Here an excerpt from the mice help:
The pool() function combines the estimates from m repeated complete data analyses. The typical sequence of steps to do a multiple imputation analysis is:
Impute the missing data by the mice function, resulting in a multiple imputed data set (class mids);
Fit the model of interest (scientific model) on each imputed data set by the with() function, resulting an object of class mira;
Pool the estimates from each model into a single set of estimates and standard errors, resulting is an object of class mipo;
Optionally, compare pooled estimates from different scientific models by the pool.compare() function.
Code wise this can look for example like this:
imp <- mice(nhanes, maxit = 2, m = 5)
fit <- with(data=imp,exp=lm(bmi~age+hyp+chl))
summary(pool(fit))

randomForest Categorical Predictor Limits

I understand and appreciate that R's randomForest function can only handle categorical predictors with less than 54 categories. However, when I trim my categorical predictor down to less than 54 categories, I still get the error. The only questions I've seen around categorical predictor limits on stackoverflow is how to get around this category limit, but I'm trying to trim my number of categories to follow the function's limitations and I am still get the error.
The following script creates a data frame so we can predict 'profession'. Understandably, I get the "Can not handle categorical predictors with more than 53 categories" error when trying to run randomForest() on 'df' due to the 'college_id' variable.
But when I trim my data set to only include the top 40 college IDs, I get the same error. Am I missing some basic data frame concept that retains all of the categories even though only 40 are now populated in the 'df2' data frame? What is a workaround option that I can use?
library(dplyr)
library(randomForest)
# create data frame
df <- data.frame(profession = sample(c("accountant", "lawyer", "dentist"), 10000, replace = TRUE),
zip = sample(c("32801", "32807", "32827", "32828"), 10000, replace = TRUE),
salary = sample(c(50000:150000), 10000, replace = TRUE),
college_id = as.factor(c(sample(c(1001:1040), 9200, replace = TRUE),
sample(c(1050:9999), 800, replace = TRUE))))
# results in error, as expected
rfm <- randomForest(profession ~ ., data = df)
# arrange college_ids by count and retain the top 40 in the 'df' data frame
sdf <- df %>%
dplyr::group_by(college_id) %>%
dplyr::summarise(n = n()) %>%
dplyr::arrange(desc(n))
sdf <- sdf[1:40, ]
df2 <- dplyr::inner_join(df, sdf, by = "college_id")
df2$n <- NULL
# confirm that df2 only contains 40 categories of 'college_id'
nrow(df2[which(!duplicated(df2$college_id)), ])
# THIS IS WHAT I WANT TO RUN, BUT STILL RESULTS IN ERROR
rfm2 <- randomForest(profession ~ ., data = df2)
I think you still had all the factor levels in your variable. Try adding this line before you fit the forest again:
df2$college_id <- factor(df2$college_id)

Multiple Imputation of longitudinal data in MICE and statistical analyses of object type mids

I have a problem with performing statistical analyses of longitudinal data after
the imputation of missing values using mice. After the imputation of missings in the wide
data-format I convert the extracted data to the longformat. Because of the longitudinal
data participants have duplicate rows (3 timepoints) and this causes problems when converting the long-formatted data set into a type mids object.
Does anyone know how to create a mids object or something else appropriate after the imputation? I want to use lmer,lme for pooled fixed effects afterwards.
I tried a lot of different things, but still cant figure it out.
Thanks in advance and see the code below:
# minimal reproducible example
## Make up some data
set.seed(2)
# ID Variable, Group, 3 Timepoints outcome measure (X1-X3)
Data <- data.frame(
ID = sort(sample(1:100)),
GROUP = sample(c(0, 1), 100, replace = TRUE),
matrix(sample(c(1:5,NA), 300, replace=T), ncol=3)
)
# install.packages("mice")
library(mice)
# Impute the data in wide format
m.out <- mice(Data, maxit = 5, m = 2, seed = 9, pred=quickpred(Data, mincor = 0.0, exclude = c("ID","GROUP"))) # ignore group here for easiness
# mids object?
is.mids(m.out) # TRUE
# Extract imputed data
imp_data <- complete(m.out, action = "long", include = TRUE)[, -2]
# Converting data into long format
# install.packages("reshape")
library(reshape)
imp_long <- melt(imp_data, id=c(".imp","ID","GROUP"))
# sort data
imp_long <- imp_long[order(imp_long$.imp, imp_long$ID, imp_long$GROUP),]
row.names(imp_long)<-NULL
# save as.mids
as.mids(imp_long,.imp=1, .id=2) # doesnt work
as.mids(imp_long) # doesnt work
Best,
Julian
I hope I can answer your question with this small example. I don't really see why conversion back to the mids class is necessary. Usually when I use mice I convert the imputed data to a list of completed datasets, then analyse that list using apply.
library(mice)
library(reshape)
library(lme4)
Data <- data.frame(
ID = sort(sample(1:100)),
GROUP = sample(c(0, 1), 100, replace = TRUE),
matrix(sample(c(1:5,NA), 300, replace=T), ncol=3)
)
# impute
m.out <- mice(Data, pred=quickpred(Data, mincor=0, exclude=c("ID","GROUP")))
# complete
imp.data <- as.list(1:5)
for(i in 1:5){
imp.data[[i]] <- complete(m.out, action=i)
}
# reshape
imp.data <- lapply(imp.data, melt, id=c("ID","GROUP"))
# analyse
imp.fit <- lapply(imp.data, FUN=function(x){
lmer(value ~ as.numeric(variable)+(1|ID), data=x)
})
imp.res <- sapply(imp.fit, fixef)
Keep in mind, however, that single-level imputation is not a good idea when you're interested in relationships of variables that vary at different levels.
For these tasks you should use procedures that maintain the two-level variation and do not suppress it as mice does in this configuration.
There are workarounds for mice, but for example Mplus and the pan package in R are specifically designed for two-level MI.
No sure how relevant my answer is since you have asked a question long time ago, but in any case... In this slide deck toward the end, on the slide titled "Method POST" the author uses function long2mids():
imp1 <- mice(boys)
long <- complete(imp1, "long", inc = TRUE)
long$whr <- with(long, wgt / (hgt / 100))
imp2 <- long2mids(long)
However, long2mids() has been deprecated in favor of as.mids() since version 2.22.
as.mids() from the miceadds package will work here

Resources