R Imputation With MICE - r

set.seed(1)
library(data.table)
data=data.table(STUDENT = 1:1000,
OUTCOME = sample(20:90, r = T),
X1 = runif(1000),
X2 = runif(1000),
X3 = runif(1000))
data[, X1 := fifelse(X1 > .9, NA_real_, X1)]
data[, X2 := fifelse(X2 > .78 & X2 < .9, NA_real_, X1)]
data[, X3 := fifelse(X3 < .1, NA_real_, X1)]
Say you have data as shown and you wish to impute values for X1, X2, X3 and leave out STUDENT and OUTCOME for the imputation processing.
I can do
library(mice)
dataIMPUTE=mice(data[, c("X1", "X2", "X3")], m = 1)
but how do I get together the imputing values from dataIMPUTE with STUDENT and OUTCOME? I am afraid that I will merge wrong and that is why I ask if you have advice for this.

One possibility is to use the complete data set in the imputation, but change the predictorMatrix so that STUDENT and OUTCOME are not used in the imputation model.
First, you need to run mice to extract the predictorMatrix (without calculating the imputation). Then you can set all columns to 0 that shouldn't be included in the imputation model. However, all your variables are still contained in your dataIMPUTE object:
set.seed(1)
library(data.table)
data=data.table(STUDENT = 1:1000,
OUTCOME = sample(20:90, r = T),
X1 = runif(1000),
X2 = runif(1000),
X3 = runif(1000))
index_1 <- sample(1:1000, 100)
index_2 <- sample(1:1000, 100)
index_3 <- sample(1:1000, 100)
data[index_1, X1 := NA_real_]
data[index_2, X2 := NA_real_]
data[index_3, X3 := NA_real_]
library(mice)
init <- mice(data, maxit = 0, print = FALSE)
# extract the predictor matrix
pred_mat <- init$predictorMatrix
# remove STUDENT and OUTCOME as predictors
pred_mat[, c("STUDENT", "OUTCOME")] <- 0
# do the imputation
dataIMPUTE = mice(data, pred = pred_mat, m = 1)

Related

subset column names from df to plot specific coefficients from regression R

I am trying to plot only a select few coefficients from a regression, but my regression has 100s of variables, so I'm trying to think of a way to extract the coefficients using better coding.
I have this regression below:
n <- 1e3
d <- data.frame(
# Covariates
x1 = rnorm(n),
x2 = rnorm(n),
x3 = rnorm(n),
x4 = rnorm(n),
# Individuals and firms
id = factor(sample(20, n, replace=TRUE)),
firm = factor(sample(13, n, replace=TRUE)),
# Noise
u = rnorm(n)
)
id.eff <- rnorm(nlevels(d$id))
firm.eff <- rnorm(nlevels(d$firm))
d$y <- d$x1 + 0.5*d$x2 + id.eff[d$id] + firm.eff[d$firm] + d$u
est <- felm(y ~ x1 + x2 + x3 + x4 | id + firm, data = d)
But I only want to plot x1 & x2, so I extract the column names I want:
d_names = names(d)
N <- -4
d_cleaned_names <- head(d_names, -N)
d_cleaned_names_filter <- d_cleaned_names[1:2]
I then add quotation marks and paste it into one continuous character:
d_cleaned_names_filter_quote <- shQuote(d_cleaned_names_filter)
d_cleaned_filter_quote_names = paste(d_cleaned_names_filter_quote, collapse = ',' )
and when I plot this:
jtools::plot_coefs(est,plot.distributions = TRUE, inner_ci_level = .9, coefs = c(d_cleaned_names_filter_quote))
I get the error:
Error in if (rescale.distributions == FALSE && max(heights)/min(heights) > :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In max(heights) : no non-missing arguments to max; returning -Inf
2: In min(heights) : no non-missing arguments to min; returning Inf
but this works perfectly writing 'x1' & 'x2' out manually. However, this is not that feasible for a large dataset:
jtools::plot_coefs(est,plot.distributions = TRUE, inner_ci_level = .9, coefs = c('x1', 'x2'))
Any advice would be very welcome.

Converting a Nested For Loop into `sapply()` in R

I have been trying to create a series of coplots using a nested for loop but the loop takes too long to run (the original data set is very big). I have looked at similar questions and they suggest using the sapply function but I am still unclear about how to convert between the 2. I understand I need to create a plotting function to use (see below) but what I don't understand is how the i's and j's of the nested for loop into sapply arguements.
I have made some sample data, the nested for loop that I have been using and the plotting function I created that are below. Could someone walk me through how I convert my nested for loop into sapply arguements. I have been doing all of this in R. Many Thanks
y = rnorm(n = 200, mean = 10, sd = 2)
x1 = rnorm(n = 200, mean = 5, sd = 2)
x2 = rnorm(n = 200, mean = 2.5, sd = 2)
x3 = rep(letters[1:4], each = 50)
x4 = rep(LETTERS[1:8], each = 25)
dat = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3, x4 = x4)
for(i in dat[, 2:3]){
for(j in dat[, 4:5]){
coplot(y ~ i | j, rows = 1, data = dat)
}
}
coplop_fun = function(data, x, y, x, na.rm = TRUE){
coplot(.data[[y]] ~ .data[[x]] | .data[[z]], data = data, rows = 1)
}
I think you might be able to use mapply here and not sapply. mapply is similar to sapply but allows for you to pass two inputs instead of one.
y = rnorm(n = 200, mean = 10, sd = 2)
x1 = rnorm(n = 200, mean = 5, sd = 2)
x2 = rnorm(n = 200, mean = 2.5, sd = 2)
x3 = rep(letters[1:4], each = 50)
x4 = rep(LETTERS[1:8], each = 25)
dat = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3, x4 = x4)
for(i in dat[, 2:3]){
for(j in dat[, 4:5]){
coplot(y ~ i | j, rows = 1, data = dat)
}
}
mapply(function(x,j){coplot(dat[["y"]]~x|j,rows =1)}, dat[,2:3],dat[,4:5])
We can use a combination of functions expand.grid, formula and apply to accept character column names into coplot.
# combinations of column names for plotting
vars <- expand.grid(y = "y", x = c("x1", "x2"), z = c("x3", "x4"))
# cycle through column name variations, construct formula for each combination
apply(vars, MARGIN = 1,
FUN = function(x) coplot(
formula = formula(paste(x[1], "~", x[2], "|", x[3])),
data = dat, row = 1
)
)
Here's a tidyverse version of #nya's solution with expand.grid() and apply(). Each row in ds_plot_parameters represents a single plot. The equation variable is the string eventually passed to coplot().
Each equation is passed to purrr::walk(), which then calls coplot()
to produce one graph each. as.equation() converts the string to an equation.
ds_plot_parameters <-
tidyr::expand_grid(
v = c("x1", "x2"),
w = c("x3", "x4")
) |>
dplyr::mutate(
equation = paste0("y ~ ", v, " | ", w),
)
ds_plot_parameters$equation |>
purrr::walk(
\(e) coplot(as.formula(e), rows = 1, data = dat)
)
Gravy:
If you want to more input to the graph, then expand ds_plot_parameters to include other things like graph & axis titles.
ds_plot_parameters <-
tidyr::expand_grid(
v = c("x1", "x2"),
w = c("x3", "x4")
) |>
dplyr::mutate(
equation = paste0("y ~ ", v, " | ", w),
label_y = "Outcome (mL)",
label_x = paste(v, " (log 10)")
)
ds_plot_parameters |>
dplyr::select(
# Make sure this order exactly matches the function signature
equation,
label_x,
label_y,
) |>
purrr::pwalk(
.f = \(equation, label_x, label_y) {
coplot(
formula = as.formula(equation),
xlab = label_x,
ylab = label_y,
rows = 1,
data = dat
)
}
)
ds_plot_parameters
# # A tibble: 4 x 5
# v w equation label_y label_x
# <chr> <chr> <chr> <chr> <chr>
# 1 x1 x3 y ~ x1 | x3 Outcome (mL) x1 (log 10)
# 2 x1 x4 y ~ x1 | x4 Outcome (mL) x1 (log 10)
# 3 x2 x3 y ~ x2 | x3 Outcome (mL) x2 (log 10)
# 4 x2 x4 y ~ x2 | x4 Outcome (mL) x2 (log 10)

different variable imputed values using same predictor variables mice R

I would expect the imputed values of x to be the same if the same preditor variables were used, despite other variables being imputed or not, but it's not the case, as reproduced here:
library(data.table)
library(robustlmm)
library(mice)
library(miceadds)
library(magrittr)
library(dplyr)
library(tidyr)
set.seed(1)
# Data ------------------------------------
dt1 <- data.table(id = rep(1:10, each=3),
group = rep(1:2, each=15),
time = rep(1:3, 10),
sex = rep(sample(c("F","M"),10,replace=T), each=3),
x = rnorm(30),
y = rnorm(30),
z = rnorm(30))
setDT(dt1)[id %in% sample(1:10,4) & time == 2, `:=` (x = NA, y = NA)][
id %in% sample(1:10,4) & time == 3, `:=` (x = NA, y = NA)]
dt2 <- dt1 %>% group_by(id) %>% fill(y) %>% ungroup %>% as.data.table
# MI 1 ------------------------------------
pm1 <- make.predictorMatrix(dt1)
pm1['x',c('y','z')] <- 0
pm1[c('x','y'), 'id'] <- -2
imp1 <- mice(dt1, pred = pm1, meth = "2l.pmm", seed = 1, m = 2, print = F, maxit = 20)
# boundary (singular) fit: see ?isSingular - don't know how to interpret this (don't occur with my real data)
View(complete(imp1, 'long'))
# MI 2 ------------------------------------
pm2 <- make.predictorMatrix(dt2)
pm2['x',c('y','z')] <- 0
pm2['x', 'id'] <- -2
imp2 <- mice(dt2, pred = pm2, meth = "2l.pmm", seed = 1, m = 2, print = F, maxit = 20, remove.constant = F)
# imp2$loggedEvents report sex as constant (don't know why) so I include remove.constant=F to keep that variable (don't occur with my real data)
View(complete(imp2, 'long'))
In imp1:
group, time and sex are used to predict x
group, time, sex, x and z are used to predict y
In ìmp2:
group, time and sex are used to predict x
y is complete so no imputation is performed for this variable
Given so, why are the results different for the imputed data on x?
Is it the expected behavior?
Thank you!

Using a loop to create table with results of ICC in r

I created a loop to calculate the icc between two raters.
For each rater (R1, R2) I have a data frame of the 75 variables in columns and 125 observations.
library(irr)
for (i in 1:75) {
icc <- icc(cbind.data.frame(R1[,i],R2[,i]), model="twoway", type="agreement",
unit="single")
print(icc)
}
icc returns as a list of results icc for each variable.
I tried to integrate in the loop a function that will generate a data frame for the objects of icc that interest me (value, lower and upper bounder of the 95% confident interval) but it returns in different ways separate tables:
With this first attempt it returns 75 data frames of only one line each one, even if I used an rbind command
for (i in 1:75) {
icc <- icc(cbind.data.frame(R1[,i],R2[,i]), model="twoway", type="agreement",
unit="single")
print(rbind.data.frame(cbind.data.frame(icc$value,icc$lbound,icc$ubound)))
}
in the second case it returns 75 different data frames filled each one of the icc'objects of one variable.
for (i in 1:75) {
icc <- icc(cbind.data.frame(R1[,i],R2[,i]), model="twoway", type="agreement",
unit="single")
name_lines_are_variables <- names(L1)
name_columns <- c("ICC","Low CI 95%","Up CI 95%)
tab <- matrix(c(icc$value,icc$conf.level),nrow=38,ncol=2)
dimnames(tab) <- list(name_lines_are_variables,name_columns)
print(tab)
I appreciate your help
If I've understood your post correctly, then the problem with your code is that it the results from the icc() function are not being accumulated.
You can solve this problem by declaring an empty data.frame before the for loop, and then using rbind() to append the latest results to the existing results in this data.frame.
Please refer to the code below for an implementation (refer to the comments for clarifications):
rm(list = ls())
#Packages
library(irr)
#Dummy data
R1 <- data.frame(matrix(sample(1:100, 75*125, replace = TRUE), nrow = 75, ncol = 125))
R2 <- data.frame(matrix(sample(1:100, 75*125, replace = TRUE), nrow = 75, ncol = 125))
#Data frame that will accumulate the ICC results
#Initialized with zero rows (but has named columns)
my_icc <- data.frame(R1_col = character(), R2_col = character(),
icc_val = double(), icc_lb = double(),
icc_ub = double(), icc_conflvl = double(),
icc_pval = double(),
stringsAsFactors = FALSE)
#For loop
#Iterates through each COLUMN in R1 and R2
#And calculates ICC values with these as inputs
#Each R1[, i]-R2[, j] combination's results are stored
#as a row each in the my_icc data frame initialized above
for (i in 1:ncol(R1)){
for (j in 1:ncol(R2)){
#tmpdat is just a temporary variable to hold the current calculation's data
tmpdat <- irr::icc(cbind.data.frame(R1[, i], R2[, j]), model = "twoway", type = "agreement", unit = "single")
#Results from current cauculation being appended to the my_icc data frame
my_icc <- rbind(my_icc,
data.frame(R1_col = colnames(R1)[i], R2_col = colnames(R2)[j],
icc_val = tmpdat$value, icc_lb = tmpdat$lbound,
icc_ub = tmpdat$ubound, icc_conflvl = tmpdat$conf.level,
icc_pval = tmpdat$p.value,
stringsAsFactors = FALSE))
}
}
head(my_icc)
# R1_col R2_col icc_val icc_lb icc_ub icc_conflvl icc_pval
# 1 X1 X1 0.14109954 -0.09028373 0.3570681 0.95 0.1147396
# 2 X1 X2 0.07171398 -0.15100798 0.2893685 0.95 0.2646890
# 3 X1 X3 -0.02357068 -0.25117399 0.2052619 0.95 0.5791774
# 4 X1 X4 0.07881817 -0.15179084 0.3004977 0.95 0.2511141
# 5 X1 X5 -0.12332146 -0.34387645 0.1083129 0.95 0.8521741
# 6 X1 X6 -0.17319598 -0.38833452 0.0578834 0.95 0.9297514
Thank you a lot for your help #Dunois. I just had to keep the same variable in the for() loop, because I have to compare the same variables columns for each rater, so the final code :
library(irr)
R1 <- data.frame(matrix(sample(1:100, 75*125, replace = TRUE), nrow = 75, ncol = 125))
R2 <- data.frame(matrix(sample(1:100, 75*125, replace = TRUE), nrow = 75, ncol = 125))
my_icc <- data.frame(R1_col = character(), R2_col = character(),
icc_val = double(), icc_lb = double(),
icc_ub = double(), icc_conflvl = double(),
icc_pval = double(),
stringsAsFactors = FALSE)
for (i in 1:ncol(R1)){
tmpdat <- irr::icc(cbind.data.frame(R1[, i], R2[, i]), model = "twoway", type = "agreement", unit = "single")
my_icc <- rbind(my_icc,
data.frame(R1_col = colnames(R1)[i], R2_col = colnames(R2)[i],
icc_val = tmpdat$value, icc_lb = tmpdat$lbound,
icc_ub = tmpdat$ubound, icc_conflvl = tmpdat$conf.level,
icc_pval = tmpdat$p.value,
stringsAsFactors = FALSE))
}
head(my_icc)
#R1_col R2_col icc_val icc_lb icc_ub icc_conflvl icc_pval
#1 X1 X1 0.116928667 -0.1147526 0.33551788 0.95 0.1601141
#2 X2 X2 0.006627921 -0.2200660 0.23238172 0.95 0.4773967
#3 X3 X3 -0.184898902 -0.3980084 0.04542289 0.95 0.9427605
#4 X4 X4 0.066504226 -0.1646006 0.28963006 0.95 0.2862440
#5 X5 X5 -0.035662755 -0.2603757 0.19227801 0.95 0.6196883
#6 X6 X6 -0.055329309 -0.2808315 0.17466685 0.95 0.6805675

Loop through various data subsets in lm() in R

I would like to loop over various regressions referencing different data subsets, however I'm unable to appropriately call different subsets. For example:
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
x.list <- list(dat$x1,dat$x2,dat$x3)
dat1 <- dat[-9,]
fit <- list()
for(i in 1:length(x.list)){ fit[[i]] <- summary(lm(y ~ x.list[[i]], data = dat))}
for(i in 1:length(x.list)){ fit[[i]] <- summary(lm(y ~ x.list[[i]], data = dat1))}
Is there a way to call in "dat1" such that it subsets the other variables accordingly? Thanks for any recs.
I'm not sure it makes sense to copy your covariates into a new list like that. Here's a way to loop over columns and to dynamically build formulas
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
dat1 <- dat[-9,]
#x.list not used
fit <- list()
for(i in c("x1","x2","x3")){ fit[[i]] <- summary(lm(reformulate(i,"y"), data = dat))}
for(i in c("x1","x2","x3")){ fit[[i]] <- summary(lm(reformulate(i,"y"), data = dat1))}
How about this?
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
mods <- lapply(list(y ~ x1, y ~ x2, y ~ x3), lm, data = dat1)
If you have lots of predictors, create the formulas something like this:
lapply(paste('y ~ ', 'x', 1:10, sep = ''), as.formula)
If your data was in long format, it would be similarly simple to do by using lapply on a split data.frame.
dat <- data.frame(y = rnorm(30), x = rnorm(30), f = rep(1:3, each = 10))
lapply(split(dat, dat$f), function(x) lm(y ~ x, data = x))
Sorry being late - but have you tried to apply the data.table solution similar to yours in:
R data.table loop subset by factor and do lm()
I have just applied the links solution by altering your data which should illustrate how I understood your question:
set.seed(1)
df <- data.frame(x1 = letters[1:3],
x2 = sample(c("a","b","c"), 30, replace = TRUE),
x3 = sample(c(20:50), 30, replace = TRUE),
y = sample(c(20:50), 30, replace = TRUE))
dt <- data.table(df,key="x1")
fits <- lapply(unique(dt$x1),
function(z)lm(y~x2+x3, data=dt[J(z),], y=T))
fit <- dt[, lm(y ~ x2 + x3)]
# Using id as a "by" variable you get a model per id
coef_tbl <- dt[, as.list(coef(lm(y ~ x2 + x3))), by=x1]
# coefficients
sapply(fits,coef)
anova_tbl = dt[, as.list(anova(lm(y ~ x2 + x3))), by=x1]
row_names = dt[, row.names(anova(lm(y ~ x2 + x3))), by=x1]
anova_tbl[, variable := row_names$V1]
It extends your solution.

Resources