Stratify then impute in R - using mi() - r

I want to "stratify-then-impute" using the packages available in R.
That is, I am hoping to:
1) stratify my dataset using a binary variable called "arm". This variable has no missing data.
2) run an imputation model for the two subsets
3) combine the two imputed data sets
4) run a pooled analysis.
My dataset looks like:
dataSim <- structure(list(pid = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), arm = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), X1 = c(0.1, NA, 0.51,
0.56, -0.82, NA, NA, NA, -0.32, 0.4, 0.58, NA, 0.22, -0.23, 1.49,
-1.88, -1.77, -0.94, NA, -1.34), X2 = c(NA, -0.13, NA, 1.2, NA,
NA, NA, 0.02, -0.04, NA, NA, 0.25, -0.81, -1.67, 1.01, 1.69,
-0.06, 0.07, NA, -0.11)), .Names = c("pid", "arm", "X1", "X2"
), row.names = c(NA, 20L), class = "data.frame")
To impute, the data, I'm currently using the mi() function as follows:
library(mi)
data.1 <- dataSim[dataSim[,"arm"]==1,]
data.0 <- dataSim[dataSim[,"arm"]==0,]
data.miss.1 <- missing_data.frame(data.1)
data.miss.0 <- missing_data.frame(data.0)
imputations.1 <- mi(data.1, n.iter=5, n.chains=5, max.minutes=20, parallel=FALSE)
imputations.0 <- mi(data.0, n.iter=5, n.chains=5, max.minutes=20, parallel=FALSE)
complete(imputations.1) # viewing the imputed datasets
complete(imputations.0)
Then I don't know how to combine the 2 imputations in order to do a pooled analysis. I have unsuccessfully tried:
imputations <- rbind(imputations.0, imputations.1) # This doesn't work
# analysis.X1 <- pool(X1 ~ arm, data = imputations ) # This is what I want to run
I assume this method is a simplified version of including an interaction term when imputing, but I don't know how this is possible either.
Thanks

Related

Function within loop with if statement in R

I am not familiar with if statements/loops/or functions in R. I have a dataset where I want to adjust the a variable (N) by the clustering of the study (the formula is this one: N/(1 + (M - 1) * ICC). Where N is the number of subjects, the M is the size of the cluster and ICC is the intra-class correlation coeff. I have all these variables in separate columns with each row identifying the different studies/sample sizes. Not all the studies have a clustering issues so I need to apply this function only to the subset of those with the ICC. I thought about something like this but I know it is missing something, and also, I don't know if a loop with an if statement is the most efficient way to go.
for (i in df$N) { # for every sample size in df$N
if (df$ICC[i] != .) { # if ICC is not missing
df$N/(1 + (df$M - 1) * df$ICC) # adjust the sample size by dividing the N by the size
of the cluster - 1 and multiply per the ICC of the study
} else {
df$N/1 #otherwise (ICC is missing) do nothing: ie., divide the N by 1.
}
}
Do you know how I could do this with something like this? Other solutions are also welcome! Thanks for any help or suggestion about this!
Here's an example of the dataset:
dput(head(df, 10))
structure(list(ID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5), ArmsID = c(0,
1, 0, 1, 0, 1, 0, 1, 0, 1), N = c(26, 34, 28, 27, 50, 52, 60,
65, 150, 152), Mean = c(10.1599998474121, 5.59999990463257, 8,
8.52999973297119, 17, 15.1700000762939, 48.0999984741211, 49,
57, 55.1315803527832), SD = c(6.30000019073486, 4.30000019073486,
5.6, 6.61999988555908, 6, 7.75, 10.1599998474121, 12, 11, 10.5495901107788
), SE = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), ICC = c(0.03,
0.02, NA, NA, 0.01, 0.003, NA, NA, NA, NA), M = c(5, 5, NA, NA,
17, 16, NA, NA, NA, NA)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
The . meant to indicate missing data: NA. I want to apply the functions that adjust the N only to the rows that have an ICC.
idx <- which(!is.na(df$ICC))
df$N[idx] <- df$N[idx]/(1 + (df$M[idx] - 1) * df$ICC[idx])
This code correctly works, thanks!

Randomly sampling iteratively from a data frame

I have the following dataframe that simulates a panel data set (i.e., multiple waves per unit).
dat <- structure(list(x = c(-0.32, -0.26, 0.05, -0.37, -0.37, -0.08,
-0.01, 0.05, 0.19, -0.48, 0.37, 0.05, -0.58, -0.18, -0.04, -0.28,
-0.44, -0.48, 1.05, 0.62, 0.85, 0.42, 0.7, 0.64, -0.19, -0.11,
-0.65, -0.01, 0.39, -0.02, -0.23, -0.6, -0.1, 0.39, 0.33, 0.39,
-0.09, -0.16, 0.26, -0.62, -0.44, -0.6, -0.17, -0.27, -0.12,
-0.53, -0.38, -0.33, -0.17, -0.11, -0.25, -0.92, -0.6, -0.81,
0.75, 0.52, 0.57, 1.32, 1.21, 1.21), y = c(-0.42, -2.01, -1.19,
0.7, 1.28, 1.37, 0.52, 2.04, 2.34, -1.45, 2.84, 0.1, -3.12, 0.22,
-0.06, -1.65, -0.9, -1.5, -0.98, -0.69, 0.15, 1.7, 1.47, 0.15,
0.26, 0.84, 0.35, 0.86, -1.23, -0.74, -1.79, -0.56, -2.15, 2.11,
2.34, 0.57, 0.38, 0.57, 0.97, 0.32, -1.71, -0.8, 1.45, -0.12,
1.93, 2.76, 0.08, -2.8, -0.06, 1.09, -0.4, 0.41, 0.02, -1.61,
1.75, 1.6, -0.19, 0.13, -0.89, -1.1), unit = c(1, 1, 1, 2, 2,
2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 9, 9,
9, 10, 10, 10, 11, 11, 11, 12, 12, 12, 13, 13, 13, 14, 14, 14,
15, 15, 15, 16, 16, 16, 17, 17, 17, 18, 18, 18, 19, 19, 19, 20,
20, 20), wave = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3,
1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3,
1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3,
1, 2, 3)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -60L), groups = structure(list(unit = c(1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20), .rows = structure(list(1:3, 4:6, 7:9, 10:12, 13:15, 16:18,
19:21, 22:24, 25:27, 28:30, 31:33, 34:36, 37:39, 40:42, 43:45,
46:48, 49:51, 52:54, 55:57, 58:60), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L), .drop = TRUE))
I now want to simulate attrition into this data set: Some units drop out with a certain probability in wave 2; some of those who remain drop out in wave 3; and so forth until wave n. The probability remains the same at each step. Note that this approach should be flexible regarding the number of waves.
Here's what I came up with. Though it works, it feels slow to me. However, due to the varying number of waves, I am not sure how to avoid the loop.
# number of units and number of observations per unit:
n = 20
n_perunit = 3
# define attrition probability:
attrition = 2/3
# Start with a vector of all units
remaining <- 1:n
# loop through waves beginning with 2
system.time(for (i in 2:n_perunit) {
n_remaining <- round(length(remaining)*attrition)
remaining <- sample(remaining, n_remaining)
dat <- dat %>%
mutate(drop = ifelse(
wave >= i & !(unit %in% remaining), TRUE, FALSE)) %>%
filter(drop == FALSE) %>%
mutate(drop = NULL)
})
Efficiency:
user system elapsed
0.016 0.000 0.016
Any ideas how to improve this?
EDIT:
Based on #jpsmith 's answer (which as far as I see does not work for groups in which no people drop out, because min(which(dropout == "yes") will return a value of Inf for those), I came up with the following:
set.seed(1234)
system.time(if (!is.null(attrition)) {
# assign a 1 or 0 indicating dropout
dat <- dat %>%
mutate(dropout = ifelse(
wave > 1, sample(
0:1, n(), prob = c(attrition, 1-attrition), replace = TRUE), 0))
# first get the first (minimum) dropout in each unit...
dat <- dat %>%
group_by(unit) %>%
mutate(min = ifelse(
length(which(dropout == 1) > 0), min(which(dropout == 1)), n_perunit)) %>%
# ... then slice out rows up to that row
slice(1:min) %>%
# as this also includes the first dropout rows, drop that one
filter(dropout == 0)
})
Efficiency:
user system elapsed
0.01 0.00 0.01
However, some annoying warnings produced by slice - any idea why?
Perhaps I'm wrong, but in effect the attrition is iid after the first wave: each subsequent wave has a probability of dropout - so if you made it to wave 3 then that probability is not conditioned on anything (akin to the probability of flipping a third heads if the first two were heads). If I am reading this correctly, you could assign the dropout simultaneously across waves > 1 and then drop all observations after the first "dropout". This would vectorize everything and be faster.
Code
set.seed(123) ), row.names = c(NA, -20L), .drop = TRUE))
attrition <- 2/3
# Assign "dropout" position
dat$dropout <- ifelse(dat$wave > 1, sample(c("Yes","No"), prob = c(attrition, 1-attrition)), "No")
# Drop all observations after first dropout recorded
dat %>% group_by(unit) %>% slice(seq_len(min(which(dropout == "Yes") - 1)))
Output:
# Groups: unit [20]
# x y unit wave dropout
# <dbl> <dbl> <dbl> <dbl> <chr>
# 1 -0.32 -0.42 1 1 No
# 2 -0.26 -2.01 1 2 No
# 3 -0.37 0.7 2 1 No
# 4 -0.01 0.52 3 1 No
# 5 0.05 2.04 3 2 No
# 6 -0.48 -1.45 4 1 No
# 7 -0.58 -3.12 5 1 No
# 8 -0.18 0.22 5 2 No
# 9 -0.28 -1.65 6 1 No
# 10 1.05 -0.98 7 1 No
# # … with 20 more rows
Since you didnt set a seed or provide a desired output dataset, I cant compare, but happy to test this if you provide.
user system elapsed
0.008 0.001 0.009
Since the number of units remaining after each wave is deterministic, we can do the sampling all in one go.
library(dplyr)
set.seed(5)
n <- 20
n_perunit <- 3
# define attrition probability:
attrition <- 2/3
# Start with a vector of all units
remaining <- 1:n
# loop through waves beginning with 2
fOriginal <- function(df, remaining) {
for (i in 2:n_perunit) {
n_remaining <- round(length(remaining)*attrition)
remaining <- sample(remaining, n_remaining)
df <- df %>%
mutate(drop = ifelse(
wave >= i & !(unit %in% remaining), TRUE, FALSE)) %>%
filter(drop == FALSE) %>%
mutate(drop = NULL)
}
df
}
fNew <- function(df) {
nleft <- numeric(n_perunit + 1)
nleft[1] <- n
for (i in 2:n_perunit) nleft[i] <- round(nleft[i - 1]*attrition)
df[df$wave <= sample(rep.int(1:n_perunit, -diff(nleft)))[df$unit],]
}
dfOrig <- fOriginal(dat, remaining)
dfNew <- fNew(dat)
# the resulting data.frames are not identical due to different random sampling
# methods, but they both have the same number of rows and same wave counts
identical(tabulate(dfOrig$wave), tabulate(dfNew$wave))
#> [1] TRUE
microbenchmark::microbenchmark(fOriginal = fOriginal(dat, remaining),
fNew = fNew(dat))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> fOriginal 12.0433 13.24815 14.52889 14.02410 15.0525 23.5338 100
#> fNew 1.2956 1.41915 1.73176 1.56935 1.7398 5.0738 100

UPGMA with hclust plotting branch lengths as raw distances

I'm working on a presentation regarding utilizing UPGMA with the hlcust() function within our research lab. According to the literature, the branch length calculated by UPGMA for any pair of elements would be 1/2 the pairwise distance between those two elements.
I'm noticing that the example dendrogram I'm building for the presentation isn't calculating branch lengths that I expected. I'm not finding anything in ?hclust that would make me think that I'm missing a function argument that is causing the UPGMA algorithm to use the raw distances as the branch lengths. I understand that in certain situations, due to the limitations of computation accuracy, having a dendrogram which is exactly ultrametric may not always be possible (from here and here, and I'm sure elsewhere as well). That still doesn't explain why I see the raw pairwise distances being plotted as the branch length between two elements.
Using the data below, here's the code I used to plot an example dendrogram...
demoDend <- hclust(d = demoTable, method = "average") # make an hclust object
# use the ggdendro package to extract segments and labels for ggplot plotting
dendData <- ggdendro::dendro_data(demoDend)
dendSegs <- dendData$segments
dendLabs <- dendData$labels
library(ggplot2)
ggplot()+
geom_segment(data = dendSegs, aes(x = x, y = y, xend = xend, yend = yend))+
geom_text(data = dendLabs, aes(x = x, y = y-0.05, label = label, angle = 90))+
geom_hline(aes(yintercept = 0.333), linetype = 2, color = "blue")+
geom_hline(aes(yintercept = 0.2), linetype = 2, color = "red")+
theme_bw()
The two elements that stand out are 13195 and 13199 which have a distance of 0.2, and whose branch length is being plotted as 0.2 (red line in ggplot).
Even after examining the hclust object, some of the heights for the branches are the raw distances in the input matrix, and not 1/2 the distance. Do I need to manually half the heights in the object before plotting? Maybe I don't understand UPGMA as well as I thought? Any help or insight into the implementation of UPGMA with hclust() would be greatly appreciated.
Here's the sample distance data that I'm working with, from dput()
demoTable <- structure(c(0, 0.333333333333333, 0.333333333333333, 0, 0, 0.333333333333333,
0.333333333333333, 1, 1, 1, 1, 1, 1, NA, 0, 0, 0.333333333333333,
0.333333333333333, 0, 0, 1, 1, 1, 1, 1, 1, NA, NA, 0, 0.333333333333333,
0.333333333333333, 0, 0, 1, 1, 1, 1, 1, 1, NA, NA, NA, 0, 0,
0.333333333333333, 0.333333333333333, 1, 1, 1, 1, 1, 1, NA, NA,
NA, NA, 0, 0.333333333333333, 0.333333333333333, 1, 1, 1, 1,
1, 1, NA, NA, NA, NA, NA, 0, 0, 1, 1, 1, 1, 1, 1, NA, NA, NA,
NA, NA, NA, 0, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, NA,
0, 0.6, 0, 1, 0.6, 0.333333333333333, NA, NA, NA, NA, NA, NA,
NA, NA, 0, 0.6, 1, 0.5, 0.2, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 0, 1, 0.6, 0.333333333333333, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 0, 0.5, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 0, 0.6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0
), .Dim = c(13L, 13L), .Dimnames = list(c("13187", "13188", "13189",
"13190", "13191", "13192", "13193", "13194", "13195", "13196",
"13197", "13198", "13199"), NULL))

Comparing test data and prediction outcome

I trying Logistic regression on a dataset. I have successfully divided my dataset into train and test. The regression model also works fine however when I apply it on my test I only get an outcome for 393 observations when the length of my test dataset is 480. How can I compare and get the mismatch or find out what went wrong?
My data has no NAs.
I am trying to create a confusion matrix.
This is my code:
n=nrow(wine_log)
shuffled=wine_log[sample(n),]
train_indices=1:round(0.7*n)
test_indices=(round(0.7*n)+1):n
#Making a new dataset
train=shuffled[train_indices,]
test=shuffled[test_indices,]
wmodel = glm(final_take~., family = binomial, data=train)
summary(wmodel)
result1 = predict(wmodel, newdata = test, type = 'response')
result1 = ifelse(result > 0.5, 1, 0) - Can someone also explain how will removing this affect the outcome?
result1
> table(result1)
result1
0 1
255 138
> table(test$final_take)
Bad Good
418 62
structure(list(fixed_acid = c(7.4, 7.8, 7.8, 11.2, 7.4, 7.4,
7.9, 7.3, 7.8, 7.5), vol_acid = c(0.7, 0.88, 0.76, 0.28, 0.7,
0.66, 0.6, 0.65, 0.58, 0.5), c_acid = c(0, 0, 0.04, 0.56, 0,
0, 0.06, 0, 0.02, 0.36), res_sugar = c(1.9, 2.6, 2.3, 1.9, 1.9,
1.8, 1.6, 1.2, 2, 6.1), chlorides = c(0.076, 0.098, 0.092, 0.075,
0.076, 0.075, 0.069, 0.065, 0.073, 0.071), free_siox = c(11,
25, 15, 17, 11, 13, 15, 15, 9, 17), total_diox = c(34, 67, 54,
60, 34, 40, 59, 21, 18, 102), density = c(0.9978, 0.9968, 0.997,
0.998, 0.9978, 0.9978, 0.9964, 0.9946, 0.9968, 0.9978), pH = c(3.51,
3.2, 3.26, 3.16, 3.51, 3.51, 3.3, 3.39, 3.36, 3.35), sulphates = c(0.56,
0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0.57, 0.8), alcohol = c(9.4,
9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10, 9.5, 10.5), final_take = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L), .Label = c("Bad", "Good"
), class = "factor")), row.names = c(NA, -10L), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"),
Your line of code here:
result1 = ifelse(result > 0.5, 1, 0)
Should be referencing result1 in the ifelse statement. I'm guessing that result is another object you have in your environment that isn't 480 rows.
So you should use this instead.
result1 = ifelse(result1 > 0.5, 1, 0)
You also asked what this line of code is doing. It's basically a threshold for your predictions from the glm model. If the prediction from the model is greater than 0.50, then you are translating the prediction to a "1". If it's less than or equal to 0.50 then you are translating that prediction to a "0". It's a way to convert a probability to a TRUE/FALSE or 1/0.

Coarsened Exact Matching with cem package- Error in .subset2(x, i, exact = exact)

I'm trying to perform coarsened exact matching on the following data.
> dput(head(cem_data))
structure(list(sex = c(1, 1, 1, 2, 2, 2), age = c(40, 59, 53,
60, 49, 60), edlev = c(3, 3, 3, 2, 3, 3), sw = c(44, 17, 10,
41, 26, 23), sw2 = c(15, 1, 5, 34, 5, 6), som = c(2.14, 0.14,
1.86, 3, 1.71, 2.14), som_2 = c(0.71, 0.14, 2, 2.57, 1.71, 2.14
), ap = c(3.5, 1.5, 1.33, 3.33, 2.67, 2.17), ap_2 = c(3, 0.17,
2.33, 3, 0.83, 1.67), dep = c(2.83, 0.17, 0.33, 2.83, 2.17, 2.33
), dep_2 = c(1.17, 0, 0.33, 2.33, 0.83, 1), int = c(2.86, 1.43,
1, 2, 2.29, 2.14), int_2 = c(2.29, 0.57, 0.14, 2.57, 1.71, 1.43
), pho = c(3.2, 0, 0, 3.4, 0.8, 0.4), pho_2 = c(1.6, 0, 0, 3.2,
0, 0.4), psy_b = c(2.67, 0.11, 0.83, 3.06, 1.61, 1.72), psy_b_2 = c(1.11,
0.06, 0.89, 2.67, 0.94, 1.28), s_wirk = c(4, 2.2, 1.6, 3.2, 1.4,
2.2), s_wirk_2 = c(2.8, 0.8, 1.8, 2.6, 1.6, 1.4), soz_b = c(2.75,
1.5, 1, 2.25, 1.25, 1.5), soz_b_2 = c(2.75, 1, 1, 2.25, 1.5,
1.25), soz_u = c(0.75, 0.75, 1.75, 3.25, 1, 3.25), soz_u_2 = c(1,
0.25, 1.75, 2.5, 2.5, 2), wohl = c(3.6, 1.4, 1.8, 3.4, 3, 3),
wohl_2 = c(2, 0.6, 1.4, 2.8, 2.2, 1.2), au_bei_aufn = c(1,
1, 1, 1, 1, 1), age_reha = c(40.9890410958904, 59.3945205479452,
53.372602739726, 60.2, 49.3342465753425, 60.7534246575342
), group_format = c(0, 0, 0, 0, 0, 0)), row.names = c(6L,
7L, 10L, 15L, 20L, 29L), class = "data.frame")
With the following code:
require(cem)
voll_data <- voll_data %>%
select(-c("auf_nr", "icd_10_1", "icd_10_2", "icd_10_3", "icd_10_4","icd_10_5", "bdi_date", "aufnahme", "entlassung")) %>%
mutate_if(is.factor,as.numeric) %>%
mutate_if(is.character, as.numeric)
cem_data <- data.frame(na.omit(voll_data))
#cem_data_s <- scale(cem_data[,5:26])
#cem_data <- cbind.data.frame(cem_data[, 1:4], cem_data_s, cem_data[, 27:36])
variables <- c("age", "sex", "edlev", "sw","au_bei_aufn")
ungleich2 <- imbalance(cem_data$group_format, data=cem_data)
However, following error is being shown, when calculating the "matt".
Error in .subset2(x, i, exact = exact) : attempt to select less than one element in get1index
7.
(function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, i, exact = exact))(x, ..., exact = exact)
6.
[[.data.frame(data, treatment)
5.
data[[treatment]]
4.
is.factor(x)
3.
as.factor(data[[treatment]])
2.
cem.main(treatment = treatment, data = data, cutpoints = cutpoints, drop = drop, k2k = k2k, method = method, mpower = mpower, verbose = verbose, baseline.group = baseline.group, keep.all = keep.all)
1.
cem(treatment = cem_data$group_format, data = cem_data, drop = "sw2", cutpoints = list(age = agecut), grouping = list(edlev_gr))
# automated coarsening
matt <- cem(cem_data$group_format, data = cem_data, drop= "sw2")
print(matt)
Does anyone have an idea what am I doing wrong?
Thanks a lot!!

Resources