MICE not imputing all variables with missing values - r

I'm struggling to get mice to impute all the variables with missing values in my dataset. It's working perfectly for 4 of the variables, but not 3 others (and I'm getting the 3 logged events, which I suspect correspond to the 3 in question: GCSPupils, Hypoxia, Hypotension), but I can't figure out the issue. There seems to be variability in those variables (not constants), so mice should work. I want to do single imputation of 7 variables (the other variables have complete data).
# We run the mice code with 0 iterations
imp <- mice(TXAIMPACT_final, maxit = 0)
# Extract predictor Matrix and methods of imputation
predM <- imp$predictorMatrix
meth <- imp$method
#Setting values of variables I'd like to leave out to 0 in the predictor matrix
predM[,c("subjectId")] <- 0
# Specify a separate imputation model for variables of interest
# Dichotomous variable
log <- c("Hypotension", "Hypoxia")
# Unordered categorical variable
poly2 <- c("GCSPupils", "GCSMotor")
# Turn their methods matrix into the specified imputation models
meth[log] <- "logreg"
meth[poly2] <- "polyreg"
Here, I check to make sure "meth" is correct, and it is:
meth
subjectId Age GCS GCSMotor GCSPupils Glucose Hemoglobin
"" "" "" "polyreg" "polyreg" "pmm" "pmm"
Hypotension Hypoxia MarshallCT SAH EDH GOS GFAP
"logreg" "logreg" "pmm" "" "" "" ""
The methods are all correct as I specified. I do notice something funny about the Predictor Matrix, which is that the 3 variables not imputing only show "0" for their columns:
predM
subjectId Age GCS GCSMotor GCSPupils Glucose Hemoglobin Hypotension Hypoxia MarshallCT SAH EDH GOS GFAP
subjectId 0 1 1 1 0 1 1 0 0 1 1 1 1 1
Age 0 0 1 1 0 1 1 0 0 1 1 1 1 1
GCS 0 1 0 1 0 1 1 0 0 1 1 1 1 1
GCSMotor 0 1 1 0 0 1 1 0 0 1 1 1 1 1
GCSPupils 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Glucose 0 1 1 1 0 0 1 0 0 1 1 1 1 1
Hemoglobin 0 1 1 1 0 1 0 0 0 1 1 1 1 1
Hypotension 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Hypoxia 0 0 0 0 0 0 0 0 0 0 0 0 0 0
MarshallCT 0 1 1 1 0 1 1 0 0 0 1 1 1 1
SAH 0 1 1 1 0 1 1 0 0 1 0 1 1 1
EDH 0 1 1 1 0 1 1 0 0 1 1 0 1 1
GOS 0 1 1 1 0 1 1 0 0 1 1 1 0 1
GFAP 0 1 1 1 0 1 1 0 0 1 1 1 1 0
I think this is the problem, but I'm not sure how to solve. Finally, here is my single imputation:
imp2 <- complete(mice(TXAIMPACT_final, maxit = 1,
+ predictorMatrix = predM,
+ method = meth, print = TRUE))
iter imp variable
1 1 GCSMotor Glucose Hemoglobin MarshallCT
1 2 GCSMotor Glucose Hemoglobin MarshallCT
1 3 GCSMotor Glucose Hemoglobin MarshallCT
1 4 GCSMotor Glucose Hemoglobin MarshallCT
1 5 GCSMotor Glucose Hemoglobin MarshallCT
Warning: Number of logged events: 3
Thanks in advance!

Figured it out--posting here in case someone else has this issue. My variables that were not imputing were stored as character classes, which blocked imputation. As soon as I switched them to numeric, my issues disappeared.

Related

dummyVars() in r and weird column names in R

for a dataset similar to the one below, I need N level dummy variables. I use dummyVars() from caret package.
As you can see the column names are ignoring "sep="-"" argument and there are some dots in the column names rather than < or > signs.
df <- data.frame(fruit=as.factor(c("apple", "orange","orange", "carrot", "apple")),
st=as.factor(c("CA", "MN","MN", "NY", "NJ")),
wt=as.factor(c("<2","2-4",">4","2-4","<2")),
buy=c(1,1,0,1,0))
fruit st wt buy
1 apple CA <2 1
2 orange MN 2-4 1
3 orange MN >4 0
4 carrot NY 2-4 1
5 apple NJ <2 0
library(caret)
dmy <- dummyVars(buy~ ., data = df, sep="-")
df2 <- data.frame(predict(dmy, newdata = df))
df2
fruit.apple fruit.carrot fruit.orange st.CA st.MN st.NJ st.NY wt..2 wt..4 wt.2.4
1 1 0 0 1 0 0 0 1 0 0
2 0 0 1 0 1 0 0 0 0 1
3 0 0 1 0 1 0 0 0 1 0
4 0 1 0 0 0 0 1 0 0 1
5 1 0 0 0 0 1 0 1 0 0
I am confused why dummyVars() is not converting the actual levels into the parts of the column names and why is it ignoring the separator argument.
I would appreciate any hint on what I am doing wrong!
EDIT: for the future readers :) ! according to AKRUN's note, the argument below for dataframe() solved the problem.
df2 <- data.frame(predict(dmy, newdata = df), check.names = FALSE)
fruit-apple fruit-carrot fruit-orange st-CA st-MN st-NJ st-NY wt-<2 wt->4 wt-2-4
1 1 0 0 1 0 0 0 1 0 0
2 0 0 1 0 1 0 0 0 0 1
3 0 0 1 0 1 0 0 0 1 0
4 0 1 0 0 0 0 1 0 0 1
5 1 0 0 0 0 1 0 1 0 0

Can the acf function properly compute autocorrelation with a binary vector?

Hi supposed I have a vector consisting of 0's and 1's; will the default acf function be able to calculate this correctly?
set.seed ( 12 )
bin = sample(c(0,1), replace=TRUE, size=5000)
acf (bin )
Yes. Your example doesn't work because it is completely random. But we can create a binomial sample with an oscillating probability of 1s and 0s like this:
times <- seq(0, 20 * pi, pi / 6)
probs <- sin(times) * 0.5 + 0.5
Our probability of getting a 1 at each time step looks like this:
plot(times, probs, type = "l")
And we can generate a sample like this:
set.seed(1)
samp <- rbinom(length(times), 1, probs)
samp
#> [1] 0 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0
#> [38] 1 1 1 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1
#> [75] 1 1 1 1 1 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 0 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1
#> [112] 1 1 0 0 0 0 0 0 0 1
And we can demonstrate that acf correctly identifies the autocorrelation:
acf(samp)
Created on 2022-02-12 by the reprex package (v2.0.1)

How do you sum different columns of binary variables based on a desired set of variables/column?

I used the code below for a total of 25 variables and it worked.It shows up as either 1 or 0:
jb$finances <- ifelse(grepl("Finances", jb$content.roll),1,0)
I want to be able to add the number of "1" s in each row across the multiple of selected column/variables I just made (using the code above) into another column called "sum.content". I used the code below:
jb <- jb %>%
mutate(sum.content=sum(jb$Finances,jb$Exercise,jb$Volunteer,jb$Relationships,jb$Laugh,jb$Gratitude,jb$Regrets,jb$Meditate,jb$Clutter))
I didn't get an error using the code above, but I did not get the outcome I wanted.
The result of this was 14 for all my row.I was expecting something <9 since I only selected for 9 variables.I don't want to delete the other variables like V1 and V2, I just want to focus on summing some variables.
This is what I got using the code:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 14
2 0 1 0 0 1 14
2 0 0 0 0 1 14
This is What I want:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 4
2 0 1 0 0 1 1
2 0 0 0 0 1 1
I want R to add the number of 1's in each row(within the columns I want to select). How would I go about incorporating the adding of the 1's in code(from a set of variable/column)?
Here is an answer that uses dplyr to sum across rows of variables starting with the letter V. We'll simulate some data, convert to binary, and then sum the rows.
data <- matrix(rnorm(100,100,30),nrow = 10)
# recode to binary
data <- apply(data,2,function(x){x <- ifelse(x > 100,1,0)})
# change some of the column names to illustrate impact of
# select() within mutate()
colnames(data) <- c(paste0("V",1:5),paste0("X",1:5))
as.data.frame(data) %>%
mutate(total = select(.,starts_with("V")) %>% rowSums)
...and the output, where the sums should equal the sum of V1 - V5 but not
X1 - X5:
V1 V2 V3 V4 V5 X1 X2 X3 X4 X5 total
1 1 0 0 0 1 0 0 0 1 0 2
2 1 0 0 1 0 0 0 1 1 0 2
3 1 1 1 0 1 0 0 0 1 0 4
4 0 0 1 1 0 1 0 0 1 0 2
5 0 0 1 0 1 0 1 1 1 0 2
6 0 1 1 0 1 0 0 1 1 1 3
7 1 0 1 1 0 0 0 0 0 1 3
8 1 0 0 1 1 1 0 1 1 1 3
9 1 1 0 0 1 0 1 1 0 0 3
10 0 1 1 0 1 1 0 0 1 0 3
>

Logistic regression outcome variable predictions in r

I'm using logistic regression to predict a binary outcome variable (Group, 0/1).
So I've noticed something: I have two variable representing the same outcome, one is coded simply as "0" or "1".
> df$Group
>[1] 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1
> 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
> [59] 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1
> 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0
>[117] 0 0 0 1 1 1 1
> 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 0
> 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1
>[175] 1 0 1
>Levels: 0 1
> is.factor(df$Group)
> [1] TRUE
Same story for the other one which represents the same thing, but has "names" labels:
> df$Group
>[1] CON CI CON CI CI CON CI
> CI CON CI CI CI CON CI
>[15] CI ecc.. ecc..
> Levels: CI CON
> is.factor(df$Group2)
> [1] TRUE
> contrasts(df$Group2)
> CI 0
> CON 1
In which 0 in the first variable =CON, whereas 1=CI. I created that first numerical variable because I wanted CI to be my "1" group, and CON the 0 reference group, but when I was transforming from the dataset, each time I tried to do "as.factor" what happened was CI=level 1, CON = level 2.
I thought they were the same thing, but when I tried to plot the odds ratio using sjPlot package, and just checked to be sure, I noticed that the OR were quite different, although by inspecting the coefficients of summary(glm model), everything seemed the same(apart from -or + of estimates, which makes sense as the two groups are coded differently). Specifically, when using the numerical variable the plotted OR are definitely bigger, whereas when using the "name" variable, the OR are smaller.
Am I missing something in the understanding of r (I'm self-thought) or in computation of logistic regression? Which one of the variables should I use in logistic regression? And how could I change the fact that in the "name" variables r uses "CI" as 0 reference group instead of CON? Thank you.
By default, R uses alphabetical order for levels of factor. You can set your own order simply by
df$Group <- factor(df$Group, levels=c('CON','CI'))
Then CON would be used as reference level in logistic regression and you should get the same results as with 0/1 coding.

"One-sided" Predictor Variable in Logistic Regression

Background Information:
I have 6 subjects; we'll call them A,B,C,D,E, and F.
Suppose they are asked to shoot basketballs from a free-throw line into a basket. A success is 1 and a failure is 0.
They performed as follows:
A - 0 0 1 0 0 1 1 1 0 0 1
B - 0 0 0 0 0 0 0 0 0 0 0
C - 1 0 1 1 0 0 0 0 1 0 0
D - 1 1 1 1 1 1 1 1 1 1 1
E - 1 1 0 0 0 0 0 1 1 0 0
F - 0 1 0 0 1 0 0 1 1 0 0
Question:
Now suppose I wanted to test that all of these subjects had the same probability of making a basket.
I would set up the logistic regression as such: Success being the probability of scoring a basket, and subject being the predictor variable.
Success ~ Subject.
Now this is where I get tangled up; I have one sided predictor variables, and what I mean by that is there is a subject that scored all of their baskets, and a subject that scored none of their's. How do we handle this type of logistic regression in r? Or can you suggest another method?
Thanks a ton!

Resources