Multiple imputation of categorical variables in large data set - r

I have a data set with 4000+ observations of 130 variables, and about half of those variables have missingness. I'm trying to use this code, which creates five imputed data sets:
mice(data_frame, method = c(rep("pmm", 130)), m = 5, maxit = 5)
However, this code only imputes on my numeric variables and does nothing to my categorical variables; it just ignores the categorical variables.
I thought maybe the problem was that I was using Predictive Mean Matching for everything and that it was meant for numeric data only, but I tested using this on a smaller data set and it seems to be able to impute categorical data just as well as numeric data when using PMM. So I'm stumped about why the code is just ignoring my categorical variables. I tried "logreg" and "polyreg" on a few of those variables, but they still just get ignored.
Any ideas?

I figured out my problem: my variables were character class rather than factor class, and the MICE algorithm ignores character classes. Once I converted all of the variables into factors, the code above worked fine. I used sapply on the variables that needed to become factors to make things easier on myself.

Related

How do I structure my data to match the structure of the diabetes dataset in R?

I'd like to structure my own data similar to the diabetes dataset in the monomvn package of R to try out some different regression model examples (specifically, LASSO regression).
The diabetes data lists three variables (x, x2, and y), but both x and x2 contain several sub-levels of variables (such as age, bmi, etc.). These x and x2 variables are called in different regression model examples that ultimately reference these sub-levels of variables.
Unfortunately, any time that I try to structure my data to match the diabetes data set, it either is coerced to a list within the variable (which prevents the examples from working as they should), or does not get classified into the single x variable (instead reading each variable individually as x.age, x.bmi, etc.). My goal is to be able to use df$x to reference multiple variables; in other words, df$x should reference df$x.var1, df$x.var2, df$x.var3. I have up to 240 variables that I'd like to code within a single x variable.
The closest I could get is:
df$x.var1 <- data.frame(as.numeric(master_data$var1))
df$x.var2 <- data.frame(as.numeric(master_data$var2))
df$x.var3 <- data.frame(as.numeric(master_data$var3))
No errors in creating the data frame; however, I still need to reference the whole variable name (x.var1) instead of "x" that refers to all of the sub-variables in order for any of the regression examples to work; with that approach, I can't list ~240 variable names as x variables in the lasso regression models.
In Matlab, I would structure this as a structure of sub-variables (var1, var2, var3, etc.) within a structure named "x"; however, I'm doing this in R and am currently unable to see how I could complete that type of task.
The diabetes data set I'm referencing is found here:
library(monomvn)
data(diabetes)
If it's helpful, the diabetes data set classifies the "x" and "x2" variables "AsIs" (although all sub-variables appear to be numeric) while "y" is numeric.
FYI, I do have some NA values in my own data set, but I haven't received any errors that makes me think that has something to do with this issue; however, the diabetes data set does not have NA values, so I'm not ruling out the possibility.
If anyone could provide some guidance about how to put numeric data into a format that matches the diabetes data set, that would be incredibly helpful. Thanks in advance.
The man page describes the data as a data.frame containing the following columns: x a matrix with 10 columns, y a numeric vector, and x2 a matrix with 64 columns.
The problem is you're trying to assign the data as columns of the data.frame. You need to first assemble the two matrices and then assign them to the data.frame.
Something like this:
mydata.x <- matrix(runif(500,-2,2),ncol=10)
mydata.y <- runif(50,50,500)
mydata.x2 <- matrix(runif(3200,-2,2),ncol=64)
mydiabetes <- data.frame(mydata.x,mydata.y,mydata.x2)

How to impute only one or some columns with mice R

I am experimenting with the mice package in R and am curious about how i can leave columns out of the imputation.
If i want to run a mean imputation on just one column, the
mice.impute.mean(y, ry, x = NULL, ...) function seems to be what I would use. I'm struggling to understand what i need to include as the third argument to get this to work.
If i have a data set that includes categorical data such as name, ID, birth date, etc. which should not affect the calculation of other columns and should not be filled in when missing, how do i tell mice to exclude these columns in its calculation?
I've been using the mice dataset
nhanes for my exploration.
Thanks
I don't know your data thus I can't create a example for you, but you are looking exactly for this parameters of the mice() function
predictorMatrix
A numeric matrix of length(blocks) rows and ncol(data) columns, containing 0/1 data specifying the set of predictors to be used for each target column. Each row corresponds to a variable block, i.e., a set of variables to be imputed. A value of 1 means that the column variable is used as a predictor for the target block (in the rows). By default, the predictorMatrix is a square matrix of ncol(data) rows and columns with all 1's, except for the diagonal. Note: For two-level imputation models (which have "2l" in their names) other codes (e.g, 2 or -2) are also allowed.
With this parameter you can define, which columns you want to use to impute a specific column.
where
A data frame or matrix with logicals of the same dimensions as data indicating where in the data the imputations should be created. The default, where = is.na(data), specifies that the missing data should be imputed. The where argument may be used to overimpute observed data, or to skip imputations for selected missing values.
Here you can define, for which columns you want to create imputation.

How to return NAs for predictions where some predictors have missing values

This is very similar with the following question: R SVM return NA for predictions with missing data
However, the response suggested there does not work (at least for me). Therefore I would like to be more general and try a different approach (or adjust the one proposed there). I can predict using my svm model on the complete.cases() of my data frame. However, it is very important for me to have NA values for all rows with missing data.
My theoretical approach should be the following: predict on complete.cases() of my data frame. Find the index of complete cases. Somehow cbind the column with predictions back to my data.frame(), while adding NA values for all values whose indices are different from those of complete cases. In the essence I should create a column in a data frame by combining two vectors: one of predictions, the other of NA values (based on known indices). However, I am stupid enough not to be able to write the few lines of code for doing that.

R mice function does not apply customized predictor matrix

I want to impute a part of my data set with mice. My data set has very many variables, which is why I don't want to impute all the variables but only those which I will use in my model.
(I know that as much information as possible should be used for the imputation, but I am already using 41 variables, which according to literature should be more than enough.)
My problem: I don't want every variable to be imputed at all times,
because I have several measurement points. So of course, my variables
at t4 have many missing, but I don't want to impute them when people
just haven't filled out the questionnaire at that point.
So I specified a predictor matrix, in which all of the variables at t0 (e.g. A103.0) are imputed, but not at t4 (A103.4).
However, when running mice, it just uses "pmm" for all of the variables, and every variables is imputed.
Any suggestions on what went wrong are highly appreciated, I spent quite some time now trying to find out what happened..
This is what I've done:
I create an object with all the columns I want to impute
impute <- c("A103", "A104", "A107", #SVF
"A302.0", "A303.0", "A304.0", "A305.0", "A306.0",
"A502_01.0", "A502_02.0", "A502_03.0", "A502_04.0",
"A504.0","A506.0", "A508.0", "W003.0", "W005.0",
"A509_02.0", "A509_03.0", "A509_06.0", "A509_10.0",
"A302.4", "A303.4", "A304.4", "A305.4", "A306.4",
"A502_01.4", "A502_02.4", "A502_03.4", "A502_04.4",
"A504.4", "A506.4", "A508.4","W003.4", "W005.4", "SD02_01",
"SD03",
"A509_02.4", "A509_03.4", "A509_06.4", "A509_10.4")
I create a subset of the columns (and all rows of course) which I want to impute
imp <- mice(ds_wide[ ,impute], maxit=0)
imp$PredictorMatrix
pred <- imp$predictorMatrix
pred [c("A302.4", "A303.4", "A304.4", "A305.4", "A306.4", #ABB.4
"A502_01.4", "A502_02.4", "A502_03.4", "A502_04.4", #PSWQ.4
"A504.4", "A506.4", "A508.4","W003.4", "W005.4", "SD02_01",
"SD03",
"A509_02.4", "A509_03.4", "A509_06.4", "A509_10.4"), ] <- 0
View(pred) #looks exactly how I want it to look like
imp <- mice(ds_wide[ ,impute], m=5, predictorMatrix = pred)
miceimp <- complete (imp)
anyNA(miceimp)
View(miceimp)
When I check miceimp (my result), there are no missing values whatsoever, so all the variables at t4 are imputed even though I specified otherwise. What did I do wrong?
Actually, what would be really best for me, would be if I could somehow impute those variables at t4 which do not only have missings. So those people, who filled out t4, should be imputed, and those, who are not at that measurement point, should not.
If anyone has any ideas how to make that possible, that would be great!
Many thanks!
I am not completely sure I understood 100% what you are trying to archive.
I understood, that you do not want to impute all your variables (but you want to include all your variables as input to the algorithm)
You were trying to define the parameter predictorMatrix
predictorMatrix
A numeric matrix of length(blocks) rows and ncol(data) columns, containing 0/1 data specifying the set of predictors to be used for each target column. Each row corresponds to a variable block, i.e., a set of variables to be imputed. A value of 1 means that the column variable is used as a predictor for the target block (in the rows). By default, the predictorMatrix is a square matrix of ncol(data) rows and columns with all 1's, except for the diagonal. Note: For two-level imputation models (which have "2l" in their names) other codes (e.g, 2 or -2) are also allowed.
To me i sounds like this parameter is used to define, what variables are used as input.
In comparison the where parameter sounds to me as the correct parameter to specify which variables should be imputed.
where
A data frame or matrix with logicals of the same dimensions as data indicating where in the data the imputations should be created. The default, where = is.na(data), specifies that the missing data should be imputed. The where argument may be used to overimpute observed data, or to skip imputations for selected missing values.
So my conclusion would be to try out the where parameter instead of predictorMatrix.
In "mice", in addition to specifying "predMatrix" as zero for the variables that should not be imputed, you must specify ("") in "method" for those variables.

MICE Function Missing Dates

I am working with a price file that has a number of missing weekend values. I am using the MICE function to impute weekend prices. The mice function doesn't allow non-numeric values and errors out if the date is included. This is the reason I use [,2:33], but I need a date so I can join it back to another file. I have tried converting the date to a number, but reversing that conversion at the end of the process yields NAs. Looking for suggestions to keep the dates in the dataframe.
Snippet Example
The link above has a snippet of the data set.
Code for mice function
Imputed <- mice(Features[,2:33], m=5, maxit = 5, method = 'pmm', seed = 500)
unpacking a large mids
df <- complete(Imputed, action = 1L, include = FALSE)
The easiest solution here would be just removing the data before imputation and adding the dates back to the data.frame afterwards.
Since mice does not change the ordering of columns this can be easily done.
As an alternative solution, mice can be also set to only perform imputation on certain columns / only use certain columns for imputation. I think if you exclude the date here, it might also no more throw an error. The parameter is:
predictorMatrix
A numeric matrix of length(blocks) rows and ncol(data) columns, containing 0/1 data specifying the set of predictors to be used for each target column. Each row corresponds to a variable block, i.e., a set of variables to be imputed. A value of 1 means that the column variable is used as a predictor for the target block (in the rows). By default, the predictorMatrix is a square matrix of ncol(data) rows and columns with all 1's, except for the diagonal. Note: For two-level imputation models (which have "2l" in their names) other codes (e.g, 2 or -2) are also allowed.
But probably the first solution with just removing and adding the column back aferwards is easier to perform.

Resources