I am experimenting with the mice package in R and am curious about how i can leave columns out of the imputation.
If i want to run a mean imputation on just one column, the
mice.impute.mean(y, ry, x = NULL, ...) function seems to be what I would use. I'm struggling to understand what i need to include as the third argument to get this to work.
If i have a data set that includes categorical data such as name, ID, birth date, etc. which should not affect the calculation of other columns and should not be filled in when missing, how do i tell mice to exclude these columns in its calculation?
I've been using the mice dataset
nhanes for my exploration.
Thanks
I don't know your data thus I can't create a example for you, but you are looking exactly for this parameters of the mice() function
predictorMatrix
A numeric matrix of length(blocks) rows and ncol(data) columns, containing 0/1 data specifying the set of predictors to be used for each target column. Each row corresponds to a variable block, i.e., a set of variables to be imputed. A value of 1 means that the column variable is used as a predictor for the target block (in the rows). By default, the predictorMatrix is a square matrix of ncol(data) rows and columns with all 1's, except for the diagonal. Note: For two-level imputation models (which have "2l" in their names) other codes (e.g, 2 or -2) are also allowed.
With this parameter you can define, which columns you want to use to impute a specific column.
where
A data frame or matrix with logicals of the same dimensions as data indicating where in the data the imputations should be created. The default, where = is.na(data), specifies that the missing data should be imputed. The where argument may be used to overimpute observed data, or to skip imputations for selected missing values.
Here you can define, for which columns you want to create imputation.
I want to impute a part of my data set with mice. My data set has very many variables, which is why I don't want to impute all the variables but only those which I will use in my model.
(I know that as much information as possible should be used for the imputation, but I am already using 41 variables, which according to literature should be more than enough.)
My problem: I don't want every variable to be imputed at all times,
because I have several measurement points. So of course, my variables
at t4 have many missing, but I don't want to impute them when people
just haven't filled out the questionnaire at that point.
So I specified a predictor matrix, in which all of the variables at t0 (e.g. A103.0) are imputed, but not at t4 (A103.4).
However, when running mice, it just uses "pmm" for all of the variables, and every variables is imputed.
Any suggestions on what went wrong are highly appreciated, I spent quite some time now trying to find out what happened..
This is what I've done:
I create an object with all the columns I want to impute
impute <- c("A103", "A104", "A107", #SVF
"A302.0", "A303.0", "A304.0", "A305.0", "A306.0",
"A502_01.0", "A502_02.0", "A502_03.0", "A502_04.0",
"A504.0","A506.0", "A508.0", "W003.0", "W005.0",
"A509_02.0", "A509_03.0", "A509_06.0", "A509_10.0",
"A302.4", "A303.4", "A304.4", "A305.4", "A306.4",
"A502_01.4", "A502_02.4", "A502_03.4", "A502_04.4",
"A504.4", "A506.4", "A508.4","W003.4", "W005.4", "SD02_01",
"SD03",
"A509_02.4", "A509_03.4", "A509_06.4", "A509_10.4")
I create a subset of the columns (and all rows of course) which I want to impute
imp <- mice(ds_wide[ ,impute], maxit=0)
imp$PredictorMatrix
pred <- imp$predictorMatrix
pred [c("A302.4", "A303.4", "A304.4", "A305.4", "A306.4", #ABB.4
"A502_01.4", "A502_02.4", "A502_03.4", "A502_04.4", #PSWQ.4
"A504.4", "A506.4", "A508.4","W003.4", "W005.4", "SD02_01",
"SD03",
"A509_02.4", "A509_03.4", "A509_06.4", "A509_10.4"), ] <- 0
View(pred) #looks exactly how I want it to look like
imp <- mice(ds_wide[ ,impute], m=5, predictorMatrix = pred)
miceimp <- complete (imp)
anyNA(miceimp)
View(miceimp)
When I check miceimp (my result), there are no missing values whatsoever, so all the variables at t4 are imputed even though I specified otherwise. What did I do wrong?
Actually, what would be really best for me, would be if I could somehow impute those variables at t4 which do not only have missings. So those people, who filled out t4, should be imputed, and those, who are not at that measurement point, should not.
If anyone has any ideas how to make that possible, that would be great!
Many thanks!
I am not completely sure I understood 100% what you are trying to archive.
I understood, that you do not want to impute all your variables (but you want to include all your variables as input to the algorithm)
You were trying to define the parameter predictorMatrix
predictorMatrix
A numeric matrix of length(blocks) rows and ncol(data) columns, containing 0/1 data specifying the set of predictors to be used for each target column. Each row corresponds to a variable block, i.e., a set of variables to be imputed. A value of 1 means that the column variable is used as a predictor for the target block (in the rows). By default, the predictorMatrix is a square matrix of ncol(data) rows and columns with all 1's, except for the diagonal. Note: For two-level imputation models (which have "2l" in their names) other codes (e.g, 2 or -2) are also allowed.
To me i sounds like this parameter is used to define, what variables are used as input.
In comparison the where parameter sounds to me as the correct parameter to specify which variables should be imputed.
where
A data frame or matrix with logicals of the same dimensions as data indicating where in the data the imputations should be created. The default, where = is.na(data), specifies that the missing data should be imputed. The where argument may be used to overimpute observed data, or to skip imputations for selected missing values.
So my conclusion would be to try out the where parameter instead of predictorMatrix.
In "mice", in addition to specifying "predMatrix" as zero for the variables that should not be imputed, you must specify ("") in "method" for those variables.
I'm trying to find a correlation matrix from a large dataset containing many NAs in R.
(Basically, I'm trying to do so since I need to visualize correlation matrix in heatmap.)
Since the dataset has 465 variables and each contains many NAs, I think list-wise deletion of whole dataset might result in quite a lossy dataset. (like using complete.cases() methods)
So I'm trying to find correlation of each pair of variables, only list-wise deleting NAs for that pair. (which might result in quite a misleading result, but anyway)
Is there anyone to give me some hints?
What about cor(., use = "pairwise.complete.obs")?
I wanted to generate correlation matrices which are made of correlation of row couples. I used the corrgram function to generate them. In my first attempt, the function generated correlation matrix of which diagonals filled with ranks.
corrgram(t(datasetA),order="GW")
a sample of the output
However when I use it for my second dataset, somehow the diagonal of correlation matrix is filled with varxxx strings instead of rank of correlation.
corrgram(t(datasetB),order="GW")
The datasets contain nearly the same type of values (ints) and they are both dataframe. How can I solve this ?
Edit:
Here is the list of commands from which generates the correlation matrix contains varxxx's in diagonal
erase <- matrix(c(1,5,2,6,8,4,1,5,6),nrow=3)
corrgram(t(erase),order="HC")
output:
Because it is a huge dataset and contains sensitive data, I cannot share the dataset and show the series of operations by which I ended up with the first output above.
Renaming column names with numbers fixed the issue
names(dataSetB)<-c(1:totalNumberOfColumn)
I've got a huge data set with six columns (call them A, B, C, D, E, F), about 450,000 rows. I simply tried to find the correlation between columns A and B:
cor(A, B)
and I got
[1] NA
as a result. What can I do to fix this problem?
Try cor(A,B, use = "pairwise.complete.obs"). That will ignore the NAs in your observations.
To be statistically rigorous, you should also look at the # of missing entries in your data and look at whether the missing at random assumption holds.
Edit 1: Take a look at ?cor to see other options for the use parameter.
You might consider using the rcorr function in the Hmisc package.
It is very fast, and only includes pairwise complete observations. The returned object contains a matrix
of correlation scores
with the number of observation used for each correlation value
of a p-value for each correlation
Some example code is available here: