How to impute only one or some columns with mice R - r

I am experimenting with the mice package in R and am curious about how i can leave columns out of the imputation.
If i want to run a mean imputation on just one column, the
mice.impute.mean(y, ry, x = NULL, ...) function seems to be what I would use. I'm struggling to understand what i need to include as the third argument to get this to work.
If i have a data set that includes categorical data such as name, ID, birth date, etc. which should not affect the calculation of other columns and should not be filled in when missing, how do i tell mice to exclude these columns in its calculation?
I've been using the mice dataset
nhanes for my exploration.
Thanks

I don't know your data thus I can't create a example for you, but you are looking exactly for this parameters of the mice() function
predictorMatrix
A numeric matrix of length(blocks) rows and ncol(data) columns, containing 0/1 data specifying the set of predictors to be used for each target column. Each row corresponds to a variable block, i.e., a set of variables to be imputed. A value of 1 means that the column variable is used as a predictor for the target block (in the rows). By default, the predictorMatrix is a square matrix of ncol(data) rows and columns with all 1's, except for the diagonal. Note: For two-level imputation models (which have "2l" in their names) other codes (e.g, 2 or -2) are also allowed.
With this parameter you can define, which columns you want to use to impute a specific column.
where
A data frame or matrix with logicals of the same dimensions as data indicating where in the data the imputations should be created. The default, where = is.na(data), specifies that the missing data should be imputed. The where argument may be used to overimpute observed data, or to skip imputations for selected missing values.
Here you can define, for which columns you want to create imputation.

Related

R mice function does not apply customized predictor matrix

I want to impute a part of my data set with mice. My data set has very many variables, which is why I don't want to impute all the variables but only those which I will use in my model.
(I know that as much information as possible should be used for the imputation, but I am already using 41 variables, which according to literature should be more than enough.)
My problem: I don't want every variable to be imputed at all times,
because I have several measurement points. So of course, my variables
at t4 have many missing, but I don't want to impute them when people
just haven't filled out the questionnaire at that point.
So I specified a predictor matrix, in which all of the variables at t0 (e.g. A103.0) are imputed, but not at t4 (A103.4).
However, when running mice, it just uses "pmm" for all of the variables, and every variables is imputed.
Any suggestions on what went wrong are highly appreciated, I spent quite some time now trying to find out what happened..
This is what I've done:
I create an object with all the columns I want to impute
impute <- c("A103", "A104", "A107", #SVF
"A302.0", "A303.0", "A304.0", "A305.0", "A306.0",
"A502_01.0", "A502_02.0", "A502_03.0", "A502_04.0",
"A504.0","A506.0", "A508.0", "W003.0", "W005.0",
"A509_02.0", "A509_03.0", "A509_06.0", "A509_10.0",
"A302.4", "A303.4", "A304.4", "A305.4", "A306.4",
"A502_01.4", "A502_02.4", "A502_03.4", "A502_04.4",
"A504.4", "A506.4", "A508.4","W003.4", "W005.4", "SD02_01",
"SD03",
"A509_02.4", "A509_03.4", "A509_06.4", "A509_10.4")
I create a subset of the columns (and all rows of course) which I want to impute
imp <- mice(ds_wide[ ,impute], maxit=0)
imp$PredictorMatrix
pred <- imp$predictorMatrix
pred [c("A302.4", "A303.4", "A304.4", "A305.4", "A306.4", #ABB.4
"A502_01.4", "A502_02.4", "A502_03.4", "A502_04.4", #PSWQ.4
"A504.4", "A506.4", "A508.4","W003.4", "W005.4", "SD02_01",
"SD03",
"A509_02.4", "A509_03.4", "A509_06.4", "A509_10.4"), ] <- 0
View(pred) #looks exactly how I want it to look like
imp <- mice(ds_wide[ ,impute], m=5, predictorMatrix = pred)
miceimp <- complete (imp)
anyNA(miceimp)
View(miceimp)
When I check miceimp (my result), there are no missing values whatsoever, so all the variables at t4 are imputed even though I specified otherwise. What did I do wrong?
Actually, what would be really best for me, would be if I could somehow impute those variables at t4 which do not only have missings. So those people, who filled out t4, should be imputed, and those, who are not at that measurement point, should not.
If anyone has any ideas how to make that possible, that would be great!
Many thanks!
I am not completely sure I understood 100% what you are trying to archive.
I understood, that you do not want to impute all your variables (but you want to include all your variables as input to the algorithm)
You were trying to define the parameter predictorMatrix
predictorMatrix
A numeric matrix of length(blocks) rows and ncol(data) columns, containing 0/1 data specifying the set of predictors to be used for each target column. Each row corresponds to a variable block, i.e., a set of variables to be imputed. A value of 1 means that the column variable is used as a predictor for the target block (in the rows). By default, the predictorMatrix is a square matrix of ncol(data) rows and columns with all 1's, except for the diagonal. Note: For two-level imputation models (which have "2l" in their names) other codes (e.g, 2 or -2) are also allowed.
To me i sounds like this parameter is used to define, what variables are used as input.
In comparison the where parameter sounds to me as the correct parameter to specify which variables should be imputed.
where
A data frame or matrix with logicals of the same dimensions as data indicating where in the data the imputations should be created. The default, where = is.na(data), specifies that the missing data should be imputed. The where argument may be used to overimpute observed data, or to skip imputations for selected missing values.
So my conclusion would be to try out the where parameter instead of predictorMatrix.
In "mice", in addition to specifying "predMatrix" as zero for the variables that should not be imputed, you must specify ("") in "method" for those variables.

MICE Function Missing Dates

I am working with a price file that has a number of missing weekend values. I am using the MICE function to impute weekend prices. The mice function doesn't allow non-numeric values and errors out if the date is included. This is the reason I use [,2:33], but I need a date so I can join it back to another file. I have tried converting the date to a number, but reversing that conversion at the end of the process yields NAs. Looking for suggestions to keep the dates in the dataframe.
Snippet Example
The link above has a snippet of the data set.
Code for mice function
Imputed <- mice(Features[,2:33], m=5, maxit = 5, method = 'pmm', seed = 500)
unpacking a large mids
df <- complete(Imputed, action = 1L, include = FALSE)
The easiest solution here would be just removing the data before imputation and adding the dates back to the data.frame afterwards.
Since mice does not change the ordering of columns this can be easily done.
As an alternative solution, mice can be also set to only perform imputation on certain columns / only use certain columns for imputation. I think if you exclude the date here, it might also no more throw an error. The parameter is:
predictorMatrix
A numeric matrix of length(blocks) rows and ncol(data) columns, containing 0/1 data specifying the set of predictors to be used for each target column. Each row corresponds to a variable block, i.e., a set of variables to be imputed. A value of 1 means that the column variable is used as a predictor for the target block (in the rows). By default, the predictorMatrix is a square matrix of ncol(data) rows and columns with all 1's, except for the diagonal. Note: For two-level imputation models (which have "2l" in their names) other codes (e.g, 2 or -2) are also allowed.
But probably the first solution with just removing and adding the column back aferwards is easier to perform.

Normalizing data frame while holding a categorical column out

very new to r.
I am trying to normalize multiple variables in matrix except the last column which has a categorical factor variable (in this case good/notgood).
I there any way to normalize the data without affecting the categorical column? I have tried to normalize while keeping the categorical column out, but can't seem to be able to add it back again.
minimum <- apply(mywines[,-12],2,min)
maximum <- apply(mywines[,-12],2,max)
mywinesNorm <- scale(mywines[,-12],center=minimum,scale=(maximum-minimum))
I still need the 12th column to build supervised models.
The short version is that you can simply reattach the column using cbind. However, it is just a little more complicated than that. scale returns a matrix not a data frame. In order to mix numbers and factors, you need a data.frame, not a matrix. So before the cbind, you will want to convert the scaled matrix back to a data.frame.
mywinesNorm = cbind(as.data.frame(mywinesNorm), mywines[ ,12])
A different approach would be to just change the data in place:
mywines[ ,12] = scale(mywines[ ,12])

Remove correlated attributes in R

I am trying to remove the correlated attributes which are out of boundaries (-1,1). I am using the following code for the correlation:
cor(df[sapply(df, is.numeric)])
After that I get the correlation values. How can I remove the values greater than 1 and smaller -1?
Thank you
The cor() function in R receives a numeric vector, matrix or data frame and gives a pairwise correlation matrix of variables. Values in the correlation matrix are expected to be in the range of -1 to +1. However problems arise when we have a correlation matrix that is not positive semi definite. The most frequent cause of an invalid correlation matrix is missing values. R offers several ways of handling missing values in a correlation matrix. You can use the na.rm = TRUE option to specify all missing values should be removed (only complete rows are used). This will always result in a valid correlation matrix.Three other options can be specified on how to handle missing observations. The use = “all.obs” option specifies there are no missing observations and the presence of any missing values will cause an error. If use = “complete.obs” is specified case wise deletion of missing observations happens.
If use = “pairwise.complete.obs” is specified only the complete pairs of observations are used. This may result in an invalid correlation matrix.

How to adapt wilcox.test to my data in R?

I am new to R and trying to use wilcox.test on my data : I have a dataframe 36021X246 with rownames as probeIDs and the last row is a label which indicates which group the samples belong to - "control" for the first 140 and "treated" for the last 106.
I would greatly appreciate knowing how to define the two groups when I perform the test....I am unable to find much information on the "formula" argument online except that -
"formula
a formula of the form lhs ~ rhs where lhs is a numeric variable giving the data values and rhs a factor with two levels giving the corresponding groups."
If someone could explain what lhs~rhs means and how to define this formula I would really appreciate it.
Thanks!
R typically assumes that each row is a case and the columns are associated variables. If the cases from both your samples occur in the same data frame, one column would be an indicator variable for sample membership. Let's call is IndSample. The Wilcoxon is a univariate test, so you would have another column containing the response values you are testing on. Let's call it Y. You then write
wilcox.test(y ~ IndSample, data=MyData, .....)
and the rest of your parameters for the test: is it two-sided? Do you want an exact statistic? (Probably not, in your case.)
It looks to me as if your data is on its side. That's problematic with a data frame, since you can't just pull out a row from a data frame, the way you would with a matrix.
You need to grab the last row and turn it into a factor - something like
factor(c(MyData[lastrow,]))
Then pull out the row that contains your response:
Y <- as.numeric(c(MyData[ResponseRow,]))
Then do the wilcoxon.
However, I am not sure that I have properly understood your situation. That seems to be a very large data matrix for a modest wilcoxon test.

Resources