I want to impute a part of my data set with mice. My data set has very many variables, which is why I don't want to impute all the variables but only those which I will use in my model.
(I know that as much information as possible should be used for the imputation, but I am already using 41 variables, which according to literature should be more than enough.)
My problem: I don't want every variable to be imputed at all times,
because I have several measurement points. So of course, my variables
at t4 have many missing, but I don't want to impute them when people
just haven't filled out the questionnaire at that point.
So I specified a predictor matrix, in which all of the variables at t0 (e.g. A103.0) are imputed, but not at t4 (A103.4).
However, when running mice, it just uses "pmm" for all of the variables, and every variables is imputed.
Any suggestions on what went wrong are highly appreciated, I spent quite some time now trying to find out what happened..
This is what I've done:
I create an object with all the columns I want to impute
impute <- c("A103", "A104", "A107", #SVF
"A302.0", "A303.0", "A304.0", "A305.0", "A306.0",
"A502_01.0", "A502_02.0", "A502_03.0", "A502_04.0",
"A504.0","A506.0", "A508.0", "W003.0", "W005.0",
"A509_02.0", "A509_03.0", "A509_06.0", "A509_10.0",
"A302.4", "A303.4", "A304.4", "A305.4", "A306.4",
"A502_01.4", "A502_02.4", "A502_03.4", "A502_04.4",
"A504.4", "A506.4", "A508.4","W003.4", "W005.4", "SD02_01",
"SD03",
"A509_02.4", "A509_03.4", "A509_06.4", "A509_10.4")
I create a subset of the columns (and all rows of course) which I want to impute
imp <- mice(ds_wide[ ,impute], maxit=0)
imp$PredictorMatrix
pred <- imp$predictorMatrix
pred [c("A302.4", "A303.4", "A304.4", "A305.4", "A306.4", #ABB.4
"A502_01.4", "A502_02.4", "A502_03.4", "A502_04.4", #PSWQ.4
"A504.4", "A506.4", "A508.4","W003.4", "W005.4", "SD02_01",
"SD03",
"A509_02.4", "A509_03.4", "A509_06.4", "A509_10.4"), ] <- 0
View(pred) #looks exactly how I want it to look like
imp <- mice(ds_wide[ ,impute], m=5, predictorMatrix = pred)
miceimp <- complete (imp)
anyNA(miceimp)
View(miceimp)
When I check miceimp (my result), there are no missing values whatsoever, so all the variables at t4 are imputed even though I specified otherwise. What did I do wrong?
Actually, what would be really best for me, would be if I could somehow impute those variables at t4 which do not only have missings. So those people, who filled out t4, should be imputed, and those, who are not at that measurement point, should not.
If anyone has any ideas how to make that possible, that would be great!
Many thanks!
I am not completely sure I understood 100% what you are trying to archive.
I understood, that you do not want to impute all your variables (but you want to include all your variables as input to the algorithm)
You were trying to define the parameter predictorMatrix
predictorMatrix
A numeric matrix of length(blocks) rows and ncol(data) columns, containing 0/1 data specifying the set of predictors to be used for each target column. Each row corresponds to a variable block, i.e., a set of variables to be imputed. A value of 1 means that the column variable is used as a predictor for the target block (in the rows). By default, the predictorMatrix is a square matrix of ncol(data) rows and columns with all 1's, except for the diagonal. Note: For two-level imputation models (which have "2l" in their names) other codes (e.g, 2 or -2) are also allowed.
To me i sounds like this parameter is used to define, what variables are used as input.
In comparison the where parameter sounds to me as the correct parameter to specify which variables should be imputed.
where
A data frame or matrix with logicals of the same dimensions as data indicating where in the data the imputations should be created. The default, where = is.na(data), specifies that the missing data should be imputed. The where argument may be used to overimpute observed data, or to skip imputations for selected missing values.
So my conclusion would be to try out the where parameter instead of predictorMatrix.
In "mice", in addition to specifying "predMatrix" as zero for the variables that should not be imputed, you must specify ("") in "method" for those variables.
Related
I have a data set with 4000+ observations of 130 variables, and about half of those variables have missingness. I'm trying to use this code, which creates five imputed data sets:
mice(data_frame, method = c(rep("pmm", 130)), m = 5, maxit = 5)
However, this code only imputes on my numeric variables and does nothing to my categorical variables; it just ignores the categorical variables.
I thought maybe the problem was that I was using Predictive Mean Matching for everything and that it was meant for numeric data only, but I tested using this on a smaller data set and it seems to be able to impute categorical data just as well as numeric data when using PMM. So I'm stumped about why the code is just ignoring my categorical variables. I tried "logreg" and "polyreg" on a few of those variables, but they still just get ignored.
Any ideas?
I figured out my problem: my variables were character class rather than factor class, and the MICE algorithm ignores character classes. Once I converted all of the variables into factors, the code above worked fine. I used sapply on the variables that needed to become factors to make things easier on myself.
I am experimenting with the mice package in R and am curious about how i can leave columns out of the imputation.
If i want to run a mean imputation on just one column, the
mice.impute.mean(y, ry, x = NULL, ...) function seems to be what I would use. I'm struggling to understand what i need to include as the third argument to get this to work.
If i have a data set that includes categorical data such as name, ID, birth date, etc. which should not affect the calculation of other columns and should not be filled in when missing, how do i tell mice to exclude these columns in its calculation?
I've been using the mice dataset
nhanes for my exploration.
Thanks
I don't know your data thus I can't create a example for you, but you are looking exactly for this parameters of the mice() function
predictorMatrix
A numeric matrix of length(blocks) rows and ncol(data) columns, containing 0/1 data specifying the set of predictors to be used for each target column. Each row corresponds to a variable block, i.e., a set of variables to be imputed. A value of 1 means that the column variable is used as a predictor for the target block (in the rows). By default, the predictorMatrix is a square matrix of ncol(data) rows and columns with all 1's, except for the diagonal. Note: For two-level imputation models (which have "2l" in their names) other codes (e.g, 2 or -2) are also allowed.
With this parameter you can define, which columns you want to use to impute a specific column.
where
A data frame or matrix with logicals of the same dimensions as data indicating where in the data the imputations should be created. The default, where = is.na(data), specifies that the missing data should be imputed. The where argument may be used to overimpute observed data, or to skip imputations for selected missing values.
Here you can define, for which columns you want to create imputation.
I am working with a price file that has a number of missing weekend values. I am using the MICE function to impute weekend prices. The mice function doesn't allow non-numeric values and errors out if the date is included. This is the reason I use [,2:33], but I need a date so I can join it back to another file. I have tried converting the date to a number, but reversing that conversion at the end of the process yields NAs. Looking for suggestions to keep the dates in the dataframe.
Snippet Example
The link above has a snippet of the data set.
Code for mice function
Imputed <- mice(Features[,2:33], m=5, maxit = 5, method = 'pmm', seed = 500)
unpacking a large mids
df <- complete(Imputed, action = 1L, include = FALSE)
The easiest solution here would be just removing the data before imputation and adding the dates back to the data.frame afterwards.
Since mice does not change the ordering of columns this can be easily done.
As an alternative solution, mice can be also set to only perform imputation on certain columns / only use certain columns for imputation. I think if you exclude the date here, it might also no more throw an error. The parameter is:
predictorMatrix
A numeric matrix of length(blocks) rows and ncol(data) columns, containing 0/1 data specifying the set of predictors to be used for each target column. Each row corresponds to a variable block, i.e., a set of variables to be imputed. A value of 1 means that the column variable is used as a predictor for the target block (in the rows). By default, the predictorMatrix is a square matrix of ncol(data) rows and columns with all 1's, except for the diagonal. Note: For two-level imputation models (which have "2l" in their names) other codes (e.g, 2 or -2) are also allowed.
But probably the first solution with just removing and adding the column back aferwards is easier to perform.
So this is a silly question and honestly I do not understand why I can seem to figure it out.
I'm using the package Amelia in R to do a multiple imputation in my dataset. I figured out how to include nominal variables but I do not see how to include information about positive numeric variables. For instance, variables like age or symptoms_days should be positive and some outputs present negative values for these variables.
Anyone knows how to pass this information to Amelia?
Here is my code:
amelia <- amelia(data1, m=70, noms=c("Vac", "Radio", "Sit", "Sex"))
Sorry if the answer was right in front of my eyes but I missed it. I have read the vignette and look for an answer in the Internet but wasn't able to figure it out.
Thank you!
It seems that you need to use the bound argument.
From the documentation
bound a three column matrix to hold logical bounds on the imputations. Each row of the matrix should be of the form c(column.number, lower.bound,upper.bound) See Details below.
and the details below reads:
In addition to priors, Amelia allows for logical bounds on variables. The bounds argument should be a matrix with 3 columns, with each row referring to a logical bound on a variable. The first column should be the column number of the variable to be bounded, the second column should be the lower bounds for that variable, and the third column should be the upper bound for that variable. As Amelia enacts these bounds by resampling, particularly poor bounds will end up resampling forever. Amelia will stop resampling after max.resample attempts and simply set the imputation to the relevant bound.
So, suppose Vac is the 3rd column and needs to be positive and Radio is the 4th column and needs to be bounded between -10 and 10. You would need then need to write something like:
amelia <- amelia(data1, m=70, noms=c("Vac", "Radio", "Sit", "Sex"),
bound = rbind(c(3, 0, Inf), c(4, -10, 10))
I am new to R and trying to use wilcox.test on my data : I have a dataframe 36021X246 with rownames as probeIDs and the last row is a label which indicates which group the samples belong to - "control" for the first 140 and "treated" for the last 106.
I would greatly appreciate knowing how to define the two groups when I perform the test....I am unable to find much information on the "formula" argument online except that -
"formula
a formula of the form lhs ~ rhs where lhs is a numeric variable giving the data values and rhs a factor with two levels giving the corresponding groups."
If someone could explain what lhs~rhs means and how to define this formula I would really appreciate it.
Thanks!
R typically assumes that each row is a case and the columns are associated variables. If the cases from both your samples occur in the same data frame, one column would be an indicator variable for sample membership. Let's call is IndSample. The Wilcoxon is a univariate test, so you would have another column containing the response values you are testing on. Let's call it Y. You then write
wilcox.test(y ~ IndSample, data=MyData, .....)
and the rest of your parameters for the test: is it two-sided? Do you want an exact statistic? (Probably not, in your case.)
It looks to me as if your data is on its side. That's problematic with a data frame, since you can't just pull out a row from a data frame, the way you would with a matrix.
You need to grab the last row and turn it into a factor - something like
factor(c(MyData[lastrow,]))
Then pull out the row that contains your response:
Y <- as.numeric(c(MyData[ResponseRow,]))
Then do the wilcoxon.
However, I am not sure that I have properly understood your situation. That seems to be a very large data matrix for a modest wilcoxon test.