How to perform multiple imputation in mice with restrictions - r

I am using the mice package to impute data, and I have read about post processing to restrict imputed values. One of the variables that I am imputing categorical variable with 10 different levels (a,b,c,d,e,f,g,h,i,j). The missing values can take everything as value except a and d. I need to make it so people with category a or d have values of NA after the imputation. Because when I'm imputing now, people are imputed based on all the available levels and that is wrong.
I have also tried to create another binary variable that says actually 0 and 1 in order to make it work but it still imputed in the wrong way.
Any ideas about post processing this in mice in R?

Related

VIM package detects values of zero in dataset as missing data

I'm working with a dataset that is comparing the abundance of certain species against environmental variables in various sampling sites.
For some of the sites, environmental variables could not be measured in the field. As a result, these values are written as "NA" in my dataset.
However, for the variables relating to species abundance, there are some values which are zero, simply because at that particular site, one or more species were simply not observed.
I'm using the mice package to deal with these NA values using imputation methods. However, I also want to use the VIM package with the functions "md.pattern" and "aggr" to assess the proportion of missing values. The issue is that when using these functions, R is not only considering the NA values as missing data, but also the zero values as missing data. How can I make it so R only detects NA as missing data and not the values which are zero?

Impute different types of variables with MICE

I am trying to perform imputation on a dataset which has 69 columns and over 50000 rows. My dataset has different types of variables:
columns that only present binary variables (0,1)
categorical columns
columns that take continuous numerical data
Now, I want to perform imputation and I know that my columns have a high level of multicollinearity.
Do I have to split my dataset into 3 different subsets (one for each of 1), 2), 3) type of column that I can have) or should I perform imputation on the whole dataset?
The problem is that the package mice have different methods for each of these types. And if I run three different times, do I have to take into consideration the whole dataset or only that specific part?
You can input your whole dataset at once to mice.
(you can actually specify which method to use for each variable separately)
I am citing from the mice reference:
Parameter 'method'
Can be either a single string, or a vector of strings with length length(blocks), specifying the imputation method to be used for each column in data. If specified as a single string, the same method will be used for all blocks. The default imputation method (when no argument is specified) depends on the measurement level of the target column, as regulated by the defaultMethod argument. Columns that need not be imputed have the empty method "". See details.

How to perform a two-way repeated measures ANOVA with missing values

For my data set, I need to perform some sort of two factor repeated measures ANOVA. I have one between-subject factor called "Treatment" and one within-subject factor called "Frequency" with 8 levels. My problem is that most of my subjects don't have responses, called "Threshold", for all 8 of the levels of frequency (missing values). In addition, my two treatments are also unbalanced (about 23 for the first treatment type and about 21 for the other).
What r code do you suggest I try? And what would that code look like? I've been looking at the aov and Anova (car package) functions. I also need to figure out the formula for my model. I was thinking something like
aov(Threshold~(TreatmentFrequency)+Error(Subject/(TreatmentFrequency))
but I keep getting error messages like "In aov (......) Error() model is singular."
My question here is if you only include within-subject factors in the error term, so Error(Subject/Frequency) or just Error(Subject), or if I had it right in including everything? Also, should I include rows of responses for every frequency per bird, even if I don't have that specific value? Should I put NA's in those missing value cells, or delete the entire row of data for that level?
Any help would be greatly appreciated! I'm new to more advanced statistics and modeling, so keep that in mind! And if I need to clarify or add anything, just let me know. Thanks!

Imputation using related data (R)

I have a panel dataset with population data. I am working mostly with two vectors - population and households. The household vector(there are 3 countries) has a substantial amount of missing values, the population vector is full. I use a model with population as the independent variable to get the missing values of households. What function should I use to extract these values? I do not need to make any forecasts, just to imput the missing data.
Thank you.
EDIT:
This is a printscreen of my dataset:
https://imagizer.imageshack.us/v2/1366x440q90/661/RAH3uh.jpg
As you can see, many values of datatype = "original" data are missing and I need to input it somehow. I have created several panel data models (Pooled, within, between) and without further considerations tried to extract the missing data with each of them; however I do not know how to do this.
EDIT 2: What I need is not how to determine which model to use but how to get the missing values(so making the dataset more balanced) of the model.

R: Time Series Regression with NA and multiple dependent variables

I would like to run a time series regression with a list of dependent variables as the column. I would like to regress each column on a set of independent variables. I know you can just use
lm(dataframe~independent variables)
because if the dependent variable is a matrix, then they will just go through each column.
However, my dependent variables are information about stocks through time and sometimes information is not available for every single stock at every time point, so I have some NA values. The problem that I am having is that if I use lm, I have to omit the NA values, i.e. the lm function removes the whole row when running the regression. This is fine if I only want to run a regression on one dependent variable, but I have a list(1000+) of dependent variables which I would like to run my regression on. Because my dataset is only 15+ years, there is are missing values for very single time point, so when I run my lm regression, I get an error because the lm function has removed every single row when running the regression. The only way that I can think of to solve this problem is to run a for loop and run a separate regression for each stock, which I think will take a very long time to compute. For example, the following is an example of my data:
135081(P) 135084(P) 135090(P)
1994-12-30 NA NA NA
1995-01-02 NA NA NA
1995-01-03 06864935 NA NA
1995-01-04 NA NA -0.05474644
1995-01-05 NA NA 0.20894900
1995-01-06 NA -0.45672832 -0.02378632
so if I run a time series regression on this, I would get an error because the lm function would skip every single row.
So my question is, would there be another way to run a time series regression across a data frame with different DEPENDENT variables where the regression "skips" the NA for just the one particular dependent variable instead of skipping it for every other dependent variable as well?
I don't think using na.omit is correct because it removes the time series properties of my dataset and using na.action=NULL doesn't work because I have NA in my dataset.
Thank you a lot for your help.
You might want to employ a multiple imputation method using something like the Amelia 2 package on CRAN in order to properly account for increased uncertainty in your estimates due to missingness, and also to help minimize biases that result from case-wise deletion. See for example:
Honaker, J. and King, G. (2010). What to do about missing values in time-series cross-section data. American Journal of Political Science, 54(2):561–581.

Resources