Behaviour of MICE package in R - r

I am using the MICE package to impute missing values - the input data are up to six parallel hourly temperature measurements for Scottish weather stations across a calendar year. None of the vectors have more than 5% NAs as I have filtered ones with more out. Most of the sets work fine with MICE but with a few I get an error message:
This data set, which generated the error message has five columns
iter imp variable
1 1 986Error in terms.formula(tmp, simplify = TRUE) :
invalid term in model formula
986 is the station number which is the column name here for the first column. The third to fifth columns don't have any NAs and the first and second have fewer than 1% - but they do have the NAs concentrated as strings of 20 or so near the beginning of the data set. I am wondering whether MICE has a problem with a too large concentration of the NAs in particular regions but I can't find any reference to this in the literature. Has anyone else come across this as a problem, and if so, what did they do about it? Thanks Nick Wray

Related

How do I keep my missing values to stay the same after I do mice imputation and save my results?

As a new R user I'm having trouble understanding why the NA valus in my dataframe keep changing. I'm running my code on Kaggle. Maybe that's where my problem is arising from?
Original dataframe titled "abc"
There are multiple columns that have NA values so I decided to try using multiple imputation to handle the na values.
So I created a new dataframe with just the columns that had na values and begin imputation
This is the new dataframe titled "abc1"
abc1 <- select(abc, c(9,10,15,16,17,18,19,25,26))
#mice imputation
input_data = abc1
my_imp = mice(input_data, m=5, method="pmm", maxit=20)
summary(input_data$m_0_9)
my_imp$imp$m_0_9
When the imputation begins it creates 5 columns that contain new values to fill in for the NA values of column m_0_9 and I choose which column.
Imputation of column 'm_0_9'
Then I run this code:
final_clean_abc1 <- complete(my_imp,5)
This assigns the values from column 5 of the last image to the NA values in my "abc1" dataframe and saves as "final_clean_abc1."
Lastly I replace the columns from the original "abc" dataframe that had missing values with the new columns in "final_clean_abc1."
I know this probably isnt the cleanest:
abc$m_0_9 <- final_clean_abc1$m_0_9
abc$m_10_12 <- final_clean_abc1$m_10_12
abc$f_0_9 <- final_clean_abc1$f_0_9
abc$f_10_12 <- final_clean_abc1$f_10_12
abc$f_13_14 <- final_clean_abc1$f_13_14
abc$f_15 <- final_clean_abc1$f_15
abc$f_16 <- final_clean_abc1$f_16
abc$asian_pacific_islander <- final_clean_abc1$asian_pacific_islander
abc$american_indian <- final_clean_abc1$american_indian
Now that I have a dataframe 'abc' with no missing values this is where my problem arises. I should be seeing '162' for row 10 for the m_0_9 column but when I save my code and view it on Kaggle I get the value '7' for that specific row and column. As shown in the photo below.
"abc" dataframe with no NA values
Hopefully this makes sense I tried to be as specific as I could be.
There are multiple stochastic processes going on in mice to impute multiple values for one target value, of which are then averaged. You should not expect the same result each time you run mice.
From the MICE documentation
In the first step, the dataset with missing values (i.e. the
incomplete dataset) is copied several times. Then in the next step,
the missing values are replaced with imputed values in each copy of
the dataset. In each copy, slightly different values are imputed due
to random variation. This results in mulitple imputed datasets. In the
third step, the imputed datasets are each analyzed and the study
results are then pooled into the final study result. In this Chapter,
the first phase in multiple imputation, the imputation step, is the
main topic. In the next Chapter, the analysis and pooling phases are
discussed.
https://bookdown.org/mwheymans/bookmi/multiple-imputation.html
We have a wonderful series of vignettes that detail the use of mice. Part of this series is the stochastic nature of the algorithm and how to fix that. Setting mice(yourdata, seed = 123) would generate the same set of multiple imputation every time.

VIM package detects values of zero in dataset as missing data

I'm working with a dataset that is comparing the abundance of certain species against environmental variables in various sampling sites.
For some of the sites, environmental variables could not be measured in the field. As a result, these values are written as "NA" in my dataset.
However, for the variables relating to species abundance, there are some values which are zero, simply because at that particular site, one or more species were simply not observed.
I'm using the mice package to deal with these NA values using imputation methods. However, I also want to use the VIM package with the functions "md.pattern" and "aggr" to assess the proportion of missing values. The issue is that when using these functions, R is not only considering the NA values as missing data, but also the zero values as missing data. How can I make it so R only detects NA as missing data and not the values which are zero?

R: imputation of values in a data frame column by distribution of that variable

I have searched stackoverflow and google regarding this but not yet found a fitting answer.
I have a data frame column with ages of individuals.
Out of around 10000 observations, 150 are NAs.
I do not want to impute those with the mean age of the whole column but assign random ages based on the distribution of the ages in my data set i.e. in this column.
How do I do that? I tried fiddling around with the MICE package but didn't make much progress.
Do you have a solution for me?
Thank you,
corkinabottle
You could simply sample 150 values from your observations:
samplevals <- sample(obs, 150)
You could also stratify your observations across quantiles to increase the chances of sampling your tail values by sampling within each quantile range.

Estimating new columns with n-1 cases in R

I am trying to build a code for a fund analysis, which starts from the returns of the fund at different frequencies. I have been able to split the data by frequencies, that is, daily, weekly, monthly, quarterly and yearly, but the next step is not quite working for me. I have done it many times on excel and SPSS, but since R is a new language for me, it is proving to be challenging. A sample of my data is given herewith :
Date Dat1 Dat2
30/06/2009 54.26 1307.16
31/07/2009 65.28 1425.40
31/08/2009 70.71 1498.97
30/09/2009 76.18 1552.84
30/10/2009 71.92 1532.74
30/11/2009 77.14 1559.57
What I wish to do is to have two more columns, with n-1 elements in them, starting at the second index. So the entries in the new columns for the 30/06/2009 would be '-' and '-' and for the 31/07/2009 row would be the values of (65.28-54.26)/54.26 and (1425.40-1307.16)/1307.16 and so forth, until the very last case. But when I run the simple code
Daily$Dat1.Return <- diff(log(Daily$Dat1)) I get the following error :
Error in `$<-.data.frame`(`*tmp*`, Dat1, value = c(-0.0616981144702153, :
replacement has 2161 rows, data has 2162
How can I get the columns I want?

Imputation using related data (R)

I have a panel dataset with population data. I am working mostly with two vectors - population and households. The household vector(there are 3 countries) has a substantial amount of missing values, the population vector is full. I use a model with population as the independent variable to get the missing values of households. What function should I use to extract these values? I do not need to make any forecasts, just to imput the missing data.
Thank you.
EDIT:
This is a printscreen of my dataset:
https://imagizer.imageshack.us/v2/1366x440q90/661/RAH3uh.jpg
As you can see, many values of datatype = "original" data are missing and I need to input it somehow. I have created several panel data models (Pooled, within, between) and without further considerations tried to extract the missing data with each of them; however I do not know how to do this.
EDIT 2: What I need is not how to determine which model to use but how to get the missing values(so making the dataset more balanced) of the model.

Resources