Running ezDesign() to my dataset shows that there are conditions with missing observations. In the plot below, conditions that have two observations are in blue, those with only one are in red, and the grey one only has one.
I want to impute the missing observations so that all cells have two observations each.
I tried to use the MICE library but what it does is imputing NAs in the dataset. It does not impute missing observations. In other words, I am looking for a function that it automatically fills the missing conditions with imputed values.
Anybody knows how to do that?
Related
I am using the MICE package to impute missing values - the input data are up to six parallel hourly temperature measurements for Scottish weather stations across a calendar year. None of the vectors have more than 5% NAs as I have filtered ones with more out. Most of the sets work fine with MICE but with a few I get an error message:
This data set, which generated the error message has five columns
iter imp variable
1 1 986Error in terms.formula(tmp, simplify = TRUE) :
invalid term in model formula
986 is the station number which is the column name here for the first column. The third to fifth columns don't have any NAs and the first and second have fewer than 1% - but they do have the NAs concentrated as strings of 20 or so near the beginning of the data set. I am wondering whether MICE has a problem with a too large concentration of the NAs in particular regions but I can't find any reference to this in the literature. Has anyone else come across this as a problem, and if so, what did they do about it? Thanks Nick Wray
I'm working with a dataset that is comparing the abundance of certain species against environmental variables in various sampling sites.
For some of the sites, environmental variables could not be measured in the field. As a result, these values are written as "NA" in my dataset.
However, for the variables relating to species abundance, there are some values which are zero, simply because at that particular site, one or more species were simply not observed.
I'm using the mice package to deal with these NA values using imputation methods. However, I also want to use the VIM package with the functions "md.pattern" and "aggr" to assess the proportion of missing values. The issue is that when using these functions, R is not only considering the NA values as missing data, but also the zero values as missing data. How can I make it so R only detects NA as missing data and not the values which are zero?
I have searched stackoverflow and google regarding this but not yet found a fitting answer.
I have a data frame column with ages of individuals.
Out of around 10000 observations, 150 are NAs.
I do not want to impute those with the mean age of the whole column but assign random ages based on the distribution of the ages in my data set i.e. in this column.
How do I do that? I tried fiddling around with the MICE package but didn't make much progress.
Do you have a solution for me?
Thank you,
corkinabottle
You could simply sample 150 values from your observations:
samplevals <- sample(obs, 150)
You could also stratify your observations across quantiles to increase the chances of sampling your tail values by sampling within each quantile range.
I am using the mice package to impute data, and I have read about post processing to restrict imputed values. One of the variables that I am imputing categorical variable with 10 different levels (a,b,c,d,e,f,g,h,i,j). The missing values can take everything as value except a and d. I need to make it so people with category a or d have values of NA after the imputation. Because when I'm imputing now, people are imputed based on all the available levels and that is wrong.
I have also tried to create another binary variable that says actually 0 and 1 in order to make it work but it still imputed in the wrong way.
Any ideas about post processing this in mice in R?
I am new to R and have the following problem: I am working on a dataset that not only has numerical values, but also non numerical values (gender, state). I wanted to start to look through the data and find some correlations first. Well, this works only for numerical values obviously and the dataset doesnt find any correlations for the numerical values. I tried it out with ggcorr and it omits the non numerical columns.
My questions is: how do you treat such datasets? How do you find correlations if you have many non numerical values categories? Also what is the workflow on creating the lineal model for such a dataset? The model should predict if a person earns more or less then 50k a year.
Thanks for any help!
Edit: This is the dataset which I am talking about. I was thinking about convert the categories into numerical values and then correlate through cor.test() but I am not sure if I would gain a valid correlation number this way. So basically my question is: how do I check the correlation between non-numerical and numerical data?