Using missForest in R with categorical variables - r

I am trying to use the missForest package to impute missing data into a fairly large dataset. Most of my variables are categorical with many factors. When I run missForest, it imputes decimal values and sometimes even negative values. Obviously, I'm doing something wrong. Here is my process below:
FIRST TRY: Entering predictor data directly. I got decimal values imputed into my dataset. I know that missForest only takes matrices but I'm not sure how to force it into recognizing what columns are factors. Someone on another post recommended dummy coding, so I tried that next, witht eh same results. code is below.
SECOND TRY: Dummy coding each predictor (so time consuming) and then running that.
homt_sub_dummy<-homt_sub[c("Psyprob.yes", "Psyprob.no","SUB2.2.0", "SUB2.2.1", "SUB2.2.2", "SUB2.2.3", "SUB2.2.4", "SUB2.2.5", "SUB2.2.6", "SUB2.2.7","Freq1.1", "Freq1.2", "Freq1.3", "Freq1.4","FRSTUSE1.0", "FRSTUSE1.1", "FRSTUSE1.2", "FRSTUSE1.3", "FRSTUSE1.4", "FRSTUSE1.5", "FRSTUSE1.6","FRSTUSE1.7", "FRSTUSE1.8", "FRSTUSE1.9", "FRSTUSE1.10", "FRSTUSE1.11","Freq2.1", "Freq2.2", "Freq2.3", "Freq2.4","AGEcont","Gender_male", "Gender_female", "Race2.0", "Race2.1", "Race2.2", "Arrests.0", "Arrests.1", "Arrests.2")]
homt_dummy_matrix<-data.matrix(homt_sub_dummy, rownames.force = NA)
homt_dummp.imp <- missForest(homt_dummy_matrix, verbose= TRUE, maxiter = 3, ntree = 20)
homt_dummy.imp.df<-as.data.frame(homt_dummp.imp$ximp)
View(homt_dummy.imp.df)
This is a chunk of the data.frame i saved with the imputed values
Any help would be appreciated. I'm pretty new to imputation. I wanted to compare results of MICE with this but I just can't seem to get missForest to work!!!

you can use as.factor function to transform the class of data that you want. For example
cleveland_t <- transform(cleveland,V2=as.factor(V2),V3 = as.factor(V3),V6 = as.factor(V6),V7=as.factor(V7),V9 = as.factor(V9),V11=as.factor(V11),V12 = as.factor(V12),V13= as.factor(V13),v14=as.factor(V14))
then use the sapply to check the class

Related

How to deal with NaN values in R?

I'm testing for random intercepts as a preparation for growth curve modeling.
Therefore, I've first created a wide subset and then converted it to a Long data set.
Calculating my ModelM1 <- gls(ent_act~1, data=school_l) with the long data set, I get an error message as I have missing values. In my long subset these values are stated as NaN.
When applying temp<-na.omit(school_l$ent_act), I can calculate ModelM1. But, when calculating ModelM2 ModelM2 <- lme(temp~1, random=~1|ID, data=school_l), then I get the error message of my variables being of unqueal lengths.
How can I deal with those missing values?
Any ideas or recommendations?
What you might get success with would be to make a temp dataframe where your remove entire lines indexed by negation of the missing condition: !is.na(school_1$ent_act)
temp<-school_l[ !is.na(school_l$ent_act), ]
Then re-run the lme call. There should now be no mismatch of variable lengths.
ModelM2 <- lme(ent_act ~1, random= ~1|ID, data=school_l)
Note that using school_l is going to be potentially confusing because it looks so much like school_1 when viewed in Times font.

Generating data with a loop to use the predict function in R

I've built a model with numeric and factor variables to predict sales based on advertising with weekly data from 2017 to 2019 and I am trying to run a code that will predict the monthly sales for 2020, for each combination of variables.
For that, I need to input the right factor variables, and I was wondering what would be the best way to go about it, here is what my regression looks like:
regads2 = lm(dolsales~flavour+brand+packaging+month+organization+manufacturer+region+displaydummie+addummie+maddummie+laddummie+discount5to10+discount10to15
+discount15to20+discount20to25+discount25to30+discount30to35+discount35to40+discount40to45+discount45to50+
discount50to55+discount55to60+discount60to65+discount65to70+discount70to75, na.action = na.exclude,data = df)
Each of the factor variables have many levels (from 5 to 30), while "discount" variables are dummies. I tried to write a loop that would generate the data based on the levels of variables and store them but I haven't been able to fully get there and I am finding myself a little stuck. Here is what I wrote so far (it is working for one variable, but not for many)
input <- matrix(ncol= 1, nrow = nlevels(region))
for(i in c(0:nlevels(region))) {
working[i,] <- levels(region)[i]
}
input
I imagine there is a simpler way to go about it.
Thanks so much, I've been stuck on that for a good week now.

Mixed Anova in R

I am trying to do an anova anaysis in R on a data set with one within factor and one between factor. The data is from an experiment to test the similarity of two testing methods. Each subject was tested in Method 1 and Method 2 (the within factor) as well as being in one of 4 different groups (the between factor). I have tried using the aov, the Anova(in car package), and the ezAnova functions. I am getting wrong values for every method I try. I am not sure where my mistake is, if its a lack of understanding of R or the Anova itself. I included the code I used that I feel should be working. I have tried a ton of variations of this hoping to stumble on the answer. This set of data is balanced but I have a lot of similar data sets and many are unblanced. Thanks for any help you can provide.
library(car)
library(ez)
#set up data
sample_data <- data.frame(Subject=rep(1:20,2),Method=rep(c('Method1','Method2'),each=20),Level=rep(rep(c('Level1','Level2','Level3','Level4'),each=5),2))
sample_data$Result <- c(4.76,5.03,4.97,4.70,5.03,6.43,6.44,6.43,6.39,6.40,5.31,4.54,5.07,4.99,4.79,4.93,5.36,4.81,4.71,5.06,4.72,5.10,4.99,4.61,5.10,6.45,6.62,6.37,6.42,6.43,5.22,4.72,5.03,4.98,4.59,5.06,5.29,4.87,4.81,5.07)
sample_data[, 'Subject'] <- as.factor(sample_data[, 'Subject'])
#Set the contrats if needed to run type 3 sums of square for unblanaced data
#options(contrats=c("contr.sum","contr.poly"))
#With aov method as I understand it 'should' work
anova_aov <- aov(Result ~ Method*Level + Error(Subject/Method),data=test_data)
print(summary(anova_aov))
#ezAnova method,
anova_ez = ezANOVA(data=sample_data, wid=Subject, dv = Result, within = Method, between=Level, detailed = TRUE, type=3)
print(anova_ez)
Also, the values I should be getting as output by SAS
SAS Anova
Actually, your R code is correct in both cases. Running these data through SPSS yielded the same result. SAS, like SPSS, seems to require that the levels of the within factor appear in separate columns. You will end up with 20 rows instead of 40. An arrangmement like the one below might give you the desired result in SAS:
Subject Level Method1 Method2

How to extract aggregated imputed data from R-package 'mice'?

I have a question regarding the aggregation of imputed data as created by the R-package 'mice'.
As far as I understand it, the 'complete'-command of 'mice' is applied to extract the imputed values of, e.g., the first imputation. However, when running a total of ten imputations, I am not sure, which imputed values to extract. Does anyone know how to extract the (aggregate) imputed data across all imputations?
Since I would like to enter the data into MS Excel and perform further calculations in another software tool, such a command would be very helpful.
Thank you for your comments. A simple example (from 'mice' itself) can be found below:
R> library("mice")
R> nhanes
R> imp <- mice(nhanes, seed = 23109) #create imputation
R> complete(imp) #extraction of the five imputed datasets (row-stacked matrix)
How can I aggregate the five imputed data sets and extract the imputed values to Excel?
I had similar issue.
I used the code below which is good enough to numeric vars.
For others I thought about randomly choose one of the imputed results (because averaging can disrupt it).
My offered code is (for numeric):
tempData <- mice(data,m=5,maxit=50,meth='pmm',seed=500)
completedData <- complete(tempData, 'long')
a<-aggregate(completedData[,3:6] , by = list(completedData$.id),FUN= mean)
you should join the results back.
I think the 'Hmisc' is a better package.
if you already found nicer/ more elegant/ built in solution - please share with us.
You should use complete(imp,action="long") to get values for each imputation. If you see ?complete, you will find
complete(x, action = 1, include = FALSE)
Arguments
x
An object of class mids as created by the function mice().
action
If action is a scalar between 1 and x$m, the function returns the data with imputation number action filled in. Thus, action=1 returns the first completed data set, action=2 returns the second completed data set, and so on. The value of action can also be one of the following strings: 'long', 'broad', 'repeated'. See 'Details' for the interpretation.
include
Flag to indicate whether the orginal data with the missing values should be included. This requires that action is specified as 'long', 'broad' or 'repeated'.
So, the default is to return the first imputed values. In addition, the argument action can also be a string: long, broad, and repeated. If you enter long, it will give you the data in long format. You can also set include = TRUE if you want the original missing data.
ok, but still you have to choose one imputed dataset for further analyses... I think the best option is to analyze using your complete(imp,action="long") and pool the results afterwards.fit <- with(data=imp,exp=lm(bmi~hyp+chl))
pool(fit)
but I also assume its not forbidden to use just one of the imputed datasets ;)

problems with mice in R: cannot coerce class '"mids"' into a data.frame

I have a dataset with about 11,500 rows and 15 factors. I only need to impute values for 3 of the factors, with only 2 of the factors having any significant number of missing values. I have been trying to use mice to create imputed datasets, and I am using the following code:
dataset<-read.csv("filename.csv",header=TRUE)
model<-success~1+course+medium+ethnicity+gender+age+enrollment+HSGPA+GPA+Pell+ethnicity*medium
library(mice)
vempty<-c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
v12<-c(0,0,0,0,0,0,0,1,1,1,1,0,1,1,1)
v13<-c(0,0,0,0,0,0,0,1,1,1,1,1,0,1,1)
v14<-c(0,0,0,0,0,0,0,1,1,1,1,1,1,0,1)
list<-list(vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,vempty,v12,v13,v14,vempty)
predmatrix<-do.call(rbind,list)
MIdataset<-mice(dataset,m=2,predictorMatrix=predmatrix)
MIoutput<- pool(glm(model, data=MIdataset, family=binomial))
After this code, I get the error message:
Error in as.data.frame.default(data) :
cannot coerce class '"mids"' into a data.frame
I'm totally at a loss as to what this means. I had no trouble doing this same analysis just deleting the missing data and using regular glm. I'd also like to do a multilvel logistic model on imputed datasets using lmer (that's the next step after I get this to work with glm), so if there is anything I am doing wrong that will also impact that next step, that would be good to know, too. I've tried to search this error on the internet, and I'm not getting anywhere. I'm just really learning R, so I'm also not that familiar with the environment yet.
Thanks for your time!
You need to apply the with.mids function. I think the last line in your code should look like this:
pool(with(MIdataset, glm(formula(model), family = binomial)))
You could also try this:
expr <- 'glm(success ~ course, family = binomial)'
pool(with(MIdataset, parse(text = expr)))

Resources