I am new to R and I am using rms package for handling imputations. I noticed that the aregImpute function in rms only returns column values which has NA values.
impute <- aregImpute(Y~X1+X2+X3+X4+X5+X6,data= train_data, n.impute=5, nk=0)
impute$imputed$Y
When I tried to find the target value in the imputed data set using impute$imputed$Y, it returned NULL. According to my understanding, because target variable had no NA values I got NULL. My query is that how can I combine imputed dataset with the original dataset so that I will have the complete data set with no NA values. I actually want to try out various algorithm like Decision Tree, Random Forest with the imputed data. Does anyone have any suggestion. It would be a great help.
Related
This is very similar with the following question: R SVM return NA for predictions with missing data
However, the response suggested there does not work (at least for me). Therefore I would like to be more general and try a different approach (or adjust the one proposed there). I can predict using my svm model on the complete.cases() of my data frame. However, it is very important for me to have NA values for all rows with missing data.
My theoretical approach should be the following: predict on complete.cases() of my data frame. Find the index of complete cases. Somehow cbind the column with predictions back to my data.frame(), while adding NA values for all values whose indices are different from those of complete cases. In the essence I should create a column in a data frame by combining two vectors: one of predictions, the other of NA values (based on known indices). However, I am stupid enough not to be able to write the few lines of code for doing that.
Suppose I run one of the missing variable imputation R packages, amelia or mice (or similar), on a large data frame -- let's say 100000 rows and 50 columns -- to get imputations for one particular column with some (let's say 200) NAs in it.
Is there a way to save the derived imputation algorithm so that when I get new data with 1000 new rows, I can simply apply the algorithm to that new data?
The goal is to impute any new NAs in the new data set using the same algorithm as the what was in the base data.
Thank you in advance -- if this isn't clear, I'm happy to answer any questions.
caret comes close to what you want: This assumes all new data takes on the same variables. Imputation(s) by caret and mice however do have different accuracies(in my experience).
library(caret)
mydata<-data.frame(A=c(rep(NA,900),rep(3,900)),B=c(rep(NA,200),rep(3,400)))
mydata1<-data.frame(D=mydata,E=rep(mydata))
prep<-preProcess(mydata,method = "medianImpute")
df_new<-predict(prep,mydata)
df_new
df_new2<-predict(prep,mydata1)
From the multiple imputation output (e.g., object of class mids for mice) I want to extract several imputed values for some of the imputed variables into a single dataset that also includes original data with the missing values.
Here are sample dataset and code:
library("mice")
nhanes
tempData <- mice(nhanes, seed = 23109)
Using the code below I can extract these values for each variable into separate datasets:
age_imputed<-as.data.frame(tempData$imp$age)
bmi_imputed<-as.data.frame(tempData$imp$bmi)
hyp_imputed<-as.data.frame(tempData$imp$hyp)
chl_imputed<-as.data.frame(tempData$imp$chl)
But I want to extract several variables to preserve the order of the rows for further analysis.
I would appreciate any help.
Use the complete function from mice package to extract the complete data set including the imputations:
complete(tempData, action = 1)
action argument takes the imputation number or if you need it in "all", "long" formats etc. Refer R documentation.
I have a short question:
I imputed item data using multiple imputation with the MICE package.
After imputation, I would like to sum items to a total score.
However, my data is now in a mids object, and I can't figure out how to do this simple task.
Does anyone have experience with this "problem"?
Best, Leonhard
I figured it out:
Create an object that contains all imputed datasets and the original
dataset
Apply the rowSums()
Reconstruct the .mids object
Example code:
# load .mids object
library("miceadds")
Dmi<-load.Rdata2("imp.Rdata",paste(getwd(),"imp",sep=""))
# create object that contains all imputed datasets and the original dataset
D<-complete(Dmi,"long",include=T)
# use rowSums
D$T<-rowSums(D[2:11])
# reconstruct .mids object
Dmi<-as.mids2(D)
I am using Amelia package in R to handle missing values.I get the below error when i am trying to train the random forest with the imputed data. I am not sure how can i convert amelia class to data frame which will be the right input to the randomForest function in R.
train_data<-read.csv("train.csv")
sum(is.na(train_data))
impute<- amelia(x=train_data,m=5,idvars=c("X13"), interacs=FALSE)
impute<= as.data.frame(impute)
for(i in 1:impute$m) {
model <- randomForest(Y ~X1+X2+X3+X4+X5+X6,
data= as.data.frame(impute))
}
Error in as.data.frame.default(impute) :
cannot coerce class ""amelia"" to a data.frame
If I used input to randomForest as impute$imputations[[i]] I the below error:
model <- randomForest(Y ~X1+X2+X3+X4+X5+X6,
impute$imputations[[i]])
Error: $ operator is invalid for atomic vectors
Can anyone suggest me how can I solve this problem .It would be a great help.
So, I think the first problem is this line here:
impute<= as.data.frame(impute)
Should be:
impute <- as.data.frame(impute)
Which will throw an error.
Multiple imputation replaces the data with multiple datasets, each with different replacements for the missing values. This reflects the uncertainty in those missing values predictions. By turning the Amelia object into a dataframe you are trying to make one data frame out of 5 data frames, and it's not obvious how to do this.
You might want to look into simpler forms of imputation (like imputing by the mean).
This is happening because you are trying to train on variable containing information on imputation you did. It does not have data you need to train on. You need to use the function complete to combine the imputed values in data set.
impute <- amelia(x=train_data,m=5,idvars=c("X13"), interacs=FALSE)
impute <- complete(impute,1)
impute <- as.data.frame(impute)
After this you won't have trouble training or predicting the data.