Is it possible to get an imputation using the package MICE even when all the values in the column are the same? Then it would impute just with that number.
Example:
test<-data.frame(var1=c(2.3,2.3,2.3,2.3,2.3,NA),var2=c(5.3,5.6,5.9,6.4,4.5,NA))
miceImp<-mice(test)
testImp<-complete(miceImp)
only imputate on var2. I would like it to replace the NA in var1 too with 2.3.
You can use passive imputation for this. For a full explanation, see section 3.4 on page 25 of this article. As applied to constant variables, the objective here would be to set the imputation method for any constant variable x to the constant value of x. If the constant value of x is y, then the imputation method for x should be "~I(y)".
test = data.frame(
var1=c(2.3,2.3,2.3,2.3,2.3,NA,2.3),
var2=c(5.3,5.6,5.9,6.4,4.5,5.1,NA),
var3=c(NA,1:6))
cVars = which(sapply(test,sd,na.rm=T)==0) #determine which vars are constant (props to SimonG)
allMeans = colMeans(test,na.rm=T) #get the column means
miceImp.ini = mice(test,maxit=0,print=F) #initial mids object with no imputations
meth = miceImp.ini$method #extract the imputation method vector
meth[cVars] = paste0("~I(",allMeans[cVars],")") #set the imputation method to be a constant (the current column mean)
miceImp = mice(test,method=meth) #run the imputation with the user defined imputation methods
testImp = complete(miceImp) #extract an imputedly complete dataset
View(testImp) #take a look at it
All that being said, constant values tend not to be of great use in statistics, so it might be more efficient to drop any constant variables before imputation (since imputation is such a costly process).
Related
I have a dataset about 2000 observations for further analyses. There are 4 variables that have a lot of missing values (percentage of missing is over 50%). I'm trying to use MICE package to impute missing values. Here are my questions:
For the final dataset, it contains variables from different dataset previously. Should I use the final dataset to impute the missing values for those variables, or should I use the original dataset (where those 4 variables are from) which has data that are more relevant to those 4 variables?
I saw two different codes online:
imputed_Data <- mice(iris.mis, m=5, maxit = 50, method = 'pmm', seed = 500)
completeData <- complete(imputed_Data,2)
Another one:
mice(anesimp2, maxit = 5,
predictorMatrix = predM,
method = meth, print = FALSE)
I'm wondering what the difference is between these two codes and which one I should use. If I use the first code, I'm also wondering what value of the seed I should set in my case.
Should I do any preprocessing of the data before running these codes?
Thanks a lot for the help!
I want to plot how the estimated survival from a Cox model depends upon the value of a covariate of interest, while the rest of variables are fixed to their average values (if they are continuous variables) or lowest values for dummy. Following this example http://www.sthda.com/english/wiki/cox-proportional-hazards-model , I have construct a new data frame with three rows, one for each value of my variable of interest; and the other covariates are fixed. Among these covariates I have two factor vectors. I created the new dataset and later it is passed to survfit() via the newdata argument.
When I passed the data frame to survfit(), I obtain the following error message error in relevel.default(occupation) : 'relevel' only for factors. Where is the source of problem? If the source of problem is related to the factor vectors, how I can solve it? Below find an example of the code. Unfortunately, I cannot share the data or find a dataset that produces the same error message:
I have transformed the factor variables into integer vectors in the cox model and in the new dataset. it did not work.
I have deleated all the factor variables and it works.
I have tried to implement this strategy, but it did not work: Plotting predicted survival curves for continuous covariates in ggplot
fit <- coxph(Surv(entry, exit, event == 1) ~ status_plot +
exp_national + relevel(occupation, 5) + age + gender + EDUCATION , data = data)
data_rank <- with(data,
data.frame(status_plot = c(1,2,3), # factor vector of interest
exp_national=rep(mean(exp_national, na.rm = TRUE), 3),
occupation = c(5,5,5), # factor with 6 categories, number 5 is the category of reference in the cox model
age=rep(mean(age, na.rm = TRUE), 3),
gender = c(1,1,1),
EDUCATION=rep(mean(EDUCATION, na.rm = TRUE), 3) ))
surv.fin <- survfit(fit, newdata=data_rank) # this produces the error
Looking at the code it appears you probably attempted to take the mean of a factor. So do post at least str(data) as an edit to the body of your question. You should also realize that you can give a single value to a column in a data.frame call and have it recycled to the correct length, you all the meanss could be entered as a single item rather thanrep`-ng.
I'm having trouble with my first forecasting implementation in R. What I'd like to achieve is to predict the variable Y with 2 exogenous variables X1 and X2. The 3 datasets are each represented as a single column with 12 rows.
From another Stackpost I followed a similar approach:
DataSample <- data.frame(Y=Y[,1],Month=rep(1:12,1),
X1=X1[,1],X2=X2[,1])
predictor_matrix <- cbind(Month=model.matrix(~as.factor(DataSample$Month)),
X1=DataSample$X1,
X2=DataSample$X2)
# Remove intercept
predictor_matrix <- predictor_matrix[,-1]
# Rename columns
colnames(predictor_matrix) <- c("January","February","March","April","May","June","July","August","September","October","November","X1","X2")
# Variable to be modeled
var <- ts(DataSample$Y, frequency=12)
#Find ARIMA
modArima <- auto.arima(var, xreg = predictor_matrix)
At this line I get the following error:
Error in optim(init[mask], armaCSS, method = optim.method, hessian =
FALSE, : non-finite value supplied by optim
I presume that my predictor_matrix is not in the correct format but I can't find the error.
Any help would be appreciated,
You have indicated "datasets are ... 12 rows". Your predictor matrix has 13 columns (11 months [of dummy variables?] and 2 other variables). Therefore, you necessarily have a linear dependence among the columns and the optimization procedure fails.
You need (ideally much) more data to support the number of predictor variables and/or a sparser set of predictors.
I want to impute values from a data set (14 variables, 200 observations) and then split it into a 70% training data set and a 30% testing data set.
Every time I work with Amelia to impute I get different types of error messages. I'm looking for the simplest way to have Amelia impute this entire data set.
colnames(mydata) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")
mydata <- subset(mydata, select=-c(ca,thal))
I also get this error and I'm unsure of what it means:
Amelia Error Code: 36
The number of categories in the nominal variable 'chol' is greater than one-third of the observations.
Warning messages:
1: In amcheck(x = x, m = m, idvars = numopts$idvars, priors = priors, :
The number of categories in one of the variables marked nominal has greater than 10 categories. Check nominal specification.
Amelia checks to see if there are not too many categories in a variable. This is done by counting the unique occurrences of the variable and comparing this to one-third of the rows.
For example, if you have 300 rows of data and you have more than 100 unique values in your column (excluding NA's), amelia will return this error. Imputing on so many different values on so few records is almost impossible that you might as well fill in random values. Either think about whether or not you need this column, get more data, or see if you can find a way to fill in the missing data based on domain knowledge.
For more information on Amelia check the vignette, but if you want to read through the code, check the github page. You can find it here. Especially the error code might be handy to read through (amcheck.r).
Splitting your data in 70/30 can be done in multiple ways. Two that I use are either:
library(caTools)
# set.seed for reproducibility.
set.seed(144)
split <- sample.split(dataframe$"Variable to split on", SplitRatio = 0.7)
train <- subset(dataframe, split == TRUE)
test <- subset(dataframe, split == FALSE)
or
library(caret)
# set.seed for reproducibility.
set.seed(42)
split <- createDataPartition(y = dataframe$"Variable to split on", p=0.7, list=FALSE)
train <- dataframe[subtrain,]
test <- dataframe[-subtrain,]
How can I perform an operation (like subsetting or adding a calculated column) on each imputed dataset in an object of class mids from R's package mice? I would like the result to still be a mids object.
Edit: Example
library(mice)
data(nhanes)
# create imputed datasets
imput = mice(nhanes)
The imputed datasets are stored as a list of lists
imput$imp
where there are rows only for the observations with imputation for the given variable.
The original (incomplete) dataset is stored here:
imput$data
For example, how would I create a new variable calculated as chl/2 in each of the imputed datasets, yielding a new mids object?
This can be done easily as follows -
Use complete() to convert a mids object to a long-format data.frame:
long1 <- complete(midsobj1, action='long', include=TRUE)
Perform whatever manipulations needed:
long1$new.var <- long1$chl/2
long2 <- subset(long1, age >= 5)
use as.mids() to convert back manipulated data to mids object:
midsobj2 <- as.mids(long2)
Now you can use midsobj2 as required. Note that the include=TRUE (used to include the original data with missing values) is needed for as.mids() to compress the long-formatted data properly. Note that prior to mice v2.25 there was a bug in the as.mids() function (see this post https://stats.stackexchange.com/a/158327/69413)
EDIT: According to this answer https://stackoverflow.com/a/34859264/4269699 (from what is essentially a duplicate question) you can also edit the mids object directly by accessing $data and $imp. So for example
midsobj2<-midsobj1
midsobj2$data$new.var <- midsobj2$data$chl/2
midsobj2$imp$new.var <- midsobj2$imp$chl/2
You will run into trouble though if you want to subset $imp or if you want to use $call, so I wouldn't recommend this solution in general.
Another option is to calculate the variables before the imputation and place restrictions on them.
library(mice)
# Create the additional variable - this will have missing
nhanes$extra <- nhanes$chl / 2
# Change the method of imputation for extra, so that it always equals chl/2
# Change the predictor matrix so only chl predicts extra
ini <- mice(nhanes, max = 0, print = FALSE)
meth <- ini$meth
meth["extra"] <- "~I(chl / 2)"
pred <- ini$pred # extra isn't used to predict
pred["extra", "chl"] <- 1
# Imputations
imput <- mice(nhanes, seed = 1, pred = pred, meth = meth, print = FALSE)
There are examples in mice: Multivariate Imputation by Chained Equations in R.
There is an overload of with that can help you here
with(imput, chl/2)
the documentation is given at ?with.mids
There's a function for this in the basecamb package:
library(basecamb)
apply_function_to_imputed_data(mids_object, function)