R - Combine two mice mids objects when data frames have different columns - r

I'm using the mice package on two different but related data frames.
While the large majority of the variables are the same for both data frames, a small number of variables are unique to each data frame and the imputation happens for both data frames separately (they have slightly different imputation models/ predictor matrices, etc.).
In the end, I would like to combine the two resulting mids objects,
but as the columns differ, the standard procedure via rbind(actually method rbind.mids is called) does not work.
Is there an easy way around this?
Two alternative approaches I could think of:
Combine the two dfs one time before imputation via dplyr::bind_rows and split them again. Now each data frame has all columns and rbind() would work after the imputations. However, that would also require defining the predictor matrix and method section again for both data frames to tell mice to ignore the new columns.
Use the mice::complete(imp_df, "long", include = TRUE) function on both mids objects, combine the resulting data frames, and use mice::as.mids() to convert them back into a single mids object. But I'm not sure if that would work or mess something else up, e.g.
Here is some example data to illustrate the issue
library(mice)
data("nhanes2") # load test data from mice package
# make test_df 1
df_1 <- nhanes2[1:14,c(1,2,3)]
# make test_df 2
df_2 <- nhanes2[15:25, c(1,2,4)]
# quick and dirty imputation for test purpose
test_imp1 <- mice(df_1)
test_imp2 <- mice(df_2)
rbind(test_imp1, test_imp2)
Error in rbind.mids.mids(x, y, call = call) :
datasets have different variable names

Related

convert lists to data frames

I'm a beginner in R. I'm working in a data to expand my knowledge especially in data manipulation.
The task is to split my data set based on a parameter(column). Then to calculate the standard deviation for each group, then to provide some graphs. I did split my data set to about 3000 list, but I'm stuck in converting the lists into separate data sets so I can collect the SD for each data set. Or if there is an efficient way to do it in one code.
this is what I did so far.
xx <- read.table("NetworkRail1.csv",sep=",",header=TRUE)
selected <- select(xx,ID, Location, Top70m)
splitNR <- split(selected, selected$Location %% 0.125)
If your problem is to save the different components of the list separately, you can try this:
list2env(splitNR, envir=.GlobalEnv)

How extract several imputed values from several variables using mice or another package in R into a single dataset?

From the multiple imputation output (e.g., object of class mids for mice) I want to extract several imputed values for some of the imputed variables into a single dataset that also includes original data with the missing values.
Here are sample dataset and code:
library("mice")
nhanes
tempData <- mice(nhanes, seed = 23109)
Using the code below I can extract these values for each variable into separate datasets:
age_imputed<-as.data.frame(tempData$imp$age)
bmi_imputed<-as.data.frame(tempData$imp$bmi)
hyp_imputed<-as.data.frame(tempData$imp$hyp)
chl_imputed<-as.data.frame(tempData$imp$chl)
But I want to extract several variables to preserve the order of the rows for further analysis.
I would appreciate any help.
Use the complete function from mice package to extract the complete data set including the imputations:
complete(tempData, action = 1)
action argument takes the imputation number or if you need it in "all", "long" formats etc. Refer R documentation.

Separating data frame based on column values

I am having a bit of trouble with trying to script a code in R so that it separates a data frame based on the character in a data frame column without manually specifying a subset command. Below is the script for reproduction in R:
a=c("Model_A","R1",358723.0,171704.0,1.0,36.818500,4.0222700,1.38895000)
b=c("Model_A","R2",358723.0,171704.0,2.6,36.447300,4.0116100,1.37479000)
c=c("Model_A","R3",358723.0,171704.0,5.0,35.615400,3.8092600,1.34301000)
d=c("Model_B","R1",358723.0,171704.0,1.0,39.818300,2.4475600,1.50384000)
e=c("Model_B","R2",358723.0,171704.0,2.6,39.391600,2.4209900,1.48754000)
f=c("Model_B","R3",358723.0,171704.0,5.0,38.442700,2.3618400,1.45126000)
g=c("Model_C","R1",358723.0,171704.0,1.0,31.246400,2.2388000,1.30652000)
h=c("Model_C","R2",358723.0,171704.0,2.6,30.911600,2.2144800,1.29234000)
i=c("Model_C","R3",358723.0,171704.0,5.0,30.166700,2.1603000,1.26077000)
df=data.frame(a,b,c,d,e,f,g,h,i)
df=t(df)
df=data.frame(df)
col_list=list("Model","Receptor.name","X(m.)","Y(m.)","Z(m.)",
"nox","PM10","PM2.5")
colnames(df)=col_list
Essentially what I am trying is to separate the data frame (df) by the Model names ("Model_A", "Model_B", and "Model_C") and store them in new and different data frames. I have been trying to use the following command
df_test=split(df,with(df,interaction(Model,Model)), drop = TRUE)
This command separates the data frame but stores them in lists, and I don't know how to extract the lists individually and store them as data frames. Is there a simpler solution (avoiding the subset command if possible as I need the script to be dynamic and relative) or does anyone know how to use the last command shown above to separate the lists into individual data frames? Also if possible, is it possible to name the data frame after the model?
I apologize if these are a lot of questions but any help would be hugely appreciated! Thank you!
list2env(split(df, df$Model), envir = .GlobalEnv) will give you three dataframes in your global environment, named after the models, containing the relevant rows.
> Model_A
Model Receptor.name X(m.) Y(m.) Z(m.) nox PM10 PM2.5
a Model_A R1 358723 171704 1 36.8185 4.02227 1.38895
b Model_A R2 358723 171704 2.6 36.4473 4.01161 1.37479
c Model_A R3 358723 171704 5 35.6154 3.80926 1.34301
Although I would just keep the list of three dataframes by only using dflist <- split(df, df$Model).
Why a list? Lists allow you the use of lapply - a looping function that applies an operation over every list element. A quick example: Let's say you'd want to get a frequency table for both PM variables in your data for all three datasets.
For single elements in your global environment this would be
table(Model_A$PM10)
table(Model_A$PM2.5)
...
table(Model_C$PM2.5)
With a list, it would be
lapply(dflist, function(x) table(x["PM10"]))
lapply(dflist, function(x) table(x["PM2.5"]))
Right now, it seems to only save some lines of code, but better yet, the output of lapply is again a list, which you can store in an object and further use for different operations. Due to this, you can have a global environment with only a few objects in it, each being lists which contain certain similar objects, like dataframes, tables, summaries or even plots.

Applying a function to a dataframe to trim empty columns within a list environment R

I am a naive user of R and am attempting to come to terms with the 'apply' series of functions which I now need to use due to the complexity of the data sets.
I have large, ragged, data frame that I wish to reshape before conducting a sequence of regression analyses. It is further complicated by having interlaced rows of descriptive data(characters).
My approach to date has been to use a factor to split the data frame into sets with equal row lengths (i.e. a list), then attempt to remove the trailing empty columns, make two new, matching lists, one of data and one of chars and then use reshape to produce a common column number, then recombine the sets in each list. e.g. a simplified example:
myDF <- as.data.frame(rbind(c("v1",as.character(1:10)),
c("v1",letters[1:10]),
c("v2",c(as.character(1:6),rep("",4))),
c("v2",c(letters[1:6], rep("",4)))))
myDF[,1] <- as.factor(myDF[,1])
myList <- split(myDF, myDF[,1])
myList[[1]]
I can remove the empty columns for an individual set and can split the data frame into two sets from the interlacing rows but have been stumped with the syntax in writing a function to apply the following function to the list - though 'lapply' with 'seq_along' should do it?
Thus for the individual set:
DF <- myList[[2]]
DF <- DF[,!sapply(DF, function(x) all(x==""))]
DF
(from an earlier answer to a similar, but simpler example on this site). I have a large data set and would like an elegant solution (I could use a loop but that would not use the capabilities of R effectively). Once I have done that I ought to be able to use the same rationale to reshape the frames and then recombine them.
regards
jac
Try
lapply(split(myDF, myDF$V1), function(x) x[!colSums(x=='')])

Nested data frame

I have got a technical problem which, as it seems, I am not able to solve by myself. I ran an estimation with the mcmcglmm package. By results$Sol I get access to the estimated posterior distributions. Applying class() tells me that the object is of class "mcmc". Using as.data.frame() results in a nested data frame which contains other data frames (one data frame which contains many other data frames). I would like to rbind() all data frames within the main data frame in order to produce one data frame (or rather a vector) with all values of all posterior distributions and the name of the (secondary) data frame as a rowname., Any ideas? I would be grateful for every hint!
Update: I didn't manage to produce a useful data set for the purpose of stackoverflow, with all these sampling chains these data sets would be always too large. If you want to help me, please consider to run the following (exemplaric) model
require(MCMCglmm)
data(PlodiaPO)
result <- MCMCglmm(PO ~ plate + FSfamily, data = PlodiaPO, nitt = 50, thin = 2, burn = 10, verbose = FALSE)
result$Sol (an mcmc object) is where all the chains are stored. I want to rbind all chains in order to have a vector with all values of all posterior distributions and the variable names as rownames (or since no duplicated rownames are allowed, as an additional character vector).
I can't (using the example code from MCMCglmm) construct an example where as.data.frame(model$Sol) gives me a dataframe of dataframes. So although there's probably a simple answer I can't check it very easily.
That said, here's an example that might help. Note that if your child dataframes don't have the same colnames then this won't work.
# create a nested data.frame example to work on
a.df <- data.frame(c1=runif(10),c2=runif(10))
b.df <- data.frame(c1=runif(10),c2=runif(10))
full.df <- data.frame(1:10)
full.df$a <- a.df
full.df$b <- b.df
full.df <- full.df[,c("a","b")]
# the solution
res <- do.call(rbind,full.df)
EDIT
Okay, using your new example,
require(MCMCglmm)
data(PlodiaPO)
result<- MCMCglmm(PO ~ plate + FSfamily, data=PlodiaPO,nitt=50,thin=2,burn=10,verbose=FALSE)
melt(do.call(rbind,(as.data.frame(result$Sol))))

Resources