Simulating data from a data frame using ddply - r

I have some plant data almost identical to the 'iris' data set. I would like to simulate new data using a normal distribution. So for each variable~species in the iris data set I would create 10 new observations from a normal distribution. Basically it would just create a new data frame with the same structure as the old one, but it would contain simulated data. I feel that the following code should get me started (I think the data frame would be in the wrong form), but it will not run.
ddply(iris, c("Species"), function(x) data.frame(rnorm(n=10, mean=mean(x), sd=sd(x))))
rnorm is returning an atomic vector so ddply should be able to handle it.

the ddply will subset the rows by Species, but you're doing nothing in the function to iterate over the columns of the sub-setting data.frame. You cannot get norm() to return a list or data.frame for you; you will need to assist with the shaping. How about
ddply(iris, c("Species"), function(x) {
data.frame(lapply(x[,1:4], function(y) rnorm(10, mean(y), sd(y))))
})
here we calculate new values for the first 4 columns in each group.

Related

R - Combine two mice mids objects when data frames have different columns

I'm using the mice package on two different but related data frames.
While the large majority of the variables are the same for both data frames, a small number of variables are unique to each data frame and the imputation happens for both data frames separately (they have slightly different imputation models/ predictor matrices, etc.).
In the end, I would like to combine the two resulting mids objects,
but as the columns differ, the standard procedure via rbind(actually method rbind.mids is called) does not work.
Is there an easy way around this?
Two alternative approaches I could think of:
Combine the two dfs one time before imputation via dplyr::bind_rows and split them again. Now each data frame has all columns and rbind() would work after the imputations. However, that would also require defining the predictor matrix and method section again for both data frames to tell mice to ignore the new columns.
Use the mice::complete(imp_df, "long", include = TRUE) function on both mids objects, combine the resulting data frames, and use mice::as.mids() to convert them back into a single mids object. But I'm not sure if that would work or mess something else up, e.g.
Here is some example data to illustrate the issue
library(mice)
data("nhanes2") # load test data from mice package
# make test_df 1
df_1 <- nhanes2[1:14,c(1,2,3)]
# make test_df 2
df_2 <- nhanes2[15:25, c(1,2,4)]
# quick and dirty imputation for test purpose
test_imp1 <- mice(df_1)
test_imp2 <- mice(df_2)
rbind(test_imp1, test_imp2)
Error in rbind.mids.mids(x, y, call = call) :
datasets have different variable names

Calculate sd and mean for many data frames in R

I have data frames called bmw_1,bmw_2,....bmw_9 and I want to calculate standard deviation and mean for each data frame but I don’t want to write
mean(bmw_1)
mean(bmw_2)
mean(bmw_3)
...
mean(bmw_9)
many times, so any help please
as mentioned in the comment, best way is to get the data frames into a list so you can apply a function over each.
Get all dfs into a list by name pattern:
ls_bmw <- mget(ls(pattern = "bmw_"))
Then apply the mean.
result <- lapply(ls_bmw, mean)
Difficult to go much further without a data example, but to get the results alongside the data frame name use:
names(ls_bmw)
... to get a vector of the df names and:
unlist(result)
... to get a vector of the results. The order of names and results elements will match and you convert that into a single result dataframe.

For Loop Over List of Data Frames and Create New Data Frames from Every Iteration Using Variable Name

I cannot for the life of me figure out where the simple error is in my for loop to perform the same analyses over multiple data frames and output each iteration's new data frame utilizing the variable used along with extra string to identify the new data frame.
Here is my code:
john and jane are 2 data frames among many I am hoping to loop over and compare to bcm to find duplicate results in rows.
x <- list(john,jane)
for (i in x) {
test <- rbind(bcm,i)
test$dups <- duplicated(test$Full.Name,fromLast=T)
test$dups2 <- duplicated(test$Full.Name)
test <- test[which(test$dups==T | test$dups2==T),]
newname <- paste("dupl",i,sep=".")
assign(newname, test)
}
Thus far, I can either get the naming to work correctly without including the x data or the loop to complete correctly without naming the new data frames correctly.
Intended Result: I am hoping to create new data frames dupl.john and dupl.jane to show which rows are duplicated in comparison to bcm.
I understand that lapply() might be better to use and am very open to that form of solution. I could not figure out how to use it to solve my problem, so I turned to the more familiar for loop.
EDIT:
Sorry if I'm not being more clear. I have about 13 data frames in total that I want to run the same analysis over to find the duplicate rows in $Full.Name. I could do the first 4 lines of my loop and then dupl.john <- test 13 times (for each data frame), but I am purposely trying to write a for loop or lapply() to gain more knowledge in R and because I'm sure it is more efficient.
If I understand correctly based on your intended result, maybe using the match_df could be an option.
library(plyr)
dupl.john <- match_df(john, bcm)
dupl.jane <- match_df(jane, bcm)
dupl.john and dupl.jane will be both data frames and both will have the rows that are in these data frames and bcm. Is this what you are trying to achieve?
EDITED after the first comment
library(plyr)
l <- list(john, jane)
res <- lapply(l, function(x) {match_df(x, bcm, on = "Full.Name")} )
dupl.john <- as.data.frame(res[1])
dupl.jane <- as.data.frame(res[2])
Now, res will have a list of the data frames with the matches, based on the column "Full.Name".

how to make groups of variables from a data frame in R?

Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})

Nested data frame

I have got a technical problem which, as it seems, I am not able to solve by myself. I ran an estimation with the mcmcglmm package. By results$Sol I get access to the estimated posterior distributions. Applying class() tells me that the object is of class "mcmc". Using as.data.frame() results in a nested data frame which contains other data frames (one data frame which contains many other data frames). I would like to rbind() all data frames within the main data frame in order to produce one data frame (or rather a vector) with all values of all posterior distributions and the name of the (secondary) data frame as a rowname., Any ideas? I would be grateful for every hint!
Update: I didn't manage to produce a useful data set for the purpose of stackoverflow, with all these sampling chains these data sets would be always too large. If you want to help me, please consider to run the following (exemplaric) model
require(MCMCglmm)
data(PlodiaPO)
result <- MCMCglmm(PO ~ plate + FSfamily, data = PlodiaPO, nitt = 50, thin = 2, burn = 10, verbose = FALSE)
result$Sol (an mcmc object) is where all the chains are stored. I want to rbind all chains in order to have a vector with all values of all posterior distributions and the variable names as rownames (or since no duplicated rownames are allowed, as an additional character vector).
I can't (using the example code from MCMCglmm) construct an example where as.data.frame(model$Sol) gives me a dataframe of dataframes. So although there's probably a simple answer I can't check it very easily.
That said, here's an example that might help. Note that if your child dataframes don't have the same colnames then this won't work.
# create a nested data.frame example to work on
a.df <- data.frame(c1=runif(10),c2=runif(10))
b.df <- data.frame(c1=runif(10),c2=runif(10))
full.df <- data.frame(1:10)
full.df$a <- a.df
full.df$b <- b.df
full.df <- full.df[,c("a","b")]
# the solution
res <- do.call(rbind,full.df)
EDIT
Okay, using your new example,
require(MCMCglmm)
data(PlodiaPO)
result<- MCMCglmm(PO ~ plate + FSfamily, data=PlodiaPO,nitt=50,thin=2,burn=10,verbose=FALSE)
melt(do.call(rbind,(as.data.frame(result$Sol))))

Resources