Combining lists into a dataframe more efficiently - r

I am running multiple chains of a MCMCglmm() model and I am trying to find the most efficient way to synthesize my output.
I am using mclapply() to run 4 chains and then combining each of the 4 chains into a list with lapply().
Here is my model and code to clean up and combine the chains. I am using this helpful tutorial for running the chains: https://github.com/tmalsburg/MCMCglmm-intro
Model:
library(parallel)
chains <- mclapply(1:4, function(i) {
MCMCglmm(outcome ~ 1 + pretest + race + satisfaction*race, data = data,
random = ~ provider,
prior = prior.1,
verbose = TRUE,
family = "gaussian",
nitt = 10000,
burnin = 5000,
thin = 10)
}, mc.cores=4)
My cleanup is a little clunky. Is there a way to run a lapply command (or I think what is needed is mapply) on both the fixed and random effects to combine them into the same list and subsequent data frame? In the end, I am hoping to have a data frame so I can add/ subtract posterior distributions and run summary statistics on them.
fixed <- lapply(chains, function(m) m$Sol) # Sol = fixed effects
fixed <- do.call(mcmc.list, fixed)
summary(fixed)
random <- lapply(chains, function(m) m$VCV) # VCV = variance
random <- do.call(mcmc.list, random)
summary(random)
fixed_df <- do.call(rbind, Map(data.frame, fixed))
random_df <- do.call(rbind, Map(data.frame, random))
chains_df <- cbind(fixed_df, random_df)
Ultimately, I am hoping to run one lapply() or mapply() and have a single fixed.random list of lists. I believe I can use the Map(data.frame, fixed.random) on that to create my data frame. My knowledge of the apply function is limited, so I'm hoping to learn more and apply it (no pun intended) to my datasets.
Unfortunately, the models output MCMC objects, so I am unable to create the exact structure. This is the best I can come up with:
list1 <- list(a = rnorm(100, 0, 1), b = rnorm(100, 0, 1))
list2 <- list(a = rnorm(100, 0, 1), b = rnorm(100, 0, 1))
list3 <- list(a = rnorm(100, 0, 1), b = rnorm(100, 0, 1))
list4 <- list(a = rnorm(100, 0, 1), b = rnorm(100, 0, 1))
list5 <- list(d = rnorm(100, 0, 1), e = rnorm(100, 0, 1))
list6 <- list(d = rnorm(100, 0, 1), e = rnorm(100, 0, 1))
list7 <- list(d = rnorm(100, 0, 1), e = rnorm(100, 0, 1))
list8 <- list(d = rnorm(100, 0, 1), e = rnorm(100, 0, 1))
fixed <- list(list1, list2, list3, list4)
random <- list(list5, list6, list7, list8)

Would the following do?
Say your four_mcmc is a list of models of the class "MCMCglmm" (chain1, chain2, etc.) and extract is the list of elements you want to read from the chains (in your case the fixed ("Sol") and random terms ("VCV")).
## The list of mcmcs
four_mcmc <- list(chain1, chain2, chain3, chain4)
## Which elements to extract from the MCMCs
extract <- c("VCV", "Sol")
You can use a get.element function to extract single elements lists from single chains:
## Extracting some specific elements from a chain
get.elements <- function(extract, mcmc) {
## Extracting the element
mcmc_elements <- sapply(extract, function(extract) mcmc[which(names(mcmc) == extract)])
}
## Extracting the VCV and Sol from one chain
str(get.elements(extract, chain1))
You can then simply apply this function to your list of chains:
## Applying get.element for each elements to extract on each chain
all_elements <- lapply(four_mcmc, function(mcmc, extract) get.elements(extract, mcmc), extract)
You can then easily summarise this table for each terms as a data frame with the terms as rows and the chains as columns
## Fixed terms table
fixed_terms <- as.data.frame(lapply(all_elements, function(X) X[[1]]))
## Random terms table
random_terms <- as.data.frame(lapply(all_elements, function(X) X[[2]]))
This code is simplified from the read.mulTree function from https://github.com/TGuillerme/mulTree.
[edit]
#headpoint suggested to simply use:
as.data.frame(lapply(chains, function(m) cbind(m$Sol, m$VCV)))
Which is more elegant but could be less portable.

Related

Sequentially multiply matrices from a split list by a list of storage arrays at time t-1 (recursively) in r

see this original post for more details
Sequentially multiply matrices from a list by a vector at time t-1 (recursively)
am trying to accomplish the same thing, this time with sublists of matrices contained within a list, which is then matrix multiplied recursively by a list of storage arrays. Additionally, the commented out section in the forloop, I’m trying to place a cap on the resulting 4th row in all resultant list items. Help extending that concept would also be appreciated. thanks
library(plyr)
library(popbio)
library(ggplot2)
library(tidyr)
library(dplyr)
library(reshape2)
library(simpleboot)
library(boot)
library(reshape)
library(vctrs)
# setting seed for replication purposes and creating function to project. replicating with bootstrap
set.seed(123)
# vector of egg survival morts
egg.to.fry.s <- vec_rep(c(seq(from = 0.40, to = 0.89,by=0.01)),10)
# vector of fry survival morts
fry.to.one.s <- vec_rep(c(seq(from = 0.30, to = 0.79,by=0.01)),10)
# one.to.two rates
one.to.two.s <- rbeta(500,5,4)
A <- lapply(1:500, function(x) # construct list of matrices
matrix(c(0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0), nrow = 4,ncol=4, byrow = TRUE, ))
Anew <- A
for(t in 1:length(Anew)) {
Anew[[t]][2,1] <- egg.to.fry.s[t]
Anew[[t]][3,2] <- fry.to.one.s[t]
Anew[[t]][4,3] <- one.to.two.s[t]
}
AnewSplit <- split(Anew, rep(1:10, each = 50)) # split list into lists to represent each sim
n <- c(10000,1000,100,10) # initial vector of abundances
nYears = 50 # define the number of years to project over
allYears <- matrix(0,nrow=4,ncol=nYears+1) # build a storage array for all abundances
allYears[,1] <- n # set the year 0 abundance
allYearsarray<-replicate(10,allYears)
allYearslist <- alply(allYearsarray,3)
i1 <- length(allYearslist)
i2 <- 2:51
# matrix multiply each list of 50 sequentially in AnewSplit * each list in allYears list and each column - 1
for(t in 1:i1) {
for(j in seq_along(i2)){
allYearslist[[t]][,i2[j]] <- AnewSplit[[t]]%*% allYearslist[[t]][,i2[j]-1]
# allYearslist[4,i1[t]] <- ifelse (allYearslist[4,i1[t]] > 100000, 100000, allYearslist[4,i1[t]]) # to add density dependence to 4th row
}
}
I believe this to be the correct answer for the loop. The "cap" also appears to be working iteratively.
i1 <- length(allYearslist)
i2 <- 2:51
for(t in 1:i1) {
for(j in seq_along(i2)){
allYearslist[[t]][,i2[j]] <- AnewSplit[[t]][[j]]%*% allYearslist[[t]][,i2[j]-1]
allYearslist[[t]][4,i2[j]] <- ifelse (allYearslist[[t]][4,i2[j]] > 2000, 2000, allYearslist[[t]][4,i2[j]]) # to add density dependence to 4th row
}
}

How to use mapply to perform function on columns of two lists in R?

I am trying to calculate a population genetic statistic (mean FST) on the nth elements of two lists in R. This stat requires calculating the population variance, which R does not have a native function for. I can not figure out how to convert my code which works for a dataframe (2 rows), to code that works for 2 lists.
Here is an example of code that works for the two row dataframe:
#Generate dataframe data
popa <- abs(rnorm(10, mean = 0, sd = 0.5))
popb <- 1-abs(rnorm(10, mean = 0, sd = 0.5))
totalpop <- rbind(popa, popb)
#Make population variance function
pvar <- function(x) {
sum((x - mean(x))**2) / length(x)
}
#Calculate statistic on each column, and take the global mean
MeanFST <- mean(apply(totalpop, 2, pvar)/((apply(totalpop, 2, mean))*(1-(apply(totalpop, 2, mean)))))
But I don't know how to convert this code to one that operates on lists
#Generate lists data
listA <- list()
listB <- list()
for(i in 1:30){
listA[[i]] <- abs(rnorm(10, mean = 0, sd = 0.5))
listB[[i]] <- 1 - abs(rnorm(10, mean = 0, sd = 0.5))
}
I've tried using Map like this,
results <- Map(function(X,Y) {
mean(apply(totalpop, 2, pvar)/((apply(totalpop, 2, mean))*(1-(apply(totalpop, 2, mean)))))
}, X = listA, Y = listB)
but obviously this won't work, because it requires "totalpop" which doesn't exist when I am calculating stats on two lists (only when I combine two dataframes).
How can I make this MeanFST calculation on two lists?
We can use lapply/sapply after splitting the data by column
mean(sapply(asplit(totalpop, 2), function(x) pvar(x)/(mean(x) * (1 - mean(x)))))
Or with apply and a lambda function
mean(apply(totalpop, 2, function(x) pvar(x)/(mean(x) * (1 - mean(x)))))
The Map part, we may need to cbind the two list elements and use the same code with apply or sapply
unlist(Map(function(u, v) mean(apply(cbind(u, v), 2,
function(x) pvar(x)/(mean(x) * (1 - mean(x))))), listA, listB))

Extract co-linear columns name - R

Based on the answer for this question and its script, how can I print to the console the co-linear columns names?
Script:
library(corrplot)
library(caret)
x <- seq(0, 100, 1)
# colinear with x
y <- x + 2.3
# almost colinear with x / some small gaussian noise
z <- x + rnorm(mean = 0, sd = 5, n = 101)
# uncorrrelated gaussian
w <- rnorm(mean = 0, sd = 1, n = 101)
a <- z+seq(101, 200, 1)/.33 + rnorm(mean = 0, sd = 5, n = 1001)
b <- a -2.3
# this frame is made to exemplify the procedure
df <- data.frame(x = x, y = y, z = z, w = w, a=a, b=b)
corrplot(cor(df))
#drop perfectly multicollinear variables
constant<-rep(1,nrow(df))
tmp<-lm(constant ~ ., data=df)
to_keep<-tmp$coefficients[!is.na(tmp$coefficients)]
to_keep<-names(to_keep[-which(names(to_keep) == "(Intercept)")])
df_result<-df[to_keep]
corrplot(cor(df_result))
You want the variables not included in to_keep. Based off how to_keep is defined, you can write to_drop <- tmp$coefficients[is.na(tmp$coefficients)] to get the coefficients with NA values (meaning there are no estimates for the corresponding variables because they are collinear with others). Then, to print the names of those coefficients, you can simply do print(names(to_drop)).
However, keep in mind that: 1. this will only drop perfectly collinear variables in a hacky way and 2. the way this method decides which variables out of a set of perfectly collinear variables to drop is rather arbitrary (it will depend on the other of variables in your data).

How to run all "N take 2" iterations of a function?

I've written a simple correlation function that takes in three variables. "A" and "B" are numerical vectors of equal length, and "n" is the length.
Corr.fxn <- function(A, B, n){
Correlation <- (sum((A - mean(A))*(B - mean(B))) / (n-1)) / (sd(A)*sd(B))
return(Correlation)
}
The function works well enough, but I have many vectors I want to process. What's the best way to modify this code to process all "N take 2" unique analyses for my set of vectors "N"?
EDIT:
Example data showing the structure of the vectors:
A <- c(-1, 0, 1, -1, 0, 1, -1, 0, 1)
B <- c(1, 1, -1, 0, 1, -1, 0, 0, 1)
...
n <- length(A)
So let's say I have vectors A through Z and I want to modify my code to output a new vector containing all {26 take 2} correlation values.
Here is one possible way you can do it assuming you have a bunch of numeric vectors in a list v as follows:
v <- list()
for (i in 1:10) {
v[[i]] <- sample(1:10, 10, replace = TRUE)
}
apply(combn(1:10, 2), 2, function(x) Corr.fxn(v[[x[1]]], v[[x[2]]], length(v[[x[1]]])))
In this answer, I assume 2 things. First, you want to write a function yourself, since otherwise you can use Hmisc::rcorr. Second, you want the "N take 2" part to be inside the function, otherwise the ways suggested earlier are correct. In that case, you can do this:
Corr.fxn <- function(vectors, n){
pairs<- combn(length(vectors), 2)
npairs<- ncol(pairs)
cor.mat<- matrix(NA, nrow = length(vectors), ncol = npairs)
for (i in 1:ncol(pairs)){
A<- vectors[[pairs[1, i]]]
B<- vectors[[pairs[2, i]]]
cor.mat[pairs[1, i], pairs[2, i]] <- (sum((A - mean(A))*(B - mean(B))) / (n-1)) /(sd(A)*sd(B))
}
cor.mat[lower.tri(cor.mat)]<- cor.mat[upper.tri(cor.mat)] ###
diag(cor.mat)<- 1 ###
cor.mat<- data.frame(cor.mat) ###
row.names(cor.mat)<- colnames(cor.mat)<- names(vectors) ###
return(cor.mat)
}
The lines that end in ### are there for decorative reasons. The main input is a list called "vectors". So it works as follows:
A<- runif(100, 1, 100)
B<- runif(100, 30, 50)
C<- runif(100, 120, 200)
> Corr.fxn(list(A=A, B=B, C=C), n=100)
A B C
A 1.0000000 -0.11800104 -0.13980458
B -0.1180010 1.00000000 0.04933581
C -0.1398046 0.04933581 1.00000000

Passing multiple arguments to Reduce

I have a list of data.frames, and would like to operate on their columns, using various weights.
For example, subtracting the first columns from the second column (solved, see below); or subtracting the first and third from twice the second (unsolved).
Thanks to the generous help obtained in response to this question, I have a solution to the the problem in two dimensions without weights using Reduce.
I would like to have the flexibility to operate with weights - and in higher dimesions.
What I have so far is:
priceList <- data.frame(aaa = rnorm(100, 100, 10), bbb = rnorm(100, 100, 10),
ccc = rnorm(100, 100, 10), ddd = rnorm(100, 100, 10),
eee = rnorm(100, 100, 10), fff = rnorm(100, 100, 10),
ggg = rnorm(100, 100, 10)
)
colDiff <- function(x)
{
Reduce('-', rev(x))
}
tradeLegsList <- combn(names(priceList), 3, function(x) priceList[x], simplify = FALSE)
tradeList <- lapply(tradeLegsList, colDiff)
From what I can tell, Reduce is not designed to take multiple arguments.
I can do this the long way with 2* tradeLegsList[[1]]$bbb - tradeLegsList[[1]]$aaa - tradeLegsList[[1]]$ccc, and some loops, but it doesn't seem like the R way.
Is there a way to pass in a weight vector?
Ideally, I would to pass an argument such as w = c(-1, 2, -1) to the colDiff (or Reduce) function ... or something similar.
True, Reduce is not geared to allow multiple arguments, just two for each reduction. Therefore it is easiest to premultiply the elements in the list you are Reduce-ing.
Below is a solution that does this using mapply within your colDiff function definition.
Change your definifion of colDiff to allow a weight vector, and apply this using mapply
with SIMPLIFY = F.
EDIT
In light of the comments, weighting depends on the number of columns and there being no need for the rev
The weighting by length
length(x) == 1 -> w = 1
length(x) == 2 -> w = c(-1, 1),
length(x) == 3 -> w = c(-1, 2, -1),
length(x) == 4 -> w = c(-1, 1, -1, +1)
weighting <- function(i){
switch(i, 1, c(-1,1), c(-1,2,-1), c(-1,1,-1, 1))
}
colDiff <- function(x)
{
w = weighting(length(x))
Reduce('+', mapply('*', x, e2 = w, SIMPLIFY = F))
}
Then something like this would work
tradeList <- lapply(tradeLegsList, colDiff)
you could also keep with the functional programming theme and use Map which is a simple wrapper for mapply with SIMPLIFY = F
colDiff <- function(x)
{
w = weighting(length(x))
Reduce('+', Map('*', x , e2 = w))
}
you could also prefine the weighting within the function colDiff (which may be easier).
weighting[[2]] is weighting for when there are 2 columns, weighting[[3]] when there are 3.
colDiff <- function(x)
{
weighting <- list(1, c(-1,1), c(-1,2,-1), c(-1,1,-1, 1))
w = weighting[[length(x)]]
Reduce('+', Map('*', x , e2 = w))
}

Resources