Quickly sum a big list of lists? - r

I have a 10000 lists (results of a simulation), each containing 22500 lists (each list is a pixel in an image) which contains a vector of length 55.
# Simple Example
m <- replicate(2, list(runif(55)))
m2 <- replicate(3, list(m))
str(m2,list.len = 3)
List of 3
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
# my function
m3 <- lapply(seq_along(m2[[1]]), FUN = function(j) Reduce('+', lapply(seq_along(m2), FUN = function(i) m2[[i]][[j]])))
#by hand
identical(m2[[1]][[1]] + m2[[2]][[1]] + m2[[3]][[1]], m3[[1]] )
I wrote a nested lapply with Reduce to sum the lists. On a small example, as in above, it's fast but on my real data, it's really slow.
#slow code
m <- replicate(22500, list(runif(55)))
m2 <- replicate(10000, list(m))
str(m2,list.len = 3)
m3 <- lapply(seq_along(m2[[1]]), FUN = function(j) Reduce('+', lapply(seq_along(m2), FUN = function(i) m2[[i]][[j]])))
How can I speed this up, or should I change data structures?
Thanks.

This gives some improvement (>2x):
split(Reduce(`+`, lapply(m2, unlist)), rep(seq_along(m2[[1]]), lengths(m2[[1]])))
Since your data is essentially rectangular, had you stored it in this shape:
library(data.table)
d = rbindlist(lapply(m2, function(x) transpose(as.data.table(x))), id = T
)[, id.in := 1:.N, by = .id]
# .id V1 V2 V55 id.in
#1: 1 0.4605065 0.09744975 ... 0.8620728 1
#2: 1 0.6666742 0.10435471 ... 0.3991940 2
#3: 2 0.4605065 0.09744975 ... 0.8620728 1
#4: 2 0.6666742 0.10435471 ... 0.3991940 2
#5: 3 0.4605065 0.09744975 ... 0.8620728 1
#6: 3 0.6666742 0.10435471 ... 0.3991940 2
You could do the aggregation even faster by doing:
d[, lapply(.SD, sum), by = id.in]
But if the list is your starting point, the conversion would take up the majority of the time.

Related

Merging two lists of elements into one with the same elements but two variables per element

I have two lists of 48 elements. Each element in the list has one variable (DiffINT or DiffEXT below), with differing numbers of observations. The names of all of the elements are the same in both lists.
What I would like to do is merge the two lists of elements based on the element name and end up with two variables per element.
Bonus question: I have two lists of 48 elements, both lists have the same elements. One list has one variable with one observation in it, the other list as six variables per element with varied numbers of observations. Can I somehow merge these to accomplish the same as above?
I have reviewed other questions and tried append() and cbind() and other functions, but none of them accomplish what I want. Example of what I am looking for is below.
> str(DiffsMerged)
List of 48
$ Element1:List of 2
..$ DiffINT : num 1 0.642 0.27 -0.102 -0.123 ...
..$ DiffEXT : num 1 0.1397 -0.1045 -0.0751 -0.1414 ...
$ Element 2:List of 2
..$ DiffINT : num 1 0.5842 0.3453 0.158 -0.0259 ...
..$ DiffEXT : num 1 -0.0312 -0.0321 -0.033 -0.0339 ...
$ Element 3:List of 2
..$ DiffINT : num 1 0.908 0.816 0.724 0.632 ...
..$ DiffEXT : num 1 0.584 0.21 -0.163 -0.406
Many thanks in advance.
Edit to add: Whenever I want to view the individual lists (DiffINT and DiffEXT), I get the following error. Thoughts?
> View(DiffEXT)
Error in if (more || nchar(output) > 80) { :
missing value where TRUE/FALSE needed
You can get a simple "merge" with a lapply loop:
all_names <- union(names(DiffINT), names(DiffEXT))
DiffsMerged <- lapply(
X = all_names,
FUN = function(name) {
list(DiffINT[[name]], DiffEXT[[name]])
}
)
names(DiffsMerged) <- all_names
str(DiffsMerged)
# List of 3
# $ Element1:List of 2
# ..$ : num [1:5] 1 0.642 0.27 -0.102 -0.123
# ..$ : num [1:5] 1 0.1397 -0.1045 -0.0751 -0.1414
# $ Element2:List of 2
# ..$ : num [1:5] 1 0.1397 -0.1045 -0.0751 -0.1414
# ..$ : num [1:5] 1 -0.0312 -0.0321 -0.033 -0.0339
# $ Element3:List of 2
# ..$ : num [1:5] 1 0.908 0.816 0.724 0.632
# ..$ : num [1:5] 1 0.584 0.21 -0.163 -0.406
I don't know what you plan to use this data for, but it could help to keep it tidy. Only do this if both lists have the same names, and all elements have the same length.
int_df <- data.frame(DiffINT)
int_df[["source"]] <- "int"
ext_df <- data.frame(DiffEXT)
ext_df[["source"]] <- "ext"
merged_df <- rbind(int_df, ext_df)
merged_df
# Element1 Element2 Element3 source
# 1 1.0000 1.0000 1.000 int
# 2 0.6420 0.1397 0.908 int
# 3 0.2700 -0.1045 0.816 int
# 4 -0.1020 -0.0751 0.724 int
# 5 -0.1230 -0.1414 0.632 int
# 6 1.0000 1.0000 1.000 ext
# 7 0.1397 -0.0312 0.584 ext
# 8 -0.1045 -0.0321 0.210 ext
# 9 -0.0751 -0.0330 -0.163 ext
# 10 -0.1414 -0.0339 -0.406 ext

Processing nested lists in R

I have a nested list with the below structure:
Each $mu is a vector of 15 numbers. How can I calculate the average of each of the 15 elements across 1000 $mu's?
I know I can access the $mu's with hb.post.PT1$compdraw[[1]]$mu, but how can I extract or loop over them to execute functions?
I've been trying stuff like this without success:
lapply(hb.post.PT1$compdraw, function(x) lapply(hb.post.PT1$compdraw[[x]]$mu, function(x) mean))
My apologies, I cannot post example list data.
You are already looping through each list element with the first lapply. So based on your structure each iteration of the loop will be a list of 2 elements (mu and rooti). So you only need one lapply:
ll <- list(list(mu = runif(15), root = runif(10)), list(mu = runif(15), runif(10)))
str(ll)
#List of 2
# $ :List of 2
# ..$ mu : num [1:15] 0.314 0.614 0.228 0.594 0.447 ...
# ..$ root: num [1:10] 0.73 0.496 0.266 0.599 0.917 ...
# $ :List of 2
# ..$ mu: num [1:15] 0.134 0.244 0.507 0.771 0.934 ...
# ..$ : num [1:10] 0.818 0.138 0.83 0.509 0.914 ...
lapply(ll, function(x) mean(x$mu))
#[[1]]
#[1] 0.4264001
#
#[[2]]
#[1] 0.4526724
You can use
hb.post.PT1$compdraw %>%
purrr::map(function(x) {
mean(x$mu)
}) %>% unlist
This is how I would do it with purrr::imap from the tidyverse. We can apply a function that creates a tbl with the list index and the mean of that index's mu vector, and bind them into one dataframe for easy display.
library(tidyverse)
set.seed(12345)
testlist <- list(
list(mu = rnorm(15), rooti = rnorm(15)),
list(mu = rnorm(15), rooti = rnorm(15)),
list(mu = rnorm(15), rooti = rnorm(15)),
list(mu = rnorm(15), rooti = rnorm(15)),
list(mu = rnorm(15), rooti = rnorm(15))
)
str(testlist[1:2])
#> List of 2
#> $ :List of 2
#> ..$ mu : num [1:15] 0.586 0.709 -0.109 -0.453 0.606 ...
#> ..$ rooti: num [1:15] 0.817 -0.886 -0.332 1.121 0.299 ...
#> $ :List of 2
#> ..$ mu : num [1:15] 0.812 2.197 2.049 1.632 0.254 ...
#> ..$ rooti: num [1:15] 1.461 -1.413 0.567 0.583 -1.307 ...
testlist %>%
imap_dfr(.f = ~ tibble(index = .y, mu_mean = mean(.x$mu)))
#> # A tibble: 5 x 2
#> index mu_mean
#> <int> <dbl>
#> 1 1 0.0341
#> 2 2 0.448
#> 3 3 0.146
#> 4 4 -0.130
#> 5 5 0.244
Created on 2018-05-07 by the reprex package (v0.2.0).
Could be overkill for this case but for this kind of problem in general it's good to remember about purrr::transpose
map(transpose(hb.post.PT1$compdraw)$mu,mean)

Subsetting and replacing in a list variable nested in a dataframe

Here is my dataframe example. It includes a column variable, named "dta" which is a single list of n values I want to keep for each of my scenario:
set.seed(777)
df <- data.frame(theo = numeric(),
size = numeric(),
dta = I(list()))
df[ 1: 5,"theo"] <- qlnorm(0.1, meanlog=0, sdlog=1, lower.tail = TRUE, log.p = FALSE)
df[ 6:10,"theo"] <- qlnorm(0.2, meanlog=0, sdlog=1, lower.tail = TRUE, log.p = FALSE)
df[ 1: 5,"size"] <- 10
df[ 6:10,"size"] <- 20
for(i in 1:10){
df$dta[i] <- list(rlnorm(df$size[i], meanlog = 0, sdlog = 1))
}
df
str(df)
This should give a df like:
theo size dta
1 0.2776062 10 1.631967....
2 0.2776062 10 0.737667....
3 0.2776062 10 0.131252....
4 0.2776062 10 1.937334....
5 0.2776062 10 0.739868....
6 0.4310112 20 4.631176....
7 0.4310112 20 2.610180....
8 0.4310112 20 0.175918....
9 0.4310112 20 3.501670....
10 0.4310112 20 0.588178....
or:
'data.frame': 10 obs. of 4 variables:
$ theo: num 0.278 0.278 0.278 0.278 0.278 ...
$ size: num 10 10 10 10 10 20 20 20 20 20
$ dta :List of 10
..$ : num 1.632 0.671 1.667 0.671 5.148 ...
..$ : num 0.738 1.056 0.152 0.967 10.089 ...
..$ : num 0.131 1.256 0.457 3.574 4.211 ...
..$ : num 1.937 2.359 3.496 0.297 4.587 ...
..$ : num 0.74 0.66 0.481 0.434 1.874 ...
..$ : num 4.631 0.298 10.28 0.933 1.286 ...
..$ : num 2.61 0.472 0.251 1.61 0.303 ...
..$ : num 0.176 0.566 2.156 0.407 3.52 ...
..$ : num 3.502 1.748 1.283 0.648 1.359 ...
..$ : num 0.588 0.392 2.447 1.926 0.86 ...
..- attr(*, "class")= chr "AsIs"
Now, I want to subset that list in such a way that:
for each list, each value is compared with the fixed value "theo" stored in the dataframe
when that value is below or equal to "theo", then recode that value NA
Here is a working code and gives me exactly what I want:
df$dta2 <- df$dta
for(i in 1:10){
df$dta2[[i]] [ df$dta2[[i]] <= df$theo[i] ] <- NA
}
However I was wondering is there is a way to get the same result with a single line of code and no "for loop" to proceed with a conditional replacement of values contained in a list which is nested in a dataframe?
We can use Map
df$dta3 <- Map(function(x,y) replace(x, x<=y, NA), df$dta, df$theo)
all.equal(df$dta2, df$dta3, check.attributes=FALSE)
#[1] TRUE

apply create columns function to a list r

I am new in using apply and functions together and I am stuck and frustrated. I have 2 different list of data frames that I need to add certain number of columns to the first one when a condition is fulfill related to the second one. Below this is the structure of the first list that has one data frame for any station and every df has 2 or more columns with each pressure:
> str(KDzlambdaEG)
List of 3
$ 176:'data.frame': 301 obs. of 3 variables:
..$ 0 : num [1:301] 0.186 0.182 0.18 0.181 0.177 ...
..$ 5 : num [1:301] 0.127 0.127 0.127 0.127 0.127 ...
..$ 20: num [1:301] 0.245 0.241 0.239 0.236 0.236 ...
$ 177:'data.frame': 301 obs. of 2 variables:
..$ 0 : num [1:301] 0.132 0.132 0.132 0.13 0.13 ...
..$ 25: num [1:301] 0.09 0.092 0.0902 0.0896 0.0896 ...
$ 199:'data.frame': 301 obs. of 2 variables:
..$ 0 : num [1:301] 0.181 0.182 0.181 0.182 0.179 ...
..$ 10: num [1:301] 0.186 0.186 0.185 0.183 0.184 ...
On the other hand I have the second list that have the number of columns that I need to add after every column on each data frame of the first list :
> str(dif)
List of 3
[[176]]
[1] 4 15 28
[[177]]
[1] 24 67
[[199]]
[1] 9 53
I´ve tried tonnes of things even this, using the append_col function that appear in:
How to add a new column between other dataframe columns?
for (i in 1:length(dif)){
A<-lapply(KDzlambdaEG,append_col,rep(list(NA),dif[[i]][1]),after=1)
}
but nothing seems to work so far... I have searched for answers here but its difficult to find specific ones being a newcomer.
Try:
indxlst <- lapply(dif, function(x) c(1, x[-length(x)]+1, x[length(x)]))
newdflist <- lapply(indxlst, function(x) data.frame(matrix(0, 2, sum(x))))
for(i in 1:length(newdflist)) {
newdflist[[i]][indxlst[[i]]] <- KDzlambdaEG[[i]]
}
Reproducible Data Test
df1 <- data.frame(x=1:2, y=c("Jan", "Feb"), z=c("A", "B"))
df3 <- df2 <- df1[,-3]
KDzlambdaEG <- list(df1,df2,df3)
x1 <- c(4,15,28)
x2 <- c(24,67)
x3 <- c(9, 53)
dif <- list(x1,x2,x3)
indxlst <- lapply(dif, function(x) c(1, x[-length(x)]+1, x[length(x)]))
newdflist <- lapply(indxlst, function(x) data.frame(matrix(0, 2, sum(x))))
for(i in 1:length(newdflist)) {
newdflist[[i]][indxlst[[i]]] <- KDzlambdaEG[[i]]
}
newdflist

Building a list in a loop in R - getting item names correct

I have a function which contains a loop over two lists and builds up some calculated data. I would like to return these data as a lists of lists, indexed by some value, but I'm getting the assignment wrong.
A minimal example of what I'm trying to do, and where i'm going wrong would be:
mybiglist <- list()
for(i in 1:5){
a <- runif(10)
b <- rnorm(16)
c <- rbinom(8, 5, i/10)
name <- paste('item:',i,sep='')
tmp <- list(uniform=a, normal=b, binomial=c)
mybiglist[[name]] <- append(mybiglist, tmp)
}
If you run this and look at the output mybiglist, you will see that something is going very wrong in the way each item is being named.
Any ideas on how I might achieve what I actually want?
Thanks
ps. I know that in R there is a sense in which one has failed if one has to resort to loops, but in this case I do feel justified ;-)
It works if you don't use the append command:
mybiglist <- list()
for(i in 1:5){
a <- runif(10)
b <- rnorm(16)
c <- rbinom(8, 5, i/10)
name <- paste('item:',i,sep='')
tmp <- list(uniform=a, normal=b, binomial=c)
mybiglist[[name]] <- tmp
}
# List of 5
# $ item:1:List of 3
# ..$ uniform : num [1:10] 0.737 0.987 0.577 0.814 0.452 ...
# ..$ normal : num [1:16] -0.403 -0.104 2.147 0.32 1.713 ...
# ..$ binomial: num [1:8] 0 0 0 0 1 0 0 1
# $ item:2:List of 3
# ..$ uniform : num [1:10] 0.61 0.62 0.49 0.217 0.862 ...
# ..$ normal : num [1:16] 0.945 -0.154 -0.5 -0.729 -0.547 ...
# ..$ binomial: num [1:8] 1 2 2 0 2 1 0 2
# $ item:3:List of 3
# ..$ uniform : num [1:10] 0.66 0.094 0.432 0.634 0.949 ...
# ..$ normal : num [1:16] -0.607 0.274 -1.455 0.828 -0.73 ...
# ..$ binomial: num [1:8] 2 2 3 1 1 1 2 0
# $ item:4:List of 3
# ..$ uniform : num [1:10] 0.455 0.442 0.149 0.745 0.24 ...
# ..$ normal : num [1:16] 0.0994 -0.5332 -0.8131 -1.1847 -0.8032 ...
# ..$ binomial: num [1:8] 2 3 1 1 2 2 2 1
# $ item:5:List of 3
# ..$ uniform : num [1:10] 0.816 0.279 0.583 0.179 0.321 ...
# ..$ normal : num [1:16] -0.036 1.137 0.178 0.29 1.266 ...
# ..$ binomial: num [1:8] 3 4 3 4 4 2 2 3
Change
mybiglist[[name]] <- append(mybiglist, tmp)
to
mybiglist[[name]] <- tmp
To show that an explicit for loop is not required
unif_norm <- replicate(5, list(uniform = runif(10),
normal = rnorm(16)), simplify=F)
binomials <- lapply(seq_len(5)/10, function(prob) {
list(binomial = rbinom(n = 5 ,size = 8, prob = prob))})
biglist <- setNames(mapply(c, unif_norm, binomials, SIMPLIFY = F),
paste0('item:',seq_along(unif_norm)))
In general if you go down the for loop path it is better to preassign the list beforehand. This is more memory efficient.
mybiglist <- vector('list', 5)
names(mybiglist) <- paste0('item:', seq_along(mybiglist))
for(i in seq_along(mybiglist)){
a <- runif(10)
b <- rnorm(16)
c <- rbinom(8, 5, i/10)
tmp <- list(uniform=a, normal=b, binomial=c)
mybiglist[[i]] <- tmp
}

Resources