Related
I wish to apply a function only to some elements of a nested list
l <- list()
l$a$forecast <- rnorm(3)
l$a$model <- "arima"
l$b$forecast <- rnorm(3)
l$b$model <- "prophet"
The desired output would be like this:
applying the sum function to the $forecast element of the list
fcst <- c(sum(l$a$forecast),sum(l$b$forecast))
mdl <- c(l$a$model,l$b$model)
df <- data.frame(fcst,mdl)
I tried something like this:
df <- lapply(l$forecast, function(x) sum(x))
df <- do.call(rbind, Map(cbind, sku = names(df)))
Another approach using rrapply() in the rrapply-package combined with dplyr's bind_rows(). This also extends to lists containing deeper nested levels.
rrapply::rrapply(l, condition = function(x, .xname) .xname == "forecast", f = sum) %>%
dplyr::bind_rows()
#> # A tibble: 2 x 2
#> forecast model
#> <dbl> <chr>
#> 1 -1.28 arima
#> 2 1.10 prophet
Data
set.seed(1)
l <- list(
a = list(forecast = rnorm(3), model = "arima"),
b = list(forecast = rnorm(3), model = "prophet")
)
do.call(
rbind,
lapply(
l,
function(x) list(fcst = sum(x$forecast), model = x$model)
)
)
Since you know the exact dimensions of your returned object you can use vapply in cases like this for a minor performance improvement:
vapply(
l,
FUN = function(x) list(fcst = sum(x$forecast), model = x$model),
FUN.VALUE = list(fcst = numeric(1), model = character(1))
)
However, the resulting object can be hard to work with.
You can get the letters with the object letters, then using its output in a loop:
n = 2 #number of lists you have
sumfore = model = vector()
for(i in letters[seq(1,n,1)]){
sumfore[i] = sum(l[[i]]$forecast)
model[i] =l[[i]]$model}
newdf = data.frame(sumfore, model)
I have a list of lists, containing data.frames, from which I want to select only a few rows. I can achieve it in a for-loop, where I create a sequence based on the amount of rows and select only row indices according to that sequence.
But if I have deeper nested lists it doesn't work anymore. I am also sure, that there is a better way of doing that without a loop.
What would be an efficient and generic approach to sample from nested lists, that vary in their dimensions and contain data.frames or matrices?
## Dummy Data
n1=100;n2=300;n3=100
crdOrig <- list(
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
## Code to opimize
FiltRef <- list()
filterBy = 10
for (r in 1:length(crdOrig)) {
tmp <- do.call(rbind, crdOrig[[r]])
filterInd <- seq(1,nrow(tmp), by = filterBy)
FiltRef[[r]] <- tmp[filterInd,]
}
crdResult <- do.call(rbind, FiltRef)
# Plotting
crdOrigPl <- do.call(rbind, unlist(crdOrig, recursive = F))
plot(crdOrigPl[,1], crdOrigPl[,2], col="red", pch=20)
points(crdResult[,1], crdResult[,2], col="green", pch=20)
The code above works also if a list contains several data.frames (data below).
## Dummy Data (Multiple DF)
crdOrig <- list(
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)),
data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
But if a list contains multiple lists, it throws an error trying to bind the result (FiltRef) together.
The result can be a data.frame with 2 columns (x,y) - like crdResult or a one dimensional list like FiltRef (from the first example)
## Dummy Data (Multiple Lists)
crdOrig <- list(
list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
+1 and Thank you all for your brilliant answers! They all work and there is a lot to learn from each one of them. I will give this one to #Gwang-Jin Kim as his solution is the most flexible and extensive, although they all deserve to be checked!
Preparation and implementation of flatten
Well, there are many other answers which are in principle the same.
I meanwhile implemented for fun the flattening of nested lists.
Since I am thinking in Lisp:
Implemented first car and cdr from lisp.
car <- function(l) {
if(is.list(l)) {
if (null(l)) {
list()
} else {
l[[1]]
}
} else {
error("Not a list.")
}
}
cdr <- function(l) {
if (is.list(l)) {
if (null(l) || length(l) == 1) {
list()
} else {
l[2:length(l)]
}
} else {
error("Not a list.")
}
}
Some predicate functions:
null <- function(l) length(l) == 0
# this is Lisp's `null` checking whether list is empty (`length(l) == 0`)
# R's `is.null()` checks for the value NULL and not `length(obj) == 0`
# upon #Martin Morgan's comment removed other predicate functions
# thank you #Martin Morgan!
# instead using `is.data.frame()` and `is.list()`, since they are
# not only already there but also safer.
Which are necessary to build flatten (for data frame lists)
flatten <- function(nested.list.construct) {
# Implemented Lisp's flatten tail call recursively. (`..flatten()`)
# Instead of (atom l) (is.df l).
..flatten <- function(l, acc.l) {
if (null(l)) {
acc.l
} else if (is.data.frame(l)) { # originally one checks here for is.atom(l)
acc.l[[length(acc.l) + 1]] <- l
acc.l # kind of (list* l acc.l)
} else {
..flatten(car(l), ..flatten(cdr(l), acc.l))
}
}
..flatten(nested.list.construct, list())
}
# an atom is in the widest sence a non-list object
After this, the actual function is defined using a sampling function.
Defining sampling function
# helper function
nrow <- function(df) dim(df)[1L]
# sampling function
sample.one.nth.of.rows <- function(df, fraction = 1/10) {
# Randomly selects a fraction of the rows of a data frame
nr <- nrow(df)
df[sample(nr, fraction * nr), , drop = FALSE]
}
The actual collector function (from nested data-frame-lists)
collect.df.samples <- function(df.list.construct, fraction = 1/10) {
do.call(rbind,
lapply(flatten(df.list.construct),
function(df) sample.one.nth.of.rows(df, fraction)
)
)
}
# thanks for the improvement with `do.call(rbind, [list])` #Ryan!
# and the hint that `require(data.table)`
# `data.table::rbindlist([list])` would be even faster.
collect.df.samples first flattens the nested list construct of data frames df.list.construct to a flat list of data frames. It applies the function sample.one.nth.of.rows to each elements of the list (lapply). There by it produces a list of sampled data frames (which contain the fraction - here 1/10th of the original data frame rows). These sampled data frames are rbinded across the list. The resulting data frame is returned. It consists of the sampled rows of each of the data frames.
Testing on example
## Dummy Data (Multiple Lists)
n1=100;n2=300;n3=100
crdOrig <- list(
list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
collect.df.samples(crdOrig, fraction = 1/10)
Refactoring for later modifications
By writing the collect.df.samples function to:
# sampler function
sample.10th.fraction <- function(df) sample.one.nth.of.rows(df, fraction = 1/10)
# refactored:
collect.df.samples <-
function(df.list.construct,
df.sampler.fun = sample.10th.fraction) {
do.call(rbind,
lapply(flatten(df.list.construct), df.sampler.fun))
}
One can make the sampler function replace-able.
(And if not: By changing the fraction parameter, one can enhance or reduce amount of rows collected from each data frame.)
The sampler function is in this definition easily exchangable
For choosing every nth (e.g. every 10th) row in the data frame, instead of a random sampling,
you could e.g. use the sampler function:
df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE]
and input it as df.sampler.fun = in collect.df.samples. Then, this function will be applied to every data frame in the nested df list object and collected to one data frame.
every.10th.rows <- function(df, nth = 10) {
df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE]
}
a.10th.of.all.rows <- function(df, fraction = 1/10) {
sample.one.nth.of.rows(df, fraction)
}
collect.df.samples(crdOrig, a.10th.of.all.rows)
collect.df.samples(crdOrig, every.10th.rows)
I would just flatten the whole darn thing and work on a clean list.
library(rlist)
out <- list.flatten(y)
# prepare a vector for which columns belong together
vc <- rep(1:(length(out)/2), each = 2)
vc <- split(1:length(vc), vc)
# prepare the final list
ll <- vector("list", length(unique(vc)))
for (i in 1:length(vc)) {
ll[[i]] <- as.data.frame(out[vc[[i]]])
}
result <- lapply(ll, FUN = function(x) {
x[sample(1:nrow(x), size = 10, replace = FALSE), ]
})
do.call(rbind, result)
x y
98 10.32912 52.87113
52 16.42912 46.07026
92 18.85397 46.26403
90 12.04884 57.79290
23 18.20997 40.57904
27 18.98340 52.55919
...
Here's an answer in base borrowing from a custom "rapply" function mentioned here rapply to nested list of data frames in R
df_samples<-list()
i=1
f<-function(x) {
i<<-i+1
df_samples[[i]]<<-x[sample(rownames(x),10),]
}
recurse <- function (L, f) {
if (inherits(L, "data.frame")) {
f(L) }
else lapply(L, recurse, f)
}
recurse(crdOrig, f)
res<-do.call("rbind", df_samples)
I too would flatten the list-of-lists into a standard representation (and do all analysis on the flattened representation, not just the subseting), but keep track of relevant indexing information, e.g.,
flatten_recursive = function(x) {
i <- 0L
.f = function(x, depth) {
if (is.data.frame(x)) {
i <<- i + 1L
cbind(i, depth, x)
} else {
x = lapply(x, .f, depth + 1L)
do.call(rbind, x)
}
}
.f(x, 0L)
}
The internal function .f() visits each element of a list. If the element is a data.frame, it adds a unique identifier to index it. If it's a list, then it calls itself on each element of the list (incrementing a depth counter, in case this is useful, one could also add a 'group' counter) and then row-binds the elements. I use an internal function so that I can have a variable i to increment across function calls. The end result is a single data frame with a index to use for referencing the original results.
> tbl <- flatten_recursive(crdOrig) %>% as_tibble()
> tbl %>% group_by(i, depth) %>% summarize(n())
# A tibble: 4 x 3
# Groups: i [?]
i depth `n()`
<int> <int> <int>
1 1 3 100
2 2 3 100
3 3 2 300
4 4 2 100
> tbl %>% group_by(i) %>% slice(seq(1, n(), by = 10)) %>% summarize(n())
# A tibble: 4 x 2
i `n()`
<int> <int>
1 1 10
2 2 10
3 3 30
4 4 10
The overall pattern of .f() can be adjusted for additional data types, e.g., (some details omitted)
.f <- function(x) {
if (is.data.frame(x)) {
x
} else if (is.matrix(x)) {
x <- as.data.frame(x)
setNames(x, c("x", "y"))
} else {
do.call(rbind, lapply(x, .f))
}
}
Consider a recursive call conditionally checking if first item is a data.frame or list class.
stack_process <- function(lst){
if(class(lst[[1]]) == "data.frame") {
tmp <- lst[[1]]
}
if(class(lst[[1]]) == "list") {
inner <- lapply(lst, stack_process)
tmp <- do.call(rbind, inner)
}
return(tmp)
}
new_crdOrig <- lapply(crdOrig, function(x) {
df <- stack_process(x)
filterInd <- seq(1, nrow(df), by = filterBy)
return(df[filterInd,])
})
final_df <- do.call(rbind, new_crdOrig)
I have a dataframe with a set of objects df$data and a set of rules to be applied on every object df$rules.
df <- data.frame(
data = c(1,2,3),
rules = c("rule1", "rule1, rule2, rule3", "rule3, rule2"),
stringsAsFactors = FALSE
)
The rules are
rule1 <- function(data) {
data * 2
}
rule2 <- function(data) {
data + 1
}
rule3 <- function(data) {
data ^ 3
}
For every row in the dataframe I want to apply all the rules specified in the rules column. The rules should be applied in series.
What I figured out:
apply_rules <- function(data, rules) {
for (i in 1:length(data)) {
rules_now <- unlist(strsplit(rules[i], ", "))
for (j in 1:length(rules_now)) {
data[i] <- apply_rule(data[i], rules_now[j])
}
}
return(data)
}
apply_rule <- function(data, rule) {
return(sapply(data, rule))
}
apply_rules(df$data, df$rules)
# [1] 2 125 28
Although this works I'm pretty sure there must be more elegant solutions. On SO I could find lot's of stuff about the apply-functions and also one post about applying many functions to a vector and something about chaining functions. The Compose idea looks promising but I couldn't figure out how to make a call to Compose with my rules as string. (parse() didn't work..)
Any hints?
Some good answers already but throw in another option - build a pipe chain as a string then evaluate it. For example - for row 1 - eval(parse(text = "1 %>% rule1")) gives 2
eval_chain <- function(df) {
eval(parse(text = paste(c(df$data, unlist(strsplit(df$rules, ", "))), collapse=" %>% ")))
}
df$value <- sapply(1:nrow(df), function(i) df[i, ] %>% eval_chain)
# data rules value
# 1 1 rule1 2
# 2 2 rule1, rule2, rule3 125
# 3 3 rule3, rule2 28
You can use mapply and Reduce together with mget in this case.
mapply(function(d,r) Reduce(function(lhs,rhs) rhs(lhs),
c(d,mget(strsplit(r,", ")[[1]],envir = globalenv())))
,df$data
,df$rules)
# [1] 2 125 28
You might have to adjust the envir argument of mget to your specific case. It would probably be more robust to explicitly pass the environment where your rules are defined to mget.
I think you have to change the approach a little (expressions will only make things worse in this case):
df <- data.frame(
data = c(1,2,3),
rules = c("rule1", "rule1, rule2, rule3", "rule3, rule2"),
stringsAsFactors = FALSE
)
# list of functions
fun_list <- list(
rule1 = function(x) x*2,
rule2 = function(x) x+1,
rule3 = function(x) x^3
)
# function to call list of functions
call_funs <- function(x, fun_vec) {
for (i in seq_along(fun_vec)) {
x <- fun_list[[fun_vec[[i]]]](x)
}
x
}
(want <- unlist(Map(call_funs, df$data, strsplit(gsub(" ", "", df$rules), ","))))
# 2 125 28
I have a large dataset in R (say >40,000 rows and >20 categorical columns) that I repeatedly subset, so I would like to speed this up as much as possible. It needs to be a general function (each categorical column has a discrete number of possible values, say in string format).
Each time I subset, I need to identify the subset of rows that satisfy multiple logical set membership conditions (e.g. >10 conditions). I.e., I need to check several columns and check if values in that column match a certain set membership (hence the use of %in%).
# simple dataset example
library(dplyr)
num_col <- 15
num_row <- 100000
dat_list <- list()
for (i in 1:num_col) {
dat_list[[i]] <- data_frame(sample(letters[1:10], size = num_row, r = T))
}
dat <- bind_cols(dat_list)
names(dat) <- paste0("col", seq(15))
I've looked around the internet and SO a lot, but haven't found the discussion of performance I'm looking for. I mostly code using dplyr, so apologies if there's a clear performance improvement here in data.table; I've tried some simple benchmarks between the two (but without using any data.table indexing or etc.) and it's not obvious if one is faster.
Example options I've considered (since I'm not great at data.table, I've excluded data.table options from here):
base_filter <- function(dat) {
for (i in 1:7) {
col_name <- paste0('col', i)
dat <- dat[dat[[col_name]] %in% sample(letters[1:10], size = 4), ]
}
dat
}
dplyr_filter1 <- function(dat) {
for (i in 1:7) {
col_name <- paste0('col', i)
dat <- filter_(dat,
.dots = interp(~ colname %in% vals,
colname = as.name(col_name),
vals = sample(letters[1:10], size = 4)))
}
dat
}
dplyr_filter2 <- function(dat) {
dots_filter <- list()
for (i in 1:7) {
col_name <- paste0('col', i)
dots_filter[[i]] <- interp(~ colname %in% vals,
colname = as.name(col_name),
vals = sample(letters[1:10], size = 4))
}
filter_(dat, .dots = dots_filter)
}
Note: In practice, on my real datasets, dplyr_filter2 actually works fastest. I've also tried dtplyr or converting my data to a data.table, but this seems slower than without.
Note: On the other hand, in practice, the base R function outperforms the dplyr examples when data has fewer rows and fewer columns (perhaps due to copying speed?).
Thus, I'd like to ask SO what the general, most efficient way(s) to subset a categorical dataframe under multiple (set membership) conditions is. And if possible, explain the mechanics for why? Does this answer differ for smaller datasets? Does it depend on copying time or search time?
Useful related links
fast lookup for one key
using hash tables in R for key-value pairs
Understand that you prefer not to use data.table. Just providing some timings for reference below. With indexing, subsetting can be performed much faster and inner join of the 2 tables can also be done easily in data.table.
# simple dataset example
library(dplyr)
library(lazyeval)
set.seed(0L)
num_col <- 15
num_row <- 100000
dat_list <- list()
for (i in 1:num_col) {
dat_list[[i]] <- data_frame(sample(letters[1:10], size = num_row, r = T))
}
dat <- bind_cols(dat_list)
names(dat) <- paste0("col", seq(15))
selection <- lapply(1:7, function(n) sample(letters[1:10], size = 4))
base_filter <- function(df) {
for (i in 1:7) {
col_name <- paste0('col', i)
df <- df[df[[col_name]] %in% selection[[i]], ]
}
df
}
dplyr_filter1 <- function(df) {
for (i in 1:7) {
col_name <- paste0('col', i)
df <- filter_(df,
.dots = interp(~ colname %in% vals,
colname = as.name(col_name),
vals = selection[[i]]))
}
df
}
dplyr_filter2 <- function(df) {
dots_filter <- list()
for (i in 1:7) {
col_name <- paste0('col', i)
dots_filter[[i]] <- interp(~ colname %in% vals,
colname = as.name(col_name),
vals = selection[[i]])
}
filter_(df, .dots = dots_filter)
}
library(data.table)
#convert data.frame into data.table
dt <- data.table(dat, key=names(dat)[1:7])
#create the sets of selection
dtSelection <- data.table(expand.grid(selection, stringsAsFactors=FALSE))
library(microbenchmark)
microbenchmark(
base_filter(dat),
dplyr_filter1(dat),
dplyr_filter2(dat),
dt[dtSelection, nomatch=0], #perform inner join between dataset and selection
times=5L)
#Unit: milliseconds
# expr min lq mean median uq max neval
# base_filter(dat) 27.084801 27.870702 35.849261 32.045900 32.872601 59.372301 5
# dplyr_filter1(dat) 23.130100 24.114301 26.922081 24.860701 29.804301 32.701002 5
# dplyr_filter2(dat) 29.641101 30.686002 32.363681 31.103000 31.884701 38.503601 5
# dt[dtSelection, nomatch = 0] 3.626001 3.646201 3.829341 3.686601 3.687001 4.500901 5
In addition to chinsoon12's alternatives, one thing to consider is to avoid subsetting the data.frame in each iteration. So, instead of
f0 = function(x, cond)
{
for(j in seq_along(x)) x = x[x[[j]] %in% cond[[j]], ]
return(x)
}
one alternative is to accumulate a logical vector of whether to include each row in the final subset:
f1 = function(x, cond)
{
i = rep_len(TRUE, nrow(x))
for(j in seq_along(x)) i = i & (x[[j]] %in% cond[[j]])
return(x[i, ])
}
or, another alternative, is to iteratively reduce the amount of comparisons, but by reducing the row indices instead of the data.frame itself:
f2 = function(x, cond)
{
i = 1:nrow(x)
for(j in seq_along(x)) i = i[x[[j]][i] %in% cond[[j]]]
return(x[i, ])
}
And a comparison with data:
set.seed(1821)
dat = as.data.frame(replicate(30, sample(c(letters, LETTERS), 5e5, TRUE), FALSE),
stringsAsFactors = FALSE)
conds = replicate(ncol(dat), sample(c(letters, LETTERS), 48), FALSE)
system.time({ ans0 = f0(dat, conds) })
# user system elapsed
# 3.44 0.28 3.86
system.time({ ans1 = f1(dat, conds) })
# user system elapsed
# 0.66 0.01 0.68
system.time({ ans2 = f2(dat, conds) })
# user system elapsed
# 0.34 0.01 0.39
identical(ans0, ans1)
#[1] TRUE
identical(ans1, ans2)
#[1] TRUE
I'm subsampling rows from a dataframe with c("x","y","density") columns at a variety of c("s_size","reps"). Reps= replicates, s_size= number of rows subsampled from the whole dataframe.
> head(data_xyz)
x y density
1 6 1 0
2 7 1 17600
3 8 1 11200
4 12 1 14400
5 13 1 0
6 14 1 8000
#Subsampling###################
subsample_loop <- function(s_size, reps, int) {
tm1 <- system.time( #start timer
{
subsample_bound = data.frame()
#Perform Subsampling of the general
for (s_size in seq(1,s_size,int)){
for (reps in 1:reps) {
subsample <- sample.df.rows(s_size, data_xyz)
assign(paste("sample" ,"_","n", s_size, "_", "r", reps , sep=""), subsample)
subsample_replicate <- subsample[,] #temporary variable
subsample_replicate <- cbind(subsample, rep(s_size,(length(subsample_replicate[,1]))),
rep(reps,(length(subsample_replicate[,1]))))
subsample_bound <- rbind(subsample_bound, subsample_replicate)
}
}
}) #end timer
colnames(subsample_bound) <- c("x","y","density","s_size","reps")
subsample_bound
} #end function
Here's the function call:
source("R/functions.R")
subsample_data <- subsample_loop(s_size=206, reps=5, int=10)
Here's the row subsample function:
# Samples a number of rows in a dataframe, outputs a dataframe of the same # of columns
# df Data Frame
# N number of samples to be taken
sample.df.rows <- function (N, df, ...)
{
df[sample(nrow(df), N, replace=FALSE,...), ]
}
It's way too slow, I've tried a few times with apply functions and had no luck. I'll be doing somewhere around 1,000-10,000 replicates for each s_size from 1:250.
Let me know what you think! Thanks in advance.
=========================================================================
UPDATE EDIT: Sample data from which to sample:
https://www.dropbox.com/s/47mpo36xh7lck0t/density.csv
Joran's code in a function (in a sourced function.R file):
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
resampling_custom <- function(dat, s_size, int, reps) {
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
}
Calling the function
set.seed(2)
out <- resampling_custom(dat=retinal_xyz, s_size=206, int=5, reps=10)
outputs data, unfortunately with this warning message:
Warning message:
In mapply(foo, i = ss, j = id, MoreArgs = list(data = dat), SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
I put very little thought into actually optimizing this, I was just concentrating on doing something that's at least reasonable while matching your procedure.
Your big problem is that you are growing objects via rbind and cbind. Basically anytime you see someone write data.frame() or c() and expand that object using rbind, cbind or c, you can be very sure that the resulting code will essentially be the slowest possible way of doing what ever task is being attempted.
This version is around 12-13 times faster, and I'm sure you could squeeze some more out of this if you put some real thought into it:
s_size <- 200
int <- 10
reps <- 30
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
The best part about R is that not only is this way, way faster, it's also way less code.