Efficient sampling from nested lists - r

I have a list of lists, containing data.frames, from which I want to select only a few rows. I can achieve it in a for-loop, where I create a sequence based on the amount of rows and select only row indices according to that sequence.
But if I have deeper nested lists it doesn't work anymore. I am also sure, that there is a better way of doing that without a loop.
What would be an efficient and generic approach to sample from nested lists, that vary in their dimensions and contain data.frames or matrices?
## Dummy Data
n1=100;n2=300;n3=100
crdOrig <- list(
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
## Code to opimize
FiltRef <- list()
filterBy = 10
for (r in 1:length(crdOrig)) {
tmp <- do.call(rbind, crdOrig[[r]])
filterInd <- seq(1,nrow(tmp), by = filterBy)
FiltRef[[r]] <- tmp[filterInd,]
}
crdResult <- do.call(rbind, FiltRef)
# Plotting
crdOrigPl <- do.call(rbind, unlist(crdOrig, recursive = F))
plot(crdOrigPl[,1], crdOrigPl[,2], col="red", pch=20)
points(crdResult[,1], crdResult[,2], col="green", pch=20)
The code above works also if a list contains several data.frames (data below).
## Dummy Data (Multiple DF)
crdOrig <- list(
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)),
data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
But if a list contains multiple lists, it throws an error trying to bind the result (FiltRef) together.
The result can be a data.frame with 2 columns (x,y) - like crdResult or a one dimensional list like FiltRef (from the first example)
## Dummy Data (Multiple Lists)
crdOrig <- list(
list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
+1 and Thank you all for your brilliant answers! They all work and there is a lot to learn from each one of them. I will give this one to #Gwang-Jin Kim as his solution is the most flexible and extensive, although they all deserve to be checked!

Preparation and implementation of flatten
Well, there are many other answers which are in principle the same.
I meanwhile implemented for fun the flattening of nested lists.
Since I am thinking in Lisp:
Implemented first car and cdr from lisp.
car <- function(l) {
if(is.list(l)) {
if (null(l)) {
list()
} else {
l[[1]]
}
} else {
error("Not a list.")
}
}
cdr <- function(l) {
if (is.list(l)) {
if (null(l) || length(l) == 1) {
list()
} else {
l[2:length(l)]
}
} else {
error("Not a list.")
}
}
Some predicate functions:
null <- function(l) length(l) == 0
# this is Lisp's `null` checking whether list is empty (`length(l) == 0`)
# R's `is.null()` checks for the value NULL and not `length(obj) == 0`
# upon #Martin Morgan's comment removed other predicate functions
# thank you #Martin Morgan!
# instead using `is.data.frame()` and `is.list()`, since they are
# not only already there but also safer.
Which are necessary to build flatten (for data frame lists)
flatten <- function(nested.list.construct) {
# Implemented Lisp's flatten tail call recursively. (`..flatten()`)
# Instead of (atom l) (is.df l).
..flatten <- function(l, acc.l) {
if (null(l)) {
acc.l
} else if (is.data.frame(l)) { # originally one checks here for is.atom(l)
acc.l[[length(acc.l) + 1]] <- l
acc.l # kind of (list* l acc.l)
} else {
..flatten(car(l), ..flatten(cdr(l), acc.l))
}
}
..flatten(nested.list.construct, list())
}
# an atom is in the widest sence a non-list object
After this, the actual function is defined using a sampling function.
Defining sampling function
# helper function
nrow <- function(df) dim(df)[1L]
# sampling function
sample.one.nth.of.rows <- function(df, fraction = 1/10) {
# Randomly selects a fraction of the rows of a data frame
nr <- nrow(df)
df[sample(nr, fraction * nr), , drop = FALSE]
}
The actual collector function (from nested data-frame-lists)
collect.df.samples <- function(df.list.construct, fraction = 1/10) {
do.call(rbind,
lapply(flatten(df.list.construct),
function(df) sample.one.nth.of.rows(df, fraction)
)
)
}
# thanks for the improvement with `do.call(rbind, [list])` #Ryan!
# and the hint that `require(data.table)`
# `data.table::rbindlist([list])` would be even faster.
collect.df.samples first flattens the nested list construct of data frames df.list.construct to a flat list of data frames. It applies the function sample.one.nth.of.rows to each elements of the list (lapply). There by it produces a list of sampled data frames (which contain the fraction - here 1/10th of the original data frame rows). These sampled data frames are rbinded across the list. The resulting data frame is returned. It consists of the sampled rows of each of the data frames.
Testing on example
## Dummy Data (Multiple Lists)
n1=100;n2=300;n3=100
crdOrig <- list(
list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
collect.df.samples(crdOrig, fraction = 1/10)
Refactoring for later modifications
By writing the collect.df.samples function to:
# sampler function
sample.10th.fraction <- function(df) sample.one.nth.of.rows(df, fraction = 1/10)
# refactored:
collect.df.samples <-
function(df.list.construct,
df.sampler.fun = sample.10th.fraction) {
do.call(rbind,
lapply(flatten(df.list.construct), df.sampler.fun))
}
One can make the sampler function replace-able.
(And if not: By changing the fraction parameter, one can enhance or reduce amount of rows collected from each data frame.)
The sampler function is in this definition easily exchangable
For choosing every nth (e.g. every 10th) row in the data frame, instead of a random sampling,
you could e.g. use the sampler function:
df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE]
and input it as df.sampler.fun = in collect.df.samples. Then, this function will be applied to every data frame in the nested df list object and collected to one data frame.
every.10th.rows <- function(df, nth = 10) {
df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE]
}
a.10th.of.all.rows <- function(df, fraction = 1/10) {
sample.one.nth.of.rows(df, fraction)
}
collect.df.samples(crdOrig, a.10th.of.all.rows)
collect.df.samples(crdOrig, every.10th.rows)

I would just flatten the whole darn thing and work on a clean list.
library(rlist)
out <- list.flatten(y)
# prepare a vector for which columns belong together
vc <- rep(1:(length(out)/2), each = 2)
vc <- split(1:length(vc), vc)
# prepare the final list
ll <- vector("list", length(unique(vc)))
for (i in 1:length(vc)) {
ll[[i]] <- as.data.frame(out[vc[[i]]])
}
result <- lapply(ll, FUN = function(x) {
x[sample(1:nrow(x), size = 10, replace = FALSE), ]
})
do.call(rbind, result)
x y
98 10.32912 52.87113
52 16.42912 46.07026
92 18.85397 46.26403
90 12.04884 57.79290
23 18.20997 40.57904
27 18.98340 52.55919
...

Here's an answer in base borrowing from a custom "rapply" function mentioned here rapply to nested list of data frames in R
df_samples<-list()
i=1
f<-function(x) {
i<<-i+1
df_samples[[i]]<<-x[sample(rownames(x),10),]
}
recurse <- function (L, f) {
if (inherits(L, "data.frame")) {
f(L) }
else lapply(L, recurse, f)
}
recurse(crdOrig, f)
res<-do.call("rbind", df_samples)

I too would flatten the list-of-lists into a standard representation (and do all analysis on the flattened representation, not just the subseting), but keep track of relevant indexing information, e.g.,
flatten_recursive = function(x) {
i <- 0L
.f = function(x, depth) {
if (is.data.frame(x)) {
i <<- i + 1L
cbind(i, depth, x)
} else {
x = lapply(x, .f, depth + 1L)
do.call(rbind, x)
}
}
.f(x, 0L)
}
The internal function .f() visits each element of a list. If the element is a data.frame, it adds a unique identifier to index it. If it's a list, then it calls itself on each element of the list (incrementing a depth counter, in case this is useful, one could also add a 'group' counter) and then row-binds the elements. I use an internal function so that I can have a variable i to increment across function calls. The end result is a single data frame with a index to use for referencing the original results.
> tbl <- flatten_recursive(crdOrig) %>% as_tibble()
> tbl %>% group_by(i, depth) %>% summarize(n())
# A tibble: 4 x 3
# Groups: i [?]
i depth `n()`
<int> <int> <int>
1 1 3 100
2 2 3 100
3 3 2 300
4 4 2 100
> tbl %>% group_by(i) %>% slice(seq(1, n(), by = 10)) %>% summarize(n())
# A tibble: 4 x 2
i `n()`
<int> <int>
1 1 10
2 2 10
3 3 30
4 4 10
The overall pattern of .f() can be adjusted for additional data types, e.g., (some details omitted)
.f <- function(x) {
if (is.data.frame(x)) {
x
} else if (is.matrix(x)) {
x <- as.data.frame(x)
setNames(x, c("x", "y"))
} else {
do.call(rbind, lapply(x, .f))
}
}

Consider a recursive call conditionally checking if first item is a data.frame or list class.
stack_process <- function(lst){
if(class(lst[[1]]) == "data.frame") {
tmp <- lst[[1]]
}
if(class(lst[[1]]) == "list") {
inner <- lapply(lst, stack_process)
tmp <- do.call(rbind, inner)
}
return(tmp)
}
new_crdOrig <- lapply(crdOrig, function(x) {
df <- stack_process(x)
filterInd <- seq(1, nrow(df), by = filterBy)
return(df[filterInd,])
})
final_df <- do.call(rbind, new_crdOrig)

Related

Randomly subsampling seurat object

I've been trying to randomly subsample my seurat object.
I'm interested in subsampling based on 2 columns: condition and cell type. I have 5 conditions and 5 cell types. Main goal is to have 1000 cells for each cell type in each condition.
I've tried this so far:
First thing is subsetting my seurat object:
my.list <- list(hipo.c1.neurons = hipo %>%
subset(., condition %in% "c1" & group %in% "Neurons"),
hipo.c1.oligo = hipo %>%
subset(., condition %in% "c1" & group %in% "Oligod")...etc...)
And then subsample it using sample function:
set.seed(0)
my.list.sampled <- lapply(X = my.list, FUN = function(x) {
x <- x[,sample(ncol(x), 1000, replace = FALSE)]
})
And I get this error since there are some objects with less than 1000 cells: error in evaluating the argument 'j' in selecting a method for function '[': cannot take a sample larger than the population when 'replace = FALSE'
Then I've tried with this function:
lapply_with_error <- function(X,FUN,...){
lapply(X, function(x, ...) tryCatch(FUN(x, ...),
error = function(e)NULL))
}
But then it gives me 0 in those objects that have less than 1000 cells. What would be the way to skip those objects that have less than 1000 cells and leave it like they are (not sample those ones)?
Is there a simpler way to do this, so I don't have to subset all of my objects separately?
I can't say for certain without seeing your data, but could you just add an if statement in the function? It looks like you're sampling column-wise, so check the number of columns. Just return x if the number of columns is less than the number you'd like to sample.
set.seed(0)
my.list.sampled <- lapply(X = my.list, FUN = function(x) {
if(ncol(x) > 1000){
x <- x[,sample(ncol(x), 1000, replace = FALSE)]
} else {
x
}
})
You could make it more flexible if you want to sample something other than 1000.
set.seed(0)
my.list.sampled <- lapply(X = my.list, B = 1000, FUN = function(x, B) {
if(ncol(x) > B){
x <- x[,sample(ncol(x), B, replace = FALSE)]
} else {
x
}
})

Return a named list with various elements from function call

Question
I have a function like this:
myfunc <- function(x){
a1 = 1
a2 = c(2,4)
a3 = data.frame(x = 1:10)
...
an = 'str'
res = list(a1 = a1,a2 = a2,..., an=an)
return(res)
}
As we can see, I return my results with a named list. However, if the number of elements is large, I cannot type a_i = a_i one by one. I use the code snippet below to save half of my time(but I still need to type " around my elements' name, it's a waste of time):
res_short = sapply(c('a1','a2',...,'an'),FUN = function(x){list(get(x))})
return(res_short)
Note that there may not exist a pattern in my elements' name a1,a2,...,an, I just use a1,a2...,an to be simplified.
I think I return with a named list is good, since list can store different types of elements. Is there any other methods to write my function return? I want to be clear and time-saving!
mget Use mget as shown below. To return all variables use mget(ls()) or to return all variables except x use mget(setdiff(ls(), "x")). ls will not return object names that begin with a dot unless the all argument is used, i.e. ls(all = TRUE), which could be used to prevent certain variables from being returned. Another possibility is to use the mode= argument of mget to restrict the objects returned to ones that are numeric, say. See ?mget. Yet another approach to restrict the objects returned is to use Filter on the result of mget. For example, res <- Filter(is.data.frame, mget(ls())) only returns data frames.
myfunc <- function(x){
a1 = 1
a2 = c(2,4)
a3 = data.frame(x = 1:10)
an = 'str'
res = mget(ls(pattern = "^a"))
return(res)
}
myfunc(3) # test
environment Another possibility is to return the environment within the executing function. All objects in the function (not just the ones beginning with a) will be in the environment.
myfunc2 <- function(x) {
a1 = 1
a2 = c(2,4)
a3 = data.frame(x = 1:10)
an = 'str'
res = environment()
return(res)
}
out <- myfunc2(3) # test
out$a
within Another possibility is to use within. Only variables created in the within will be returned. x is used in the within but not created in the within so it is not returned.
myfunc3 <- function(x) {
res <- within(list(), {
a1 <- x
a2 <- BOD
})
return(res)
}
myfunc3(3) # test
Multiple ls Perform an ls() before and after the section creating the variables to be output and then mget the difference.
myfunc4 <- function(x) {
.excl <- ls()
a1 <- x
a2 <- BOD
res <- mget(setdiff(ls(), .excl))
return(res)
}
myfunc4(3) # test
If I understand it correctly, your requirements are very flexible. You have a bunch of variables with names that have no pattern. You want to apply a different computation for each variable. Well, you realize that you do need to type everything in at least once. One approach is to have a list of all possible variable names and their computations. You can then apply all of them, or a subset to your input. Here is an example for 3 names with 3 different computations.
mycomputer = list(
add5 = function(x) {
x + 5
},
mymean = function(x) {
mean(x)
},
square = function(x) {
x*x
}
)
computeall = function(x) {
result = lapply(names(mycomputer), function(f) {
mycomputer[[f]](x)
})
names(result) = names(mycomputer)
result
}
computeall(c(1,2,3))
## $add5
## [1] 6 7 8
##
## $mymean
## [1] 2
##
## $square
## [1] 1 4 9

R loop to create data frames with 2 counters

What I want is to create 60 data frames with 500 rows in each. I tried the below code and, while I get no errors, I am not getting the data frames. However, when I do a View on the as.data.frame, I get the view, but no data frame in my environment. I've been trying for three days with various versions of this code:
getDS <- function(x){
for(i in 1:3){
for(j in 1:30000){
ID_i <- data.table(x$ID[j: (j+500)])
}
}
as.data.frame(ID_i)
}
getDS(DATASETNAME)
We can use outer (on a small example)
out1 <- c(outer(1:3, 1:3, Vectorize(function(i, j) list(x$ID[j:(j + 5)]))))
lapply(out1, as.data.table)
--
The issue in the OP's function is that inside the loop, the ID_i gets updated each time i.e. it is not stored. Inorder to do that we can initialize a list and then store it
getDS <- function(x) {
ID_i <- vector('list', 3)
for(i in 1:3) {
for(j in 1:3) {
ID_i[[i]][[j]] <- data.table(x$ID[j:(j + 5)])
}
}
ID_i
}
do.call(c, getDS(x))
data
x <- data.table(ID = 1:50)
I'm not sure the description matches the code, so I'm a little unsure what the desired result is. That said, it is usually not helpful to split a data.table because the built-in by-processing makes it unnecessary. If for some reason you do want to split into a list of data.tables you might consider something along the lines of
getDS <- function(x, n=5, size = nrow(x)/n, column = "ID", reps = 3) {
x <- x[1:(n*size), ..column]
index <- rep(1:n, each = size)
replicate(reps, split(x, index),
simplify = FALSE)
}
getDS(data.table(ID = 1:20), n = 5)

Two same type of dataframes perform differently in a function

Below is my data
set.seed(100)
toydata <- data.frame(A = sample(1:50,50,replace = T),
B = sample(1:50,50,replace = T),
C = sample(1:50,50,replace = T)
)
Below is my swapping function
derangement <- function(x){
if(max(table(x)) > length(x)/2) return(NA)
while(TRUE){
y <- sample(x)
if(all(y != x)) return(y)
}
}
swapFun <- function(x, n = 10){
inx <- which(x < n)
y <- derangement(x[inx])
if(length(y) == 1) return(NA)
x[inx] <- y
x
}
In the first case,I get the new data toy by swapping the entire dataframe. The code is below:
toydata<-as.matrix(toydata)
toy<-swapFun(toydata)
toy<-as.data.frame(toy)
In the second case, I get the new data toy by swapping each column respectively. Below is the code:
toydata<-as.data.frame(toydata)
toy2 <- toydata # Work with a copy
toy2[] <- lapply(toydata, swapFun)
toy<-toy2
Below is the function that can output the difference of contigency table after swapping.
# the function to compare contingency tables
f = function(x,y){
table1<-table(toydata[,x],toydata[,y])
table2<-table(toy[,x],toy[,y])
sum(abs(table1-table2))
}
# vectorise your function
f = Vectorize(f)
combn(x=names(toydata),
y=names(toydata), 2) %>%# create all combinations of your column names
t() %>% # transpose
data.frame(., stringsAsFactors = F) %>% # save as dataframe
filter(X1 != X2) %>% # exclude pairs of same
# column
mutate(SumAbs = f(X1,X2)) # apply function
In the second case, this mutate function works.
But in the first case, this mutatefunction does not work. It says:
+ filter(X1 != X2) %>% # exclude pairs of same column
+ mutate(SumAbs = f(X1,X2)) # apply function
Error in combn(x = names(toydata), y = names(toydata), 2) : n < m
However in the two cases, the toy data are all dataframes with the same dimension, the same row names and the same column names. I feel confused.
How can I fix it? Thanks.

R: Row resampling loop speed improvement

I'm subsampling rows from a dataframe with c("x","y","density") columns at a variety of c("s_size","reps"). Reps= replicates, s_size= number of rows subsampled from the whole dataframe.
> head(data_xyz)
x y density
1 6 1 0
2 7 1 17600
3 8 1 11200
4 12 1 14400
5 13 1 0
6 14 1 8000
#Subsampling###################
subsample_loop <- function(s_size, reps, int) {
tm1 <- system.time( #start timer
{
subsample_bound = data.frame()
#Perform Subsampling of the general
for (s_size in seq(1,s_size,int)){
for (reps in 1:reps) {
subsample <- sample.df.rows(s_size, data_xyz)
assign(paste("sample" ,"_","n", s_size, "_", "r", reps , sep=""), subsample)
subsample_replicate <- subsample[,] #temporary variable
subsample_replicate <- cbind(subsample, rep(s_size,(length(subsample_replicate[,1]))),
rep(reps,(length(subsample_replicate[,1]))))
subsample_bound <- rbind(subsample_bound, subsample_replicate)
}
}
}) #end timer
colnames(subsample_bound) <- c("x","y","density","s_size","reps")
subsample_bound
} #end function
Here's the function call:
source("R/functions.R")
subsample_data <- subsample_loop(s_size=206, reps=5, int=10)
Here's the row subsample function:
# Samples a number of rows in a dataframe, outputs a dataframe of the same # of columns
# df Data Frame
# N number of samples to be taken
sample.df.rows <- function (N, df, ...)
{
df[sample(nrow(df), N, replace=FALSE,...), ]
}
It's way too slow, I've tried a few times with apply functions and had no luck. I'll be doing somewhere around 1,000-10,000 replicates for each s_size from 1:250.
Let me know what you think! Thanks in advance.
=========================================================================
UPDATE EDIT: Sample data from which to sample:
https://www.dropbox.com/s/47mpo36xh7lck0t/density.csv
Joran's code in a function (in a sourced function.R file):
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
resampling_custom <- function(dat, s_size, int, reps) {
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
}
Calling the function
set.seed(2)
out <- resampling_custom(dat=retinal_xyz, s_size=206, int=5, reps=10)
outputs data, unfortunately with this warning message:
Warning message:
In mapply(foo, i = ss, j = id, MoreArgs = list(data = dat), SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
I put very little thought into actually optimizing this, I was just concentrating on doing something that's at least reasonable while matching your procedure.
Your big problem is that you are growing objects via rbind and cbind. Basically anytime you see someone write data.frame() or c() and expand that object using rbind, cbind or c, you can be very sure that the resulting code will essentially be the slowest possible way of doing what ever task is being attempted.
This version is around 12-13 times faster, and I'm sure you could squeeze some more out of this if you put some real thought into it:
s_size <- 200
int <- 10
reps <- 30
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
The best part about R is that not only is this way, way faster, it's also way less code.

Resources