What is the most memory efficient way to initialize a list before a loop in R?

What is the most memory efficient way to initialize a list before a loop in R? - r

I am wondering what the most memory efficient way to initialize a list is in R if that list is going to be used in a loop to store results. I know that growing an object in a loop can cause a serious hit in computational efficiency so I am trying to avoid that as much as possible.
My problem is as follows. I have several groups of data that I want to process individually. The gist of my code is I have a loop that runs through each group one at a time, does some t-tests, and then returns only the statistically significant results (thus variable length results for each group). So far I am initializing a list of length(groups) to store the results of each iteration.
My main question is how I should be initializing the list so that the object is not grown in the loop.
Is it good enough to do list = vector(mode = "list", length=length(groups)) for the initialization?
I am skeptical about this because it just creates a list of length(groups) but each entry is equal to NULL. My concern is that during each iteration of the loop when I go to store data into the list, it is going to recopy the object each time as the entry goes from NULL to my results vector, in which case initializing the list doesn't really do much good. I don't know how the internals of a list work, however, so it is possible that it just stores the reference to the vector being stored in the list, meaning recopying is not necessary.
The other option would be to initialize each element of the list to a vector of the maximum possible length the results could have.
This is not a big issue as the maximum number of possible valid results is known. If I took this approach I would just overwrite each vector with the results vector within the loop. Since the maximum amount of memory would already be reserved hopefully no recopying/growth would occur. I don't want to take this approach, however, if it is not necessary and the first option above is good enough.
Below is some psuedo code describing my problem
#initialize variables
results = vector(mode="list", length=length(groups)) #the line of code in question
y=1
tTests = vector(length = length(singleGroup))
#perform analysis on each group in groups
for(group in groups)
{
#returns a vector of p values with one entry per element in group
tTests = tTestFunction(group)
results[[y]] = tTests<=0.05
y=y+1
}

Your code does not work, so it is a bad example. Consider this:
x <- vector("list", length = 4)
tracemem(x) ## trace memory copies of "x"
for (i in 1:4) x[[i]] <- rnorm(4)
No extra copy of x is made during update. So there is nothing to worry.
As suggested by #lmo, even if you use x <- list() to initialize this list, no memory copy will be incurred, either.
Comment
The aim of my answer, is to refer you to the use of tracemem, when you want to trace (possible) memory copies made during code execution. Had you known this function, you would not ask us here.
Here is my other answer made, related to using tracemem. It is in a different context, though. There, you can see what tracemem would return when memory copies are made.

Related

Extracting Nested Elements of an R List Generated by Loops

For lists within lists produced by a loop in R (in this example a list of caret models) I get an object with an unpredictable length and names for inner elements, such as list[[1]][[n repeats of 1]][[2]] where the internal [[1]] is repeated multiple times according to the function's input. In some cases, the length of n is not known, when accessing some older stored lists where input was not saved. While there are ways to work within a list index, like with list[length(list)], there appears to be no way to do this with repeated nested elements. This has made accessing them and passing them to various jobs awkward. I assume there is an efficient way to access them that I have missed, so I'm asking for help to do so, with an example case given below.
The function I'm generating gives out a list from a function that creates several outputs. The final list returned for a function having a complicated output structure is produced by returning something like:
return(list(listOfModels, trainingData, testingData))
The listofModels has variable length, depending on input of models given, and potentially other conditions depend on evaluation inside the function. It is made by:
listOfModels <- list(c(listOfModels, list(trainedModel)))
Where the "trainedModel" refers to the most recently trained model generated in the loop. The models used and the number of them may vary each time depending on choice. An unfortunate result is a complicated nested lists within a list.
That is, output[[1]] contains the models I want to access more efficiently, which are themselves list objects, while output[[2]] and output[[3]] are the dataframes used to train and evaluate the models. While accessing the dataframes is simple and has a defined, reproducible structure each time (simply being output[[2]], output[[3]] every time), output[[1]] becomes a mess. E.g., something like the following follows the "output[[1]]":
The only thing I am able to attempt in order to access this is using the fact that [[1]] is attached upon output[[1]] before [[2]]. All of the nested elements except one have a [[2]] at the end. Given the above pattern, there is an ugly solution that works, but is not a desirable format to work with. E.g., after evaluating n models given by a vector of strings called inputList, and a list given as output of the function, "output", I can have [[1]] repeated tens to hundreds of times.
for (i in (1:length(inputList)-1)){
eval(rlang::parse_expr(paste0(c("output", c(rep("[[1]]", 1+i)), "[[2]]" ) , collapse="")) )
}
This could be used to use all models for some downstream task like making predictions on new data, or whatever. In cases where the length of the inputList was not known, this could be found out by attempting to repeat this until finding an error, or something similar. This approach can be modified to call on a specific part of the list, for example, a certain model within inputList, if I know the original list input and can find the number for that model. Besides the bulkiness code working this way, compared to some way where I could just call on output[[1]][[n]] using some predictable format for various length n. One of the big problems is when accessing older runs that have been saved where the input list of models was not saved, leaving the length of n unknown. I don't know of any way of using something like length() or lengths() to count how many nested elements exist within a list. (For my example, output[[1]] is of length 1, no matter how many [[1]] repeat elements there are.)
I believe the simplest solution is to change the way the list is saved by the function, so that I can access it by a systematic reference, however, I have a bunch of old lists which I still want to access and perform some work with, and I'd also like to be able to have better control of working with lists in any case. So any help would be greatly appreciated.
I expected there would be some way to query the structure of nested R lists, which could be used to pass nested elements to separate functions, without having to use very long repetition of brackets.

How to modify elements of a vector based on other elements in parallel?

I'm trying to parallelize part of my code, but I can't figure out how since everywhere I read about parallelization the object is completely split and then processed using only the "information" in the chunk. My problem cannot be split in independent chunks, but the updates can be done independently.
Here's a simplified version
a <- runif(10000)
indexes <- c(seq(1,9999,2), seq(2,10000,2))
for(i in indexes){
.prev <- ifelse(i>1, a[i-1], 0)
.next <- ifelse(i<10000, a[i+1], 1)
a[i] <- runif(1, min(.prev,.next), max(.prev, .next))
}
While each iteration on the for loop depends on the current values in a, the order defined in indexes makes the problem parallelized within even and odd indexes, e.g., if indexes = seq(1,9999,2) the dependence is not a problem since the values used in each iteration won't ever be modified. On the other hand it will be impossible to execute this with the subset a[indexes], hence the "splitting" strategy in the guides I read are unable to perform this operation.
How can I parallelize a problem like this, where I need the object to be modified and "look" at the whole vector every time? Instead of each worker returning some output, each worker has to modify a "shared" object in memory?

R language: how to work with dynamically sized vector?

I'm learning R programming, and trying to understand the best approach to work with a vector when you don't know the final size it will end up being. For example, in my case I need to build the vector inside a for loop, but only for some iterations, which aren't know beforehand.
METHOD 1
I could run through the loop a first time to determine the final vector length, initialize the vector to the correct length, then run through the loop a second time to populate the vector. This would be ideal from a memory usage standpoint, since the vector memory would occupy the required amount of memory.
METHOD 2
Or, I could use one for loop, and simply append to the vector as needed, but this would be inefficient from a memory allocation standpoint since a new block may need to be assigned each time a new element is appended to the vector. If you're working with big data, this could be a problem.
METHOD 3
In C or Matlab, I usually initialize the vector length to the largest possible length that I know the final vector could occupy, then populate a subset of elements in the for loop. When the loop completes, I'll re-size the vector length appropriately.
Since R is used a lot in data science, I thought this would be a topic others would have encountered and there may be a best practice that was recommended. Any thoughts?

Canonical R code would use lapply or similar to run the function on each element, then combine the results in some way. This avoids the need to grow a vector or know the size ahead of time. This is the functional programming approach to things. For example,
set.seed(5)
x <- runif(10)
some_fun <- function(x) {
if (x > 0.5) {
return(x)
} else {
return(NULL)
}
}
unlist(lapply(x, some_fun))
The size of the result vector is not specified, but is determined automatically by combining results.
Keep in mind that this is a trivial example for illustration. This particular operation could be vectorized.

I think Method1 is the best approach if you have a very large amount of data. But in general you might want to read this chapter before you make a final decision:
http://adv-r.had.co.nz/memory.html

Yet another apply Questions

I am totally convinced that an efficient R programm should avoid using loops whenever possible and instead should use the big family of the apply functions.
But this cannot happen without pain.
For example I face with a problem whose solution involves a sum in the applied function, as a result the list of results is reduced to a single value, which is not what I want.
To be concrete I will try to simplify my problem
assume N =100
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
As you can see the function inside cause length of the built vector to explode
whereas using the sum inside would collapse everything to single value
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
What I would like to have is a the list of degree of N.
so what do you think? how can I repair it?

Your question doesn't contain reproducible code (what's "eps"?), but on the general point about for loops and optimising code:
For loops are not incredibly slow. For loops are incredibly slow when used improperly because of how memory is assigned to objects. For primitive objects (like vectors), modifying a value in a field has a tiny cost - but expanding the /length/ of the vector is fairly costly because what you're actually doing is creating an entirely new object, finding space for that object, copying the name over, removing the old object, etc. For non-primitive objects (say, data frames), it's even more costly because every modification, even if it doesn't alter the length of the data.frame, triggers this process.
But: there are ways to optimise a for loop and make them run quickly. The easiest guidelines are:
Do not run a for loop that writes to a data.frame. Use plyr or dplyr, or data.table, depending on your preference.
If you are using a vector and can know the length of the output in advance, it will work a lot faster. Specify the size of the output object before writing to it.
Do not twist yourself into knots avoiding for loops.
So in this case - if you're only producing a single value for each thing in N, you could make that work perfectly nicely with a vector:
#Create output object. We're specifying the length in advance so that writing to
#it is cheap
output <- numeric(length = length(N))
#Start the for loop
for(i in seq_along(output)){
output[i] <- your_computations_go_here(N[i])
}
This isn't actually particularly slow - because you're writing to a vector and you've specified the length in advance. And since data.frames are actually lists of equally-sized vectors, you can even work around some issues with running for loops over data.frames using this; if you're only writing to a single column in the data.frame, just create it as a vector and then write it to the data.frame via df$new_col <- output. You'll get the same output as if you had looped through the data.frame, but it'll work faster because you'll only have had to modify it once.

How to structure R code for inherently nested problems to be easily readible?

There are problems that inherently require several layers of nesting to be solved. In a current project, I frequently find myself using three nested applys in order to do something with the elements contained in the deepest layer of a nested list structure.
R's list handling and apply-family allows for quite concise code for this type of problems, but writing it still gives me headaches and I'm pretty sure that anyone else reading it will have to take several minutes to understand what I am doing. This is although I'm basically doing nothing complicated - if it weren't for the multiple layers of lists to be traversed.
Below I provide a snippet of code that I wrote that shall serve as an example. I would consider it concise, yet hard to read.
Context: I am writing a simulation of surface electromyographic data, i.e. changes in the electrical potential that can be measured on the human skin and which is invoked by muscular activity. For this, I consider several muscles (first layer of lists) which each consist of a number of so-called motor units (second layer of lists) which each are related to a set of electrodes (third layer of lists) placed on the skin. In the example code below, we have the objects .firing.contribs, which contains the information on how strong the action of a motor unit influences the potential at a specific electrode, and .firing.instants which contains the information on the instants in time at which these motor units are fired. The given function then calculates the time course of the potential at each electrode.
Here's the question: What are possible options to render this type of code easily understable for the reader? Particularly: How can I make clearer what I am actually doing, i.e. performing some computation on each (MU, electrode) pair and then summing up all potential contributions to each electrode? I feel this is currently not very obvious in the code.
Notes:
I know of plyr. So far, I have not been able to see how I can use this to render my code more readable, though.
I can't convert my structure of nested lists to a multidimensional array since the number of elements is not the same for each list element.
I searched for R implementations of the MapReduce algorithm, since this seems to be applicable to the problem I give as an example (although I'm not a MapReduce expert, so correct me if I'm wrong...). I only found packages for Big Data processing, which is not what I'm interested in. Anyway, my question is not about this particular (Map-Reducible) case but rather about a general programming pattern for inherently nested problems.
I am not interested in performance for the moment, just readability.
Here's the example code.
sum.MU.firing.contributions <- function(.firing.contribs, .firing.instants) {
calc.MU.contribs <- function(.MU.firing.contribs, .MU.firing.instants)
lapply(.MU.firing.contribs, calc.MU.electrode.contrib,
.MU.firing.instants)
calc.muscle.contribs <- function(.MU.firing.contribs, .MU.firing.instants) {
MU.contribs <- mapply(calc.MU.contribs, .MU.firing.contribs,
.MU.firing.instants, SIMPLIFY = FALSE)
muscle.contribs <- reduce.contribs(MU.contribs)
}
muscle.contribs <- mapply(calc.muscle.contribs, .firing.contribs,
.firing.instants, SIMPLIFY = FALSE)
surface.potentials <- reduce.contribs(muscle.contribs)
}
## Takes a list (one element per object) of lists (one element per electrode)
## that contain the time course of the contributions of that object to the
## surface potential at that electrode (as numerical vectors).
## Returns a list (one element per electrode) containing the time course of the
## contributions of this list of objects to the surface potential at all
## electrodes (as numerical vectors again).
reduce.contribs <- function(obj.list) {
contribs.by.electrode <- lapply(seq_along(obj.list[[1]]), function(i)
sapply(obj.list, `[[`, i))
contribs <- lapply(contribs.by.electrode, rowSums)
}
calc.MU.electrode.contrib <- function(.MU.firing.contrib, .MU.firing.instants) {
## This will in reality be more complicated since then .MU.firing.contrib
## will have a different (more complicated) structure.
.MU.firing.contrib * .MU.firing.instants
}
firing.contribs <- list(list(list(1,2),list(3,4)),
list(list(5,6),list(7,8),list(9,10)))
firing.instants <- list(list(c(0,0,1,0), c(0,1,0,0)),
list(c(0,0,0,0), c(1,0,1,0), c(0,1,1,0)))
surface.potentials <- sum.MU.firing.contributions(firing.contribs, firing.instants)

As suggested by user #alexis_laz, it is a good option to actually not use a nested list structure for representing data, but rather use a (flat, 2D, non-nested) data.frame. This greatly simplified coding and increased code conciseness and readability for the above example and seems to be promising for other cases, too.
It completely eliminates the need for nested apply-foo and complicated list traversing. Moreover, it allows for the usage of a lot of built-in R functionality that works on data frames.
I think there are two main reasons why I did not consider this solution before:
It does not feel like the natural representation of my data, since the data actually are nested. By fitting the data into a data.frame, we're losing direct information on this nesting. It can of course be reconstructed, though.
It introduces redundancy, as I will have a lot of equivalent entries in the categorical columns. Of course this doesn't matter by any measure since it's just indices that don't consume a lot of memory or something, but it still feels kind of bad.
Here is the rewritten example. Comments are most welcome.
sum.MU.firing.contributions <- function(.firing.contribs, .firing.instants) {
firing.info <- merge(.firing.contribs, .firing.instants)
firing.info$contribs.time <- mapply(calc.MU.electrode.contrib,
firing.info$contrib,
firing.info$instants,
SIMPLIFY = FALSE)
surface.potentials <- by(I(firing.info$contribs.time),
factor(firing.info$electrode),
function(list) colSums(do.call(rbind, list)))
surface.potentials
}
calc.MU.electrode.contrib <- function(.MU.firing.contrib, .MU.firing.instants) {
## This will in reality be more complicated since then .MU.firing.contrib
## will have a different (more complicated) structure.
.MU.firing.contrib * .MU.firing.instants
}
firing.instants <- data.frame(muscle = c(1,1,2,2,2),
MU = c(1,2,1,2,3),
instants = I(list(c(F,F,T,F),c(F,T,F,F),
c(F,F,F,F),c(T,F,T,F),c(F,T,T,F))))
firing.contribs <- data.frame(muscle = c(1,1,1,1,2,2,2,2,2,2),
MU = c(1,1,2,2,1,1,2,2,3,3),
electrode = c(1,2,1,2,1,2,1,2,1,2),
contrib = c(1,2,3,4,5,6,7,8,9,10))
surface.potentials <- sum.MU.firing.contributions(firing.contribs,
firing.instants)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex