How to modify elements of a vector based on other elements in parallel? - r

I'm trying to parallelize part of my code, but I can't figure out how since everywhere I read about parallelization the object is completely split and then processed using only the "information" in the chunk. My problem cannot be split in independent chunks, but the updates can be done independently.
Here's a simplified version
a <- runif(10000)
indexes <- c(seq(1,9999,2), seq(2,10000,2))
for(i in indexes){
.prev <- ifelse(i>1, a[i-1], 0)
.next <- ifelse(i<10000, a[i+1], 1)
a[i] <- runif(1, min(.prev,.next), max(.prev, .next))
}
While each iteration on the for loop depends on the current values in a, the order defined in indexes makes the problem parallelized within even and odd indexes, e.g., if indexes = seq(1,9999,2) the dependence is not a problem since the values used in each iteration won't ever be modified. On the other hand it will be impossible to execute this with the subset a[indexes], hence the "splitting" strategy in the guides I read are unable to perform this operation.
How can I parallelize a problem like this, where I need the object to be modified and "look" at the whole vector every time? Instead of each worker returning some output, each worker has to modify a "shared" object in memory?

Related

What is the most memory efficient way to initialize a list before a loop in R?

I am wondering what the most memory efficient way to initialize a list is in R if that list is going to be used in a loop to store results. I know that growing an object in a loop can cause a serious hit in computational efficiency so I am trying to avoid that as much as possible.
My problem is as follows. I have several groups of data that I want to process individually. The gist of my code is I have a loop that runs through each group one at a time, does some t-tests, and then returns only the statistically significant results (thus variable length results for each group). So far I am initializing a list of length(groups) to store the results of each iteration.
My main question is how I should be initializing the list so that the object is not grown in the loop.
Is it good enough to do list = vector(mode = "list", length=length(groups)) for the initialization?
I am skeptical about this because it just creates a list of length(groups) but each entry is equal to NULL. My concern is that during each iteration of the loop when I go to store data into the list, it is going to recopy the object each time as the entry goes from NULL to my results vector, in which case initializing the list doesn't really do much good. I don't know how the internals of a list work, however, so it is possible that it just stores the reference to the vector being stored in the list, meaning recopying is not necessary.
The other option would be to initialize each element of the list to a vector of the maximum possible length the results could have.
This is not a big issue as the maximum number of possible valid results is known. If I took this approach I would just overwrite each vector with the results vector within the loop. Since the maximum amount of memory would already be reserved hopefully no recopying/growth would occur. I don't want to take this approach, however, if it is not necessary and the first option above is good enough.
Below is some psuedo code describing my problem
#initialize variables
results = vector(mode="list", length=length(groups)) #the line of code in question
y=1
tTests = vector(length = length(singleGroup))
#perform analysis on each group in groups
for(group in groups)
{
#returns a vector of p values with one entry per element in group
tTests = tTestFunction(group)
results[[y]] = tTests<=0.05
y=y+1
}
Your code does not work, so it is a bad example. Consider this:
x <- vector("list", length = 4)
tracemem(x) ## trace memory copies of "x"
for (i in 1:4) x[[i]] <- rnorm(4)
No extra copy of x is made during update. So there is nothing to worry.
As suggested by #lmo, even if you use x <- list() to initialize this list, no memory copy will be incurred, either.
Comment
The aim of my answer, is to refer you to the use of tracemem, when you want to trace (possible) memory copies made during code execution. Had you known this function, you would not ask us here.
Here is my other answer made, related to using tracemem. It is in a different context, though. There, you can see what tracemem would return when memory copies are made.

R language: how to work with dynamically sized vector?

I'm learning R programming, and trying to understand the best approach to work with a vector when you don't know the final size it will end up being. For example, in my case I need to build the vector inside a for loop, but only for some iterations, which aren't know beforehand.
METHOD 1
I could run through the loop a first time to determine the final vector length, initialize the vector to the correct length, then run through the loop a second time to populate the vector. This would be ideal from a memory usage standpoint, since the vector memory would occupy the required amount of memory.
METHOD 2
Or, I could use one for loop, and simply append to the vector as needed, but this would be inefficient from a memory allocation standpoint since a new block may need to be assigned each time a new element is appended to the vector. If you're working with big data, this could be a problem.
METHOD 3
In C or Matlab, I usually initialize the vector length to the largest possible length that I know the final vector could occupy, then populate a subset of elements in the for loop. When the loop completes, I'll re-size the vector length appropriately.
Since R is used a lot in data science, I thought this would be a topic others would have encountered and there may be a best practice that was recommended. Any thoughts?
Canonical R code would use lapply or similar to run the function on each element, then combine the results in some way. This avoids the need to grow a vector or know the size ahead of time. This is the functional programming approach to things. For example,
set.seed(5)
x <- runif(10)
some_fun <- function(x) {
if (x > 0.5) {
return(x)
} else {
return(NULL)
}
}
unlist(lapply(x, some_fun))
The size of the result vector is not specified, but is determined automatically by combining results.
Keep in mind that this is a trivial example for illustration. This particular operation could be vectorized.
I think Method1 is the best approach if you have a very large amount of data. But in general you might want to read this chapter before you make a final decision:
http://adv-r.had.co.nz/memory.html

Yet another apply Questions

I am totally convinced that an efficient R programm should avoid using loops whenever possible and instead should use the big family of the apply functions.
But this cannot happen without pain.
For example I face with a problem whose solution involves a sum in the applied function, as a result the list of results is reduced to a single value, which is not what I want.
To be concrete I will try to simplify my problem
assume N =100
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
As you can see the function inside cause length of the built vector to explode
whereas using the sum inside would collapse everything to single value
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
What I would like to have is a the list of degree of N.
so what do you think? how can I repair it?
Your question doesn't contain reproducible code (what's "eps"?), but on the general point about for loops and optimising code:
For loops are not incredibly slow. For loops are incredibly slow when used improperly because of how memory is assigned to objects. For primitive objects (like vectors), modifying a value in a field has a tiny cost - but expanding the /length/ of the vector is fairly costly because what you're actually doing is creating an entirely new object, finding space for that object, copying the name over, removing the old object, etc. For non-primitive objects (say, data frames), it's even more costly because every modification, even if it doesn't alter the length of the data.frame, triggers this process.
But: there are ways to optimise a for loop and make them run quickly. The easiest guidelines are:
Do not run a for loop that writes to a data.frame. Use plyr or dplyr, or data.table, depending on your preference.
If you are using a vector and can know the length of the output in advance, it will work a lot faster. Specify the size of the output object before writing to it.
Do not twist yourself into knots avoiding for loops.
So in this case - if you're only producing a single value for each thing in N, you could make that work perfectly nicely with a vector:
#Create output object. We're specifying the length in advance so that writing to
#it is cheap
output <- numeric(length = length(N))
#Start the for loop
for(i in seq_along(output)){
output[i] <- your_computations_go_here(N[i])
}
This isn't actually particularly slow - because you're writing to a vector and you've specified the length in advance. And since data.frames are actually lists of equally-sized vectors, you can even work around some issues with running for loops over data.frames using this; if you're only writing to a single column in the data.frame, just create it as a vector and then write it to the data.frame via df$new_col <- output. You'll get the same output as if you had looped through the data.frame, but it'll work faster because you'll only have had to modify it once.

How to simplify several for loops into a single loop or function in R

I am trying to combine several for loops into a single loop or function. Each loop is evaluating if an individual is present at a site that is protected, and based on that is assigning a number (numbers represent sites) at each time step. After that, the results for each time step are stored in a matrix and later used in other analysis. The problem that I am having is that I am repeating the same loop several times to evaluate the different scenarios (10%, 50%, 100% of sites protected). Since I need to store my results for each scenario I can't think of a better way to simplify this into a single loop or function. Any ideas or suggestions will be appreciated. This is a very small and simplify idea of the problem. I would like to keep the structure of the loop since my original loop is using several if statements. The only thing that is changing is the proportion of sites that are protected.
N<-10 # number of sites
sites<-factor(seq(from=1,to=N))
sites10<-as.factor(sample(sites,N*1))
sites5<-as.factor(sample(sites,N*0.5))
sites1<-as.factor(sample(sites,N*0.1))
steps<-10
P.stay<-0.9
# storing results
result<-matrix(0,nrow=steps)
time.step<-seq(1,steps)
time.step<-data.frame(time.step)
time.step$event<-0
j<-numeric(steps)
j[1]<-sample(1:N,1)
time.step$event[1]<-j[1]
for(i in 1:(steps-1)){
if(j[i] %in% sites1){
if(rbinom(1,1,P.stay)==1){time.step$event[i+1]<-j[i+1]<-j[i]} else
time.step$event[i+1]<-0
}
time.step$event[i+1]<-j[i+1]<-sample(1:N,1)
}
results.sites1<-as.factor(result)
###
result<-matrix(0,nrow=steps)
time.step<-seq(1,steps)
time.step<-data.frame(time.step)
time.step$event<-0
j<-numeric(steps)
j[1]<-sample(1:N,1)
time.step$event[1]<-j[1]
for(i in 1:(steps-1)){
if(j[i] %in% sites5){
if(rbinom(1,1,P.stay)==1){time.step$event[i+1]<-j[i+1]<-j[i]} else
time.step$event[i+1]<-0
}
time.step$event[i+1]<-j[i+1]<-sample(1:N,1)
}
results.sites5<-as.factor(result)
###
result<-matrix(0,nrow=steps)
time.step<-seq(1,steps)
time.step<-data.frame(time.step)
time.step$event<-0
j<-numeric(steps)
j[1]<-sample(1:N,1)
time.step$event[1]<-j[1]
for(i in 1:(steps-1)){
if(j[i] %in% sites10){
if(rbinom(1,1,P.stay)==1){time.step$event[i+1]<-j[i+1]<-j[i]} else
time.step$event[i+1]<-0
}
time.step$event[i+1]<-j[i+1]<-sample(1:N,1)
}
results.sites10<-as.factor(result)
#
results.sites1
results.sites5
results.sites10
Instead of doing this:
sites10<-as.factor(sample(sites,N*1))
sites5<-as.factor(sample(sites,N*0.5))
sites1<-as.factor(sample(sites,N*0.1))
and running distinct loops for each of the three variables, you can make a general loop and put it in a function, then use one of the -apply functions to call it with specific parameters. For example:
N<-10 # number of sites
sites<-factor(seq(from=1,to=N))
steps<-10
P.stay<-0.9
simulate.n.sites <- function(n) {
n.sites <- sample(sites, n)
result<-matrix(0,nrow=steps)
time.step<-seq(1,steps)
time.step<-data.frame(time.step)
time.step$event<-0
j<-numeric(steps)
j[1]<-sample(1:N,1)
time.step$event[1]<-j[1]
for(i in 1:(steps-1)){
if(j[i] %in% n.sites){
...etc...
return(result)
}
results <- lapply(c(1, 5, 10), simulate.n.sites)
Now results will be a list, with three matrix elements.
The key is to identify places where you repeat yourself, and then refactor those areas into functions. Not only is this more concise, but it's easy to extend in the future. Want to sample for 2 site? Put a 2 in the vector you pass to lapply.
If you're unfamiliar with the -apply family of functions, definitely look into those.
I also suspect that much of the rest of your code could be simplified, but I think you've gutted it too much for me to make sense of it. For example, you define an element of time.step$event based on a condition, but then you overwrite that element. Surely this isn't what the actual code does?

In R, is there danger of communication between foreach loops (doSNOW) when using assignments to store intermediate output?

I want to create a function that uses assignments to store intermediate output (p). This intermediate output is used in statements below. I want everything to be parallelized using doSNOW and foreach and I do NOT want that intermediate output to be communicated between iteration of the forearch loop. I don't want to store intermediate output in a list (e.g. p[[i]]) because then I have to change a huge amount of code.
Question 1: Is there any danger that another iteration of the foreach loop will use the intermediate output (p)?
Question 2: If yes, when would there be danger of that happening and how to prevent it?
Here is an example of what I mean:
install.packages('foreach')
library('foreach')
install.packages('doSNOW')
library('doSNOW')
NbrCores <- 4
cl<-makeCluster(NbrCores)
registerDoSNOW(cl)
test <- function(value){
foreach(i=1:500) %dopar% {
#some statement based on parameter 'value'
p <- value
#some statement that uses p
v <- p
#other statements
}
}
test(value=1)
Each of the nodes used in parallel computations runs in its own R process I believe. Therefore there is no risk of variables from one node influencing the results in another. In general, there is a possibility to communicate between the processes. However foreach only iterates over the sequence it is given, executing each item in the sequence in one of the nodes independently.

Resources