Directly assign results of doMC (foreach) to data frame - r

Lets say I have the example code
kkk<-data.frame(m.mean=1:1000, m.sd=1:1000/20)
kkk[,3:502]<-NA
for (i in 1:nrow(kkk)){
kkk[i,3:502]<-rnorm(n=500, mean=kkk[i,1], sd=kkk[i,2])
}
I would like to convert this function to run parallel with doMC. My problem is that foreach results in a list, whereas I need the results of each iteration to be a vector that can be then transfered to the data frame (which later will be exported as CVS for further processing).
Any ideas?

You don't need a loop for this, and putting a large matrix of numbers in a data frame only to treat is as a matrix is inefficient (although you may need to create a data frame at the end after doing all your math in order to write to a CSV file).
m.mean <- 1:1000
m.sd <- 1:1000/20
num.columns <- 500
x <- matrix(nrow=length(m.mean), ncol=num.columns,
data=rnorm(n=length(m.mean) * num.columns))
x <- x * cbind(m.sd)[,rep(1,num.columns)] + cbind(m.mean)[,rep(1,num.columns)]
kkk <- data.frame(m.mean=m.mean, m.sd=m.sd, unname(x))
write.csv(kkk, "kkk.txt")
To answer your original question about directly assigning results to an existing data structure from a foreach loop, that is not possible. The foreach package's parallel backends are designed to perform each computation in a separate R process, so each one has to return a separate object to the parent process, which collects them with the .combine function provided to foreach. You could write a parallel foreach loop that assignes directly to the kkk variable, but it would have no effect, because each assignment would happen in the separate processes and would not be shared with the main process.

Related

Send chunks of large dataset to specific cores for R parallel foreach

I am attempting to scale a script that involves applying a feature extraction function on a image file generated using data from each row of an R matrix. For faster computation I split the matrix into equally sized chunks and run each chunk in R parallel using the foreach structure. To do this, I have to send the entire matrix to each core using clusterExport and subset it to desired chunk within the foreach loop.
I am hoping to find a way to export only the matrix chunks to each core, instead of passing the full matrix to each core and then subsetting. I was only able to find one thread close to what I was looking for, this answer by Steve Weston who sent individual chunks to each core using clusterCall (code pasted below)
library(parallel)
cl <- makeCluster(detectCores())
df <- data.frame(a=1:10, b=1:10)
ix <- splitIndices(nrow(df), length(cl))
for (i in seq_along(cl)) {
clusterCall(cl[i], function(d) {
assign('mydata', d, pos=.GlobalEnv)
NULL # don't return any data to the master
}, df[ix[[i]],,drop=FALSE])
}
This answer worked as advertised, however, the cores in this example run in sequence instead of parallel. My attempt to parallelize this using foreach instead of for was hamstrung by having to use clusterExport to transfer the dataset variable, which is the issue I'm trying to avoid
clusterExport(cl,c("df","ix"))
foreach() %dopar% {etc}
Is there a way to pass chunks of a variable to each core, and operate on them in parallel? A foreach solution would be nice, or a parallelized adaptation to Steve Weston's structure. Note that I am developing for Windows, so forking is not an option for me.

Losing data frame cells in foreach loop

similar questions have been posted but I can't find one that actually addresses the problem i'm having, so sorry if this is not distinct enough.
I'm processing a for loop in parallel using doParallel and foreach. The core of my code is:
combinedOut <- foreach(i = 1:48, .combine=rbind) %dopar%
{
##function that builds a data frame row with 6 columns, adding different columns seperately
##data frame is called out18
out18[i,]
}
When I run this is as a for loop my output (out18) is correct, and in this form.
However when I run it as a foreach, only the first and last column contain the right values (referring to combinedOut here). I have no idea why its only the middle four columns that are empty.
Essentially I want to copy the entire ith row of every foreach iteration and combine them all into one data frame at the end.
Thanks for any responses.

Getting the index of an iterator in R (in parallel with foreach)

I'm using the foreach function to iterate over columns of a data.frame. At each iteration, I would like to get the index of the iterator (i.e. the index or the name of the column considered) and the column itself.
However, the following code, which seems fine in first place, doesn't work because i has no names or colnames attributes.
foreach(i=iter(base[1:N],by='col')) %dopar% c(colnames(i),i)
Now, if you wonder why I'm not iterating over indexes, the reason is that I'm using the %dopar% tool and I don't want to send the whole base to all workers, but only the columns each of them require.
Question : How can I get the index of an iterator ?
Thank you
I would just specify a second iteration variable in the foreach loop that acts as a counter:
library(foreach)
library(iterators)
df <- data.frame(a=1:10, b=rnorm(10), c=runif(10))
r <- foreach(d=df, i=icount()) %do% {
list(d=d, i=i)
}
The "icount" function from the iterators package will return an unbounded counting iterator if no arguments are used, so this example works regardless of the number of columns in the data frame.
You could also include the column name as a third iteration variable:
r <- foreach(d=df, i=icount(), nm=colnames(df)) %do% {
list(d=d, i=i, nm=nm)
}
Here are a couple of possibilities:
Modify the iter function (or write your own) so that instead of sending just the value of the column it includes the names or other information)
You could iterate over the indexes, but use a shared memory tool (such as the Rdsm package) so that each process only needs to grab the part of the data frame it needs rather than distributing the entire data frame.
You could convert your base data frame into a list where each element contains the corresponding column of base along with the column name, then iterate over that list (so the entire element is sent, but not the other elements).

Saving many subsets as dataframes using "for"-loops

this question might be very simple, but I do not find a good way to solve it:
I have a dataset with many subgroups which need to be analysed all-together and on their own. Therefore, I want to use subsets for the groups and use them for the later analysis. As well, the defintion of the subsets as the analysis should be partly done with loops in order to save space and to ensure that the same analysis has been done with all subgroups.
Here is an example of my code using an example dataframe from the boot package:
data(aids)
qlist <- c("1","2","3","4")
for (i in length(qlist)) {
paste("aids.sub.",qlist[i],sep="") <- subset(aids, quarter==qlist[i])
}
The variable which contains the subgroups in my dataset is stored as a string, therefore I added the qlist part which would be not required otherwise.
Make a list of the subsets with lapply:
lapply(qlist, function(x) subset(aids, quarter==x))
Equivalently, avoiding the subset():
lapply(qlist, function(x) aids[aids$quarter==x,])
It is likely the case that using a list will make the subsequent code easier to write and understand. You can subset the list to get a single data frame (just as you can use one of the subsets, as created below). But you can also iterate over it (using for or lapply) without having to construct variable names.
To do the job as you are asking, use assign:
for (i in qlist) {
assign(paste("aids.sub.",i,sep=""), subset(aids, quarter==i))
}
Note the removal of the length() function, and that this is iterating directly over qlist.

In R, is there danger of communication between foreach loops (doSNOW) when using assignments to store intermediate output?

I want to create a function that uses assignments to store intermediate output (p). This intermediate output is used in statements below. I want everything to be parallelized using doSNOW and foreach and I do NOT want that intermediate output to be communicated between iteration of the forearch loop. I don't want to store intermediate output in a list (e.g. p[[i]]) because then I have to change a huge amount of code.
Question 1: Is there any danger that another iteration of the foreach loop will use the intermediate output (p)?
Question 2: If yes, when would there be danger of that happening and how to prevent it?
Here is an example of what I mean:
install.packages('foreach')
library('foreach')
install.packages('doSNOW')
library('doSNOW')
NbrCores <- 4
cl<-makeCluster(NbrCores)
registerDoSNOW(cl)
test <- function(value){
foreach(i=1:500) %dopar% {
#some statement based on parameter 'value'
p <- value
#some statement that uses p
v <- p
#other statements
}
}
test(value=1)
Each of the nodes used in parallel computations runs in its own R process I believe. Therefore there is no risk of variables from one node influencing the results in another. In general, there is a possibility to communicate between the processes. However foreach only iterates over the sequence it is given, executing each item in the sequence in one of the nodes independently.

Resources