Losing data frame cells in foreach loop - r

similar questions have been posted but I can't find one that actually addresses the problem i'm having, so sorry if this is not distinct enough.
I'm processing a for loop in parallel using doParallel and foreach. The core of my code is:
combinedOut <- foreach(i = 1:48, .combine=rbind) %dopar%
{
##function that builds a data frame row with 6 columns, adding different columns seperately
##data frame is called out18
out18[i,]
}
When I run this is as a for loop my output (out18) is correct, and in this form.
However when I run it as a foreach, only the first and last column contain the right values (referring to combinedOut here). I have no idea why its only the middle four columns that are empty.
Essentially I want to copy the entire ith row of every foreach iteration and combine them all into one data frame at the end.
Thanks for any responses.

Related

Looping through list of dataframes in R

I am relatively new to R, and have the below problem:
I have a list of dataframes in R which was generated through lapply and cbind funtion (code mentioned below - it works fine):
res<-lapply(1:35,function(i){cbind(df1[i],df2[i],df3[i])})
This has generated list of 35 dataframes each containing list[72*3](S3: data.frame)
Next what i want to do is, save each of these dataframes assigning separate names to it. The names would be specific dates retrieved from an already stored list. The below is the code for it:
for (i in 1:length(res)) {
a<-res[[i]]
for (j in as.list(Date.table)){
newname<-paste(j)
d<-data.frame(a)
names(d)<-c("RIC","MV","BVMV")
assign(newname,d)
}
}
While 35 dataframes are being generated with different dates, the data in all these dataframes is the same i.e. of the last dataframe.
Could somebody please point out the error in the code to resolve this. It is essentially not saving each dataframe but saving only the last one.
Many thanks!!!

How to follow a variable value during runtime?

I have a large list of dataframes dataset that I try to fuse into one dataframe and I would like to check the speed of the process because the list is large and it takes a lot of time and I don't know if I'm supposed to wait. How to check the current index that do.call is currently working on during runtime ?
dataset <- do.call(rbind, dataset)

Getting the index of an iterator in R (in parallel with foreach)

I'm using the foreach function to iterate over columns of a data.frame. At each iteration, I would like to get the index of the iterator (i.e. the index or the name of the column considered) and the column itself.
However, the following code, which seems fine in first place, doesn't work because i has no names or colnames attributes.
foreach(i=iter(base[1:N],by='col')) %dopar% c(colnames(i),i)
Now, if you wonder why I'm not iterating over indexes, the reason is that I'm using the %dopar% tool and I don't want to send the whole base to all workers, but only the columns each of them require.
Question : How can I get the index of an iterator ?
Thank you
I would just specify a second iteration variable in the foreach loop that acts as a counter:
library(foreach)
library(iterators)
df <- data.frame(a=1:10, b=rnorm(10), c=runif(10))
r <- foreach(d=df, i=icount()) %do% {
list(d=d, i=i)
}
The "icount" function from the iterators package will return an unbounded counting iterator if no arguments are used, so this example works regardless of the number of columns in the data frame.
You could also include the column name as a third iteration variable:
r <- foreach(d=df, i=icount(), nm=colnames(df)) %do% {
list(d=d, i=i, nm=nm)
}
Here are a couple of possibilities:
Modify the iter function (or write your own) so that instead of sending just the value of the column it includes the names or other information)
You could iterate over the indexes, but use a shared memory tool (such as the Rdsm package) so that each process only needs to grab the part of the data frame it needs rather than distributing the entire data frame.
You could convert your base data frame into a list where each element contains the corresponding column of base along with the column name, then iterate over that list (so the entire element is sent, but not the other elements).

Custom R function returning weird output

So I'm trying to create a list of lists of data frames, basically for the purposes of passing them to multiple cores via mclapply. But that's not the part I'm having trouble with. I wrote a function to create a list of smaller data frames from one large data frame, and then applied it sequentially to break a large data frame down into a list of lists of small data frames. The problem is that when the function is called the second time (via lapply to the first list of data frames), it's adding extra small data frames to each list of data frames in the larger list. I have no idea why. I don't think it's the lapply, since when I ran the function manually on one frame from the first list it also did work. Here's the code:
create_frame_list<-function(mydata,mystep,elnames){
datalim<-dim(mydata)[1]
mylist<-list()
init<-1
top<-mystep
i<-1
repeat{
if(top < datalim){
mylist[[i]]<-assign(paste(elnames,as.character(i),sep=""),data.frame(mydata[init:top,]))
}
else {
mylist[[i]]<-assign(paste(elnames,as.character(i),sep=""),data.frame(mydata[init:datalim,]))
}
if(top > datalim){break}
i<-i+1
init<-top+1
top<-top+mystep
}
return(mylist)
}
test_data<-data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
#Create the first list of data frames, works fine
master_list<-create_frame_list(test_data,300,"bd")
#check the dimensions of the data frames created, they are correct
lapply(master_list,dim)
#create a list of lists of data frames, doesn't work right
list_list<-lapply(master_list,create_frame_list,50,"children")
#check the dimensions of the data frames in the various lists. The function when called again is making extra data frames of length 2 for no reason I can see
lapply(list_list,lapply,dim)
So that's it. Any help is appreciated as always.
Okay, so your code only has one small bug, but there are definitely better ways of doing this. Your code doesn't work when the number of rows is an exact multiple of step. This has to do with the position of your break. Here is a fix:
create_frame_list<-function(mydata,mystep,elnames){
datalim<-dim(mydata)[1]
mylist<-list()
init<-1
top<-mystep
i<-1
repeat{
if(top < datalim)
# mylist[[i]]<-assign(paste0(elnames,as.character(i)),data.frame(mydata[init:top,]))
mylist[[i]]<-mydata[init:top,]
else
mylist[[i]]<-mydata[init:datalim,]
# if(top > datalim) break
i<-i+1
init<-top+1
top<-top+mystep
if(init > datalim) break
}
return(mylist)
}
The main fix was to move the if and make it reliant on init, and not top.
You'll note that I cleaned up your code, and removed the assign statments. One good rule of thumb is: if you think you need to use assign or get, you're doing it wrong. In your case, the assign was completely redundant, and did not assign the names in the way you wanted.
If you're looking for a better way to do this, here is one option:
n<-nrow(test_data)
step<-300
split.var<-rep(1:ceiling(n/step),each=step,length.out=n)
master_list<-split(test_data,split.var)
names(master_list)<-paste0('bd',seq_along(master_list))
# If you didn't care about the order of the rows you could just do
# split(test_data,seq(ceiling(n/step)))
If you want to get fancy, you could do something like:
special.split<-function(data,step)
split(data,rep(1:ceiling(nrow(data)/step),each=step,length.out=nrow(data)))
lapply(special.split(test_data,300),special.split,step=50)
And that would do everything in one step.

Directly assign results of doMC (foreach) to data frame

Lets say I have the example code
kkk<-data.frame(m.mean=1:1000, m.sd=1:1000/20)
kkk[,3:502]<-NA
for (i in 1:nrow(kkk)){
kkk[i,3:502]<-rnorm(n=500, mean=kkk[i,1], sd=kkk[i,2])
}
I would like to convert this function to run parallel with doMC. My problem is that foreach results in a list, whereas I need the results of each iteration to be a vector that can be then transfered to the data frame (which later will be exported as CVS for further processing).
Any ideas?
You don't need a loop for this, and putting a large matrix of numbers in a data frame only to treat is as a matrix is inefficient (although you may need to create a data frame at the end after doing all your math in order to write to a CSV file).
m.mean <- 1:1000
m.sd <- 1:1000/20
num.columns <- 500
x <- matrix(nrow=length(m.mean), ncol=num.columns,
data=rnorm(n=length(m.mean) * num.columns))
x <- x * cbind(m.sd)[,rep(1,num.columns)] + cbind(m.mean)[,rep(1,num.columns)]
kkk <- data.frame(m.mean=m.mean, m.sd=m.sd, unname(x))
write.csv(kkk, "kkk.txt")
To answer your original question about directly assigning results to an existing data structure from a foreach loop, that is not possible. The foreach package's parallel backends are designed to perform each computation in a separate R process, so each one has to return a separate object to the parent process, which collects them with the .combine function provided to foreach. You could write a parallel foreach loop that assignes directly to the kkk variable, but it would have no effect, because each assignment would happen in the separate processes and would not be shared with the main process.

Resources