Getting the index of an iterator in R (in parallel with foreach) - r

I'm using the foreach function to iterate over columns of a data.frame. At each iteration, I would like to get the index of the iterator (i.e. the index or the name of the column considered) and the column itself.
However, the following code, which seems fine in first place, doesn't work because i has no names or colnames attributes.
foreach(i=iter(base[1:N],by='col')) %dopar% c(colnames(i),i)
Now, if you wonder why I'm not iterating over indexes, the reason is that I'm using the %dopar% tool and I don't want to send the whole base to all workers, but only the columns each of them require.
Question : How can I get the index of an iterator ?
Thank you

I would just specify a second iteration variable in the foreach loop that acts as a counter:
library(foreach)
library(iterators)
df <- data.frame(a=1:10, b=rnorm(10), c=runif(10))
r <- foreach(d=df, i=icount()) %do% {
list(d=d, i=i)
}
The "icount" function from the iterators package will return an unbounded counting iterator if no arguments are used, so this example works regardless of the number of columns in the data frame.
You could also include the column name as a third iteration variable:
r <- foreach(d=df, i=icount(), nm=colnames(df)) %do% {
list(d=d, i=i, nm=nm)
}

Here are a couple of possibilities:
Modify the iter function (or write your own) so that instead of sending just the value of the column it includes the names or other information)
You could iterate over the indexes, but use a shared memory tool (such as the Rdsm package) so that each process only needs to grab the part of the data frame it needs rather than distributing the entire data frame.
You could convert your base data frame into a list where each element contains the corresponding column of base along with the column name, then iterate over that list (so the entire element is sent, but not the other elements).

Related

Lapply on a list of a list

This is all about a code in R.
I have seperated a big data file "All_data.csv" in smaller data of individuals in a particular year.
So the list looks like this:
All individuals is a list of 25 individuals. If you would then take the first element of that list you get:
Indivudual 1: dataframe_year1, dataframe_year2.
If you take the second element you get for instance:
Individual 2: dataframe_year1, dataframe_year2, dataframe_year3.
etc. so the lists in the lists differ in their length.
Now I want to do a (analysis) function on the dataframes, I do not need to store the output again in the list per se.
I solved it with doing an lapply on the list All_data, with a function defined by myself which also calls lapply again and then my analysis function. But I was wondering if there was another way. Because it seems a bit inefficient to do.
split <- function (All_data)
{
#function that splits files by date and individual
#returns list of individuals and within that list is another list of dataframes. Called All_individuals
}
Make_analysis <- function (All_individuals)
{
Listfiles <- split (All_individuals)
HRE <- lapply (Listfiles, Doall)
}
Analysis <- function (files)
{
...
}
function calls:
lapply (All_data, Make_analysis)
Could anyone help?
Also is this the best way to go if I would want to parallise the analysis with RSlurm to run it on a HPC? Then I could change lapply with slurm map right?
My function in itself works but it seems very inefficient. Would like some tips on how to make code more efficient. Also on how to parallise it with Rslurm.

Performing column select over multiple dataframes

I have looked around a lot for this answer, they get close but no cigar. I am trying to perform a selection of columns over multiple dataframes. I can do this and return a list, but I wish to preserve the dataframes in the global environment. I want to keep the dataframes separate for ease of use and visibility in Rstudio. For example I am selecting columns based on their name as so, for one dataframe:
E07 <- E07[,c("Block","Name","F635.Mean","F532.Mean","B635.Mean","B532")]
I have x amount of data frames listed in dflist so I have written this function:
columnselect<-function(df){df[,c("Block","Name","F635.Mean","F532.Mean","B635.Mean","B532")];df}
I then wish to apply this over the dflist as so:
lapply(X=dflist,FUN=columnselect)
This returns the function over the dflist however the data tables remain unchanged. How do I apply the function over multiple dataframes without returning them in a list.
Many thanks
M
Your function returns the data frames unchanged because this is the last thing evaluated in your function. Instead of:
columnselect<-function(df){
df[,c("Block","Name","F635.Mean","F532.Mean","B635.Mean","B532")]
df}
It should be:
columnselect<-function(df){
df[,c("Block","Name","F635.Mean","F532.Mean","B635.Mean","B532")]
}
Having the last df in your function simply returned the full df that you passed in the function.
As for the second question that you would like to have the data.frames in the global environment rather than in the list (which is bad practice just so you know; it is always better to keep those in the list) you need the list2env function i.e.:
mylist <- lapply(X=dflist,FUN=columnselect)
list2env(mylist, envir = globalenv())
Using this the data.frames in the global environment will be updated.

R:Define data frames in a similar space

Is there a way in R to define data frames in a similar space.
So lets say I have an unknown number of data frames to be created (say there will be n data.frames)
I want to define a space as such:
space<-data.frame.space()
for(i in 1:n) (
space[i]<-some.func(var1,var2)
)|
where some.func creates certain data.frames (in this case it downloads information from the internet), and then I get to call these data frames by saying
space[1] #or
space[2]
#etc
I know people somehow use environments for this, and in functions I see something of the sort. I just don't know how they do that.
I think you just want a simple list
space<-list()
for(i in 1:n) (
space[[i]]<-some.func(var1,var2)
)
and then
space[[1]]
space[[2]]
Note the double bracket indexing. Using double brackets will return the data.frame. Using single brackets will return a list containing the data.frame.

Saving many subsets as dataframes using "for"-loops

this question might be very simple, but I do not find a good way to solve it:
I have a dataset with many subgroups which need to be analysed all-together and on their own. Therefore, I want to use subsets for the groups and use them for the later analysis. As well, the defintion of the subsets as the analysis should be partly done with loops in order to save space and to ensure that the same analysis has been done with all subgroups.
Here is an example of my code using an example dataframe from the boot package:
data(aids)
qlist <- c("1","2","3","4")
for (i in length(qlist)) {
paste("aids.sub.",qlist[i],sep="") <- subset(aids, quarter==qlist[i])
}
The variable which contains the subgroups in my dataset is stored as a string, therefore I added the qlist part which would be not required otherwise.
Make a list of the subsets with lapply:
lapply(qlist, function(x) subset(aids, quarter==x))
Equivalently, avoiding the subset():
lapply(qlist, function(x) aids[aids$quarter==x,])
It is likely the case that using a list will make the subsequent code easier to write and understand. You can subset the list to get a single data frame (just as you can use one of the subsets, as created below). But you can also iterate over it (using for or lapply) without having to construct variable names.
To do the job as you are asking, use assign:
for (i in qlist) {
assign(paste("aids.sub.",i,sep=""), subset(aids, quarter==i))
}
Note the removal of the length() function, and that this is iterating directly over qlist.

Directly assign results of doMC (foreach) to data frame

Lets say I have the example code
kkk<-data.frame(m.mean=1:1000, m.sd=1:1000/20)
kkk[,3:502]<-NA
for (i in 1:nrow(kkk)){
kkk[i,3:502]<-rnorm(n=500, mean=kkk[i,1], sd=kkk[i,2])
}
I would like to convert this function to run parallel with doMC. My problem is that foreach results in a list, whereas I need the results of each iteration to be a vector that can be then transfered to the data frame (which later will be exported as CVS for further processing).
Any ideas?
You don't need a loop for this, and putting a large matrix of numbers in a data frame only to treat is as a matrix is inefficient (although you may need to create a data frame at the end after doing all your math in order to write to a CSV file).
m.mean <- 1:1000
m.sd <- 1:1000/20
num.columns <- 500
x <- matrix(nrow=length(m.mean), ncol=num.columns,
data=rnorm(n=length(m.mean) * num.columns))
x <- x * cbind(m.sd)[,rep(1,num.columns)] + cbind(m.mean)[,rep(1,num.columns)]
kkk <- data.frame(m.mean=m.mean, m.sd=m.sd, unname(x))
write.csv(kkk, "kkk.txt")
To answer your original question about directly assigning results to an existing data structure from a foreach loop, that is not possible. The foreach package's parallel backends are designed to perform each computation in a separate R process, so each one has to return a separate object to the parent process, which collects them with the .combine function provided to foreach. You could write a parallel foreach loop that assignes directly to the kkk variable, but it would have no effect, because each assignment would happen in the separate processes and would not be shared with the main process.

Resources