R: how to split dataframe in foreach %dopar% - r

This is a very simple example.
df = c("already ","miss you","haters","she's cool")
df = data.frame(df)
library(doParallel)
cl = makeCluster(4)
registerDoParallel(cl)
foreach(i = df[1:4,1], .combine = rbind, .packages='tm') %dopar% classification(i)
stopCluster(cl)
In real case I have dataframe with n=400000 rows.
I don't know how to send nrow/ncluster data for each cluster in one step, i = ?
I tried with isplitRows from library(itertools) without success.

You should try to work with indices to create subsets of your data.
foreach(i = nrow(df), .combine = rbind, .packages='tm') %dopar% {
tmp <- df[i, ]
classification(tmp)
}
This will take a new row of the data.frame each iteration.
Furthermore, you should notice that the result of a foreach loop will be written to a new variable. Thus, you should assign it like this:
res <- foreach(i = 1:10, .combine = c, ....) %dopar% {
# things you want to do
x <- someFancyFunction()
# the last value will be returned and combined by the .combine function
x
}

Try using a combination of split and mclapply as proposed in Aproach 1 here: https://www.r-bloggers.com/trying-to-reduce-the-memory-overhead-when-using-mclapply/
split lets you split data into groups defined by a factor, or you can just use 1:nrow(df) if you want to do the operation on each row seperately.

My solution after your comments:
n = 8 #number of cluster
library(foreach)
library(doParallel)
cl = makeCluster(n)
registerDoParallel(cl)
z = nrow(df)
y = floor(z/n)
x = nrow(df)%%n
ris = foreach(i = split(df[1:(z-x),],rep(1:n,each=y)), .combine = rbind, .packages='tm') %dopar% someFancyFunction(i)
stopCluster(cl)
#sequential
if (x !=0 )
ris = rbind(ris,someFancyFunction(df[(z-x+1):z,1]))
Note:
I used the sequential esecution at the end, because if "x" is not zero, the function split put the rest of rows (z-(z-x)) in the first cluster, and change the order of the result.

Related

R converting nested for loops_the nested parallel foreach doesn't work

I have a table with ~1M entry points (where each line is an insurance contract, i.e. one client can have multiple contracts) and cols client_id, names and adresses.
The problem I am trying to solve is that the same client can have different client_id for each new contract.
To resolve this I have done the following:
Creating a New_ID as a 4th col in the table
Iterate twice over namesand calculate names similarity for each combination
Iterate twice over adressesand calculate names similarity for each combination
Inside each iteration: if name_similarity > 0.9 & adresses_similarity > 0.8 then New_ID takes the value of j
Used packages + fake data:
library(tidyverse)
library(stringdist) # strings' similarities
library(parallel) # parallel programming
library(foreach) # parallel programming
library(doParallel) # parallel programming
library(doSNOW) # parallel programming
# Fake data
client_id <- 1:6
names <- c("Name", "Naaame", "Name", "Namee", "Nammee", "Nammee")
adresses <- c("Adress", "Adressss", "Adress", "Adresss", "Aadressss", "Aadressss")
A <- data.frame(cbind(client_id, names, adresses)) %>%
mutate(New_ID = NA)
Script
The below nested for loops works well:
for(i in seq_along(A$client_id)){
for(j in seq_along(A$client_id)){
# calculate names similarities
name_similarity <- stringdist::stringsim(A$names[i],
A$names[j],
method = "osa",
useBytes = T)
# calculate adresses similarities
adresses_similarity <- stringdist::stringsim(A$adresses[i],
A$adresses[j],
method = "qgram",
useBytes = T)
# Decision & New_ID attribution
if(name_similarity > 0.9) {
if(adresses_similarity > 0.85){
A[i , 4] = j # New ID
}
} # decision end
} # Close j loop
} # Close i loop
Although the script above produces the expected result, it will take days to iterate over the real data size (~ 1M). So I thought of parallel programming.
Parallel programming:
I have tried to nest two foreach using the operator %:% and run it in parallel using the operator %dopar% of the doParallel package.
cl <- makeCluster(detectCores()) # Intiate clusters (I have 8 cores on my local machine)
registerDoSNOW(cl) # relate foreach to a parallel mecanism from {parallel}
clusterExport(cl, list("A")) # export data to clusters
clusterEvalQ(cl, c(library(tidyverse),
library(stringdist))) # export used packages to child clusters
foreach(i = seq_along(A$client_id) ) %:%
foreach(j = seq_along(A$client_id)) %dopar%{
# calculate names similarities
name_similarity <- stringdist::stringsim(A$names[i],
A$names[j],
method = "osa",
useBytes = T)
# calculate adresses similarities
adresses_similarity <- stringdist::stringsim(A$adresses[i],
A$adresses[j],
method = "osa",
useBytes = T)
# Decision & New_ID attribution
if(name_similarity > 0.9) {
if(adresses_similarity > 0.85){
A[i , 4] = j # New ID
}
} # decision end
}
stopCluster(cl)
However, after running the parallel nested foreach loops, the New_ID column still empty.
I've tried to unlist() the result as the foreach loop returns values in list, it doesn't work.
How can I write the nested parallel foreach to obtain the same result as in the nested for loops? Thanks

R loop to create data frames with 2 counters

What I want is to create 60 data frames with 500 rows in each. I tried the below code and, while I get no errors, I am not getting the data frames. However, when I do a View on the as.data.frame, I get the view, but no data frame in my environment. I've been trying for three days with various versions of this code:
getDS <- function(x){
for(i in 1:3){
for(j in 1:30000){
ID_i <- data.table(x$ID[j: (j+500)])
}
}
as.data.frame(ID_i)
}
getDS(DATASETNAME)
We can use outer (on a small example)
out1 <- c(outer(1:3, 1:3, Vectorize(function(i, j) list(x$ID[j:(j + 5)]))))
lapply(out1, as.data.table)
--
The issue in the OP's function is that inside the loop, the ID_i gets updated each time i.e. it is not stored. Inorder to do that we can initialize a list and then store it
getDS <- function(x) {
ID_i <- vector('list', 3)
for(i in 1:3) {
for(j in 1:3) {
ID_i[[i]][[j]] <- data.table(x$ID[j:(j + 5)])
}
}
ID_i
}
do.call(c, getDS(x))
data
x <- data.table(ID = 1:50)
I'm not sure the description matches the code, so I'm a little unsure what the desired result is. That said, it is usually not helpful to split a data.table because the built-in by-processing makes it unnecessary. If for some reason you do want to split into a list of data.tables you might consider something along the lines of
getDS <- function(x, n=5, size = nrow(x)/n, column = "ID", reps = 3) {
x <- x[1:(n*size), ..column]
index <- rep(1:n, each = size)
replicate(reps, split(x, index),
simplify = FALSE)
}
getDS(data.table(ID = 1:20), n = 5)

Parallel foreach with my defined function in R

I'd like to parallel the process below utilizing foreach in R. The process is to compare the value of column 'a' with the values of column 'b' and 'c'. if it these two values, set the flag column to TRUE, otherwise set it to FALSE. I have defined a sample dataset and I have written a function to do the comparison. I understand the function implementation is not appropriate for this a parallel process because of different return time for different nodes. So I'd appreciate if you help me how to modify the function and how to get my desired result in by running it in parallel.
a <- sample(seq(1:30),10)
b <- sample(seq(20:30),10)
c <- sample(seq(1:20),10)
data <- cbind(id,a,b,c)
data <- as.data.frame(data)
setFlag <- function(id ,data){
if(data[id]$a > df[id]$b & data[id]$a < data[id]$c){
flag <- TRUE
}else{
flag <- FALSE
}
return(flag)
}
NUM_OF_PROCESSOR <- 6
cluster <- makeCluster(NUM_OF_PROCESSOR, type = "SOCK")
cluster <- registerDoSNOW(cluster)
data$flag <- foreach(i=1:10, .combine = 'c') %dopar% setFlag(i, data)

How to store data from for loop inside of for loop? (rolling correlation in r)

require(quantmod)
require(TTR)
iris2 <- iris[1:4]
b=NULL
for (i in 1:ncol(iris2)){
for (j in 1:ncol(iris2)){
a<- runCor(iris2[,i],iris2[,j],n=21)
b<-cbind(b,a)}}
I want to calculate a rolling correlation of different columns within a dataframe and store the data separately by a column. Although the code above stores the data into variable b, it is not as useful as it is just dumping all the results. What I would like is to be able to create different dataframe for each i.
In this case, as I have 4 columns, what I would ultimately want are 4 dataframes, each containing 4 columns showing rolling correlations, i.e. df1 = corr of col 1 vs col 1,2,3,4, df2 = corr of col 2 vs col 1,2,3,4...etc)
I thought of using lapply or rollapply, but ran into the same problem.
d=NULL
for (i in 1:ncol(iris2))
for (j in 1:ncol(iris2))
{c<-rollapply(iris2, 21 ,function(x) cor(x[,i],x[,j]), by.column=FALSE)
d<-cbind(d,c)}
Would really appreciate any inputs.
If you want to keep the expanded loop, how about a list of dataframes?
e <- list(length = length(ncol(iris2)))
for (i in 1:ncol(iris2)) {
d <- matrix(0, nrow = length(iris2[,1]), ncol = length(iris2[1,]))
for (j in 1:ncol(iris2)) {
d[,j]<- runCor(iris2[,i],iris2[,j],n=21)
}
e[[i]] <- d
}
It's also a good idea to allocate the amount of space you want with placeholders and put items into that space rather than use rbind or cbind.
Although it is not a good practice to create dataframes on the fly in R (you should prefer putting them in a list as in other answer), the way to do so is to use the assign and get functions.
for (i in 1:ncol(iris2)) {
for (j in 1:ncol(iris2)){
c <- runCor(iris2[,i],iris2[,j],n=21)
# Assign 'c' to the name df1, df2...
assign(paste0("df", i), c)
}
}
# to have access to the dataframe:
get("df1")
# or inside a loop
get(paste0("df", i))
Since you stated your computation was slow, I wanted to provide you with a parallel solution. If you have a modern computer, it probably has 2 cores, if not 4 (or more!). You can easily check this via:
require(parallel) # for parallelization
detectCores()
Now the code:
require(quantmod)
require(TTR)
iris2 <- iris[,1:4]
Parallelization requires the functions and variables be placed into a special environment that is created and destroyed with each process. That means a wrapper function must be created to define the variables and functions.
wrapper <- function(data, n) {
# variables placed into environment
force(data)
force(n)
# functions placed into environment
# same inner loop written in earlier answer
runcor <- function(data, n, i) {
d <- matrix(0, nrow = length(data[,1]), ncol = length(data[1,]))
for (j in 1:ncol(data)) {
d[,i] <- TTR::runCor(data[,i], data[,j], n = n)
}
return(d)
}
# call function to loop over iterator i
worker <- function(i) {
runcor(data, n, i)
}
return(worker)
}
Now create a cluster on your local computer. This allows the multiple cores to run separately.
parallelcluster <- makeCluster(parallel::detectCores())
models <- parallel::parLapply(parallelcluster, 1:ncol(iris2),
wrapper(data = iris2, n = 21))
stopCluster(parallelcluster)
Stop and close the cluster when finished.

how to return a list of tables parallel processing in R

I'm trying to see how I can return a list of four tables via my parallel loop. Currently, I have a function calculatef1(k,i) which returns a vector of four elements c(score1,score2,score3,score4).
aggregate_scores <-
foreach(k = 1:num_metrics, .combine='cbind') %:%
foreach(i = 1:10, .combine='c') %dopar% {
x = c("boot","dtw","sqldf")
lapply(x, require, character.only=T)
f1 <- NA
try(f1<-calculatef1(k,i),silent=TRUE)
f1
}
Right now, this parallel loop returns a table of 40 rows and num_metrics columns, but how would I change my foreach loops so I return a list of 4 tables, each corresponding to a different score, instead (or at least something similar)?
Thanks!
*For reproducibility, use this for the calculatef1 function...
calculatef1 <- function(k,i){
score1 <- k+i
score2 <- k-i
score3 <- k*i
score4 <- k^i
c(score1,score2,score3,score4)
}
I would like to return a list of four data frames where, for example, the 3rd data frame has score3 for the following pairs of k,i.
1*1=1 2*1=2 3*1=3 ...
1*2=2 2*2=4 3*2=6 ...
1*3=3 2*3=6 3*3=9 ...
. . .
. . .
And similarly the 4th data frame would have 1^1, 2^1, 3^1, 1^2, 2^2, and etc. You get the idea.
Of course the trick is to create the appropriate combine functions. You could combine the vectors from the inner foreach loop with rbind so it returns matrices which are then processed by the outer foreach loop. The combine for the outer foreach loop could be:
comb <- function(x, ...) {
lapply(1:4, function(i) c(x[[i]], lapply(list(...), function(y) y[,i])))
}
This returns a list of four lists, which can be converted into a list of four data frames using a "final" function:
final <- function(x) {
lapply(x, function(y) {
attr(y, 'names') <- paste('X', seq_along(y), sep='.')
attr(y, 'row.names') <- .set_row_names(length(y[[1]]))
class(y) <- 'data.frame'
y
})
}
You can test these two functions using:
x <- list(list(), list(), list(), list())
x <- comb(x, matrix(1, 10, 4), matrix(2, 10, 4))
x <- comb(x, matrix(3, 10, 4), matrix(4, 10, 4))
print(final(x))
Finally, you can test them in an actual foreach loop:
x <- list(list(), list(), list(), list())
aggregate_scores <-
foreach(k = 1:6, .init=x, .final=final,
.combine=comb, .multicombine=TRUE) %:%
foreach(i = 1:10, .combine='rbind') %dopar% {
c(k+i, k-i, k*i, k^i)
}

Resources