I'd like to parallel the process below utilizing foreach in R. The process is to compare the value of column 'a' with the values of column 'b' and 'c'. if it these two values, set the flag column to TRUE, otherwise set it to FALSE. I have defined a sample dataset and I have written a function to do the comparison. I understand the function implementation is not appropriate for this a parallel process because of different return time for different nodes. So I'd appreciate if you help me how to modify the function and how to get my desired result in by running it in parallel.
a <- sample(seq(1:30),10)
b <- sample(seq(20:30),10)
c <- sample(seq(1:20),10)
data <- cbind(id,a,b,c)
data <- as.data.frame(data)
setFlag <- function(id ,data){
if(data[id]$a > df[id]$b & data[id]$a < data[id]$c){
flag <- TRUE
}else{
flag <- FALSE
}
return(flag)
}
NUM_OF_PROCESSOR <- 6
cluster <- makeCluster(NUM_OF_PROCESSOR, type = "SOCK")
cluster <- registerDoSNOW(cluster)
data$flag <- foreach(i=1:10, .combine = 'c') %dopar% setFlag(i, data)
Related
I have a table with ~1M entry points (where each line is an insurance contract, i.e. one client can have multiple contracts) and cols client_id, names and adresses.
The problem I am trying to solve is that the same client can have different client_id for each new contract.
To resolve this I have done the following:
Creating a New_ID as a 4th col in the table
Iterate twice over namesand calculate names similarity for each combination
Iterate twice over adressesand calculate names similarity for each combination
Inside each iteration: if name_similarity > 0.9 & adresses_similarity > 0.8 then New_ID takes the value of j
Used packages + fake data:
library(tidyverse)
library(stringdist) # strings' similarities
library(parallel) # parallel programming
library(foreach) # parallel programming
library(doParallel) # parallel programming
library(doSNOW) # parallel programming
# Fake data
client_id <- 1:6
names <- c("Name", "Naaame", "Name", "Namee", "Nammee", "Nammee")
adresses <- c("Adress", "Adressss", "Adress", "Adresss", "Aadressss", "Aadressss")
A <- data.frame(cbind(client_id, names, adresses)) %>%
mutate(New_ID = NA)
Script
The below nested for loops works well:
for(i in seq_along(A$client_id)){
for(j in seq_along(A$client_id)){
# calculate names similarities
name_similarity <- stringdist::stringsim(A$names[i],
A$names[j],
method = "osa",
useBytes = T)
# calculate adresses similarities
adresses_similarity <- stringdist::stringsim(A$adresses[i],
A$adresses[j],
method = "qgram",
useBytes = T)
# Decision & New_ID attribution
if(name_similarity > 0.9) {
if(adresses_similarity > 0.85){
A[i , 4] = j # New ID
}
} # decision end
} # Close j loop
} # Close i loop
Although the script above produces the expected result, it will take days to iterate over the real data size (~ 1M). So I thought of parallel programming.
Parallel programming:
I have tried to nest two foreach using the operator %:% and run it in parallel using the operator %dopar% of the doParallel package.
cl <- makeCluster(detectCores()) # Intiate clusters (I have 8 cores on my local machine)
registerDoSNOW(cl) # relate foreach to a parallel mecanism from {parallel}
clusterExport(cl, list("A")) # export data to clusters
clusterEvalQ(cl, c(library(tidyverse),
library(stringdist))) # export used packages to child clusters
foreach(i = seq_along(A$client_id) ) %:%
foreach(j = seq_along(A$client_id)) %dopar%{
# calculate names similarities
name_similarity <- stringdist::stringsim(A$names[i],
A$names[j],
method = "osa",
useBytes = T)
# calculate adresses similarities
adresses_similarity <- stringdist::stringsim(A$adresses[i],
A$adresses[j],
method = "osa",
useBytes = T)
# Decision & New_ID attribution
if(name_similarity > 0.9) {
if(adresses_similarity > 0.85){
A[i , 4] = j # New ID
}
} # decision end
}
stopCluster(cl)
However, after running the parallel nested foreach loops, the New_ID column still empty.
I've tried to unlist() the result as the foreach loop returns values in list, it doesn't work.
How can I write the nested parallel foreach to obtain the same result as in the nested for loops? Thanks
I am new to R and programming, I want to store values from loop to a data frame in R. I want ker, cValues, accuracyValues values to be stored a data frame from bellow code. I am not able to achieve this, Data Frame is only saving last value not all the values.
Can you please help me with this please.
# Define a vector which has different kernel methods
kerna <- c("rbfdot","polydot","vanilladot","tanhdot","laplacedot",
"besseldot","anovadot","splinedot")
# Define a for loop to calculate accuracy for different values of C and kernel
for (ker in kerna){
cValues <- c()
accuracyValues <- c()
for (c in 1:100) {
model <- ksvm(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = credit_card_data,
type ="C-svc",
kernel = ker,
C=c,
scaled =TRUE)
pred <- predict(model,credit_card_data[,1:10])
#pred
accuracy <- sum(pred== credit_card_data$V11)/nrow(credit_card_data)
cValues[c] <- c;
accuracyValues[c] <- accuracy;
}
for(i in 1:100) {
print(paste("kernal:",ker, "c=",cValues[i],"accuracy=",accuracyValues[i]))
}
}
Starting from your base code, set up the structure of the output data frame. Then, loop through and fill in the accuracy values on each iteration. This method also "flattens" the nested loop and gets rid of your c variable which conflicts with the built-in c() function.
kerna <- c("rbfdot","polydot","vanilladot","tanhdot","laplacedot",
"besseldot","anovadot","splinedot")
# Create dataframe to store output data
df <- data.frame(kerna = rep(kerna, each = 100),
cValues = rep(1:100, times = length(kerna)),
accuracyValues = NA,
stringsAsFactors = F)
# Define a for loop to calculate accuracy for different values of C and kernel
for (i in 1:nrow(df)){
ker <- df$kerna[i]
j <- df$cValues[i]
model <- ksvm(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = credit_card_data,
type ="C-svc",
kernel = ker,
C=j,
scaled =TRUE)
pred <- predict(model,credit_card_data[,1:10])
accuracy <- sum(pred== credit_card_data$V11)/nrow(credit_card_data)
# Insert accuracy into df$accuracyValues
df$accuracyValues[i] <- accuracy;
}
Consider Map to build a list of data frames from each pairing of ker and cValues (1:100) generated from expand.grid and row bind all elements together.
k_c_pairs_df <- expand.grid(kerna=kerna, c_value=1:100, stringsAsFactors = FALSE)
model_fct <- function(ker, c) {
model <- ksvm(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = credit_card_data,
type ="C-svc",
kernel = ker,
C=c,
scaled =TRUE)
pred <- predict(model,credit_card_data[,1:10])
accuracy <- sum(pred== credit_card_data$V11)/nrow(credit_card_data)
print(paste("kernal:",ker, "c=",cValues[i],"accuracy=",accuracyValues[i]))
return(data.frame(kernel = ker, cValues = c, accuracyValues = accuracy))
}
df_list <- Map(model_fct, k_c_pairs_df$ker, k_c_pairs_df$c_value)
final_df <- do.call(rbind, df_list)
The followings are functions for bootstrapping, but how can I make the result reproducible? I tried set.seed() but that does not work because the every time lapply calls function boot.lm.vector, the function just produced one simulated set and calculated coefficients once. Is there any thing in R that can function like a seed list? or any other way to make the result reproducible?
boot.lm.vector <- function(index, inputData) {
d <- inputData[sample.int(nrow(inputData), replace = T),]
a <- ncol(inputData)-1
X <- d[, 1:a]
y <- d[, a+1]
solve(crossprod(X), crossprod(X,y))
}
rtest <- lapply(1:10000, fun = boot.lm.vector, inputData = boot_set)
rtestdf <- plyr::ldply(rtest)
If you set the seed using an index inside your function, you should be able to reproduce it. Dummy boot.lm.vector function below:
## samples 1 item from inputData
boot.lm.vector <- function(index, inputData) {
set.seed(index)
return(sample(inputData, 1))
}
## iterating 5 times: use lapply as per your requirement
test <- sapply(1:5, FUN = boot.lm.vector, inputData = 1:10)
test
[1] 3 2 2 6 3 # reproducible result
This is a very simple example.
df = c("already ","miss you","haters","she's cool")
df = data.frame(df)
library(doParallel)
cl = makeCluster(4)
registerDoParallel(cl)
foreach(i = df[1:4,1], .combine = rbind, .packages='tm') %dopar% classification(i)
stopCluster(cl)
In real case I have dataframe with n=400000 rows.
I don't know how to send nrow/ncluster data for each cluster in one step, i = ?
I tried with isplitRows from library(itertools) without success.
You should try to work with indices to create subsets of your data.
foreach(i = nrow(df), .combine = rbind, .packages='tm') %dopar% {
tmp <- df[i, ]
classification(tmp)
}
This will take a new row of the data.frame each iteration.
Furthermore, you should notice that the result of a foreach loop will be written to a new variable. Thus, you should assign it like this:
res <- foreach(i = 1:10, .combine = c, ....) %dopar% {
# things you want to do
x <- someFancyFunction()
# the last value will be returned and combined by the .combine function
x
}
Try using a combination of split and mclapply as proposed in Aproach 1 here: https://www.r-bloggers.com/trying-to-reduce-the-memory-overhead-when-using-mclapply/
split lets you split data into groups defined by a factor, or you can just use 1:nrow(df) if you want to do the operation on each row seperately.
My solution after your comments:
n = 8 #number of cluster
library(foreach)
library(doParallel)
cl = makeCluster(n)
registerDoParallel(cl)
z = nrow(df)
y = floor(z/n)
x = nrow(df)%%n
ris = foreach(i = split(df[1:(z-x),],rep(1:n,each=y)), .combine = rbind, .packages='tm') %dopar% someFancyFunction(i)
stopCluster(cl)
#sequential
if (x !=0 )
ris = rbind(ris,someFancyFunction(df[(z-x+1):z,1]))
Note:
I used the sequential esecution at the end, because if "x" is not zero, the function split put the rest of rows (z-(z-x)) in the first cluster, and change the order of the result.
require(quantmod)
require(TTR)
iris2 <- iris[1:4]
b=NULL
for (i in 1:ncol(iris2)){
for (j in 1:ncol(iris2)){
a<- runCor(iris2[,i],iris2[,j],n=21)
b<-cbind(b,a)}}
I want to calculate a rolling correlation of different columns within a dataframe and store the data separately by a column. Although the code above stores the data into variable b, it is not as useful as it is just dumping all the results. What I would like is to be able to create different dataframe for each i.
In this case, as I have 4 columns, what I would ultimately want are 4 dataframes, each containing 4 columns showing rolling correlations, i.e. df1 = corr of col 1 vs col 1,2,3,4, df2 = corr of col 2 vs col 1,2,3,4...etc)
I thought of using lapply or rollapply, but ran into the same problem.
d=NULL
for (i in 1:ncol(iris2))
for (j in 1:ncol(iris2))
{c<-rollapply(iris2, 21 ,function(x) cor(x[,i],x[,j]), by.column=FALSE)
d<-cbind(d,c)}
Would really appreciate any inputs.
If you want to keep the expanded loop, how about a list of dataframes?
e <- list(length = length(ncol(iris2)))
for (i in 1:ncol(iris2)) {
d <- matrix(0, nrow = length(iris2[,1]), ncol = length(iris2[1,]))
for (j in 1:ncol(iris2)) {
d[,j]<- runCor(iris2[,i],iris2[,j],n=21)
}
e[[i]] <- d
}
It's also a good idea to allocate the amount of space you want with placeholders and put items into that space rather than use rbind or cbind.
Although it is not a good practice to create dataframes on the fly in R (you should prefer putting them in a list as in other answer), the way to do so is to use the assign and get functions.
for (i in 1:ncol(iris2)) {
for (j in 1:ncol(iris2)){
c <- runCor(iris2[,i],iris2[,j],n=21)
# Assign 'c' to the name df1, df2...
assign(paste0("df", i), c)
}
}
# to have access to the dataframe:
get("df1")
# or inside a loop
get(paste0("df", i))
Since you stated your computation was slow, I wanted to provide you with a parallel solution. If you have a modern computer, it probably has 2 cores, if not 4 (or more!). You can easily check this via:
require(parallel) # for parallelization
detectCores()
Now the code:
require(quantmod)
require(TTR)
iris2 <- iris[,1:4]
Parallelization requires the functions and variables be placed into a special environment that is created and destroyed with each process. That means a wrapper function must be created to define the variables and functions.
wrapper <- function(data, n) {
# variables placed into environment
force(data)
force(n)
# functions placed into environment
# same inner loop written in earlier answer
runcor <- function(data, n, i) {
d <- matrix(0, nrow = length(data[,1]), ncol = length(data[1,]))
for (j in 1:ncol(data)) {
d[,i] <- TTR::runCor(data[,i], data[,j], n = n)
}
return(d)
}
# call function to loop over iterator i
worker <- function(i) {
runcor(data, n, i)
}
return(worker)
}
Now create a cluster on your local computer. This allows the multiple cores to run separately.
parallelcluster <- makeCluster(parallel::detectCores())
models <- parallel::parLapply(parallelcluster, 1:ncol(iris2),
wrapper(data = iris2, n = 21))
stopCluster(parallelcluster)
Stop and close the cluster when finished.