Here, I am trying to translate the language of a text by using parallel processing in R. This is the first time I am using Parallel processing. My code is:
install.packages("RYandexTranslate")
install.packages("textcat")
install.packages("plyr")
install.packages("parallel")
library("RYandexTranslate")
library("textcat")
library("dplyr")
library("parallel")
api_key <- "trnsl.1.1.20160707T103515Z.90fa575d702ae81e.6ec78e064eb94a1c00a9bc506c615f223cf0cf5b"
cl <- makeCluster(4)
Query_L_German <- c("5 euro muenze stempelglanz","2 euro muenzen uebersicht")
Par_Conversion <- function(QUery_L_German)
{
for(i in 1:length(Query_L_German))
{
x <- translate(api_key,Query_L_German[i], "de-en")$text
return(x)
}
}
a <- length(Query_L_German)
parLapply(cl, seq(a), function(i,Query_L_German,Par_Conversion)
for(i in 1:length(Query_L_German)){
x <- Par_Conversion(Query_L_German)
return(x)
}, Query_L_German, Par_Conversion)
But, I am getting following error:
Error in checkForRemoteErrors(val) : 3 nodes produced errors; first
error: object 'Query_L_German' not found
When you are using the function parLapply you need to define the function and variabels which are used within parLapply explicitly. This can be done by defining varlist in the the function clusterExport. Here is a in-depth question/answer on how to do this and other stuff with parLapply if you want to understand more.
Your example can be solved by inserting the following line before parLapply is used:
clusterExport(cl, varlist = c("api_key","Query_L_German","translate"))
Related
I'm currently developing an R package that will be using parallel computing to solve some tasks, through means of the "parallel" package.
I'm getting some really awkward behavior when utilizing clusters defined inside functions of my package, where the parLapply function assigns a job to a worker and waits for it to finish to assign a job to next worker.
Or at least this is what appears to be happening, through the observation of the log file "cluster.log" and the list of running processes in the unix shell.
Below is a mockup version of the original function declared inside my package:
.parSolver <- function( varMatrix, var1 ) {
no_cores <- detectCores()
#Rows in varMatrix
rows <- 1:nrow(varMatrix[,])
# Split rows in n parts
n <- no_cores
parts <- split(rows, cut(rows, n))
# Initiate cluster
cl <- makePSOCKcluster(no_cores, methods = FALSE, outfile = "/home/cluster.log")
clusterEvalQ(cl, library(raster))
clusterExport(cl, "varMatrix", envir=environment())
clusterExport(cl, "var1", envir=environment())
rParts <- parLapply(cl = cl, X = 1:n, fun = function(x){
part <- rasterize(varMatrix[parts[[x]],], raster(var1), .....)
print(x)
return(part)
})
do.call(merge, rParts)
}
NOTES:
I'm using makePSOCKcluster because i want the code to run on windows and unix systems alike although this particular problem is only manifesting itself in a unix system.
Functions rasterize and raster are defined in library(raster), exported to the cluster.
The weird part to me is if I execute the exact same code of the function parSolver in a global environment every thing works smoothly, all workers take one job at the same time and the task completes in no time.
However if I do something like:
library(myPackage)
varMatrix <- (...)
var1 <- (...)
result <- parSolver(varMatrix, var1)
the described problem appears.
It appears to be a load balancing problem however that does not explain why it works ok in one situation and not in the other.
Am I missing something here?
Thanks in advance.
I don't think parLapply is running sequentially. More likely, it's just running inefficiently, making it appear to run sequentially.
I have a few suggestions to improve it:
Don't define the worker function inside parSolver
Don't export all of varMatrix to each worker
Create the cluster outside of parSolver
The first point is important, because as your example now stands, all of the variables defined in parSolver will be serialized along with the anonymous worker function and sent to the workers by parLapply. By defining the worker function outside of any function, the serialization won't capture any unwanted variables.
The second point avoids unnecessary socket I/O and uses less memory, making the code more scalable.
Here's a fake, but self-contained example that is similar to yours that demonstrates my suggestions:
# Define worker function outside of any function to avoid
# serialization problems (such as unexpected variable capture)
workerfn <- function(mat, var1) {
library(raster)
mat * var1
}
parSolver <- function(cl, varMatrix, var1) {
parts <- splitIndices(nrow(varMatrix), length(cl))
varMatrixParts <- lapply(parts, function(i) varMatrix[i,,drop=FALSE])
rParts <- clusterApply(cl, varMatrixParts, workerfn, var1)
do.call(rbind, rParts)
}
library(parallel)
cl <- makePSOCKcluster(3)
r <- parSolver(cl, matrix(1:20, 10, 2), 2)
print(r)
Note that this takes advantage of the clusterApply function to iterate over a list of row-chunks of varMatrix so that the entire matrix doesn't need to be sent to everyone. It also avoids calls to clusterEvalQ and clusterExport, simplifying the code, as well as making it a bit more efficient.
I'm having some troubles with the variables inside a foreach. I load the cluster and set up a couple of vectors:
library(doParallel)
ncores <- detectCores() - 2
cl <- makeCluster(ncores, outfile="", port=11439)
registerDoParallel(cl)
results <- rep(NA,10)
values <- 20:30
Then, it does not work:
# Error: object 'i' not found
foreach(i=1:10) %dopar%
results[i] <- i
stopCluster(cl)
While this does:
# ok
foreach(i=1:10) %dopar%
values[i]
stopCluster(cl)
How come it finds i when it is used inside a [i] in the left hand side, but it does not find it when used in the right hand side?
From my comment:
try it with curly braces.
foreach(i=1:10) %dopar% {
results[i] <- i
}
Not just with this example, I experienced it is better to use curly braces in R. Many Problems can be avoided by using them. And apparently there are some more advantages of these little helpers, as you may see while browsing through the Internets (e.g. see here).
I would like to split a large data.frame into chunks and pass each individually to the different members of the cluster.
Something like:
library(parallel)
cl <- makeCluster(detectCores())
for (i in 1:detectCores()) {
clusterExport(cl, mydata[indices[[i]]], <extra option to specify a thread/process>)
}
Is this possible?
Here is an example that uses clusterCall inside a for loop to send a different chunk of the data frame to each of the workers:
library(parallel)
cl <- makeCluster(detectCores())
df <- data.frame(a=1:10, b=1:10)
ix <- splitIndices(nrow(df), length(cl))
for (i in seq_along(cl)) {
clusterCall(cl[i], function(d) {
assign('mydata', d, pos=.GlobalEnv)
NULL # don't return any data to the master
}, df[ix[[i]],,drop=FALSE])
}
Note that the call to clusterCall is subsetting cl in order to execute the function on a single worker each time through the for loop.
You can verify that the workers were properly initialized in this example using:
r <- do.call('rbind', clusterEvalQ(cl, mydata))
identical(df, r)
There are easier ways to do this, but this example minimizes the memory used by the master and the amount of data sent to each of the workers. This is important when the data frame is very large.
My challenge is to parallel compute a recursive function. However, the recursion is quite deep, and therefore (in my own novice words) there is an issue with allocating a worker when all the workers are busy. in short, it crushes.
Here is some reproducible code. The code is very stupid, but the structure is what counts. This is a simplified version of what is going on.
I work on a windows machine, if the solution is to go linux, just say the word. Because the real function can be quite deep, managing the number of workers that are called for in the upper level will not solve the issue. Is there perhaps a way to know in what level the recursion is?
FUN <- function(optimizer,neighbors,considered,x){
considered <- c(considered,optimizer)
neighbors <- setdiff(x=neighbors,y=considered)
if (length(neighbors)==0) {
# this loop is STUPID, but it is just an example.
z <- numeric(10)
for (i in 1:100)
{
z[i] <- sample(x,1)
}
return(max(z))
} else {
# Something embarrassingly parallel,
# but cannot be vectorized.
z <- numeric(10)
z <- foreach(i=1:10, .combine='c') %dopar%{
FUN(optimizer=neighbors[1],neighbors=neighbors,
considered=considered,x=x)}
return(max(z))
}
}
require(doParallel,quietly=T)
cl <- makeCluster(3)
clusterExport(cl, c("FUN"))
registerDoParallel(cl)
getDoParWorkers()
>FUN(optimizer=1,neighbors=c(2),considered=c(),x=1:500)
[1] 500
>FUN(optimizer=1,neighbors=c(2,3),considered=c(),x=1:500)
Error in { : task 1 failed - "could not find function "%dopar%""
>FUN(optimizer=1,neighbors=c(2,3),considered=c(),x=1:500)
Error in { : task 1 failed - "could not find function "%dopar%""
Is this error really because the recursion is too deep or is it just because you haven't got require(doParallel) in your FUN function? So that when FUN is called on the workers, that instance of R hasn't got that package in its list.
Your first example doesn't do this because its simple enough to not get to the inner %dopar% loop.
I'm trying to run a NetLogo simulation (using RNetLogo package) in R using parallel processing on my laptop. I'm trying to assess "t-feeding of females" using 3 (i.e., 0, 25, and 50) different "minimum-separation" values. For each "minimum-separation" value, I'd like to replicate the simulation 10 times. I can run everything correctly just using lapply but I'm having trouble with parLapply. I've just started using the package "parallel" so I'm sure it is something in the syntax.
#Set up clusters for parallel
processors <- detectCores()
cl <- makeCluster(processors)
#Simulation
sim3 <- function(min_sep) {
NLCommand("set minimum-separation ", min_sep, "setup")
ret <- NLDoReport(720, "go", "[t-feeding] of females", as.data.frame=TRUE)
tot <- sum(ret[,1])
return(tot)
}
#Replicate simulations 10 times using lapply and create boxplots. This one works.
rep.sim3 <- function(min_sep, rep) {
return(
lapply(min_sep, function(min_sep) {
replicate(rep, sim3(min_sep))
})
)
}
d <- seq(0,50,25)
res <- rep.sim3(d,10)
boxplot(res,names=d, xlab="Minimum Separation", ylab="Time spent feeding")
#Replicate simulations 10 times using parLapply. This one does not work.
rep.sim3 <- function(min_sep, rep) {
return(
parLapply(cl, min_sep, function(min_sep) {
replicate(rep, sim3(min_sep))
})
)
}
d <- seq(0,50,25)
res <- rep.sim3(d,10)
# Error in checkForRemoteErrors(val) : 3 nodes produced errors; first error: could not find function "sim3"
#Replicate simulations 10 times using parLapply. This one does work but creates a list of the wrong length and therefore the boxplot cannot be plotted correctly.
rep.sim3 <- function(min_sep, rep) {
return(
parLapply(cl, replicate(rep, d), sim3))
}
d <- seq(0,50,25)
res <- rep.sim3(d,10)
Ideally I'd like to make the first parLapply work. Alternatively, I guess I could modify res from the parLapply that works so that the list has a length of max_sep instead of 30. However, I can't seem to do that. Any help would be much appreciated!
Thanks in advance.
You need to initialize the cluster workers before executing rep.sim3. The error message indicates that your workers can't execute the sim3 function because you haven't exported it to them. Also, I noticed that you haven't loaded the RNetlogo package on the workers, either.
The easiest way to initialize the workers is with the clusterEvalQ and clusterExport functions:
clusterEvalQ(cl, library(RNetLogo))
clusterExport(cl, 'sim3')
Note that you shouldn't do this in your rep.sim3 function, since that would be inefficient and unnecessary. Do it just once after creating the cluster object and sim3 has been defined.
This initialization is necessary because the workers started via makeCluster don't know anything about your variables or functions, or anything else about your R session. And parLapply doesn't analyze the function that you pass to it any more than lapply does. The difference is that lapply executes in your local R session where sim3 is defined and the RNetLogo package is loaded. parLapply executes the specified function in remote R sessions that have not been initialized by executing your R script.