R Foreach parallel processing with ffdf mapply function - r

I have a large ffdf named 'Scenarios' that I am applying a function to from the NGA package. I am already using mychunks to try and speed things up but it is still slow. Could I run it with parallel processing as well using say the Foreach package? My code at present is shown below:
PGA = (rep(NA,Nevs))
mychunks <- chunk(Scenarios)
for(myblock in mychunks){
ScenariosINRAM <- Scenarios[myblock, ]
PGA[seq(min(myblock), max(myblock))] <- mapply(Sa.ba,ScenariosINRAM$Magnitude, ScenariosINRAM$Rjb, Vs30, ScenariosINRAM$Epsilon,T=0,rake=NA, U=0, SS=1, NS=0, RS=0, AB11=1)
}
I have not had much success with Foreach, and I need to get the speed up, any help would be greatly appreciated. Thanks

Related

read_html malfunctions when used with parLapply R

So I am trying to do web scraping of an API with millions of individual pages. To do so I am using parallel processing and the rvest package. The problem arises because the function read_html returns an empty xml document when used inside the cluster. Does anyone have a solution for this? So far I have been using getURL function but the problem is the size of the object increases from one function to another and when talking tenths of millions of websites it makes quite a difference. My sample(toy) code that shows the problem is:
library(parallel)
docss<-c('https://stackoverflow.com/' , 'https://stackoverflow.com/')
read_html(paste0(docss[1]))
cl<-makeCluster(2)
clusterEvalQ(cl, {require(rvest)})
clusterExport(cl,'docss')
dats<-parLapply(cl, docss, function(j){
read_html(paste0(docss[1]))
})
dats
stopCluster(cl)
This happens not only with the parLapply function but also with doParallel's foreach, and since I am working with Windows I cannot use many other options for parallel processing.

Is there a way to combine Rmpi & mclapply?

I have some R code that applies a function to a list of objects. The function is simple but involves a bootstrapping calculation, which can be easily sped up using mclapply. When run on a single node, everything is fine.
However, I have a cluster and what I've been trying to do is to distribute the application of the function to the list of objects across multiple nodes. To do this I've been using Rmpi (0.6-6).
The code below runs fine
library(Rmpi)
cl <- parallel::makeCluster(10, type='MPI')
parallel::clusterExport(cl, varlist=c('as.matrix'), envir=environment())
descriptor <- parallel::parLapply(1:5, function(am) {
val <- mean(unlist(lapply(1:120, function(x) mean(rnorm(1e7)))))
return(c(val, Rmpi::mpi.universe.size()))
}, cl=cl)
print(do.call(rbind, descriptor))
snow::stopCluster(cl)
However, if I convert the lapply to mclapply and set mc.cores=10, MPI warns that forking will lead to bad things, and the job hangs.
(In all cases jobs are being submitted via SLURM)
Based on the MPI warning, it seems that I should not be using mclapply within Rpmi jobs. Is this a correct assessment?
If so, does anybody have suggestions on how I can parallelize the function that is being run on each node?

How can Iist only functions and those that come from a package

I use the foreach package to parallelize some stuff and I am tired of indicating 5 functions in .export everytime I need to use it.
I know I can do foreach(...,.export=ls(.GlobalEnv)) but this transfers a lot of data to the workers and slow me down (there can be big tables defined).
So the question is how can I list only functions in the .GlobalEnv
I did that:
getAllFunctions <- function(envir=.GlobalEnv){
allClasses <- sapply(grep(x=ls(envir), pattern='^%', value=TRUE, invert=TRUE), FUN=function(x){class(eval(parse(text=x)))})
fnNames <- names(allClasses)[allClasses == 'function']
return(fnNames)
}
But that's ugly (and gives everything) and I'm sure there is an idiomatic way
From the comments:
as.list(.GlobalEnv)[sapply(.GlobalEnv, is.function)]

Error: could not find function "includePackage"

I am trying to execute Random Forest algorithm on SparkR, with Spark 1.5.1 installed. I am not getting clear idea, why i am getting the error -
Error: could not find function "includePackage"
Further even if I use mapPartitions function in my code , i get the error saying -
Error: could not find function "mapPartitions"
Please find the below code:
rdd <- SparkR:::textFile(sc, "http://localhost:50070/explorer.html#/Datasets/Datasets/iris.csv",5)
includePackage(sc,randomForest)
rf <- mapPartitions(rdd, function(input) {
## my function code for RF
}
This is more of a comment and a cross question rather than an answer (not allowed to comment because of the reputation) but just to take this further, if we are using the collect method to convert the rdd back to an R dataframe, isnt that counter productive as if the data is too large, it would take too long to execute in R.
Also does it mean that we could possibly use any R package say, markovChain or a neuralnet using the same methodology.
Kindly check the functions that can be in used in sparkR http://spark.apache.org/docs/latest/api/R/index.html
This doesn't include function mapPartitions() or includePackage()
#For reading csv in sparkR
sparkRdf <- read.df(sqlContext, "./nycflights13.csv",
"com.databricks.spark.csv", header="true")
#Possible way to use `randomForest` is to convert the `sparkR` data frame to `R` data frame
Rdf <- collect(sparkRdf)
#compute as usual in `R` code
>install.packages("randomForest")
>library(rainForest)
......
#convert back to sparkRdf
sparkRdf <- createDataFrame(sqlContext, Rdf)

How to parrallellize a for loop in R

I have this loop and I'm wondering what are the different ways to parralelize it :
for (i in 1:nrow(dataset)){
dataset$dayDiff[i] = dataset$close[i] - dataset$open[i]
}
I was thinking of using lapply but I don't see how to use a list in this context. Maybe I would use foreach in the parallel package but I don't know how to use it.
There is no good reason to use a loop here. Simply do dataset$dayDiff <- dataset$close - dataset$open. R is vectorized.

Resources