I'm using MSGARCH (R's packages) on windows 10
I should fitting any markov switching model many times (12.500 with a while and for loop) with this code
X = CreateSpec(variance.spec = list(model = c()),
distribution.spec = list(distribution = c()))
Y = FitML(data, spec = X)
How to parallelize the last function (FitML)?? I'd like to run many FitML() function for various X values
Assuming you already have a list of X:s, let us call it Xs, then you can call FitML() on each of the elements as:
Ys <- lapply(Xs, FUN = FitML)
The above applies the function to the elements sequentially. To do the same in parallel, you can use the future.apply package part of the future ecosystem (I'm the author). The following parallelizes on your local machine and works on all operating systems:
library(future.apply)
plan(multiprocess)
Ys <- future_lapply(Xs, FUN = FitML)
If there is a random-number-generation (RNG) component to FitML(), then you need to use:
Ys <- future_lapply(Xs, FUN = FitML, future.seed = TRUE)
to make sure you use proper random numbers.
If you don't specify plan(), or specify plan(sequentially), it will run sequentially.
Related
few days ago I ask this topic about calling a custom made function within a loop that was well resolved by a combination of
eval(parse(text = Function text))
here is the link: Automatic creation and use of custom made function in R.
This allowed me to work with for loop and call automatically the function I need from a Data frame storing the body of the function to create.
Now I would like to bring the question to a next level. My problem is computation time. I need to evaluate something like 52 indices from a hyperspectrial image. this means that in R my hyperspectral image is loaded as a 3d array of 512x512x204 bands.
what I would like to do is run the evaluation of the indices in parallel to reduce the computation time.
here a dummy example to what I would like to emulate, but not in parallel computing.
# create a fake matrix rappresenting my Hyperpectral image
HYPR_IMG=array(NA,dim=c(5,3,4))
HYPR_IMG[,,1]=1
HYPR_IMG[,,2]=2
HYPR_IMG[,,3]=3
HYPR_IMG[,,4]=4
image.plot(HYPR_IMG[,,1], zlim=c(0,20))
image.plot(HYPR_IMG[,,2], zlim=c(0,20))
image.plot(HYPR_IMG[,,3], zlim=c(0,20))
image.plot(HYPR_IMG[,,4], zlim=c(0,20))
#create a fake DF for simulating my indices stored in the dataframe
IDXname=c("IDX1","IDX2","IDX3","IDX4")
IDXFunc=c("HYPR_IMG[,,1] + 3*HYPR_IMG[,,2]",
"HYPR_IMG[,,3] + HYPR_IMG[,,2]",
"HYPR_IMG[,,4] + HYPR_IMG[,,2] - HYPR_IMG[,,3]",
"HYPR_IMG[,,1] + HYPR_IMG[,,4] + 4*HYPR_IMG[,,2] + HYPR_IMG[,,3]")
IDX_DF=as.data.frame(cbind(IDXname,IDXFunc))
# that was what I did before
Store_DF=data.frame(NA)
for (i in 1: length(IDX_DF$IDXname)) {
IDX_ID=IDX_DF$IDXname[i]
IDX_Fun_tmp=IDX_DF$IDXFunc[which(IDX_DF$IDXname==IDX_ID)] #use for extra care to select the right fuction
IDXFunc_call=paste("IDXfun_tmp=function(HYPR_IMG){",IDX_Fun_tmp,"}",sep="")
eval(parse(text = IDXFunc_call))
IDX_VAL=IDXfun_tmp (HYPR_IMG)
image.plot(IDX_VAL,zlim=c(0,20)); title(main=IDX_ID)
temp_DF=as.vector(IDX_VAL)
Store_DF=cbind(Store_DF,temp_DF)
names(Store_DF)[i+1] <- as.vector(IDX_ID)
}
my final goal is to have the very same Store_DF ,storing all the Indices value. Here I have a for loop but using a foreach loop things should speed up. if needed I am working with windows 8 or more as OS.
Is it really possible ?
Will I be able at the end, to reduce the overall computational time having the same Store_DF dataframe or somthing simlar like a matrix with the columns names?
Thanks a lot!!!
For the specific example using either the build in parallelization of a package like data.table or a parallel apply might be more beneficial.
Below is a minimal example of how to achieve the results using a parApply from the parallel package. Note the output is a matrix, which actually yields slightly better performance in base R (not the case necessarily in tidyverse or data.table). In case the data.frame structure is vital you will have to convert it with data.frame.
cl <- parallel::makeCluster( parallel::detectCores() )
result <- parallel::parApply(cl = cl, X = IDX_DF, MARGIN = 1, FUN = function(x, IMAGES){
IDX_ID <- x[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", x[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(IMAGES))
names(IDX_VAL) <- IDX_ID
IDX_VAL
}, IMAGES = HYPR_IMG)
colnames(result) = IDXname
IDXname
parallel::stopCluster(cl)
Please note the stopCluster(cl) which is important to shut down any loose R sessions.
Benchmark results (4 tiny cores):
Unit: milliseconds
expr min lq mean median uq max neval
Loop 8.420432 9.027583 10.426565 9.272444 9.943783 26.58623 100
Parallel 1.382324 1.491634 2.038024 1.554690 1.907728 18.23942 100
For replications of benchmarks the code has been provided below:
cl <- parallel::makeCluster( parallel::detectCores() )
microbenchmark::microbenchmark(
Loop = {
Store_DF=data.frame(NA)
for (i in 1: length(IDX_DF$IDXname)) {
IDX_ID = IDX_DF$IDXname[i]
IDX_Fun_tmp = IDX_DF$IDXFunc[which(IDX_DF$IDXname == IDX_ID)] #use for extra care to select the right function
eval(parse(text = paste0("IDXfun_tmp = function(HYPR_IMG){", IDX_Fun_tmp, "}")))
IDX_VAL = IDXfun_tmp(HYPR_IMG)
#Plotting in parallel is not a good idea. It will most often not work but might make the R session crash or slow down significantly (at best the latter at worst the prior)
#image.plot(IDX_VAL, zlim = c(0,20)); title(main = IDX_ID)
temp_DF = as.vector(IDX_VAL)
Store_DF = cbind(Store_DF,temp_DF)
names(Store_DF)[i+1] <- as.vector(IDX_ID)
}
rm(Store_DF)
},
Parallel = {
result <- parallel::parApply(cl = cl, X = IDX_DF, MARGIN = 1, FUN = function(x, IMAGES){
IDX_ID <- x[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", x[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(IMAGES))
names(IDX_VAL) <- IDX_ID
IDX_VAL
}, IMAGES = HYPR_IMG)
colnames(result) = IDXname
rm(result)
}
)
parallel::stopCluster(cl)
Edit: Using the foreach package
After a few comments about performance issues (maybe due to memory), i decided to make an illustration of how one could obtain the same result using the foreach package. A few notes:
The foreach package uses iterators. As standard it can be used like a for loop, where it will iterate over each column in a data.frame
Like other parallel implementations in R, if you are on Windows, often you will have to export the data used for calculations. It can sometimes be avoided with some fiddling and foreach sometimes will let you not export data. When this is, is unclear from the documentation.
The output of the foreach will be combined either as a list or as defined by the .combine argument, which can be rbind, cbind or any other function.
There is a lot of comments, making the code seem alot longer than it actually it. Removing comments and blank lines, it is 9 lines longer.
Below is the code which will yield the same output as above. Note i have used the data.table package. For more information about this package i suggest their wikipedia on github.
cl <- parallel::makeCluster( parallel::detectCores() )
#Foeach uses doParallel for the parallization
doParallel::registerDoParallel(cl)
#To iterate over the rows, we need to use iterators
# if foreach is given a matrix it will be converted to a column iterators
rowIterator <- iterators::iter(IDX_DF, by = "row")
library(foreach)
result <-
foreach(
#Supply the iterator
row = rowIterator,
#Specify if the calculations needs to be in order. If not then we can get better performance not doing so
.inorder = FALSE,
#In most foreach loops you will have to export the data you need for the calculations
# it worked without doing so for me, in which case it is faster if the exported stuff is large
#.export = c("HYPR_IMG"),
#We need to say how the output should be merged. If nothing is given it will be output as a list
#data.table rbindlist is faster than rbind (returns a data.table)
.combine = function(...)data.table::rbindlist(list(...)) ,
#otherwise we could've used:
#.combine = rbind
#if we dont use rbind or cbind (i used data.table::rbindlist for speed)
# we will have to tell if it can take more than 1 argument
.multicombine = TRUE
) %dopar% #Specify how to do the calculations. %do% loop. %dopar% parallel loop. %:% nested loops (next foreach tells how we do the loop)
{ #to do stuff in parallel we use the %dopar%. Alternative %do%. For multiple foreach we split each of them by %:%
IDX_ID <- row[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", row[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(HYPR_IMG))
data.frame(ID = IDX_ID, IDX_VAL)
}
#output is saved in result
result
result_reformatted <- dcast(result[,indx := 1:.N, by = ID],
indx~ID,
value.var = "IDX_VAL")
#if we dont want to use data.table we could use unstack instead
unstack(test, IDX_VAL ~ ID)
There are several packages in R to simplify running code in parallel, like foreach and future. Most of these have constructs which are like lapply or a for loop: they carry on until all the tasks have finished.
Is there a simple parallel version of Find? That is, I would like to run several tasks in parallel. I don't need all of them to finish, I just need to get the first one that finishes (maybe with a particular result). After that the other tasks can be killed, or left to finish on their own.
Conceptual code:
hunt_needle <- function (x, y) x %in% (y-1000):y
x <- sample.int(1000000, 1)
result <- parallel_find(seq(1000, 1000000, 1000), hunt_needle)
# should return the first value for which hunt_needle is true
You can use shared memory so that processes can communicate with one another.
For that, you can use package bigstatsr (disclaimer: I'm the author).
Choose a block size and do:
# devtools::install_github("privefl/bigstatsr")
library(bigstatsr)
# Data example
cond <- logical(1e6)
cond[sample(length(cond), size = 1)] <- TRUE
ind.block <- bigstatsr:::CutBySize(length(cond), block.size = 1000)
cl <- parallel::makeCluster(nb_cores())
doParallel::registerDoParallel(cl)
# This value (in an on-disk matrix) is shared by processes
found_it <- FBM(1, 1, type = "integer", init = 0L)
library(foreach)
res <- foreach(ic = sample(rows_along(ind.block)), .combine = 'c') %dopar% {
if (found_it[1]) return(NULL)
ind <- bigstatsr:::seq2(ind.block[ic, ])
find <- which(cond[ind])
if (length(find)) {
found_it[1] <- 1L
return(ind[find[1]])
} else {
return(NULL)
}
}
parallel::stopCluster(cl)
# Verification
all.equal(res, which(cond))
Basically, when a solution is found, you don't need to do some computations anymore, and others know it because you put a 1 in found_it which is shared between all processes.
As your question is not reproducible and I don't understand everything you need, you may have to adapt this solution a little bit.
I am pretty new to Spark, I have tried to look for something on the web but I haven't found anything satisfactory.
I have always run parallel computations using the command mclapply and I like its structure (i.e., first parameter used as scrolling index, second argument the function to be parallelized, and then other optional parameters passed to the function).
Now I am trying to do kind of the same thing via Spark, i.e., I would like to distribute my computations among all the node of the Spark cluster. This is shortly what I have learned and how I think the code should be structured (I'm using the package sparklyr):
I create a connection to Spark using the command spark_connect;
I copy my data.frame in the Spark environment with copy_to and access it through its tibble;
I would like to implement a "Spark-friendly" version of mclapply, but I have seen there is no similar function in the package (I have seen there exists the function spark.lapply in the SparkR package, but unfortunately it is not in the CRAN anymore).
Here below, a simple test script I have implemented that works using the function mclapply.
#### Standard code that works with mclapply #########
dfTest = data.frame(X = rep(1, 10000), Y = rep(2, 10000))
.testFunc = function(X = 1, df, str) {
rowSelected = df[X, ]
y = as.numeric(rowSelected[1] + rowSelected[2])
return(list(y = y, str = str))
}
lOutput = mclapply(X = 1 : nrow(dfTest), FUN = .testFunc, df = dfTest,
str = "useless string", mc.cores = 2)
######################################################
###### Similar code that should work with Spark ######
library(sparklyr)
sc = spark_connect(master = "local")
dfTest = data.frame(X = rep(1, 10000), Y = rep(2, 10000))
.testFunc = function(X = 1, df, str) {
rowSelected = df[X, ]
nSum = as.numeric(rowSelected[1] + rowSelected[2])
return(list(nSum = nSum, str = str))
}
dfTest_tbl = copy_to(sc, dfTest, "test_tbl", overwrite = TRUE)
# Apply similar function mclapply to dfTest_tbl, that works with
# Spark
# ???
######################################################
If someone has already found a solution for this, then it will be great. Also other references/guides/links are more than welcome. Thanks!
sparklyr
spark_apply is existing function you're looking for:
spark_apply(sdf, function(data) {
...
})
Please refer to Distributed R in sparklyr documentation for details.
SparkR
With SparkR use gapply / gapplyCollect
gapply(df, groupingCols, function(data) {...} schema)
dapply / dapplyCollect
dapply(df, function(data) {...}, schema)
UDFs. Refer to
dapply docs
gapply docs
for details.
Be warned that all solutions are inferior compared to native Spark code and should be avoided when high performance is required.
sparklyr::spark_apply now can support pass some external variables like models as context.
Here is my example to run xgboost model on sparklyr:
bst <- xgboost::xgb.load("project/models/xgboost.model")
res3 <- spark_apply(x = ft_union_price %>% sdf_repartition(partitions = 1500, partition_by = "uid"),
f = inference_fn,
packages = F,
memory = F,
names = c("uid",
"action_1",
"pred"),
context = {model <- bst})
I am trying to use foreach to run different classifiers on my data, but it doesn't work. In fact it doesn't return me anything.
my purpose is to parallelize my process. here is the simplified of my code:
library(foreach)
library(doParallel)
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
registerDoParallel(cl)
registerDoParallel(no_cores)
model_list<-foreach(i = 1:2,
.combine = c,.packages=c("e1071","randomeForest")) %dopar%
if (i==1){
model1<-svm(x = X,y = as.factor(Y),type = "C-classification",probability = T)
}
if (i==2){
mode2<-randomForest(x = X,y = as.factor(Y), ntree=100, norm.votes=FALSE,importance = T)
}
My way of parallelizing is correct overall?
Thanks indeed.
The main problem is that you're not enclosing the body of the foreach loop in curly braces. Because %dopar% is a binary operator, you have to be careful about precedence, which is why I recommend always using curly braces.
Also, you shouldn't use c as the combine function. Since svm and randomForest return objects, the default behavior of returning the results in a list is appropriate. Combining them with c will give you a garbage result.
Finally, it doesn't make sense to call registerDoParallel twice. It doesn't hurt, but it makes your code confusing.
I suggest:
library(doParallel)
no_cores <- detectCores() - 1
registerDoParallel(no_cores)
model_list <- foreach(i = 1:2,
.packages=c("e1071","randomForest")) %dopar% {
if (i==1) {
svm(x = X,y = as.factor(Y),type = "C-classification",
probability = T)
} else {
randomForest(x = X,y = as.factor(Y), ntree=100, norm.votes=FALSE,
importance = T)
}
}
I also removed the two unnecessary variable assignments to model1 and model2. Those variables won't be defined correctly on the master, and it obscures how the foreach loop really works.
I'm doing cross validation. So I wanted to split data into 10 folds. Somebody has post following code.
f_K_fold <- function(Nobs,K=10){
rs <- runif(Nobs)
id <- seq(Nobs)[order(rs)]
k <- as.integer(Nobs * seq(1, K-1) / K)
k <- matrix(c(0, rep(k, each=2), Nobs), ncol = 2, byrow = TRUE)
k[,1] <- k[,1]+1
l <- lapply(seq.int(K), function(x, k, d)
list(train=d[!(seq(d) %in% seq(k[x, 1],k[x, 2]))],
test=d[seq(k[x,1],k[x,2])]),
k=k,d=id)
return(l)
}
however I don't really understand what the lapply doing. Could someone explain to a newbie? Appreciate it.
It's really unfortunate that the code folding in this example is horrible, since aving properly formatted code can aid in understanding the code and catching mistakes.
The last three lines can be viewed as an anonymous function passed to lapply. lapply in essence "climbs" a list and for each list element, applies that (anonymous) function. In the example below, I've disambiguated the lines into a not so anonymous function and a call to lapply.
notSoanonymousFunction <- function(x, k, d) {
list(train = d[!(seq(d) %in% seq(k[x,1],k[x,2]))],
test = d[seq(k[x,1],k[x,2])])
}
l <- lapply(seq.int(K), FUN = notSoanonymousFunction, k = k, d = id)
If you look at ?lapply, you'll notice that there are no k or d arguments. However, these arguments do belong to our notSoanonymousFunction, and lapply takes it in via the ... argument.
As a mental exercise for you, I will show you one more trick how to learn what the function is doing. If you need to see what is happening inside the function, place a browser() call inside and run it. In your case, this would look like this:
notSoanonymousFunction <- function(x, k, d) {
browser()
list(train = d[!(seq(d) %in% seq(k[x,1],k[x,2]))],
test = d[seq(k[x,1],k[x,2])])
}
Once you run this, your console should say something along the lines of
Browser[1] >
You are now effectively inside the function. You can navigate to next line by typing n, running the whole chunk by c and quitting the browser all together, by pressing Q (see ?browser()). You can view and manipulate objects ad libidum. You can try by checking your workspace with ls() to see which objects are inside the function. You can bet your family farm that there will be objects x, k and d.