How do I find out how many workers is my disk.frame using? - disk.frame

I am using the disk.frame package and I wanted to know how many workers is disk.frame using to perform the operations? I looked through disk.frame documentation and can't find such a function.

disk.frame uses the future package to manage the workers. So we can simply use future::nbrOfWorkers() to find out.

Related

tidyverse VS dplyr in R - Processing Power / Performance

I'm relatively new to R programming and I've been doing research, but I can't find the answer to this topic.
Does it take more processing power to load the full tidyverse in the beginning of the code rather than to load just dplyr package. For example, I might only need functions that can be found in dplyr. Am I reducing the speed/performance of my code by loading the full tidyverse, which must be a larger package considering that it contains several other packages? Or would the processing speed be the same regardless of which package I choose to load. From an ease of coding, I'd rather use tidyverse since it's more comprehensive, but if I'm using more processing power, then perhaps loading the less comprehensive package is more efficient.
As NelsonGon commented, your processing speed is not reduced by loading packages. Although the packages themselves will take time to load, it may be negligible, especially if you are already wanting to load dplyr, tidyr, and purrr for example.
Loading more libraries on the search path (using library(dplyr) for example) might not hurt your speed, but may cause namespace errors down the road.
Now, there are some benchmarks out there comparing dpylr, data.table, and base R and dpylr tends to be slower, but YMMV. Here's one I found: https://www.r-bloggers.com/2018/01/tidyverse-and-data-table-sitting-side-by-side-and-then-base-r-walks-in/. So, if you are doing operations that take a long time, it might be worthwhile to use data.table instead.

Why is collect in SparkR so slow?

I have a 500K row spark DataFrame that lives in a parquet file. I'm using spark 2.0.0 and the SparkR package inside Spark (RStudio and R 3.3.1), all running on a local machine with 4 cores and 8gb of RAM.
To facilitate construction of a dataset I can work on in R, I use the collect() method to bring the spark DataFrame into R. Doing so takes about 3 minutes, which is far longer than it'd take to read an equivalently sized CSV file using the data.table package.
Admittedly, the parquet file is compressed and the time needed for decompression could be part of the issue, but I've found other comments on the internet about the collect method being particularly slow, and little in the way of explanation.
I've tried the same operation in sparklyr, and it's much faster. Unfortunately, sparklyr doesn't have the ability to do date path inside joins and filters as easily as SparkR, and so I'm stuck using SparkR. In addition, I don't believe I can use both packages at the same time (i.e. run queries using SparkR calls, and then access those spark objects using sparklyr).
Does anyone have a similar experience, an explanation for the relative slowness of SparkR's collect() method, and/or any solutions?
#Will
I don't know whether the following comment actually answers your question or not but Spark does lazy operations. All the transformations done in Spark (or SparkR) doesn't really create any data they just create a logical plan to follow.
When you run Actions like collect, it has to fetch data directly from source RDDs (assuming you haven't cached or persisted data).
If your data is not large enough and can be handled by local R easily then there is no need for going with SparkR. Other solution can be to cache your data for frequent uses.
Short: Serialization/deserialization is very slow.
See for example post on my blog http://dsnotes.com/articles/r-read-hdfs
However it should be equally slow in both sparkR and sparklyr.

Do the R parallel extensions break the `apply` metaphor?

Every time I see a question on parallel processing in R, it uses the foreach function. Since for loops are not very R-like, is there a parallel version of apply, and if so why isn't it more popular?
There are numerous parallel versions of *apply, starting with
parLapply() in snow
mclapply() in multicore
mpi.apply() in Rmpi
as well as dedicated packages such as papply (possibly no longer maintained).
#Dirk is correct. I'd add that the plyr package now has support for a parallel backend.
In the case of the plyr package, it may be the case that little is mentioned because dropping in a parallel backend doesn't take any thought: it's just a flag.

R package that automatically uses several cores?

I have noticed that R only uses one core while executing one of my programs which requires lots of calculations. I would like to take advantage of my multi-core processor to make my program run faster.
I have not yet investigated the question in depth but I would appreciate to benefit from your comments because I do not have good knowledge in computer science and it is difficult for me to get easily understandable information on that subject.
Is there a package that allows R to automatically use several cores when needed?
I guess it is not that simple.
R can only make use of multiple cores with the help of add-on packages, and only for some types of operation. The options are discussed in detail on the High Performance Computing Task View on CRAN
Update: From R Version 2.14.0 add-on packages are not necessarily required due to the inclusion of the parallel package as a recommended package shipped with R. parallel includes functionality from the multicore and snow packages, largely unchanged.
The easiest way to take advantage of multiprocessors is the multicore package which includes the function mclapply(). mclapply() is a multicore version of lapply(). So any process that can use lapply() can be easily converted to an mclapply() process. However, multicore does not work on Windows. I wrote a blog post about this last year which might be helpful. The package Revolution Analytics created, doSMP, is NOT a multi-threaded version of R. It's effectively a Windows version of multicore.
If your work is embarrassingly parallel, it's a good idea to get comfortable with the lapply() type of structuring. That will give you easy segue into mclapply() and even distributed computing using the same abstraction.
Things get much more difficult for operations that are not "embarrassingly parallel".
[EDIT]
As a side note, Rstudio is getting increasingly popular as a front end for R. I love Rstudio and use it daily. However it needs to be noted that Rstudio does not play nice with Multicore (at least as of Oct 2011... I understand that the RStudio team is going to fix this). This is because Rstudio does some forking behind the scenes and these forks conflict with Multicore's attempts to fork. So if you need Multicore, you can write your code in Rstuido, but run it in a plain-Jane R session.
On this question you always get very short answers. The easiest solution according to me is the package snowfall, based on snow. That is, on a Windows single computer with multiple cores. See also here the article of Knaus et al for a simple example. Snowfall is a wrapper around the snow package, and allows you to setup a multicore with a few commands. It's definitely less hassle than most of the other packages (I didn't try all of them).
On a sidenote, there are indeed only few tasks that can be parallelized, for the very simple reason that you have to be able to split up the tasks before multicore calculation makes sense. the apply family is obviously a logical choice for this : multiple and independent computations, which is crucial for multicore use. Anything else is not always that easily multicored.
Read also this discussion on sfApply and custom functions.
Microsoft R Open includes multi-threaded math libraries to improve the performance of R.It works in Windows/Unix/Mac all OS type. It's open source and can be installed in a separate directory if you have any existing R(from CRAN) installation. You can use popular IDE Rstudio also with this.From its inception, R was designed to use only a single thread (processor) at a time. Even today, R works that way unless linked with multi-threaded BLAS/LAPACK libraries.
The multi-core machines of today offer parallel processing power. To take advantage of this, Microsoft R Open includes multi-threaded math libraries.
These libraries make it possible for so many common R operations, such as matrix multiply/inverse, matrix decomposition, and some higher-level matrix operations, to compute in parallel and use all of the processing power available to reduce computation times.
Please check the below link:
https://mran.revolutionanalytics.com/rro/#about-rro
http://www.r-bloggers.com/using-microsoft-r-open-with-rstudio/
As David Heffernan said, take a look at the Blog of revolution Analytics. But you should know that most packages are for Linux. So, if you use windows it will be much harder.
Anyway, take a look at these sites:
Revolution. Here you will find a lecture about parallerization in R. The lecture is actually very good, but, as I said, most tips are for Linux.
And this thread here at Stackoverflow will disscuss some implementation in Windows.
The package future makes it extremely simple to work in R using parallel and distributed processing. More info here. If you want to apply a function to elements in parallel, the future.apply package provides a quick way to use the "apply" family functions (e.g. apply(), lapply(), and vapply()) in parallel.
Example:
library("future.apply")
library("stats")
x <- 1:10
# Single core
y <- lapply(x, FUN = quantile, probs = 1:3/4)
# Multicore in parallel
plan(multiprocess)
y <- future_lapply(x, FUN = quantile, probs = 1:3/4)

is there another way of loading extra packages in workers (parallel computing)?

One way of parallelization in R is through the snowfall package. To send custom functions to workers you can use sfExport() (see Joris' post here).
I have a custom function that depends on functions from non-base packages that are not loaded automagically. Thus, when I run my function in parallel, R craps out because certain functions are not available (think of the packages spatstat, splancs, sp...). So far I've solved this by calling library() in my custom function. This loads the packages on the first run and possibly just ignores on subsequent iterations. Still, I was wondering if there's another way of telling each worker to load the package on first iteration and be done with it (Or am I missing something and each iteration starts as a tabula rasa?).
I don't understand the question.
Packages are loaded via library(), and most of the parallel execution functions support that. For example, the snow package uses
clusterEvalQ(cl, library(boot))
to 'quietly' (ie not return value) evaluate the given expression---here a call to library()---on each node. Most of the parallel execution frameworks have something like that.
Why again would you need something different, and what exactly does not work here?
There's a specific command for that in snowfall, sfLibrary(). See also ?"snowfall-tools". Calling library manually on every node is strongly discouraged. sfLibrary is basically a wrapper around the solution Dirk gave based on the snow package.

Resources