tidyverse VS dplyr in R - Processing Power / Performance - r

I'm relatively new to R programming and I've been doing research, but I can't find the answer to this topic.
Does it take more processing power to load the full tidyverse in the beginning of the code rather than to load just dplyr package. For example, I might only need functions that can be found in dplyr. Am I reducing the speed/performance of my code by loading the full tidyverse, which must be a larger package considering that it contains several other packages? Or would the processing speed be the same regardless of which package I choose to load. From an ease of coding, I'd rather use tidyverse since it's more comprehensive, but if I'm using more processing power, then perhaps loading the less comprehensive package is more efficient.

As NelsonGon commented, your processing speed is not reduced by loading packages. Although the packages themselves will take time to load, it may be negligible, especially if you are already wanting to load dplyr, tidyr, and purrr for example.
Loading more libraries on the search path (using library(dplyr) for example) might not hurt your speed, but may cause namespace errors down the road.
Now, there are some benchmarks out there comparing dpylr, data.table, and base R and dpylr tends to be slower, but YMMV. Here's one I found: https://www.r-bloggers.com/2018/01/tidyverse-and-data-table-sitting-side-by-side-and-then-base-r-walks-in/. So, if you are doing operations that take a long time, it might be worthwhile to use data.table instead.

Related

tidyr / dplyr performance: CRAN R vs. MRAN Microsoft R Open

When I read about Microsoft R Open I ususally read that it is faster in matrix calculations that R from CRAN due to multicore support.
I understand that this can increase performance e.g. when running regressions. Does it also significantly increase calculations from tidyr or dplyr? The underlying question is, i guess, whether these packages rely on matrix calculations or not. More generally, do data.frames work with matrix calculations under the hood? As far as I know, data.frames are a special kind of a list...
Does anyone have an answer to this. Theoretically ans (ideally) some benchmarks?

How can I prevent a library from masking functions

A typical situation is the following:
library(dplyr)
library(xgboost)
When I import the library xgboost, the function slice of dplyr is masked, and I have to write dplyr::slice even though I never use xgboost::slice explicitly.
The obvious solution to the problem is to import xgboost before dplyr. But it is crazy to import all libraries which can affect the functions of dplyr in advance. Moreover this problem often happens when I use caret library. Namely train function imports automatically required libraries and some functions are masked at the time.
It is possible to prevent some functions from being masked?
Is it possible to mask "the masking function" (e.g. xgboost::slice) with an early imported function (e.g. dplyr::slice)?
Notes
I am NOT asking how to disable warning message.
I am NOT asking how to use the masked functions.
The next version of R has this in the NEWS{.Rd} file (quoted from the NEWS file post-build):
• The import() namespace directive now accepts an argument except
which names symbols to exclude from the imports. The except
expression should evaluate to a character vector (after
substituting symbols for strings). See Writing R Extensions.
There referenced text from the manual is here (in raw texi format).
So soon we can. Right now one cannot, and that is a huge pain in the aRse particular when functions from Base R packages are being masked: lag(), filter(), ...
We have used the term anti-social for this behaviour in the past. I don't think it is too strong.
To illustrate the problem, here is a snippet of code I wrote a decade ago (and had it posted on the now-vanished R Graph Gallery) which uses a clever and fast way to compute a moving average:
## create a (normalised, but that's just candy) weight vector
weights <- rep(1/ndays, ndays)
## and apply it as a one-sided moving average calculations, see help(filter)
bbmiddle <- as.vector(filter(dat$Close, weights,
method="convolution", side=1))
If you do library(dplyr) as you might in an interactive session, you're dead in the water as filter() is now something completely different. Not nice.
It is possible to prevent some functions from being masked?
I don't believe so but I could be wrong. I'm not sure what this would look like
Is it possible to mask "the masking function" (e.g. xgboost::slice) with an early imported function (e.g. dplyr::slice)?
If you're asking about just or use in an interactive session you can always just define slice to be the function you actually want to use like so
slice <- dplyr::slice
and then you can use slice as if it is the dplyr version (because now it is).
The solution is to manage your namespace like it is common to do in other languages. You can selectively import dplyr functions:
select <- dplyr::select
For convenience you can also import the whole package and selectively reimport functions from previously attached packages:
library("dplyr")
filter <- stats::filter
R has a great module system and attaching whole namespaces is especially handy for interactive use. It does requires a bit of manual adjusting if the preferences of the package authors do not match yours.
Note that in packages and long-term maintenance scripts you should privilege selective imports, in part because it is hard to predict new exported functions in future releases. Having several packages imported in bulk might give rise to unexpected masking over time.
More generally a good rule is to rely on a single attached package and selectively import the rest. To this end the tidyverse package might be handy if you're a heavy tidyverse user because it provides a single import point for several packages.
Finally it seems from your question that you think that the order of attached packages might have side effects inside other packages. This is nothing to worry about because all packages have their own contexts. The import scheme will only affect your script.
You can also now use the conflict_prefer() function from the conflicted package to specify which package's function should "win" and which should be masked when there are conflicting function names (details here). In your example, you would run
conflict_prefer("slice", "dplyr", "xgboost")
right after loading your libraries. Then when you run slice, it will default to using dplyr::slice rather than xgboost::slice. Or you can simply run
conflict_prefer("slice", "dplyr")
if you want to give dplyr::slice precedence over all other packages' slice functions.
I know this is a silly answering and this thread is very old (but I only had the same issue today): I changed the sequence of loading the packages. Personally, I was having a problem with MASS and Dplyr "select" function. I would like to use always Dplyr version. So I loaded MASS first!

How does the number many library in an R script impact on the performance?

First of all, does using libraries has anything to do with the performance at all?
If yes, how big is the impact on the performance?
Context: I noticed that different packages are containing (very) similar functions. I can basically do many things with just one library, but I can also import different libraries to do the same task. I am asking this question because I would like to know if having several packages versus having few ones has an impact on the performance, so that I would know if I should care about optimizing my library or not.
library(syuzhet)
library(lubridate)
library(plyr)
library("rjson")
library(compare)
...

reshape vs. reshape2 in R

I am attempting to understand why development had shifted from reshape to reshape2 package. They seem to be functionally the same, however, I am unable to upgrade to reshape2 currently due to an older version of R running on the server. I am concerned about the possibility of a major bug that would have shifted development to a whole new package instead of simply continuing development of reshape. Does anyone know if there is a major flaw in the reshape package?
reshape2 let Hadley make a rebooted reshape that was way, way faster, while avoiding busting up people's dependencies and habits.
https://stat.ethz.ch/pipermail/r-packages/2010/001169.html
Reshape2 is a reboot of the reshape package. It's been over five years
since the first release of the package, and in that time I've learned
a tremendous amount about R programming, and how to work with data in
R. Reshape2 uses that knowledge to make a new package for reshaping
data that is much more focussed and much much faster.
This version improves speed at the cost of functionality, so I have
renamed it to reshape2 to avoid causing problems for existing users.
Based on user feedback I may reintroduce some of these features.
What's new in reshape2:
considerably faster and more memory efficient thanks to a much
better underlying algorithm that uses the power and speed of
subsetting to the fullest extent, in most cases only making a
single copy of the data.
cast is replaced by two functions depending on the output type:
dcast produces data frames, and acast produces matrices/arrays.
multidimensional margins are now possible: grand_row and
grand_col have been dropped: now the name of the margin refers to
the variable that has its value set to (all).
some features have been removed such as the | cast operator, and
the ability to return multiple values from an aggregation function.
I'm reasonably sure both these operations are better performed by
plyr.
a new cast syntax which allows you to reshape based on functions
of variables (based on the same underlying syntax as plyr):
better development practices like namespaces and tests.

R package that automatically uses several cores?

I have noticed that R only uses one core while executing one of my programs which requires lots of calculations. I would like to take advantage of my multi-core processor to make my program run faster.
I have not yet investigated the question in depth but I would appreciate to benefit from your comments because I do not have good knowledge in computer science and it is difficult for me to get easily understandable information on that subject.
Is there a package that allows R to automatically use several cores when needed?
I guess it is not that simple.
R can only make use of multiple cores with the help of add-on packages, and only for some types of operation. The options are discussed in detail on the High Performance Computing Task View on CRAN
Update: From R Version 2.14.0 add-on packages are not necessarily required due to the inclusion of the parallel package as a recommended package shipped with R. parallel includes functionality from the multicore and snow packages, largely unchanged.
The easiest way to take advantage of multiprocessors is the multicore package which includes the function mclapply(). mclapply() is a multicore version of lapply(). So any process that can use lapply() can be easily converted to an mclapply() process. However, multicore does not work on Windows. I wrote a blog post about this last year which might be helpful. The package Revolution Analytics created, doSMP, is NOT a multi-threaded version of R. It's effectively a Windows version of multicore.
If your work is embarrassingly parallel, it's a good idea to get comfortable with the lapply() type of structuring. That will give you easy segue into mclapply() and even distributed computing using the same abstraction.
Things get much more difficult for operations that are not "embarrassingly parallel".
[EDIT]
As a side note, Rstudio is getting increasingly popular as a front end for R. I love Rstudio and use it daily. However it needs to be noted that Rstudio does not play nice with Multicore (at least as of Oct 2011... I understand that the RStudio team is going to fix this). This is because Rstudio does some forking behind the scenes and these forks conflict with Multicore's attempts to fork. So if you need Multicore, you can write your code in Rstuido, but run it in a plain-Jane R session.
On this question you always get very short answers. The easiest solution according to me is the package snowfall, based on snow. That is, on a Windows single computer with multiple cores. See also here the article of Knaus et al for a simple example. Snowfall is a wrapper around the snow package, and allows you to setup a multicore with a few commands. It's definitely less hassle than most of the other packages (I didn't try all of them).
On a sidenote, there are indeed only few tasks that can be parallelized, for the very simple reason that you have to be able to split up the tasks before multicore calculation makes sense. the apply family is obviously a logical choice for this : multiple and independent computations, which is crucial for multicore use. Anything else is not always that easily multicored.
Read also this discussion on sfApply and custom functions.
Microsoft R Open includes multi-threaded math libraries to improve the performance of R.It works in Windows/Unix/Mac all OS type. It's open source and can be installed in a separate directory if you have any existing R(from CRAN) installation. You can use popular IDE Rstudio also with this.From its inception, R was designed to use only a single thread (processor) at a time. Even today, R works that way unless linked with multi-threaded BLAS/LAPACK libraries.
The multi-core machines of today offer parallel processing power. To take advantage of this, Microsoft R Open includes multi-threaded math libraries.
These libraries make it possible for so many common R operations, such as matrix multiply/inverse, matrix decomposition, and some higher-level matrix operations, to compute in parallel and use all of the processing power available to reduce computation times.
Please check the below link:
https://mran.revolutionanalytics.com/rro/#about-rro
http://www.r-bloggers.com/using-microsoft-r-open-with-rstudio/
As David Heffernan said, take a look at the Blog of revolution Analytics. But you should know that most packages are for Linux. So, if you use windows it will be much harder.
Anyway, take a look at these sites:
Revolution. Here you will find a lecture about parallerization in R. The lecture is actually very good, but, as I said, most tips are for Linux.
And this thread here at Stackoverflow will disscuss some implementation in Windows.
The package future makes it extremely simple to work in R using parallel and distributed processing. More info here. If you want to apply a function to elements in parallel, the future.apply package provides a quick way to use the "apply" family functions (e.g. apply(), lapply(), and vapply()) in parallel.
Example:
library("future.apply")
library("stats")
x <- 1:10
# Single core
y <- lapply(x, FUN = quantile, probs = 1:3/4)
# Multicore in parallel
plan(multiprocess)
y <- future_lapply(x, FUN = quantile, probs = 1:3/4)

Resources