How to use data.table::setDTthreads() in my own package? - r

I'm developing very small package for the first time (and - perhaps it is important in context of my question - would like to publish it on CRAN). This package uses functions from data.table and base R. I would like to take benefits of paralell computations provided by data.table::setDTthreads() function.
When user loads data.table package, this function is calling immediately, but I'm not doing this when developing my package. What I did now is just: (1) in the DESCRIPTION file I have added data.table to Imports field; (2) in the NAMESPACE I have included import(data.table).
As I know this is not the same as library(data.table) and I don't want this, because I don't want to load data.table when user will load my package. But still I would like to use data.table::setDTthreads() function. Where should I include this in my script? Or maybe I'm using it since in the NAMESPACE I have included import(data.table)?
My package contains only one .R file in R/ directory, with few functions, but only one will be exported to be visible for the user (so the other ones are just helper functions). Let's say, it looks like this:
#' roxygen2 skeleton
main_function <- function(x) {
x <- helper_function_1(x)
x
}
helper_function_1 <- function(x) {
x
}
What I really worry about is that when I will use data.table::setDTthreads() in my package it will have an impact on user's environment, i.e. I will enable parallel computation and set threads not just for my function in my package, but generally for user's session.

Very good question.
Yes, it will affect all data.table calls (including those from other packages) in user environment and not just those from your package.
General advise is to not set this value in your package but let users know that they could set it themselves. If you want to set it in your package you should document it really well.
Note that 50% vs. 100% is often very small difference (can be less than 5%, or even slow down on a shared environments) so I suggest you to measure if it is really worth to mess with user environment if benefits are small.
Check those timings for example
https://github.com/h2oai/db-benchmark/issues/202
You could also fill a feature request for a possibility to set number of threads just for calls from a single package. It technically possible by checking top environment of a call.

Related

Including `library()` or `require()` calls in functions

This question has been asked before (Include library calls in functions?) but I wonder if there are any new takes on it.
Putting library() calls inside functions is something I am often tempted to do. It means the function is slightly more portable; it will run without errors if a colleague copies it into their own project, for example.
The main downside seems to be "global effects", which I guess refers to namespace clashes. Are there other worries?
The alternatives seem to be:
Load packages at the top of the script containing the functions. But how does a colleague who wants to copy the function know which package is required for which function? (I guess the function documentation should cover this?).
Build a package. But the ROI on this may be low for small-scale projects.

Using functions from other packages - when to use package::function?

When making your own package for R, one often wants to make use of functions from a different package.
Maybe it's a plotting library like ggplot2, dplyr, or some niche function.
However, when making a function that depends on functions in other packages, what is the appropriate way to call them? In particular, I am looking for examples of when to use
myFunction <- function(x) {
example_package::function(x)
}
or
require(example_package)
myFunction <- function(x) {
function(x)
}
When should I use one over the other?
If you're actually creating an R package (as opposed to a script to source, R Project, or other method), you should NEVER use library() or require(). This is not an alternative to using package::function(). You are essentially choosing between package::function() and function(), which as highlighted by #Bernhard, explicitly calling the package ensures consistency if there are conflicting names in two or more packages.
Rather than require(package), you need to worry about properly defining your DESCRIPTION and NAMESPACE files. There's many posts about that on SO and elsewhere, so won't go into details, see here for example.
Using package::function() can help with above if you are using roxygen2 to generate your package documentation (it will automatically generate a proper NAMESPACE file.
The douple-colon variant :: has a clear advantage in the rare situations, when the same function name is used by two packages. There is a function psych::alpha to calculate Cronbach's alpha as a measure of internal consistency and a function scales::alpha to modify color transparency. There are not that many examples but then again, there are examples. dplyr even masks functions from the stats and base package! (And the tidyverse is continuing to produce more and more entries in our namespaces. Should you use dyplr you do not know, if the base function you use today will be masked by a future version of dplyr thus leading to an unexpected runtime problem of your package in the future.)
All of that is no problem if you use the :: variant. All of that is not a problem if in your package the last package opened is the one you mean.
The require (or library) variant leads to overall shorter code and it is obvious, at what time and place in the code the problem of a not-available package will lead to an error and thus become visible.
In general, both work well and you are free to choose, which of these admittedly small differences appears more important to you.

Why does calling `detach` cause R to "forget" functions?

Sometimes I use attach with some subset terms to work with odd dimensions of study data. To prevent "masking" variables in the environment (really the warning message itself) I simply call detach() to just remove whatever dataset I was working with from the R search path. When I get muddled in scripting, I may end up calling detach a few times. Well, interestingly if I call it enough, R removes functions that are loaded at start-up as part of packages like from utils, stats, and graphics. Why does "detach" remove these functions?
R removes base functions from the search path, like plot and ? and so on.
These functions that were removed are often called “base” functions but they are not part of the actual ‹base› package. Rather, plot is from the package ‹graphics›, and ? is from the package ‹utils›, both of which are part of the R default packages, and are therefore attached by default. Both packages are attached after package:base, and you’re accidentally detaching these packages with your too many detach calls (package:base itself cannot be detached; this is important because if it were detached, you couldn’t reattach it: the functions necessary for that are inside package:base).
To expand on this, attach and detach are usually used in conjunction with package environments rather than data sets: to enable the uses functions from a package without explicitly typing the package name (e.g. graphics::plot), the library function attaches these packages. When loading R, some packages are attached by default. You can find more information about this in Hadley Wickham’s Advanced R.
As you noticed, you can also attach and detach data sets. However, this is generally discouraged (quite strongly, in fact). Instead, you can use data transformation functions from the base package (e.g. with and transform, as noted by Moody_Mudskipper in a comment) or from data manipulation package (‹dplyr› is state of the art; an alternative is ‹data.table›).

How can I prevent a library from masking functions

A typical situation is the following:
library(dplyr)
library(xgboost)
When I import the library xgboost, the function slice of dplyr is masked, and I have to write dplyr::slice even though I never use xgboost::slice explicitly.
The obvious solution to the problem is to import xgboost before dplyr. But it is crazy to import all libraries which can affect the functions of dplyr in advance. Moreover this problem often happens when I use caret library. Namely train function imports automatically required libraries and some functions are masked at the time.
It is possible to prevent some functions from being masked?
Is it possible to mask "the masking function" (e.g. xgboost::slice) with an early imported function (e.g. dplyr::slice)?
Notes
I am NOT asking how to disable warning message.
I am NOT asking how to use the masked functions.
The next version of R has this in the NEWS{.Rd} file (quoted from the NEWS file post-build):
• The import() namespace directive now accepts an argument except
which names symbols to exclude from the imports. The except
expression should evaluate to a character vector (after
substituting symbols for strings). See Writing R Extensions.
There referenced text from the manual is here (in raw texi format).
So soon we can. Right now one cannot, and that is a huge pain in the aRse particular when functions from Base R packages are being masked: lag(), filter(), ...
We have used the term anti-social for this behaviour in the past. I don't think it is too strong.
To illustrate the problem, here is a snippet of code I wrote a decade ago (and had it posted on the now-vanished R Graph Gallery) which uses a clever and fast way to compute a moving average:
## create a (normalised, but that's just candy) weight vector
weights <- rep(1/ndays, ndays)
## and apply it as a one-sided moving average calculations, see help(filter)
bbmiddle <- as.vector(filter(dat$Close, weights,
method="convolution", side=1))
If you do library(dplyr) as you might in an interactive session, you're dead in the water as filter() is now something completely different. Not nice.
It is possible to prevent some functions from being masked?
I don't believe so but I could be wrong. I'm not sure what this would look like
Is it possible to mask "the masking function" (e.g. xgboost::slice) with an early imported function (e.g. dplyr::slice)?
If you're asking about just or use in an interactive session you can always just define slice to be the function you actually want to use like so
slice <- dplyr::slice
and then you can use slice as if it is the dplyr version (because now it is).
The solution is to manage your namespace like it is common to do in other languages. You can selectively import dplyr functions:
select <- dplyr::select
For convenience you can also import the whole package and selectively reimport functions from previously attached packages:
library("dplyr")
filter <- stats::filter
R has a great module system and attaching whole namespaces is especially handy for interactive use. It does requires a bit of manual adjusting if the preferences of the package authors do not match yours.
Note that in packages and long-term maintenance scripts you should privilege selective imports, in part because it is hard to predict new exported functions in future releases. Having several packages imported in bulk might give rise to unexpected masking over time.
More generally a good rule is to rely on a single attached package and selectively import the rest. To this end the tidyverse package might be handy if you're a heavy tidyverse user because it provides a single import point for several packages.
Finally it seems from your question that you think that the order of attached packages might have side effects inside other packages. This is nothing to worry about because all packages have their own contexts. The import scheme will only affect your script.
You can also now use the conflict_prefer() function from the conflicted package to specify which package's function should "win" and which should be masked when there are conflicting function names (details here). In your example, you would run
conflict_prefer("slice", "dplyr", "xgboost")
right after loading your libraries. Then when you run slice, it will default to using dplyr::slice rather than xgboost::slice. Or you can simply run
conflict_prefer("slice", "dplyr")
if you want to give dplyr::slice precedence over all other packages' slice functions.
I know this is a silly answering and this thread is very old (but I only had the same issue today): I changed the sequence of loading the packages. Personally, I was having a problem with MASS and Dplyr "select" function. I would like to use always Dplyr version. So I loaded MASS first!

Cleaning up function list in an R package with lots of functions

[Revised based on suggestion of exporting names.]
I have been working on an R package that is nearing about 100 functions, maybe more.
I want to have, say, 10 visible functions and each may have 10 "invisible" sub-functions.
Is there an easy way to select which functions are visible, and which are not?
Also, in the interest of avoiding 'diff', is there a command like "all.equal" that can be applied to two different packages to see where they differ?
You can make a file called NAMESPACE in the base directory of your package. In this you can define which functions you want to export to the user, and you can also import functions from other packages. Exporting will make a function usable, and import will transfer a function from another package to you without making it available to the user (useful if you just need one function and don't want to require your users to load another package when they load yours).
A trunctuated part of my packages NAMESPACE :
useDynLib(qgraph)
export(qgraph)
(...)
importFrom(psych,"principal")
(...)
import(plyr)
which respectively loads the compiled functions, makes the function qgraph() available, imports from psych the principal function and imports from plyr all functions that are exported in plyr's NAMESPACE.
For more details read:
http://cran.r-project.org/doc/manuals/R-exts.pdf
I think you should organise your package and code the way you feel most comfortable with; it is your package after all. NAMESPACE can be used to control what gets exposed or not to the user up-front, as other's have mentioned, and you don't need to document all the functions, just the main user-called functions, by adding \alias{} tags to the Rd files for all the support functions you don't want people to know too much about, or hide them on an package.internals.Rd man page.
That being said, if you want people to help develop your package, or run with it and do amazing things, the better organised it is the easier that job will be. So lay out your functions logically, perhaps one file per function, named after the function name, or group all the related functions into a single R file for example. But be consistent in which approach you do.
If you have generic functions that have more general use, consider splitting those functions out into a separate package that others can use, without having to depend on your mega package with the extra cruft that is more specific. Your package can then depend on this generic package, as can packages of other authors. But don't split packages up just for the sake of making them smaller.
The answer is almost certainly to create a package. Some rules of thumb may help in your design choice:
A package should solve one problem
If you have functions that solve a different problem, put them in a separate package
For example, have a look at the ggplot2 package:
ggplot2 is a package that creates wonderful graphics
It imports plyr, a package that gives a consistent syntax and approach to solve the Split, Apply, Combine problem
It depends on reshape2, a package with only few functions that turns wide data into long, and vice-versa.
The point is that all of these packages were written by a single author, i.e. Hadley Wickham.
If you do decide to make a package, you can control the visibility of your functions:
Only functions that are exported are directly visible in the namespace
You can additionally mark some functions with the keyword internal, which will prevent them appearing in automatically generated lists of functions.
If you decide to develop your own package, I strongly recommend the devtools package, and reading the devtools wiki
If your reformulated question is about 'how to organise large packages', then this may apply:
NAMESPACE allows for very fine-grained exporting of functions: your user would see 10 visisble functions
even the invisible function are accessible if you or the users 'known', that is done via the ::: triple colon operator
packages do come in all sizes and shapes; one common rule about 'when to split' may be that as soon as you have functionality of use in different contexts
As for diff on packages: Huh? Packages are not usually all that close so that one would need a comparison function. The diff command is indeed quite useful on source code. You could use a hash function on binary code if you really wanted to but I am still puzzled as to why one would want to.

Resources