R: Searching package code in CRAN or installed locally - r

Suppose I wish to find instances of the use of one or more functions in the code of base or submitted packages, for purpose of better understanding idiomatic use of those functions. That is to say, I want to do a code search for the places where a function is used, not a search for places where that function is defined. So I would want to include e.g., unexported functions.
Ideally I would like to do RegEx matching so as to find functions with similar names that might serve a parallel function. I would also like to be able to restrict the output based on R's logical tests of output type, to find, e.g., only functions, or some finer subdivisions, such as is.primitive() or is.closure(), or (from rlang) is_primitive_eager() or is_primitive_lazy().
I note that some of the kinds of search I am asking about exist for package documentation in the sos package. Also, I know that grep searches can be done on the names of exported functions of loaded packages, as here: Searching functions using grep over multiple loaded packages in R, and Jim Hester's lookup package finds function definitions in CRAN packages even if they are not installed. See also Ben Bolker's answer, here: Name of a package for a given function in R But none of these methods will search for function usage as opposed to function definition.

Related

What is the difference between :: and library? [duplicate]

I'm writing some R functions that employ some useful functions in other packages like stringr and base64enc. Is it good not to call library(...) or require(...) to load these packages first but to use :: to directly refer to the function I need, like stringr::str_match(...)?
Is it a good practice in general case? Or what problem might it induce?
It all depends on context.
:: is primarily necessary if there are namespace collisions, functions from different packages with the same name. When I load the dplyr package, it provides a function filter, which collides with (and masks) the filter function loaded by default in the stats package. So if I want to use the stats version of the function after loading dplyr, I'll need to call it with stats::filter.
This also gives motivation for not loading lots of packages. If you really only want one function from a package, it can be better to use :: than load the whole package, especially if you know the package will mask other functions you want to use.
Not in code, but in text, I do find :: very useful. It's much more concise to type stats::filter than "the filter function from the stats package".
From a performance perspective, there is a (very) small price for using ::. Long-time R-Core development team member Martin Maechler wrote (on the r-devel mailing list (Sept 2017))
Many people seem to forget that every use of :: is an R
function call and using it is inefficient compared to just using
the already imported name.
The performance penalty is very small, on the order of a few microseconds, so it's only a concern when you need highly optimized code. Running a line of code that uses :: one million times will take a second or two longer than code that doesn't use ::.
As far as portability goes, it's nice to explicitly load packages at the top of a script because it makes it easy to glance at the first few lines and see what packages are needed, installing them if necessary before getting too deep in anything else, i.e., getting halfway through a long process that now can't be completed without starting over.
Aside: a similar argument can be made to prefer library() over require(). Library will cause an error and stop if the package isn't there, whereas require will warn but continue. If your code has a contingency plan in case the package isn't there, then by all means use if (require(package)) ..., but if your code will fail without a package you should use library(package) at the top so it fails early and clearly.
Within your own package
The general solution is to make your own package that imports the other packages you need to use in the DESCRIPTION file. Those packages will be automatically installed when your package is installed, so you can use pkg::fun internally. Or, by also importing them in the NAMESPACE file, you can import an entire package or selectively importFrom specific functions and not need ::. Opinions differ on this. Martin Maechler (same r-devel source as above) says:
Personally I've got the impression that :: is
much "overused" nowadays, notably in packages where I'd strongly
advocate using importFrom() in NAMESPACE, so all this happens
at package load time, and then not using :: in the package
sources itself.
On the other hand, RStudio Chief Scientist Hadley Wickham says in his R Packages book:
It's common for packages to be listed in Imports in DESCRIPTION, but not in NAMESPACE. In fact, this is what I recommend: list the package in DESCRIPTION so that it’s installed, then always refer to it explicitly with pkg::fun(). Unless there is a strong reason not to, it's better to be explicit.
With two esteemed R experts giving opposite recommendations, I think it's fair to say that you should pick whichever style suits you best and meets your needs for clarity, efficiency, and maintainability.
If you frequently find yourself using just one function from another package, you can copy the code and add it to your own package. For example, I have a package for personal use that borrows %nin% from the Hmisc package because I think it's a great function, but I don't often use anything else from Hmisc. With roxygen2, it's easy to add #author and #references to properly attribute the code for a borrowed function. Also make sure the package licenses are compatible when doing this.

Using functions from other packages - when to use package::function?

When making your own package for R, one often wants to make use of functions from a different package.
Maybe it's a plotting library like ggplot2, dplyr, or some niche function.
However, when making a function that depends on functions in other packages, what is the appropriate way to call them? In particular, I am looking for examples of when to use
myFunction <- function(x) {
example_package::function(x)
}
or
require(example_package)
myFunction <- function(x) {
function(x)
}
When should I use one over the other?
If you're actually creating an R package (as opposed to a script to source, R Project, or other method), you should NEVER use library() or require(). This is not an alternative to using package::function(). You are essentially choosing between package::function() and function(), which as highlighted by #Bernhard, explicitly calling the package ensures consistency if there are conflicting names in two or more packages.
Rather than require(package), you need to worry about properly defining your DESCRIPTION and NAMESPACE files. There's many posts about that on SO and elsewhere, so won't go into details, see here for example.
Using package::function() can help with above if you are using roxygen2 to generate your package documentation (it will automatically generate a proper NAMESPACE file.
The douple-colon variant :: has a clear advantage in the rare situations, when the same function name is used by two packages. There is a function psych::alpha to calculate Cronbach's alpha as a measure of internal consistency and a function scales::alpha to modify color transparency. There are not that many examples but then again, there are examples. dplyr even masks functions from the stats and base package! (And the tidyverse is continuing to produce more and more entries in our namespaces. Should you use dyplr you do not know, if the base function you use today will be masked by a future version of dplyr thus leading to an unexpected runtime problem of your package in the future.)
All of that is no problem if you use the :: variant. All of that is not a problem if in your package the last package opened is the one you mean.
The require (or library) variant leads to overall shorter code and it is obvious, at what time and place in the code the problem of a not-available package will lead to an error and thus become visible.
In general, both work well and you are free to choose, which of these admittedly small differences appears more important to you.

R: When to use package::function() vs function()? [duplicate]

I'm writing some R functions that employ some useful functions in other packages like stringr and base64enc. Is it good not to call library(...) or require(...) to load these packages first but to use :: to directly refer to the function I need, like stringr::str_match(...)?
Is it a good practice in general case? Or what problem might it induce?
It all depends on context.
:: is primarily necessary if there are namespace collisions, functions from different packages with the same name. When I load the dplyr package, it provides a function filter, which collides with (and masks) the filter function loaded by default in the stats package. So if I want to use the stats version of the function after loading dplyr, I'll need to call it with stats::filter.
This also gives motivation for not loading lots of packages. If you really only want one function from a package, it can be better to use :: than load the whole package, especially if you know the package will mask other functions you want to use.
Not in code, but in text, I do find :: very useful. It's much more concise to type stats::filter than "the filter function from the stats package".
From a performance perspective, there is a (very) small price for using ::. Long-time R-Core development team member Martin Maechler wrote (on the r-devel mailing list (Sept 2017))
Many people seem to forget that every use of :: is an R
function call and using it is inefficient compared to just using
the already imported name.
The performance penalty is very small, on the order of a few microseconds, so it's only a concern when you need highly optimized code. Running a line of code that uses :: one million times will take a second or two longer than code that doesn't use ::.
As far as portability goes, it's nice to explicitly load packages at the top of a script because it makes it easy to glance at the first few lines and see what packages are needed, installing them if necessary before getting too deep in anything else, i.e., getting halfway through a long process that now can't be completed without starting over.
Aside: a similar argument can be made to prefer library() over require(). Library will cause an error and stop if the package isn't there, whereas require will warn but continue. If your code has a contingency plan in case the package isn't there, then by all means use if (require(package)) ..., but if your code will fail without a package you should use library(package) at the top so it fails early and clearly.
Within your own package
The general solution is to make your own package that imports the other packages you need to use in the DESCRIPTION file. Those packages will be automatically installed when your package is installed, so you can use pkg::fun internally. Or, by also importing them in the NAMESPACE file, you can import an entire package or selectively importFrom specific functions and not need ::. Opinions differ on this. Martin Maechler (same r-devel source as above) says:
Personally I've got the impression that :: is
much "overused" nowadays, notably in packages where I'd strongly
advocate using importFrom() in NAMESPACE, so all this happens
at package load time, and then not using :: in the package
sources itself.
On the other hand, RStudio Chief Scientist Hadley Wickham says in his R Packages book:
It's common for packages to be listed in Imports in DESCRIPTION, but not in NAMESPACE. In fact, this is what I recommend: list the package in DESCRIPTION so that it’s installed, then always refer to it explicitly with pkg::fun(). Unless there is a strong reason not to, it's better to be explicit.
With two esteemed R experts giving opposite recommendations, I think it's fair to say that you should pick whichever style suits you best and meets your needs for clarity, efficiency, and maintainability.
If you frequently find yourself using just one function from another package, you can copy the code and add it to your own package. For example, I have a package for personal use that borrows %nin% from the Hmisc package because I think it's a great function, but I don't often use anything else from Hmisc. With roxygen2, it's easy to add #author and #references to properly attribute the code for a borrowed function. Also make sure the package licenses are compatible when doing this.

The main use(s) of "pkg::name" [duplicate]

This question already has an answer here:
What is the benefit of import in a namespace in R?
(1 answer)
Closed 6 years ago.
I noticed that some answers on SO contain the use of pkg::name where name is typically a function.
What is the advantage of this over library(pkg); ... name() or require(pkg); ... name()? R help, (help("::")) says
For a package pkg, pkg::name returns the value of the exported variable name in namespace pkg, ... The namespace will be loaded if it was not loaded before the call, but the package will not be attached to the search path.
Does this mean that the function is used without the additional memory loss of loading the entire package, (ie, is it equivalent to import <function> from <package>) in python? Or is it simply a means of telling R use the function from this package when there may be ambiguities?
My question relates the use of :: in an Rscript or directly in the console and so is not a duplicate of the linked question as the OP in that question is discussing the use of of functions from the stats4 package during a package development project. On the other hand, there appear to be answers within this post that shed some light on my question, however. Thanks for the link. (Note the following discussion on Meta: duplicates flag)
It avoids namespace collisions but it still has to load the pkg.
Example => I did this:
pryr::mem_used()
dplyr::filter(mtcars, cyl==4)
pryr::mem_used()
in one R instance and:
pryr::mem_used()
library(dplyr)
filter(mtcars, cyl==4)
pryr::mem_used()
in another.
mem before/after for the 1st was: 27.7 MB / 30.6 MB
mem before/after for the 2nd was: 27.7 MB / 30.7 MB
I didn't do multiple tests or see if the difference was rounding or something else, but no there were no real savings there IMO.
There are two main reasons why I use this notation:
Disambiguation: Some packages provide functions that have the same name as those of base R or of functions from other packages. Loading such libraries therefore replaces certain functions. This effect is referred to as "masking". In such cases I consider it a better way of coding to use the notation package::function(), in order to clarify which of the homonymous functions is used - even if it is the function that has been loaded most recently into the namespace and it is therefore not necessary for obtaining the desired output. If a function is masked by another package it is the only way to address it.
For example, library(raster)
contains (and loads into the namespace) a function called stack() which is different from the base R function stack() in the utils package. In order to still use the function stack() from base R it should be called with utils::stack() once the raster library has been loaded.
Accessing functions that are not exported into the namespace: This case is much less frequent and slightly different. Some libraries foo.pkg contain functions which are not loaded into the namespace with library(foo.pkg). As a result, in such cases library(foo.pkg) does not help in order to access these functions.
The only example where I encounter this situation on a regular basis is the quite useful function cbind.na() from the qpcR package. It can only be accessed by specifying qpcR:::cbind.na(). Note that three colons are required in this case.
Upon reconsideration, there could also be a third reason for me to use this notation: code compactness. If I know that I will only need one specific function of a package, possibly only once in the code, without being interested in the rest of the functionalities offered by that package, then I find that the notation package::function() is preferable. This may not give any tangible advantage in R, but I adopted this style from the general advice in other programming languages to avoid namespace pollution.

Cleaning up function list in an R package with lots of functions

[Revised based on suggestion of exporting names.]
I have been working on an R package that is nearing about 100 functions, maybe more.
I want to have, say, 10 visible functions and each may have 10 "invisible" sub-functions.
Is there an easy way to select which functions are visible, and which are not?
Also, in the interest of avoiding 'diff', is there a command like "all.equal" that can be applied to two different packages to see where they differ?
You can make a file called NAMESPACE in the base directory of your package. In this you can define which functions you want to export to the user, and you can also import functions from other packages. Exporting will make a function usable, and import will transfer a function from another package to you without making it available to the user (useful if you just need one function and don't want to require your users to load another package when they load yours).
A trunctuated part of my packages NAMESPACE :
useDynLib(qgraph)
export(qgraph)
(...)
importFrom(psych,"principal")
(...)
import(plyr)
which respectively loads the compiled functions, makes the function qgraph() available, imports from psych the principal function and imports from plyr all functions that are exported in plyr's NAMESPACE.
For more details read:
http://cran.r-project.org/doc/manuals/R-exts.pdf
I think you should organise your package and code the way you feel most comfortable with; it is your package after all. NAMESPACE can be used to control what gets exposed or not to the user up-front, as other's have mentioned, and you don't need to document all the functions, just the main user-called functions, by adding \alias{} tags to the Rd files for all the support functions you don't want people to know too much about, or hide them on an package.internals.Rd man page.
That being said, if you want people to help develop your package, or run with it and do amazing things, the better organised it is the easier that job will be. So lay out your functions logically, perhaps one file per function, named after the function name, or group all the related functions into a single R file for example. But be consistent in which approach you do.
If you have generic functions that have more general use, consider splitting those functions out into a separate package that others can use, without having to depend on your mega package with the extra cruft that is more specific. Your package can then depend on this generic package, as can packages of other authors. But don't split packages up just for the sake of making them smaller.
The answer is almost certainly to create a package. Some rules of thumb may help in your design choice:
A package should solve one problem
If you have functions that solve a different problem, put them in a separate package
For example, have a look at the ggplot2 package:
ggplot2 is a package that creates wonderful graphics
It imports plyr, a package that gives a consistent syntax and approach to solve the Split, Apply, Combine problem
It depends on reshape2, a package with only few functions that turns wide data into long, and vice-versa.
The point is that all of these packages were written by a single author, i.e. Hadley Wickham.
If you do decide to make a package, you can control the visibility of your functions:
Only functions that are exported are directly visible in the namespace
You can additionally mark some functions with the keyword internal, which will prevent them appearing in automatically generated lists of functions.
If you decide to develop your own package, I strongly recommend the devtools package, and reading the devtools wiki
If your reformulated question is about 'how to organise large packages', then this may apply:
NAMESPACE allows for very fine-grained exporting of functions: your user would see 10 visisble functions
even the invisible function are accessible if you or the users 'known', that is done via the ::: triple colon operator
packages do come in all sizes and shapes; one common rule about 'when to split' may be that as soon as you have functionality of use in different contexts
As for diff on packages: Huh? Packages are not usually all that close so that one would need a comparison function. The diff command is indeed quite useful on source code. You could use a hash function on binary code if you really wanted to but I am still puzzled as to why one would want to.

Resources