R: what's the proper way to overwrite a function from a package? - r

I am using a R package, in which there are 2 functions f1 and f2 (with f2 calling f1)
I wish to overwrite function f1.
Since R 2.15 and the mandatory usage of namespace in packages, if I just source the new function, it is indeed available in the global environement (ie. just calling f1(x) in the console returns the new result). However, calling f2 will still use the packaged function f1. (Because the namespace modifies the search path, and seals it as explained here in the Writing R Extensions tutorial)
What is the proper way to completely replace f1 with the new one? (apart from building again the package!) This can be useful in several situations. For instance if there is a bug in a package that you have not developed. Or if you don't want to re-build your packages everyday while they are still under development.
I know about function
assignInNamespace("f1",f1,ns="mypackage")
However, the help page ?assignInNamespace is a bit enignmatic and seems to discourage people from using it without giving more information, and I couldn't find any best practice recommendations on the official CRAN tutorial. and after calling this function:
# Any of these 2 calls return the new function
mypackage::f1
getFromNamespace(x = "f1", envir = as.environment("package:mypackage"))
# while this one still returns the old packaged version
getFunction(name = "f1", where = as.environment("package:mypackage"))
This is very disturbing. How is the search path affected?
For now I am doing some ugly things such as modifying the lockEnvironment function so that library doesn't lock the package namespace, and I can lock it at a later stage once I have replaced f1 (which seems really not a good practice)
So basically I have 2 questions:
what does exactly do assignInNamespace in the case of a package namespace (which is supposed to be locked)
What are the good practices?
many thanks for sharing your experience there.
EDIT: people interested in this question might find this blog post extremely interesting.

There are lots of different cases here.
If it's a bug in someone else's package
Then the best practice is to contact the package maintainer and persuade them to fix it. That way everyone gets the fix, not just you.
If it's a bug while developing your own package
Then you need to find a workflow where rebuilding packages is easy. Like using the devtools package and typing build(mypackage), or clicking a button ("Build & Reload" in RStudio; "R CMD build" in Architect).
If you just want different behaviour to an existing package
If it isn't a bug as such, or the package maintainer won't make the fix that you want, then you'll have to maintain you own copy of f1. Using assignInNamespace to override it in the existing package is OK for exploring, but it's a bit hacky so it isn't really suitable for a permanent solution.
Your best bet is to create your own package containing copies of f1 and f2. This is less effort than it sounds, since you can just define f2 <- existingpackage::f2.
In response to the comment:
Second and third cases makes sense if you are alone but they require to build and install the packages which is tricky in the case of my organisation as the packages are deployed on dozens of computer and I need root access to update the packages.
So take a copy of the existing package source, apply your patch, and host it on your company network or github or Bitbucket. Then the updated package can be installed programmatically via
install.packages("//some/network/path/mypackage_0.0-1.tar.gz", repos = NULL)
or
library(devtools)
install_github("mypackage", "mygithubusername")
Since the installation is just a line of code, you can easily push it to as many machines as you like. You don't need root access either - just install the package to a library folder that doesn't require root access to write to. (Read the Startup and .libPaths help pages for how to define a new library.) You'll need network access to those machines, but I can't help you with that. Speak to your network administrator or your boss or whoever can get you permission.

In case the function has no explicit binding within the package:
rlang::env_unlock(env = asNamespace('mypackage'))
rlang::env_binding_unlock(env = asNamespace('mypackage'))
assign('f1', f1, envir = asNamespace('mypackage'))
rlang::env_binding_lock(env = asNamespace('mypackage'))
rlang::env_lock(asNamespace('mypackage'))

Related

Run-time vs develop-time dependencies in R

I'm developing a package (golem) in R, and it returns a NOTE about excess package in an Import (DESCRIPTION):
checking package dependencies … NOTE
Imports includes 34 non-default packages.
Importing from so many packages makes the package vulnerable to any of
them becoming unavailable. Move as many as possible to Suggests and
use conditionally.
I have allocated some packages in Suggests (DESCRIPTION), like this:
usethis::use_package(package = "ggplot2", type = "Suggests")
usethis::use_package(package = "MASS", type = "Suggests")
I would like to know :
What is the difference between Imports (run-time) vs Suggests (develop-time) and if the latter has anything to do with the term "compile time" of other programming languages.
How do I know a package is needed by the user at runtime? Is there any universal rule for this (like a phrase to help you know)? And for Suggests?
In R, packages listed in the Imports clause of the DESCRIPTION file must be available or your package won't load. Normally they will all be loaded when your package is loaded, though it's possible to delay that by not importing anything, just using :: notation to access them.
Packages listed in the Suggests clause don't need to be available, and won't be automatically loaded. To access their functions, you normally call requireNamespace() to find out if the package is available, and if so use :: for access. If it is not available, your package should fail gracefully in whatever the user was trying to do, letting them know that they need to install the missing package if they want the task to succeed.
These aren't really "run-time" versus "develop-time" differences. It's all run-time.
There are two things in R that might be called "compile-time" in other languages. The best match is installing your package. That configures it to the particular R version and platform it is running on. R also has a "just-in-time" compiler that optimizes functions, but other than a bit of a speed increase that is pretty much invisible to the user.
I think #r2evans answered your second question clearly in a comment: the user needs a package to use functions that use that package. If some of your functions that use it are unlikely to be used by most users, use Suggests, and add the test.

Searching for a way to use `linearKEuclid` and corresponding functions of `spatstat`

My goal is to analyse simple point patterns on linear networks with respect to Euclidean distance instead of shortest-path distance implemented in linearK and related functions of spatstat and its sub packages. Browsing through the web I found the promising named function linearKEuclid() and related functions here.
Unfortunately, I could not bring those functions to live on my Win machine, e.g. I run in errors like this
Error in xysegMcircle(Y$x, Y$y, D, df$x0, df$y0, df$x1, df$y1) :
object 'C_circMseg' not found
or
Error in tapply(stuff$sinalpha, list(ii, jj), harmonicsum) :
object 'harmonicsum' not found
There is always something missing. For me, this means simply copying missing functions from the web, if available, does not help.
Probably, a reason for this is that the functions are merely written for internal purposes and under internal development, see, for instance, here under "Details".
However, I am hoping for some recommendation making the fascinating code around linearKEuclid() runnable on my machine. Maybe, there are some chances that someone draws my attention to a downloadable developer version or something comparable. Many thanks in advance!
I understand your confusion and it is unnecessarily complicated to get this to work at the moment since problems with another package on CRAN prevents spatstat and subpackages to be updated at the moment. Indeed you need to install a development version of spatstat.linnet and its dependencies. This is most easily done if you have the package remotes installed (and necessary tools to compile packages from source which would be RTools on Windows):
First run (in sequence):
remotes::install_github("spatstat/spatstat.random")
remotes::install_github("spatstat/spatstat.sparse")
remotes::install_github("baddstats/spatstat.explore")
remotes::install_github("baddstats/spatstat.model")
remotes::install_github("spatstat/spatstat.linnet")
Now the function should work (you may have to restart R if an old version of spatstat.linnet was already loaded when you updated). Try e.g. the example from the help file:
library(spatstat.linnet)
X <- rpoislpp(5, simplenet)
K <- linearKEuclid(X)

Are there any good resources/best-practices to "industrialize" code in R for a data science project?

I need to "industrialize" an R code for a data science project, because the project will be rerun several times in the future with fresh data. The new code should be really easy to follow even for people who have not worked on the project before and they should be able to redo the whole workflow quite quickly. Therefore I am looking for tips, suggestions, resources and best-practices on how to achieve this objective.
Thank you for your help in advance!
You can make an R package out of your project, because it has everything you need for a standalone project that you want to share with others :
Easy to share, download and install
R has a very efficient documentation system for your functions and objects when you work within R Studio. Combined with roxygen2, it enables you to document precisely every function, and makes the code clearer since you can avoid commenting with inline comments (but please do so anyway if needed)
You can specify quite easily which dependancies your package will need, so that every one knows what to install for your project to work. You can also use packrat if you want to mimic python's virtualenv
R also provide a long format documentation system, which are called vignettes and are similar to a printed notebook : you can display code, text, code results, etc. This is were you will write guidelines and methods on how to use the functions, provide detailed instructions for a certain method, etc. Once the package is installed they are automatically included and available for all users.
The only downside is the following : since R is a functional programming language, a package consists of mainly functions, and some other relevant objects (data, for instance), but not really scripts.
More details about the last point if your project consists in a script that calls a set of functions to do something, it cannot directly appear within the package. Two options here : a) you make a dispatcher function that runs a set of functions to do the job, so that users just have to call one function to run the whole method (not really good for maintenance) ; b) you make the whole script appear in a vignette (see above). With this method, people just have to write a single R file (which can be copy-pasted from the vignette), which may look like this :
library(mydatascienceproject)
library(...)
...
dothis()
dothat()
finishwork()
That enables you to execute the whole work from a terminal or a distant machine with Rscript, with the following (using argparse to add arguments)
Rscript myautomatedtask.R --arg1 anargument --arg2 anotherargument
And finally if you write a bash file calling Rscript, you can automate everything !
Feel free to read Hadley Wickham's book about R packages, it is super clear, full of best practices and of great help in writing your packages.
One can get lost in the multiple files in the project's folder, so it should be structured properly: link
Naming conventions that I use: first, second.
Set up the random seed, so the outputs should be reproducible.
Documentation is important: you can use the Roxygen skeleton in rstudio (default ctrl+alt+shift+r).
I usually separate the code into smaller, logically cohesive scripts, and use a main.R script, that uses the others.
If you use a special set of libraries, you can consider using packrat. Once you set it up, you can manage the installed project-specific libraries.

Use data.table in functions/packages (With roxygen)

I am quite new to R but it seems, this question is closely related to the following post 1, 2, 3 and a bit different topic 4. Unfortunately, I have not enough reputation to comment right there. My problem is that after going through all the suggestions there, the code still does not work:
I included "Depends" in the description file
I tried the second method including a change of NAMESPACE (Not reproducable)
I created a example package here containing a very small part of the code which showed a bit different error ("J" not found in routes[J(lat1, lng1, lat2, lng2), .I, roll = "nearest", by = .EACHI] instead of 'lat1' not found in routes[order(lat1, lng1, lat2, lng2, time)])
I tested all scripts using the console and R-scripts. There, the code ran without problems.
Thank you very much for your support!
Edit: #Roland
You are right. Roxygen overwrites the namespace. You have to include #' #import data.table to the function. Do you understand, why only inserting Depends: data.table in the DESCRIPTION file does not work? This might be a useful hint in the documentation or did I miss it?
It was missleading that changing to routes <- routes[order("lat1", "lng1", "lat2", "lng2", "time")] helped at least a bit as this line was suddenly no problem any more. Is it correct, that in this case data.frame order is used? I will see how far I get now. I will let you know the final result...
Answering your questions (after edit).
Quoting R exts manual:
Almost always packages mentioned in ‘Depends’ should also be imported from in the NAMESPACE file: this ensures that any needed parts of those packages are available when some other package imports the current package.
So you still should have import in NAMESPACE despite the fact if you depends or import data.table.
The order call doesn't seems to be what you expect, try the following:
order("lat1", "lng1", "lat2", "lng2", "time")
library(data.table)
data.table(a=2:1,b=1:2)[order("a","b")]
In case of issues I recommend to start debugging by writing unit test for your expected results. The most basic way to put unit tests in package is just plain R script in tests directory having stopifnot(...) call. Be aware you need to library/require your package at the start of the script.
This is more in addition to the answers above: I found this to be really useful...
From the docs [Hadley-description](http://r-pkgs.had.co.nz/description.html und)
Imports packages listed here must be present for your package to
work. In fact, any time your package is installed, those packages
will, if not already present, be installed on your computer
(devtools::load_all() also checks that the packages are installed).
Adding a package dependency here ensures that it’ll be installed.
However, it does not mean that it will be attached along with your
package (i.e., library(x)). The best practice is to explicitly refer
to external functions using the syntax package::function(). This
makes it very easy to identify which functions live outside of your
package. This is especially useful when you read your code in the
future.
If you use a lot of functions from other packages this is rather
verbose. There’s also a minor performance penalty associated with
:: (on the order of 5$\mu$s, so it will only matter if you call the
function millions of times).
From the docs Hadley-namespace
NAMESPACE also controls which external functions can be used by your
package without having to use ::. It’s confusing that both
DESCRIPTION (through the Imports field) and NAMESPACE (through import
directives) seem to be involved in imports. This is just an
unfortunate choice of names. The Imports field really has nothing to
do with functions imported into the namespace: it just makes sure the
package is installed when your package is. It doesn’t make functions
available. You need to import functions in exactly the same way
regardless of whether or not the package is attached.
... this is what I recommend: list the package in DESCRIPTION so that it’s
installed, then always refer to it explicitly with pkg::fun().
Unless there is a strong reason not to, it’s better to be explicit.
It’s a little more work to write, but a lot easier to read when you
come back to the code in the future. The converse is not true. Every
package mentioned in NAMESPACE must also be present in the Imports or
Depends fields.

Convenient way to load (and if needed install) a package in R

A user can work on many PCs. A good code runs no matter what PC it is running on. Assuming one does not want to rely on preference and option files, what is the best way to make sure a package is loaded (and installed if needed).
library command is cool, but the require command is much better. But even require is not getting the job done.
Triggering re-install that is not needed (eg, in R studio) causes an interesting prompt to restart the R session - and this is why unnecessary installs are best avoided.
One possible trick A is to do this (not to type the package name too often)
doInstall <- T;toInstall <- c("downloader");
if(doInstall) install.packages(toInstall);
lapply(toInstall, library, character.only = T)
or a worse trick B would be
if (!require(downloader)) {install.packages("downloader"); require(downloader)}
Is there a "2015 way" of doing it with one command - something like
justdoitall(c("downloader","dplyr"))
Here is an example of installing package zipcode using the pacman approach.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(zipcode)
Assuming one does not want to rely on preference and option files
That rules out putting anything in .Rprofile or using external packages so we're stuck with base R to solve your problem. If that's the case then the answer is that you can't do this much better than what you have written in your question (I prefer B to A)
If you're willing to bend a little bit and require the user to load a package first (which could be done on startup by using .Rprofile) there are a few options that do exactly what you want.
installr::require2 and pacman::p_load do what you ask. Disclosure: I am a an author/maintainer of pacman. I agree with your sentiment that we shouldn't rely on options or external files though especially if we plan on sharing the code. I use pacman pretty much every day (it has much more use than just installing/loading packages) but for the most part these types of functions should be treated as useful for interactive use but if you want portable, shareable code without worries about whether packages will be available you will have to resort to something along the lines of what you have in your question.

Resources