How do I rebuild data-raw using roxygen2? - r

I have a package that uses a pre-built dataset that may be modified over time by other parts of the package. Specifically, I have files that I add to the inst directory that will be indexed to make a data.frame. The indexing does not take a very long time (about 15-30 seconds), but it is longer than something I'd like to do on every package load.
Is there a way to automate the indexing so that it will occur with roxygen2::roxygenize()? What I'd really love to have happen is that an R function from the package would run any time I run devtools::document(). I think that this is possible by creating a custom roclet, but I don't understand quite how I would do this in practice.

After reading more documentation, I found the answer to the question. The #examples idea from #r2evans helped me find something that is more directly targeting something like this. I used #' #eval my_function() where my_function() is an exported function in the package that builds the data and generates the documentation for the data.

Related

Understanding Environments within Packages

I am writing a package for the first time and I am trying to understand how it works in principle.
I need a variable to be available to many functions within the package. The common suggestion is to use a new environment that you specifically create for this purpose from within the package. For example, here:
How to define "hidden global variables" inside R packages?
or in this snippet of code:
https://github.com/eddelbuettel/rcppgsl/blob/master/R/inline.R#L19-L34
Based on this suggestion, right now my code starts with:
.pkg_env <- new.env(parent=emptyenv())
Yet, I cannot reconcile this with the principle that "the code in a package is run when the package is built" as mentioned here (https://r-pkgs.org/r.html#understand-when-code-is-executed). Perhaps, I am missing something in the way environments are managed in R, but I would guess you need that line to be run at load to create the environment.

Should one load the dataset before or inside of the function?

Q. How does one write an R package function that references an R package dataset in a way that is simple/friendly/efficient for an R user. For example, how does one handle afunction() that calls adataset from a package?
What I think may not be simple/friendly/efficient:
User is required to run
data(adataset) before running
afunction(...)
or else receiving an Error: ... object 'adataset' not found. I have noticed some packages have built-in datasets that can be called anytime the package is loaded, for example, iris, which one can call without bringing it to the Global Environment.
Possible options which I have entertained:
Write data(NamedDataSet) directly into the function. Is this a bad idea. I thought perhaps it could be, looking at memory and given my limiting understanding of function environments.
Code the structure of the dataset directly into the function. I think this works depending on the size of the data but it makes me wonder about how to go about proper documentation in the package.
Change Nothing. Given a large enough dataset, maybe it does not make sense to implement a way different from reading it before calling the function.
Any comments are appreciated.
You might find these resources about writing R data packages useful:
the "External Data" section of R Packages
Creating an R data package
Creating an R data package (overview)
In particular, take note of the DESCRIPTION file and usage of the line LazyData: true. This is how datasets are made available without having to use data(), as in the iris example that you mention.

How to retrieve R function script from a specific package?

I realize there are generic functions like plot, predict for a list of packages. I am wondering how could I get the R script of these generic functions for a specific package, like the lme4::predict. I tried lme4::predict, but it comes with error:
> lme4::predict
Error: 'predict' is not an exported object from 'namespace:lme4'
Since you state that my suggestion above was helpful I will tell you my process. I used my own co-authored package called pacman. This package was developed because we had a hard time remembering all the obscurely named functions to get information on and work with add on packages.
I used this to figure out what you wanted:
library(pacman)
p_funs(lme4, all=TRUE)
I set all = TRUE as predict is a method for a specific class (like print, summary and plot). Generally, these methods are not exported so p_funs won't return them unless you set all = TRUE. Then I scrolled down to the p section and found only a single predict method: predict.merMod
Next I realized it wasn't exported so :: won't show me the stuff and extra colon power is needed, hence: lme4:::predict.merMod
As pointed out by David and rawr above, some functions can be suborn little snippets (methods etc.), thus methods and getAnywhere are helpful.
Here is an example of that:
library(tm)
dissimilarity #The good stuff is hid
methods(dissimilarity) #I want the good stuff
getAnywhere("dissimilarity.DocumentTermMatrix")
Small end note
And of course you don't need pacman to look at functions for packages, it's what I use and helpful but it just wraps base R things. Use THE SOURCE to figure out exactly what.

Cleaning up function list in an R package with lots of functions

[Revised based on suggestion of exporting names.]
I have been working on an R package that is nearing about 100 functions, maybe more.
I want to have, say, 10 visible functions and each may have 10 "invisible" sub-functions.
Is there an easy way to select which functions are visible, and which are not?
Also, in the interest of avoiding 'diff', is there a command like "all.equal" that can be applied to two different packages to see where they differ?
You can make a file called NAMESPACE in the base directory of your package. In this you can define which functions you want to export to the user, and you can also import functions from other packages. Exporting will make a function usable, and import will transfer a function from another package to you without making it available to the user (useful if you just need one function and don't want to require your users to load another package when they load yours).
A trunctuated part of my packages NAMESPACE :
useDynLib(qgraph)
export(qgraph)
(...)
importFrom(psych,"principal")
(...)
import(plyr)
which respectively loads the compiled functions, makes the function qgraph() available, imports from psych the principal function and imports from plyr all functions that are exported in plyr's NAMESPACE.
For more details read:
http://cran.r-project.org/doc/manuals/R-exts.pdf
I think you should organise your package and code the way you feel most comfortable with; it is your package after all. NAMESPACE can be used to control what gets exposed or not to the user up-front, as other's have mentioned, and you don't need to document all the functions, just the main user-called functions, by adding \alias{} tags to the Rd files for all the support functions you don't want people to know too much about, or hide them on an package.internals.Rd man page.
That being said, if you want people to help develop your package, or run with it and do amazing things, the better organised it is the easier that job will be. So lay out your functions logically, perhaps one file per function, named after the function name, or group all the related functions into a single R file for example. But be consistent in which approach you do.
If you have generic functions that have more general use, consider splitting those functions out into a separate package that others can use, without having to depend on your mega package with the extra cruft that is more specific. Your package can then depend on this generic package, as can packages of other authors. But don't split packages up just for the sake of making them smaller.
The answer is almost certainly to create a package. Some rules of thumb may help in your design choice:
A package should solve one problem
If you have functions that solve a different problem, put them in a separate package
For example, have a look at the ggplot2 package:
ggplot2 is a package that creates wonderful graphics
It imports plyr, a package that gives a consistent syntax and approach to solve the Split, Apply, Combine problem
It depends on reshape2, a package with only few functions that turns wide data into long, and vice-versa.
The point is that all of these packages were written by a single author, i.e. Hadley Wickham.
If you do decide to make a package, you can control the visibility of your functions:
Only functions that are exported are directly visible in the namespace
You can additionally mark some functions with the keyword internal, which will prevent them appearing in automatically generated lists of functions.
If you decide to develop your own package, I strongly recommend the devtools package, and reading the devtools wiki
If your reformulated question is about 'how to organise large packages', then this may apply:
NAMESPACE allows for very fine-grained exporting of functions: your user would see 10 visisble functions
even the invisible function are accessible if you or the users 'known', that is done via the ::: triple colon operator
packages do come in all sizes and shapes; one common rule about 'when to split' may be that as soon as you have functionality of use in different contexts
As for diff on packages: Huh? Packages are not usually all that close so that one would need a comparison function. The diff command is indeed quite useful on source code. You could use a hash function on binary code if you really wanted to but I am still puzzled as to why one would want to.

is there another way of loading extra packages in workers (parallel computing)?

One way of parallelization in R is through the snowfall package. To send custom functions to workers you can use sfExport() (see Joris' post here).
I have a custom function that depends on functions from non-base packages that are not loaded automagically. Thus, when I run my function in parallel, R craps out because certain functions are not available (think of the packages spatstat, splancs, sp...). So far I've solved this by calling library() in my custom function. This loads the packages on the first run and possibly just ignores on subsequent iterations. Still, I was wondering if there's another way of telling each worker to load the package on first iteration and be done with it (Or am I missing something and each iteration starts as a tabula rasa?).
I don't understand the question.
Packages are loaded via library(), and most of the parallel execution functions support that. For example, the snow package uses
clusterEvalQ(cl, library(boot))
to 'quietly' (ie not return value) evaluate the given expression---here a call to library()---on each node. Most of the parallel execution frameworks have something like that.
Why again would you need something different, and what exactly does not work here?
There's a specific command for that in snowfall, sfLibrary(). See also ?"snowfall-tools". Calling library manually on every node is strongly discouraged. sfLibrary is basically a wrapper around the solution Dirk gave based on the snow package.

Resources