Global variable on load in R package - r

I'm writing a package in R and I'd like to include some sample objects in it that would be easily accessible for users to play with. The problem is they contain non-ASCII characters, and R CMD check won't allow this in .rda files in data. It will, however, allow Unicode in inst/extdata. I could just have these datasets read and wrapped in objects when the package is loaded. I tried assign and <<-, but I couldn't make either work.
Alternately, they could be loaded and saved as .rda files during the installation of the package. This would be preferable, in fact, but from what I read this seemed to be less possible.
Probably irrelevant but possibly interesting bit of history: I started the package on Debian unstable. I saved those datasets as .rda and they passed the check just fine. At one point I made a little correction, resaved them, and got a warning. I saved them again, and the warning disappeared. Then I moved to Debian stable, added some new datasets, resaved them all, and now I can't get rid of the warning in any way. When I save them from r-devel, however, I only get a note, not a warning.

The answer is embarrassingly simple: read the data and prepare the variables in one of the files in the R folder, and #' #export them. No need to assign or anything.

Related

Is there an R function to make a copy of all the source code used to generate an analysis?

I have a file run_experiment.rmd which performs an analysis on data using a bunch of .r scripts in another folder.
Every analysis is saved into its own timestamped folder. I save the outputs of the analysis, the inputs used, and if possible I would also like to save the code used to generate the analysis (including the contents of both the .rmd file and the .r files).
The reason for this is because if I make changes to the way my analyses are run, then if I re-run the analysis using the new updated file, I will get different results. If possible, I would like to keep legacy versions of the code so that I can always, if need be, re-run the original analysis.
Have you considered using a git repository to commit your code and output each time you update/run it? I think this is the optimal solution for what you are describing. Each commit would have a timestamp associated with it for you to rollback to a previous version when needed.
The best way to do this is to put all of those scripts into an R package, and in your Rmd file, print sessionInfo() to record the package version used.
You should change the version number of the package each time you make a non-trivial change to it (or even better, with every change).
Then when you want to reproduce the analysis, you use the sessionInfo() listing to work out which version of R and of the packages to install, and you'll get the same environment.
There are packages to help with this (pak and renv, maybe others), but I haven't used them, so I can't give details or recommendations.

Where to put R files that generate package data

I am currently developing an R package and want it to be as clean as possible, so I try to resolve all WARNINGs and NOTEs displayed by devtools::check().
One of these notes is related to some code I use for generating sample data to go with the package:
checking top-level files ... NOTE
Non-standard file/directory found at top level:
'generate_sample_data.R'
It's an R script currently placed in the package root directory and not meant to be distributed with the package (because it doesn't really seem useful to include)
So here's my question:
Where should I put such a file or how do I tell R to leave it be?
Is .Rbuildignore the right way to go?
Currently devtools::build() puts the R script in the final package, so I shouldn't just ignore the NOTE.
As suggested in http://r-pkgs.had.co.nz/data.html, it makes sense to use ./data-raw/ for scripts/functions that are necessary for creating/updating data but not something you need in the package itself. After adding ./data-raw/ to ./.Rbuildignore, the package generation should ignore anything within that directory. (And, as you commented, there is a helper-function devtools::use_data_raw().)

Automatic loading of data from sysdata.rda in package

I have spent a lot of time searching for an answer to what is probably a very basic question, but I just can't find the solution to my issue. The closest that I found was this exchange from a few years ago.
In that case, the issue was the location of the sysdata.rda file in the correct directory within the package. That is not my issue.
I have some variables that store things like color palettes that I amusing inside a package. These variables are only used inside my functions so I storing them in R/sysdata.rda. However, when I load the packages, the variables are not loading into the package environment. If I load the data manually from sysdata.rda then everything works fine.
My impression from reading everything that I could find on internal data in R packages was that the data in R/sysdata.rda would load automatically.
Here is the code that I am using to store my data.
devtools::use_data(tmpBrks, tmpColors, prcpBrks, prcpChgBrks,
prcpChgBrkLabels, prcpColors, prcpChgColors,
internal = TRUE, overwrite = TRUE)
That successfully creates the data file at R/sysdata.rda and the data is in the file when I load it manually.
What do I need to do to have the data load automatically so the functions in my package can use them?
As usual, this was a bad combination of user ignorance and poor R documentation. The data was being loaded and was available to the functions. Where I went wrong was in assuming that the data would be visible in the package environment. That is not the case.
As far as I can tell, internal data in the R\sysdata.rda file is available to the functions within the package, but not visible in any way. After I created the internal data file I was looking for the data in the package environment. When I didn't see it I assumed that it wasn't loaded. When I kept pushing forward with my package development I finally realized that the data was loading silently and accessible to the functions in the package.
As evidenced by the two up votes that my question got, I am not the only one who didn't understand the behavior of the R\sysdata.rda internal data. Hopefully this explanation will save someone else a bunch of time searching for an answer to this issue that doesn't really exist.

Can I load an RData file while bypassing loading the namespaces?

Let's say some of my users cannot alter their R environments, but I need them to be able to open up RData files. These environment files require a package to be loaded (httpuv to be exact). We don't care about the package, we don't need its capabilities, we just need to get at the data. Is there a way to either force R to bypass loading namespaces when loading the RData file, or force it to save it without namespace dependencies at the originating end? Thanks.
To reproduce, install Shiny. Create and save a some R objects to the server's file system from within a Shiny applet as an RData file. Copy the file over to a computer that doesn't have Shiny or the httpuv package installed. Try loading the RData file, even if the actual objects you saved are completely ordinary data.frames that have nothing to do with Shiny or httpuv.
I did strings on the RData, and the damn thing is full of references to httpuv. The software is loading the file and then actively deciding to not continue in the internal loadFromConn2() function. Therefore there must be a way to make it stop doing so.
Really #baptiste should get credit for the link in his comment to some general solutions, especially the R CMD INSTALL --fake trick, and I will accept that if he reposts it as an answer. That is why I am not accepting the following answer of my own to the specific problem that caused it in my case, but I am posting my answer in case it helps someone else.
Some of the objects I was saving were lm fitted objects. Those contain formula/terms objects (at least two each, for some reason... maybe because they've been through stepAIC), and those formulas in turn each have an environment attribute. The environment attribute is .GlobalEnv which probably does contain copies of package functions someplace. When I dug through the objects inside the fitted models, and then the objects inside all the attributes of those objects, and then the objects inside the attributes of the attributes of those objects... and set every environment attribute I could find to NULL, eventually I was able to save that fitted model to a file that could be opened from a different R installation without getting the error about not being able to load a namespace.
I suppose I could also write a function to iterate through the objects within a fitted model, and their attributes, and remove environments but that sounds ugly and dangerous. Maybe there is a way to force formulas and fitted models not to retain environments, and that will be better. For the time being, instead of saving fitted models, I will save their call attributes after scrubbing any environment attributes I might find there. If that doesn't work, I'll deparse them into character strings.
PS: I used the RDS format and haven't yet tested it with RData, but I suspect that the problem was the saving of the evalution environment in some of the attributes, and had nothing to do with the format in which the objects get saved. I'll post an update if it turns out that this doesn't also work with RData.
PPS: I suspect I'm not the only one here who's hearing about the R CMD INSTALL --fake trick for the first time, and perhaps the word should be spread about this... because to the extent other R users don't know about it, this remains an obvious vector for denial-of-service attacks against R!
I will accept my own answer to get rid of the SO auto-nagger, but will unaccept it and accept #baptiste if they make it possible for me to do so by posting it as an answer. Thanks.

Exporting data in Roxygen2 so that they are available without requiring data()

After reading questions such as this SO question on documenting a data set with Roxygen I have managed to document a dataset (which I will refer to as cells) and it now appears in the list generated by data(package="mypackage") and is loaded if I run the command data(cells). After this, cells will appear when ls() is run.
However, in many packages the data is immediately available without requiring a data() call. Also, the data names do not appear when ls() is run. An example is the baseball data set that comes with plyr. I have looked at the source for plyr and I cannot see how this is done.
In the DESCRIPTION file of your package make sure that there is a field called LazyData that is set to TRUE.
From the "Writing R Extensions" guide:
The ‘data’ subdirectory is for data files, either to be made available
via lazy-loading or for loading using data(). (The choice is made by
the ‘LazyData’ field in the ‘DESCRIPTION’ file: the default is not to
do so.)
I think the exact syntax changed with R version 2.14; before that it was LazyLoad not LazyData.

Resources