I am writing an R package with very large internal data consisting of several models created with caret, which add up to almost 2 GB. The idea is that this package will live on my computer exclusively, and several other packages I built will be able to use it to make predictions. My computer has plenty of memory, but I can't figure out how to set up the package so that the models work efficiently.
I can install the package successfully if I store the large models in inst/extdata, set lazy loading to false in the DESCRIPTION file, and load the models inside the function that uses them. (I think I could also so this by putting the models in the data directory, turning off lazy loading, and loading them inside the function.) But this is very slow, since my other packages call the prediction function repeatedly and it has to load the models every time. It would work much better if the models were loaded along with the package, and just stayed in memory.
Other things I have tried made it so that the package couldn't be installed at all. When I try I get the error "long vectors not supported yet." These include:
storing the models in inst/extdata with lazy loading
storing the models in R/sysdata.rda (with or without lazy loading)
storing the models in the data directory (so they're exported) with lazy loading
Is there a better way to do this, that keeps the models loaded when the package is loaded? Or is there some better alternative to using an R package?
Related
I have been working on a ML project for which that work (done inside an R-project) resulted in some ML models (built with caret) ALONG WITH code that uses those models for additional analysis.
As the next phase, I am "deploying" these models by creating an R-package that my collaborators can use for analysis of new data, where that analysis includes USING the trained ML models. This package includes functions that generate reports, where, embedded in that report is the application of the trained ML models against the new data sets.
I am trying to identify the "right" way to include those trained models in the package. (Note, currently each model is saved in its own .rds file).
I want to be able to use those models inside of package functions.
I also want to consider the possibility of "updating" the models to a new version at a later date.
So ... should I:
Include the .rda files in inst/exdata
Include as part of sysdata.rda
Put them in an external data package (which seems reasonable, except almost all examples in tutorials expect a data package to
include data.frame-ish objects.)
With regard to that third option ... I note that these models likely imply that there are some additional "NAMESPACE" issues at play, as the models will require a whole bunch of caret related stuff to be useable. Is that NAMESPACE modification required to be in the "data" package or the package that I am building that will "use" the models?
My first intention is to go for 1. There is no need to go for other formats as PMML as you only want to run it within R. So I consider Rda as natively best. As long, as your models are not huge, it should be fine to share with collaborators (but maybe not for a CRAN package). I see, that 3. sounds convenient but why seperate models and functions? Freshly trained models then would come with a new package version, as you anayway would need to go with a data package. I dont see gaining much this way, but I have not much experiance with data packages.
I am currently building a R package and I want to use a trained model in one of my R script. Is it possible to load the model (saved in .rds form)?
Yes, it's exactly the way you described it. Save object with saveRDS function and load it with readRDS. You have to remember to load every package you used for the model, and to prepare data in exact way for prediction.
Sometimes I use attach with some subset terms to work with odd dimensions of study data. To prevent "masking" variables in the environment (really the warning message itself) I simply call detach() to just remove whatever dataset I was working with from the R search path. When I get muddled in scripting, I may end up calling detach a few times. Well, interestingly if I call it enough, R removes functions that are loaded at start-up as part of packages like from utils, stats, and graphics. Why does "detach" remove these functions?
R removes base functions from the search path, like plot and ? and so on.
These functions that were removed are often called “base” functions but they are not part of the actual ‹base› package. Rather, plot is from the package ‹graphics›, and ? is from the package ‹utils›, both of which are part of the R default packages, and are therefore attached by default. Both packages are attached after package:base, and you’re accidentally detaching these packages with your too many detach calls (package:base itself cannot be detached; this is important because if it were detached, you couldn’t reattach it: the functions necessary for that are inside package:base).
To expand on this, attach and detach are usually used in conjunction with package environments rather than data sets: to enable the uses functions from a package without explicitly typing the package name (e.g. graphics::plot), the library function attaches these packages. When loading R, some packages are attached by default. You can find more information about this in Hadley Wickham’s Advanced R.
As you noticed, you can also attach and detach data sets. However, this is generally discouraged (quite strongly, in fact). Instead, you can use data transformation functions from the base package (e.g. with and transform, as noted by Moody_Mudskipper in a comment) or from data manipulation package (‹dplyr› is state of the art; an alternative is ‹data.table›).
Q. How does one write an R package function that references an R package dataset in a way that is simple/friendly/efficient for an R user. For example, how does one handle afunction() that calls adataset from a package?
What I think may not be simple/friendly/efficient:
User is required to run
data(adataset) before running
afunction(...)
or else receiving an Error: ... object 'adataset' not found. I have noticed some packages have built-in datasets that can be called anytime the package is loaded, for example, iris, which one can call without bringing it to the Global Environment.
Possible options which I have entertained:
Write data(NamedDataSet) directly into the function. Is this a bad idea. I thought perhaps it could be, looking at memory and given my limiting understanding of function environments.
Code the structure of the dataset directly into the function. I think this works depending on the size of the data but it makes me wonder about how to go about proper documentation in the package.
Change Nothing. Given a large enough dataset, maybe it does not make sense to implement a way different from reading it before calling the function.
Any comments are appreciated.
You might find these resources about writing R data packages useful:
the "External Data" section of R Packages
Creating an R data package
Creating an R data package (overview)
In particular, take note of the DESCRIPTION file and usage of the line LazyData: true. This is how datasets are made available without having to use data(), as in the iris example that you mention.
I use unit tests via testthat for many of my simpler functions in R but where I have more complex functions combining simple functions and business logic. I'd like to test the cumulative impact via a before and after view for a sample of inputs. Ideally, I'd like to do this for a variety of candidate changes.
At the moment I'm:
Using Rmarkdown documents
Loading the package as-is
Getting my sample
Running my sample through the package as-is and outputting table of results
sourceing new copies of functions
Running my sample again and outputting table of results
Reloading package and sourceing different copies of functions as required
This has proven difficult due to some functions that sit in the package namespace still running the package versions of functions, making results unreliable unless I thoroughly reload all downstream dependencies of functions. Additionally, the mechanism is complex to manage and difficult to reuse.
Is there a better strategy for testing candidate changes in R packages?
I've reduced this issue by creating a sub-package in my impact analysis folder that contains the amended functions.
I then use devtools::load_all to load these new function versions up.
I can then compare and contrast results by accessing the originals via the namespace e.g. myoriginalpackage:::testfunction whilst looking at the new ones with testfunction
Maybe you can replace steps 5 and 7 with "Rcmd build YourPackage" on each version. Then with
install.packages("Path/To/MyPackage_1.1.tar.gz", type="source")
test the old version, and then replace 1.1 with 1.2