R: What's actually loaded when library() is called? - r

Here is a snippet of R script doing beta regression on data "GasolineYield":
library("betareg")
data("GasolineYield", package = "betareg")
gy_logit <- betareg(yield ~ batch + temp, data = GasolineYield)
It works fine but if I run the code with the second line deleted, it errs with message:
Error in terms.formula(form, ...) : object 'GasolineYield' not found
But isn't the data.frame GasolineYield in the package betareg ? What's actually happening when I call library("betareg")? Aren't all the data inside the package automatically loaded into current environment? Could anybody help me understand the mechanism behind this?

For the most part, data is included in R packages for the purposes of providing examples and other stuff that is not mission critical. That is why datasets are not automatically loaded into the environment for most packages and you have to load them using the data() command. This is a good thing. It would be a waste of memory, time, and namespace for packages that primarily provide functions to load their data all the time when users don't use it very frequently.
When you load a package, only the stuff that is exported in the "NAMESPACE" file by the package designer is made available. And the "DESCRIPTION" file has a field called "LazyData" that determines the data behavior as well. By the way, packages often have functions in them that are for internal use as well and are not exported in the NAMESPACE file.
TL;DR, the package writer determines what stuff will be available when the package is loaded and they specify those items in the NAMESPACE and DESCRIPTION files.

Related

How to declare a dependency on an R package from which you only use S3/S4 methods, but no exports?

Currently I have in my package DESCRIPTION, a dependency on dbplyr:
Imports:
dbplyr,
dplyr
dbplyr is useful almost solely because of the S3 methods it defines: https://github.com/tidyverse/dbplyr/blob/main/NAMESPACE. The actual functions you call to use dbplyr are almost entirely from dplyr.
By putting dbplyr in my Imports, it should automatically get loaded, but not attached, which should be enough to register its S3 methods: https://r-pkgs.org/dependencies-mindset-background.html#sec-dependencies-attach-vs-load.
This seems to work fine, but whenever I R CMD check, it tells me:
N checking dependencies in R code (10.8s)
Namespace in Imports field not imported from: ‘dbplyr’
All declared Imports should be used.
Firstly, why does R CMD check even check this, considering that it often makes sense to load packages without importing them. Secondly, how am I supposed to satisfy R CMD check without loading things into my namespace that I don't want or need?
I am pretty sure two of your assumptions are false.
First, putting Imports: dbplyr into your DESCRIPTION file won't load it, so its methods won't be loaded from that alone. Basically the Imports field in the DESCRIPTION file just guarantees that dbplyr is available to be loaded when requested. If you import something via the NAMESPACE file, that will cause it to be loaded. If you evaluate dbplyr::something that will cause it to be loaded. Executing loadNamespace("dbplyr") is another way, and there are a few others. You may also load some other package that loads it.
Second, I think you have misinterpreted the error message. It isn't saying that you loaded it without importing it (though it would complain about that too), it is saying that it can't detect any use of it in your package, so maybe it shouldn't be a requirement for installing your package.
Unfortunately, the code to detect uses is fallible, so it sometimes misses uses. Examples I've heard about are:
if the package is only used in the default value for a function argument. This has been fixed in R-devel.
if the package is only used during the build to construct some object, e.g. code like someclass <- R6::R6Class( ... ) needs R6, but the check code won't see it because it looks at someclass, not at the source code that created it.
if the use of the package is hidden by specifying the name of the package in a character variable.
if the need for the package is indirect, e.g. you need to use ggplot2::geom_hex. That needs the hexbin package, but ggplot2 only declares it as "Suggested".
These examples come from this discussion: https://github.com/hadley/r-pkgs/issues/828#issuecomment-1421353457 .
The recommended workaround there is to create an object that refers to the imported package explicitly, e.g. putting the line
dummy_r6 <- function() R6::R6Class
into your package is enough to suppress the note without actually loading R6. (It will be loaded if you ever call this function.)
However, your requirement is stronger: you do need to make sure dbplyr is loaded if you want its methods to be used. I'd put something in your .onLoad() function that triggers the load. For example,
.onLoad <- function(lib, pkg) {
# Make sure the dbplyr methods are loaded
loadNamespace("dbplyr")
}
EDITED TO ADD: As pointed out in the comments, there's a bug in the check code that means it won't detect this as being a use of dbplyr. You really need to do both things, e.g.
.onLoad <- function(lib, pkg) {
# Make sure the dbplyr methods are loaded
loadNamespace("dbplyr")
# Work around bug in code checking in R 4.2.2 for use of packages
dummy <- function() dbplyr::across_apply_fns
}
The function used in the dummy construction is arbitrary; it probably doesn't even need to exist, but I chose one that does.

can't find a function from loaded package

I created a local package with personal functions to be easily used within R. One of these is aimed to be used in the lidR package within a wrapper function (i.e. grid_metrics). For this reason I took the scheme of this script as a reference, exporting both the long name (e.g. my_metrics(param1, param2,...)) and the lazy one (e.g. .my_metrics), because I really like its ease of use.
Nevertheless, if I load my package and then call the lazy function
library(mypackage)
test = grid_metrics(las, .my_metrics, 20)
it does not work, so I have to load in memory the function by running its code from the file. At this stage, I can use it in both forms.
Within the NAMESPACE file I can see that both forms are exported so my last guess is that this might be related somehow to lazyeval but I don't get how.
It seems that the problem is related to the DESCRIPTION section in which the lidR package was included. Since when I moved from Imports to Depends the issue is solved.

How to use an R package's own data when writing a vignette

I have written most of an R package and now wish to write a vignette that uses my own data, that is already in the package. The data is correctly stored as my_data.Rda in the Data folder, and when the package is loaded I can access it in the Console, for instance by using data(my_data).
My problem comes when, using usethis::use_vignette("my_vignette") , I want to include something like this (much more complex in practice, of course) in the vignette:
The mean of my_data is given by
```{r} data(my_data)
mean(my_data)
```
When I knit the vignette I get the message
"Error in assert_engine(is_numeric, x, .xname = get_name_in_parent(x),
: object 'my_data' not found"
I have looked at this post: How to add external data file into developing R package? but that addresses external data.
What am I doing wrong?
I have created a minimal R package with the relevant Rmd file in the vignettes folder. link to Github
I think you are supposed to use
data(my_dataset, package = "my_package")
to load your package's data into the session where your vignette is built.
Could you confirm that your datasets are stored inside the ./data directory of your package as *.rda files

R Package Build/Install Error: "object not found" even though I have it in R/sysdata.rda

Similar Question
accessing sysdata.rda within package functions
Why This Similar Question Does Not Apply To Me
They were able to actually build it, and apparently it was a Github error for them (not related)
R VERSION
3.4.2 (I tried using 3.4.3 as well but the same problem occurred)
EDIT: I am on Windows 10
Context
I have fully read the following tutorial on R packages and how to include .Rda files in them. I have LazyData in my DESCRIPTION file set as true as well. I have tried both the data/ folder implementation and the R/sysdata.rda implementation using the function devtools::use_data() with the respective options of internal = FALSE and internal = TRUE.
However, when I try to build the package, or use devtools::install (which builds as well I assume), it fails and gives me the following error message:
Error in predict(finalModel, newInput) : object 'finalModel' not found
Where finalModel is stored within my .rda file.
Does anyone know any possible reasons why this might occur?
I also asked a coworker to install the package on his machine, but unfortunately he got the exact same error.
I made another test package as a 'sanity-check' by creating a simple linear model using the lm() function on datasets::swiss, and then made a test package with this newly created model as a .rda file. When I referenced this test model in a function within this test package, it eerily worked, despite the fact that (to the best of my knowledge) I used the exact same steps to create this new R package.
Also, I unfortunately cannot share the code for the package I am creating, but I am willing to share the code for the test package that uses the swiss dataset.
Thank you in advance.
EDIT: My .rda file I am putting in the package was created last year, if that has anything to do with it.
I just solved a similar issue of having object 'objectName' not found that arose during package management. In my case, the issue related to losing the context of variables when working with parallelization.
When using parallel::clusterExport(cl, varlist=c("function-name")), clusterExport looks at .GlobalEnv for variable definitions. This wouldn't come up during debugging, as I always the variables defined in .GlobalEnv. The solution was to state the environment explicitly: parallel::clusterExport(cl, varlist=c("function-name"), envir=environment()). This ensures that the parallel processes have context of the variables within the data/ folder and R/sysdata.rda.
Source
If you have more than one internal file, you must save them together:
usethis::use_data(file_1,
file_2,
file_3,
internal = TRUE,
overwrite = TRUE)

Load data object when package is loaded

Is there a way to automatically load a data object from a package in memory when the package is loaded (but not yet attached)? I.e. the opposite of lazy loading? The object is used in one of the package functions, so it needs to be available at all time.
When the package is set to lazydata=false, the data object is not exported by the package at all, and needs to be loaded manually with data(). We could use something like:
.onLoad <- function(lib, pkg){
data(mydata, package = pkg)
}
However, data() loads the object in the global environment. I strongly prefer to load it in the package environment (which is what lazydata does) to prevent masking conflicts.
A workaround is to bypass the data mechanics completely, and simply hardcode the object in the package. So the package myscore.R would look like
mymodel <- readRDS("inst/mymodel.rds")
myscore <- function(newdata){
predict(mymodel, newdata)
}
But this will lead to a huge packagedb for large data objects, and I am not sure what are the consequences of that.
As you say
The object is used in one of the package functions, so it needs to be available at all time.
I think the author of that package should really NOT use data(.) for that.
Instead he should define the object inside his /R/ either by simple R code in an R/*.R file,
or by using the sysdata.rda approach that is explained in the famous first reference for all these question,
"Writing R Extensions". In both cases the package author can also export the object which is often desirable for other users as in your case.
Of course this needs a polite conversation between you and the package author, and will only apply to the next version of that package.
I'm going to post this since it seems to work for my use case.
.onLoad() is:
function(lib,pkg)
data(mydata, package=pkg,
environment=parent.env(environment()))
Also need Imports: utils in DESCRIPTION and importFrom(utils, data) in NAMESPACE in order to pass R CMD check.
In my case I don't need the data object to be visible to the user, I need it to be visible to one of the functions in the package. If you need it visible to the user, that's going to be even harder (I think) because as far as I can tell you can't export data, just functions. The only way I've thought of to export data is to export a wrapper function for the data.

Resources