R function example requires nonstandard dataset, doesn't jive with devtools - r

I've been struggling to get the example code for a function working using devtools::check(), because the data required for the example is not in .RData format. Unfortunately, the way the function is written, .RData cannot be loaded and work properly. The function takes in a list of filenames and performs an action on them collectively.
Therefore, example code must be written in a way that check() is able to access a folder and list the files therein. Using the function on my own computer, I input
setwd("/Users/mydirectory")
myfilelist <- list.files(pattern = "mypattern")
output <- myfunction(myfilelist, ...)
and everything is groovy. But this doesn't work with devtools because #examples doesn't know how to access subdirectories on my computer. check() pulls the following error:
base::assign(".ptime", proc.time(), pos = "CheckExEnv")
This is almost undoubtedly because check() doesn't know where to look for the data. I'd like it to look toward github to access the online data repository.
I found this brief conversation regarding a similar roxygen-related problem, but overall I haven't seen much advice on how to work through it. I think that perhaps this issue starts to get a little closer to my situation, but here the user failed to export a function, rather than bind data to an example.
I don't think I'm looking for a pull function (though the end goal is to pull data...), does anyone have advice moving forward? I have the data stored in the inst/extdata folder on github, so while I don't really have something reproducible for you all I'm hoping you might have some thoughts.
Edit: I worked around the problem using #alistaire's advice below, and guiding the roxygen to the package directory (updated on github) and also using \dontrun{}. However, I am leaving the question unanswered for now because I think accessing data stored in github should still be somehow possible and we haven't yet addressed that.

Related

Automatic loading of data from sysdata.rda in package

I have spent a lot of time searching for an answer to what is probably a very basic question, but I just can't find the solution to my issue. The closest that I found was this exchange from a few years ago.
In that case, the issue was the location of the sysdata.rda file in the correct directory within the package. That is not my issue.
I have some variables that store things like color palettes that I amusing inside a package. These variables are only used inside my functions so I storing them in R/sysdata.rda. However, when I load the packages, the variables are not loading into the package environment. If I load the data manually from sysdata.rda then everything works fine.
My impression from reading everything that I could find on internal data in R packages was that the data in R/sysdata.rda would load automatically.
Here is the code that I am using to store my data.
devtools::use_data(tmpBrks, tmpColors, prcpBrks, prcpChgBrks,
prcpChgBrkLabels, prcpColors, prcpChgColors,
internal = TRUE, overwrite = TRUE)
That successfully creates the data file at R/sysdata.rda and the data is in the file when I load it manually.
What do I need to do to have the data load automatically so the functions in my package can use them?
As usual, this was a bad combination of user ignorance and poor R documentation. The data was being loaded and was available to the functions. Where I went wrong was in assuming that the data would be visible in the package environment. That is not the case.
As far as I can tell, internal data in the R\sysdata.rda file is available to the functions within the package, but not visible in any way. After I created the internal data file I was looking for the data in the package environment. When I didn't see it I assumed that it wasn't loaded. When I kept pushing forward with my package development I finally realized that the data was loading silently and accessible to the functions in the package.
As evidenced by the two up votes that my question got, I am not the only one who didn't understand the behavior of the R\sysdata.rda internal data. Hopefully this explanation will save someone else a bunch of time searching for an answer to this issue that doesn't really exist.

how to use utils::globalVariables

Following your recommendations (or trying to do it, at least), I have tried some options, but the problem remains, so there must be something I am missing.
I have included a more complete code
setwd("C:/naapp")
#' #import utils
#' #import devtools
I have tried with and without using suppressForeignCheck
if(getRversion() >= "2.15.1"){
utils::globalVariables(c("eleven"))
utils::suppressForeignCheck(c("eleven"))
}
myFunctionSum <- function(X){print(X+eleven)}
myFunctionMul <- function(X){print(X*eleven)}
myFunction11 <- function(X){
assign("eleven",11,envir=environment(myFunctionMul))
}
maybe I should use a particular environment?
package.skeleton(name = "myPack11", list=ls(),
path = "C:/naapp", force = TRUE,
code_files = character())
I remove the "man" directory from the directory myPack11,
otherwise I would get an error because the help files are empty.
I add the imports utils, and devtools to the descrption
Then I run check
devtools::check("myPack11")
And I still get this note
#checking R code for possible problems ... NOTE
#myFunctionMul: no visible binding for global variable 'eleven'
#myFunctionSum: no visible binding for global variable 'eleven'
#Undefined global functions or variables:eleven
I have tried also to make an enviroment, combining Tomas Kalibera's suggetion and an example I found in the Internet.
myEnvir <- new.env()
myEnvir$eleven <- 11
etc
In this case, I get the same note, but with "myEnvir", instead of "eleven"
First version of the question
I trying to use "globalVariables" from the package utils. I am building an interface in R and I am planning to submit to CRAN. This is my first time, so, sorry if the question is very basic.I have read the help and I have tried to find examples, but I still don't know how to use it.
I have made a little silly example to ilustrate my question, which is:
Where do I have to place this line exactly?:
if(getRversion() >= "2.15.1"){utils::globalVariables("eleven")}
My example has three functions. myFunction11 creates the global variable "eleven" and the other two functions manipulate it. In my real code, I cannot use arguments in the functions that are called by means of a button. Consider that this is just a silly example to learn how to use globalVariables (to avoid binding notes).
myFunction11 <- function(){
assign("eleven",11,envir=environment(myFunctionSum))
}
myFunctionSum <- function(X){
print(X+eleven)
}
myFunctionMul <- function(X){
print(X*eleven)
}
Thank you in advance
I thought that the file globals.R would be automatically generated when using globalsVariables. The problem was that I needed to create the package skeleton, then create the file globals.R, add it to the R directory in the package and check the package.
So, I needed to place this in a different file:
#' #import utils
utils::globalVariables(c("eleven"))
and save it
The documentation clearly says:
## In the same source file (to remind you that you did it) add:
if(getRversion() >= "2.15.1") utils::globalVariables(c(".obj1", "obj2"))
so put it in the same source file as your functions. It can go in any of your R source files, but the comment above recommends you put it close to your code. Looking at a bunch of github packages reveals another common pattern is to have a globals.R function with it in, but this is probably a bad idea. If you later remove the global from your package but neglect to update globals.R you could mask a problem. Putting it right close to the functions that use it will hopefully remind you when you edit those functions.
Make sure you put it outside any function definitions in the file, or it won't get seen.
You cannot modify bindings in a package namespace once the package is loaded (and namespace sealed, and bindings locked). The check tool helps you to spot violations of this restriction, so you find out about the problem when checking the package rather than while running it. globalVariables is just a call to silence check when looking for these violations, which is undesirable in almost all cases. If you really need mutable state in a package, you can create a new environment (using new.env) and bind it to an (unexported) "global" variable in your namespace. This binding will be locked, but this is ok, because in R you can change an environment in place (add/remove elements, effectively modifying the elements).
The best situation is however when you can keep all mutable state in user objects (passed in as arguments into functions, and their modified versions returned as output values of functions).

How to use/install merge method for data.sets from memisc package?

I have two data.sets (from the memisc package) all set for merge, and the merge goes through without error or warning, but the output is a data.frame, not a data.set. The command is:
datTS <- merge(datT1, datT2, by.x="ryear", by.y="ryear")
(Sorry I don't have a more convenient example with toy data handy.) The following pages seem to make it very clear that there should be a method built into memisc that properly merges the data.sets into one data.set:
http://rpackages.ianhowson.com/rforge/memisc/man/dataset-manip.html
https://github.com/melff/memisc/blob/master/pkg/R/dataset-methods.R
...but it just doesn't seem to be properly triggering on my machine (sorry also for my clumsy lingo). Note the similarity of my code and the example code from the very end of the first page I linked:
ds6 <- merge(ds1,ds5,by.x="a",by.y="c")
I've verified that I have the most recent versions of R, RStudio, memisc, and all dependencies. I've used a number of other memisc methods so far (within, transform, missing.values, etc.) without issue.
So my question is: what else does one need to do to get the merge function to properly produce a data.set when the source data are in data.set form, as per the memisc package? (There's no explicit addressing of this merge capability in the official package documentation.) Since the code in the second link above seems to provide the method for this, is there some workaround, at least, for installing and utilizing that code? Maybe there's just some separate "methods installation" I'm not aware of (but why would it be separate from the main package?).
The help page for pkg:memisc in the released version 0.97 does not describe a merge function method for data.sets. You are pointing us to the github version which may not be the one that has been released. You need to install the github version. See: https://github.com/melff/memisc/releases

Using glmm_funs.R?

I am fitting a GLMM and I had seen some examples where is used the function: overdisp_fun, deļ¬ned in glmm_funs.R, but I don't know which package contain them or how can I call it from R, can somebody help me?
Thanks,
If you google for glmm_funs.R, you'll find links to the script (eg here: http://glmm.wdfiles.com/local--files/trondheim/glmm_funs.R).
You can save the file on your local machine, then call it in your R session with source("path to file/glmm_funs.R").
You will then be able to use the functions contained in the script, including overdisp_fun().
You can think of it a little bit like loading a package, except the functions are just presented in a script.

Cache expensive operations in R

A very simple question:
I am writing and running my R scripts using a text editor to make them reproducible, as has been suggested by several members of SO.
This approach is working very well for me, but I sometimes have to perform expensive operations (e.g. read.csv or reshape on 2M-row databases) that I'd better cache in the R environment rather than re-run every time I run the script (which is usually many times as I progress and test the new lines of code).
Is there a way to cache what a script does up to a certain point so every time I am only running the incremental lines of code (just as I would do by running R interactively)?
Thanks.
## load the file from disk only if it
## hasn't already been read into a variable
if(!(exists("mytable")){
mytable=read.csv(...)
}
Edit: fixed typo - thanks Dirk.
Some simple ways are doable with some combinations of
exists("foo") to test if a variable exists, else re-load or re-compute
file.info("foo.Rd")$ctime which you can compare to Sys.time() and see if it is newer than a given amount of time you can load, else recompute.
There are also caching packages on CRAN that may be useful.
After you do something you discover to be costly, save the results of that costly step in an R data file.
For example, if you loaded a csv into a data frame called myVeryLargeDataFrame and then created summary stats from that data frame into a df called VLDFSummary then you could do this:
save(c(myVeryLargeDataFrame, VLDFSummary),
file="~/myProject/cachedData/VLDF.RData",
compress="bzip2")
The compress option there is optional and to be used if you want to compress the file being written to disk. See ?save for more details.
After you save the RData file you can comment out the slow data loading and summary steps as well as the save step and simply load the data like this:
load("~/myProject/cachedData/VLDF.RData")
This answer is not editor dependent. It works the same for Emacs, TextMate, etc. You can save to any location on your computer. I recommend keeping the slow code in your R script file, however, so you can always know where your RData file came from and be able to recreate it from the source data if needed.
(Belated answer, but I began using SO a year after this question was posted.)
This is the basic idea behind memoization (or memoisation). I've got a long list of suggestions, especially the memoise and R.cache packages, in this query.
You could also take advantage of checkpointing, which is also addressed as part of that same list.
I think your use case mirrors my second: "memoization of monstrous calculations". :)
Another trick I use is to do a lot of memory mapped files, which I use a lot of, to store data. The nice thing about this is that multiple R instances can access shared data, so I can have a lot of instances cracking at the same problem.
I want to do this too when I'm using Sweave. I'd suggest putting all of your expensive functions (loading and reshaping data) at the beginning of your code. Run that code, then save the workspace. Then, comment out the expensive functions, and load the workspace file with load(). This is, of course, riskier if you make unwanted changes to the workspace file, but in that event, you still have the code in comments if you want to start over from scratch.
Without going into too much detail, I usually follow one of three approaches:
Use assign to assign a unique name for each important object throughout my execution. Then include an if(exists(...)) get(...) at the top of each function to get the value or else recompute it. (same as Dirk's suggestion)
Use cacheSweave with my Sweave documents. This does all the work for you of caching computations and retrieves them automatically. It's really trivial to use: just use the cacheSweave driver and add this flag to each block: <<..., cache=true>>=
Use save and load to save the environment at crucial moments, again making sure that all names are unique.
The 'mustashe' package is great for this kind of problem. In addition to caching the results, it also can include links to dependencies so that the code is re-run if the dependencies change.
Disclosure: I wrote this tool ('mustashe'), though I do not make any financial gains from others using it. I made it for this exact purpose for my own work and want to share it with others.
Below is a simple example. The foo variable is created and "stashed" for later. If the same code is re-run, the foo variable is loaded from disk and added to the global environment.
library(mustashe)
stash("foo", {
foo <- some_long_running_opperation(1e3)
}
#> Stashing object.
The documentation has additional examples of more complex use-cases and a detailed explanation of how it works under the hood.

Resources