When should data go in /data, and when should it go in /inst/extdata? - r

The Writing R Extensions manual states:
The data subdirectory is for data files, either to be made available via lazy-loading or for loading using data(). (The choice is made by the ‘LazyData’ field in the DESCRIPTION file: the default is not to do so.) It should not be used for other data files needed by the package, and the convention has grown up to use directory inst/extdata for such files.)
But it is still not clear what data is "required" by a package. I would like to use data for the following (not always mutually exclusive) reasons:
documentation
function examples
function tests
vignettes
to provide access to an original data set
to make data available to functions within the package (e.g. a lookup table / dictionary)
But it is not clear which of these should go in the data folder, and which should go in inst/extdata. And are there any conditions under which "data" should go elsewhere?
Related questions: Previous questions (e.g. inst and extdata folders in R Packaging and Using inst/extdata with vignette during package checking R 2.14.0) give some instructions on use, but don't tell me how to decide which directory to use. Another question, R - where should I place RDA file - /R, /data, /inst/extdata?, gets the closest, but seems to focus specifically on RDA and RData files.

The data directory supplies data for the data() function and is expected to follow certain customs in terms of file formats and extensions.
The inst/extdata directory becomes extdata/ when installed and is more of a wild west and you can do whatever you want and it is expected that you write your own accessors.
It may be useful to look at empirics. On my machine, among around 240-some installed packages, a full 77 (or not quite a third) have data/, but only 4 (including one of mine) have extdata..

Related

Renaming files in RStudio that have been sourced other places

I have a few R files that contain functions imported and used by several other R files. I import these functions with the source function. Naturally, the scope of a particular file might change over time, and recently I wanted to rename a file I had already sourced in many other places.
I'm using RStudio, and I have been unable to find a way to do this except for either manually updating each dependent file, or creating some external code to scan through the files.
Is there no way to do consistent renaming in RStudio? Alternatively, am I doing something wrong by using source to add functions?
You may or may not find this satisfactory. Create a parent script with the old name that sources the script with the new name.
Extending this, you could just create a general preamble script, called something like "preamble.R", that sources all general utility scripts you have. Such an approach is common (I believe) with TeX. Then you only have one place to update file names.

Example input data with example output using relative pathway in vignette of R package?

I'm putting together an R package. I would like to show example code in the vignette, where example data files (included in the package) are used to generate an (example) output file.
I read about using example data in Hadley Wickham's post (http://r-pkgs.had.co.nz/data.html), and believe I should keep my example data as raw data, as it must be parsed to generate the output.
So, I created a directory in my package structure
/Users/userName/myPackage/inst/extdata/
with subdirectories InputFiles and OutputFiles.
And I put the example file (exampleData.csv) inside of the InputFiles subdirectory (/Users/userName/myPackage/inst/extdata/InputFiles).
My vignette is located in:
/Users/userName/myPackage/vignettes/myPackage.Rnw
It contains the following syntax:
<<eval=FALSE>>=
fileString = "/Users/userName/myPackage/inst/extdata/InputFiles/exampleData.csv"
doFunction1(fileString)
doFunction2(fileString)
doFunction3(fileString, output ="Users/userName/myPackage/inst/extdata/OutputFiles")
#
I am having two problems with developing this vignette and its example datasets:
1) I am unsure if my use of the extdata file is appropriate. This seemed to be the best directory name and location to place my example files, according to the aforementioned Hadley Wickham reference.
2) I am unsure how to make the pathways relative, instead of absolute, as I have them currently. This example code does not run automatically, as you can see. Instead, I have it under an R chunk of eval=FALSE so that it is simply listed there for the users to test themselves. After running the example code, the users can also check that the output file was indeed created in (/Users/userName/myPackage/inst/extdata/OutputFiles). What is the best way for me to allow the user to not have to use an absolute path when following the example? Is it possible to just follow a relative path from within the package directory myPackage?
My data files consist of .csv, .htm, and .text files. In the past, when constructing a package, I have saved a data frame as .rda file, and then the user could simply use:
data(example.rda)
to read that file. They would not have to write the entire pathway. Is there a similar function that can be used to read .csv, .html, and .text files, and then output them to an example output location - without having to use the full pathway? Would it be possible to have help functions that also read in the input files and write to the output files? Would this cause a conflict in CRAN if various example help functions in the /man folder physically save the example output file to the example output folder?
The standard way to refer to a file in a package is:
# gives root package directory
system.file(package="myPackage")
# specific file
system.file("extdata/InputFiles/exampleData.csv", package="myPackage")
# best is to use cross-platform way to write a file path:
system.file("extdata", "InputFiles", "exampleData.csv", package="myPackage")
When developing with devtools, the inst subdirectory is ignored, so you never need to worry about absolute paths. This should work in a vignette. Note that a vignette, I think, only ever uses the installed version of a package, not the one you may have loaded in your development environment (specifically, devtools::load_all() does not change the code which is used to build the vignette, you must install() it first).
Finally, using data() is a bit old fashioned. Hadley and others recommend using lazy data, so the data appears in the namespace automatically. Try the following in your DESCRIPTION.
LazyData: true
LazyDataCompression: xz

Retrieve path of supplementary data file of developed package

While developing a package I encountered the problem of supplementary data import - this has been 'kind of' solved here.
Nevertheless, I need to make use of a function of another package, which needs a path to the used file. Sadly, using GlobalEnvironment variables here is not an option.
[By the way: the file needs to be .txt, while supplementary data should be .RData. The function is quite picky.]
So I need to know how to get the path supplementary data file of a package. Is this even possible to do?
I had the idea of reading the .RData into the global environment and then saving it into a tmpfile for further processing. I would really like to know a clean way - the supplementary data is ~100MB large...
Thank you very much!
Use system.file() to reliably find the path to the installed package and sub-directories, typically these are created in your-pkg-source/inst/extdata/your-file.txt and then referenced as
system.file(package="your-pkg", "extdata", "your-file.txt")

Trouble including data in package

I've created my first package, hosted on github. I'm trying to include three data frames in the package. Per the Writing R Extensions guide, I saved each data frame as a separate .RData file in the data subdirectory. However, I can't seem to access the data when I load the package.
When I install the package from a clean R session using
require(devtools)
install_github("Lloyd.et.al.Cell.abundance.metaanalysis", "adsteen")
the package functions, documentation, and vignette seem to load correctly. Better yet,
data(package="Lloyd.et.al.Cell.abundance.metaanalysis")
shows the three data frames that are encoded as .RData files, named all_data, corrected_seds, and corrected_sw.
The problem is, I can' seem to actually access the data. I have LazyData: true at line 21 in the DESCRIPTION file, so I would expect head(all_data) to show the data frame, but it returns an error, obect 'all_data' not found. I can't seem to find a way to use load() to load the data.
What am I doing wrong?

How to point to a directory in an R package?

I am making my first attempts to write a R package. I am loading one csv file from hard drive and I am hoping to bundle up my R codes and my csv files into one package later.
My question is how can I load my csv file when my pakage is generated, I mean right now my file address is something like c:\R\mydirectory....\myfile.csv but after I sent it to someone else how can I have a relative address to that file?
Feel free to correct this question if it is not clear to others!
You can put your csv files in the data directory or in inst/extdata.
See the Writing R Extensions manual - Section 1.1.5 Data in packages.
To import the data you can use, e.g.,
R> data("achieve", package="flexclust")
or
R> read.table(system.file("data/achieve.txt", package = "flexclust"))
Look at the R help for package.skeleton: this function
automates some of the setup for a new source package. It creates directories, saves functions, data, and R code files to appropriate places, and creates skeleton help files and a ‘Read-and-delete-me’ file describing further steps in packaging.
The directory structure created by package.skeleton includes a data directory. If you put your data here it will be distributed with the package.

Resources