Unpacking compressed package file after R package installation - r

I'd like to include a large data file in my R package. This file is located in the inst directory, and it is compressed. My goal is to yield a smaller package size on our local repository while eliminating decompression at attachment time.
Every time my package is attached, it must decompress the file which takes a few seconds.
Is there a way to decompress this file permanently upon installation of my package?

Don't save in inst/ and use usethis to save data
I'd strongly recommend using usethis::use_data to save data in a R package setting. With use_data(), it saves to data/. You can also set the compress method. For your purpose, I'd suggest method = "xz" (see the save documentation on compression).
The other thing to do is set LazyData: false in the DESCRIPTION file, then when you want to access the data, use data("dataname", package = "yourpackage") to load it.
See the chapter on Data in the R Packages book. It helps clarify many things.

Related

What's the hash in packrat.lock for?

I'm setting up a git workflow with my R project using packrat. Everytime I packrat::snapshot() my workspace, the file packrat.lock changes with the new packages/versions etc, but it also changes the Hash line for each package, which is a bit annoying when checking file diffs to see what changed from one commit to another.
Is this Hash really necessary? If not, is there any way to disable it?
The hash is generated by the hidden hash() function in packrat library, and it serves as a package consistency check.
The algorithm generates an md5sum that is based on the DESCRIPTION file included in the package tarball, but there is additional logic involved, see lines #103-#107 in the packrat/R/cache.R source at Github.
In order to obtain the HASH that packrat expects to find in the packrat.lock file a direct call to the hash() function must be made. This function is not exposed in the compiled package, so the only way to access it is to use the packrat source.
Obtain a copy of the source of the packrat library from CRAN with the correct version
Extract it into a folder (in my example it is packrat-0.5.0)
Start an R session
The following lines demonstrate how to generate the hash for the package BH-1.66.0-1 (4cc8883584b955ed01f38f68bc03af6d):
# md5sum() function is neeeded
library(tools)
# relevant source code files are loaded
source('packrat-0.5.0/R/utils.R') # readDcf() function
source('packrat-0.5.0/R/cache.R') # packrat's hash() function
# execute the hash() function on the DESCRIPTION file in the package
print(hash('/usr/local/lib/R/site-library/BH/DESCRIPTION'))
This should return the correct HASH of 4cc8883584b955ed01f38f68bc03af6d.
I am not aware of any options in packrat that would allow you to disable HASH checking. If your goal is to manually modify the packrat.lock file to alter a package version, it is certainly possible by performing this trick.
This could help overcome some minor dependency issues. However, there are two dangers:
such a package version change may start a cascade of dependency upgrade requirements
errors appear in your app because of compatibility issues

Load my own R package

I made an R package for personal use, but the way I load it is by individual files. Such as:
source("../compr/R/compr.R")
source("../compr/R/error_df.R")
source("../compr/R/rmse.R")
I would like to load the entire package, which is called compr, as I would other libraries.
If you are using RStudio, I would suggest creating a project and setting it to your compr directory. After that you will be able to use devtools::load_all() to load your package directly.
If you don't want to do this, or you don't use RStudio devtools::load_all('path/to/compr') will also work.
P.S. compr directory needs to be the root of the package i.e. the place where your DESCRIPTION file is.

Submitting a package to CRAN: .tar.gz file

I have a package I am ready to submit to CRAN (everything checks out). However, in the spot where it says Choose File, I am unsure what file to choose, as it says it requires a .tar.gz file, which I gather is some kind of compressed file?
Do I need to compress everything into a .tar.gz file? If so, how?
If not, I have a .Rproj file, and various files like namespace and description and license, so it is unclear to me which file to submit.
I apologize if this is a simple question, this is my first package to be submitted.
You have two options here. Use R's command line command:
> R CMD build /path/to/package/directory
Or use devtools::build from within R:
R> devtools::build( "path/to/package/directory" )
Both result in a tar.gz file on your local file system. The name will look like: mypackage_[Version].tar.gz
It is this file that you load to CRAN.

What type of object is an R package?

Probably a pretty basic question but a friend and I tried to run str(packge_name) and R threw us an error. Now that I'm looking at it, I'm wondering if an R package is like a .zip file in that it is a collection of objects, say pictures and songs, but not a picture or song itself.
If I tried to open a zip of pictures with an image viewer, it wouldn't know what to do until I unzipped it - just like I can't call str(forecast) but I can call str(ts) once I've loaded the forecast package into my library...
Can anyone set me straight?
R packages are generally distributed as compressed bundles of files. They can either be in "binary" form which are preprocessed at a repository to compile any C or Fortran source and create the proper headers, or they can be in source form where the various required files are available to be used in the installation process, but this requires that the users have the necessary compilers and tools installed at locations where the R build process using OS system resources can get at them.
If you read the documentation for a package at CRAN you see they are distributed in set of compressed formats that vary depending on the OS-targets:
Package source: Rcpp_0.11.3.tar.gz # the Linus/UNIX targets
Windows binaries: r-devel: Rcpp_0.11.3.zip, r-release: Rcpp_0.11.3.zip, r-oldrel: Rcpp_0.11.3.zip
OS X Snow Leopard binaries: r-release: Rcpp_0.11.3.tgz, r-oldrel: Rcpp_0.11.3.tgz
OS X Mavericks binaries: r-release: Rcpp_0.11.3.tgz
Old sources: Rcpp archive # not really a file but a web link
Once installed an R package will have a specified directory structure. The DESCRIPTION file is a text file with specific entries for components that determine whether the local installation meets the dependencies of the package. There are NAMESPACE, LICENSE, and INDEX files. There are directories named '/help', '/html', '/Meta', '/R', and possibly '/libs', '/demo', '/data', '/unitTests', and others.
This is the tree at the top of the ../library/Rcpp package directory:
$ ls
CITATION NAMESPACE THANKS examples libs
DESCRIPTION NEWS.Rd announce help prompt
INDEX R discovery html skeleton
Meta README doc include unitTests
So in the "life-cycle" of a package, there will be initially a series of required and optional files, which then get processed by the BUILD and CHECK mechanisms into an installed package, which than then get compressed for distribution, and later unpacked into a specified directory tree on the users machine. See these help pages:
?.libPaths # also describes .Library()
?package.skeleton
?install.packages
?INSTALL
And of course read Writing R Extensions, a document that ships with every installation of R.
Your question is:
What type of object is an R package?
Somehow, I’m still missing an answer to this exact question. So here goes:
As far as R is concerned, an R package is not an object. That is, it’s not an object in R’s type system. R is being a bit difficult, because it allows you to write
library(pkg_name)
Without requiring you to define pkg_name anywhere prior. In contrast, other objects which you are using in R have to be defined somewhere – either by you, or by some package that’s loaded either explicitly or implicitly.
This is unfortunate, and confuses people. Therefore, when you see library(pkg_name), think
library('pkg_name')
That is, imagine the package name in quotes. This does in fact work just as expected. The fact that the code also works without quotes is a peculiarity of the library function, known as non-standard evaluation. In this case, it’s mostly an unfortunate design decision (but there are reasons).
So, to repeat the answer: a package isn’t a type of R object1. For R, it’s simply a name which refers to a known location in the file system, similar to what you’ve assumed. BondedDust’s answer goes into detail to explain that structure, so I shan’t repeat it here.
1 For super technical details, see Joshua’s and Richard’s comments below.
From R's own documentation:
Packages provide a mechanism for loading optional code, data and
documentation as needed.…A package is a directory of files which
extend R, a source package (the master files of a package), or a
tarball containing the files of a source package, or an installed
package, the result of running R CMD INSTALL on a source package. On
some platforms (notably OS X and Windows) there are also binary
packages, a zip file or tarball containing the files of an installed
package which can be unpacked rather than installing from sources. A
package is not a library.
So yes, a package is not the functions within it; it is a mechanism to have R be able to use the functions or data which comprise the package. Thus, it needs to be loaded first.
I am reading Hadley's book Advanced-R (Chapter 6.3 - functions, p.79) and this quote will cover you I think:
Every operation is a function call
“To understand computations in R, two slogans are helpful:
Everything that exists is an object.
Everything that happens is a function call."
— John Chambers
According to that using library(name_of_library) is a function call that will load the package. Every little bit that has been loaded i.e. functions or data sets are objects which you can use by calling other functions. In that sense a package is not an object in any of R's environments until it is loaded. Then you can say that it is a collection of the objects it contains and which are loaded.

How can I read gzip compressed grib files in R?

I am trying to open MUlti-sensor precipitation data from eumetsat in R. I can get these data only using GZIP compression method and data format type is GRIB. When I download data I get tar file.
How can I open these data in R?
I tried to use code
> untar("1098496-1of1")
but got error message
Error in gzfile(path.expand(tarfile), "rb") : cannot open the connection
In addition: Warning message:
In gzfile(path.expand(tarfile), "rb") :
cannot open compressed file '1098496-1of1', probable reason 'No such file or directory'
but I when I use next code:
> dir.create("rainfalldataeumetstatR")
> getwd()
[1] "C:/Users/st/Documents"
> untar("1098496-1of1.tar")
> untar("1098496-1of1.tar", files="rainfalldataeumetstatR")
> list.files("rainfalldataeumetstatR")
I don't get some files in my directory and get answer:
character(0)
May be that error appears because files in tar zip are gz archives?
I, too, have grappled with opening GRIB files in R. You have several problems and can tackle them one by one.
For the untar and gzip issues, work from the command line. I don't know how the tar package is built/packaged from Eumetsat; does it create a directory and put all the data files in that directory? In that case, put the tarball in a top-level data directory and then
tar xvf tar_file_name
cd (to the directory that was just created)
gunzip *.gz
Note down the full path name of the files you will want to open for later use.
Are the files in GRIB1 or GRIB2? If in GRIB1, you need to install wgrib. If in GRIB2, you need to install wgrib2. Both are available from NCEP.
You can download them from:
http://www.cpc.ncep.noaa.gov/products/wesley/
In R, 3.1 and later, you install the rNOMADS package 2.0.1 and later.
NOAA National Operational Model Archive and Distribution System (NOMADS) distributes global grid data in GRIB format (currently in GRIB2).
rNOMADS helps you open GRIB1 and GRIB2 data in R by calling wgrib or wgrib2 to decode the binary GRIB data and pipe it (in csv format) for R to read in.
Open up R, load up rNOMADS, and then call the ReadGrib routine using the full path name of your data file in "data_file_name". This is not the way described in the rNOMADS documentation, but it works.
Installing wgrib and wgrib2 is the only hard part and it may not even be that hard, depending on your system. I'm writing tutorials on how to install wgrib, wgrib2 and use rNOMADS with local data files. When I am done, they will be posted here:
http://rda.ucar.edu/datasets/ds083.2/#!software
Now for some bad news:
You need to open each file sequentially. But, you can extract and save the subfields you need, and then read in the next datafile, overwriting the large data structure into which you read the previous file. If that is too much of a PITA, have you considered using the GRADS tool for displaying GRIB data?
There is no native way to read grib files into R. Use wgrib or wgrib2 depending on whether your file is in grib or grib2 format. I am the package manager for rNOMADS - and trust me, we tried to figure out a simple R way, and ended up dropping it. Maybe the folks at NCEP will do it someday, but it's out of our skill range.
Personally I untar my files using cygwin also because the wgrib package in cygwin will allow you to get an inventory file so you can tell R what data is contained in each layer. Under the assumption the data is grib1 r can read it directly. Grib2 requires wgrib2 on your machine, RNomads is working on that challenge.
Alright I recently found a great website that shows how to install wgrib so that it can run in R in conjunction with rNOMADS.
https://bovineaerospace.wordpress.com/2015/04/26/how-to-install-rnomads-with-grib-file-support-on-windows/#comments

Resources