I'm trying to include a (somewhat) large dataset in an R package. I keep getting the Warning during the check in Rstudio saying that I could save space with compression:
* checking data for ASCII and uncompressed saves ... WARNING
Note: significantly better compression could be obtained
by using R CMD build --resave-data
old_size new_size compress
slp.rda 499Kb 310Kb bzip2
sst.rda 1.3Mb 977Kb xz
I've tried adding -- resave-data to RStudio's "Configure Buid Tools" to no effect.
Another alternative, if you have a large dataset that you don't want to re-create, is to use tools::resaveRdaFiles from within R. Point it at the dataset file, or the entire data directory, and it will compress your data in a format of your choosing. See its manual page for more information.
The devtools function use_data takes a parameter for the type of compression and makes adding data to pkgs much easier in general. Using it, or just save on your own), use xz compression when you save your data (for save it's the compression_level parameter).
If you want to use --resave-data then you can try --resave-data=best since just using --resave-data defaults to gzip (gaining you pretty much nothing in this case).
See Building package tarballs for more information.
Related
My question is if an object in R saved to binary format using the save function can be different if saved from different (but recent) versions of R.
That is because I have a script that makes some calculations and save its results to a file. When reproducing the same calculations later, I decided to compare the two files using
diff --binary -s mv3p.Rdata mv3p.Rdata.backup
To my surprise the two files are different. However when analysing the contents in R, they are identical.
The new version is 3.3.1. I believe the older version have been created by R 3.3.0 but it could also be by 3.2.x, I am not 100% sure. I used the save command with only the object I wanted to save and the filename arguments.
So my question is : is it normal that the same object is written differently in different versions of R? is it documented somewhere? How can I be sure to be able to reproduce exactly the same file? On what can it depend (R version, OS, processor architecture, etc...)
Please , I am NOT asking if versions can be read by another version of R and I am NOT asking about very old R versions.
R data files also include the R version used to write it. That's one reason the files may be different. See here on documentation: http://biostat.mc.vanderbilt.edu/wiki/Main/RBinaryFormat
Also, you can use save(..., ascii=T) to see the difference in plain text.
Is there a package or function that can be applied to a whole and heavy data object to get back a measure of changes in the file? Something based on hash keys would be great, so I can keep track on a shared file.
digest package (digest function) lets you create hash functions for R objects (possible ones: "md5", "sha1", "crc32", "sha256", "sha512", "xxhash32", "xxhash64"). You can also run external programs from R (e.g. md5sum on linux) with system commend (see e.g. here).
At our site, we have a large amount of custom R code that is used to build a set of packages for internal use and distribution to our R users. We try to maintain the entire library in a versioning scheme so that the version numbers and the date are the same. The problem is that we've gotten to the point where the number of packages is substantial enough that manual modification of the DESCRIPTION file and the package .Rd file is very time consuming, and it would be nice to automate these pieces.
We could write a pre-script that goes through the full set of files and writes the current data and version number. This could be done with out a lot of pain, but it would modify our current build chain and we would have to adapt the various steps.
Is there a way that this can be done without having to do a pre-build file modification step? In other words, can the DESCRIPTION file and the .Rd file contain something akin to an environment variable that will be substituted with the current information when called upon by R CMD build ?
You cannot use environment variables as R, when running R CMD build ... or R CMD INSTALL ..., sees the file as fixed.
But the no problem that cannot be fixed by another layer of indirection saying remains true. Your R source code could simply be files within another layer in which you text substitution according to some pattern. If you like autoconf, you could just have DESCRIPTION.in and have a configure script query the environment variables, or a meta-config file or database, or something else, and have that written out. Similarly you could have a sed or perl or python or R or ... script doing the textual substitution.
I used to let svn fill in the argument to Date: in DESCRIPTION, and also encoded revision numbers in an included header file. It's all scriptable to your heart's content.
I have data on a server in the form of SAS data sets that are updated daily. I would like these to be packaged auto-magically into R packages and then dropped in a package repository on the server. This should allow my co-workers and I to easily work with this packaged data in R and keep up-to-date as it changes each day by simply calling install.packages and update.packages.
What is a good way to implement this automatic creation of data packages?
I have written some code that pulls in the data set, converts it and then uses packages.skeleton() to dynamically create the package structure. I then have to overwrite the DESCRIPTION file to update the version along with some other edits. Then finally have to call tools::build and tools::check to package the whole lot and drop it in the repository. Is there a better way?
What you can do is to create an R file under your data folder to load data:
data
--sas_data.R
And in this sas_data.R you write your code to load the data from the server. The code should be something like :
download.file(urll,dest_file)
## process here
sas_data = read.table(dest_file)
Then you call it using data:
data(sas_data)
I would recommend using a makefile to automate the conversion of datasets. This would be useful especially if there are multiple datasets and the conversion process is time consuming.
I am assuming that the sas files are in a directory called sas. Here is the makefile.
By typing make data, all the *.sas7bdat files are read from the sas directory, using the package sas7bdat and saved as *.rda files of the same name in the data directory of the package. You can add more automation by adding package installation to the makefile and using a continuous integration system like TravisCI so that your R package is always up-to-date.
I have created a sample repo to illustrate my idea. This is an interesting question and I think it makes sense to develop a simple, flexible and robust approach to data packing.
SAS_FILES = $(wildcard sas/*.sas7bdat)
RDA_FILES = $(patsubst sas/%.sas7bdat, data/%.rda, $(SAS_FILES))
data: $(RDA_FILES)
data/%.rda: sas/%.sas7bdat
Rscript -e "library(sas7bdat); library(tools); fname = file_path_sans_ext(basename('$<')); assign(fname, read.sas7bdat('$<')); save($(basename $(notdir $<)), file = '$#')"
In recent efforts to develop a package, I'm including datasets in the data/ folder of my package. In my specific case I have 5 datasets all of which are in data.table format (although the issues I describe below persist if I keep them as data.frame). I've saved each one as individual .rda files and documented them appropriately.
When I run check() from package devtools, I get the following warnings:
checking data for ASCII and uncompressed saves ... WARNING
Warning: large data file(s) saved inefficiently:
size ASCII compress
data1.rda 129Kb TRUE gzip
data2.rda 101Kb TRUE gzip
data3.rda 1.6Mb TRUE gzip
Note: significantly better compression could be obtained
by using R CMD build --resave-data
old_size new_size compress
data1.rda 129Kb 34Kb xz
data2.rda 101Kb 20Kb xz
data4.rda 92Kb 35Kb xz
data3.rda 1.6Mb 116Kb xz
species.rda 12Kb 9Kb xz
I've tried saving the data with resaveRdaFiles (package tools) with the recommended xz compression. Even after doing that, the warning persists.
OK, so I run R CMD build --resave-data and the warning continues to persist.
What am I missing here and how do I overcome this issue (now and in the future)?
When you save your .rda file, please use the command: save(..., file='test.rda', compress='xz')
This will help to solve the problem!