How to effectively deal with uncompressed saves during package check? - r

In recent efforts to develop a package, I'm including datasets in the data/ folder of my package. In my specific case I have 5 datasets all of which are in data.table format (although the issues I describe below persist if I keep them as data.frame). I've saved each one as individual .rda files and documented them appropriately.
When I run check() from package devtools, I get the following warnings:
checking data for ASCII and uncompressed saves ... WARNING
Warning: large data file(s) saved inefficiently:
size ASCII compress
data1.rda 129Kb TRUE gzip
data2.rda 101Kb TRUE gzip
data3.rda 1.6Mb TRUE gzip
Note: significantly better compression could be obtained
by using R CMD build --resave-data
old_size new_size compress
data1.rda 129Kb 34Kb xz
data2.rda 101Kb 20Kb xz
data4.rda 92Kb 35Kb xz
data3.rda 1.6Mb 116Kb xz
species.rda 12Kb 9Kb xz
I've tried saving the data with resaveRdaFiles (package tools) with the recommended xz compression. Even after doing that, the warning persists.
OK, so I run R CMD build --resave-data and the warning continues to persist.
What am I missing here and how do I overcome this issue (now and in the future)?

When you save your .rda file, please use the command: save(..., file='test.rda', compress='xz')
This will help to solve the problem!

Related

Convincing R that the .dbf file associated with a .shp file is not an executable during command checks

I am working on submitting an R package to CRAN. Right now I am trying to reduce the memory footprint of the package. Because this package deals with spatial data that has a very particular format, I want to include a properly formatted shapefile as an example. If I include the full-size original shapefile, there are no warnings (other than file size) in the R CMD checks. However, if I crop the file and include the cropped version in the package (in "inst/extdata") I get this warning:
W checking for executable files (389ms)
Found the following executable file:
inst/extdata/temp/temp.dbf
Source packages should not contain undeclared executable files.
See section ‘Package structure’ in the ‘Writing R Extensions’ manual.
This file is the database file associated with the shapefile. I have tried cropping the file and saving it using rgdal functions, sf functions, and using QGIS. I have also verified that the modes of the cropped files match the original file using chmod. I even tried changing .dbf to .DBF. Does anyone have any additional suggestions, other than listing it in BinaryFiles, which CRAN will not accept in a submission?
I'm running R version 4.0.2 via RStudio 2021.09.1 on Mac OSX 10.15.7. rgdal and sf are fully updated, as are all of their dependencies.
This is a known issue[1] where file will mis-identify DBF files with last-update date in the year 2022. Easiest fix is to not use a 2022 update date when saving the file. Alternatively you can simply change the second byte of the file after the fact, e.g.:
fn = "myfile.dbf"
sz = file.info(fn)$size
r = readBin(fn, raw(), sz)
r[2] = as.raw(121) ## make it 2021 instead of 2022
writeBin(r, fn)
(See also corresponding discussion on R-package-devel)

How to compress saves in R package build

I'm trying to include a (somewhat) large dataset in an R package. I keep getting the Warning during the check in Rstudio saying that I could save space with compression:
* checking data for ASCII and uncompressed saves ... WARNING
Note: significantly better compression could be obtained
by using R CMD build --resave-data
old_size new_size compress
slp.rda 499Kb 310Kb bzip2
sst.rda 1.3Mb 977Kb xz
I've tried adding -- resave-data to RStudio's "Configure Buid Tools" to no effect.
Another alternative, if you have a large dataset that you don't want to re-create, is to use tools::resaveRdaFiles from within R. Point it at the dataset file, or the entire data directory, and it will compress your data in a format of your choosing. See its manual page for more information.
The devtools function use_data takes a parameter for the type of compression and makes adding data to pkgs much easier in general. Using it, or just save on your own), use xz compression when you save your data (for save it's the compression_level parameter).
If you want to use --resave-data then you can try --resave-data=best since just using --resave-data defaults to gzip (gaining you pretty much nothing in this case).
See Building package tarballs for more information.

The cause of "bad magic number" error when loading a workspace and how to avoid it?

I tried to load my R workspace and received this error:
Error: bad restore file magic number (file may be corrupted) -- no data loaded
In addition: Warning message:
file ‘WORKSPACE_Wedding_Weekend_September’ has magic number '#gets'
Use of save versions prior to 2 is deprecated
I'm not particularly interested in the technical details, but mostly in how I caused it and how I can prevent it in the future. Here's some notes on the situation:
I'm running R 2.15.1 on a MacBook Pro running Windows XP on a bootcamp partition.
There is something obviously wrong this workspace file, since it weighs in at only ~80kb while all my others are usually >10,000
Over the weekend I was running an external modeling program in R and storing its output to different objects. I ran several iterations of the model over the course of several days, eg output_Saturday <- call_model()
There is nothing special to the model output, its just a list with slots for betas, VC-matrices, model specification, etc.
I got that error when I accidentally used load() instead of source() or readRDS().
Also worth noting the following from a document by the R Core Team summarizing changes in versions of R after v3.5.0 (here):
R has new serialization format (version 3) which supports custom serialization of
ALTREP framework objects... Serialized data in format 3 cannot be read by versions of R prior to version 3.5.0.
I encountered this issue when I saved a workspace in v3.6.0, and then shared the file with a colleague that was using v3.4.2. I was able to resolve the issue by adding "version=2" to my save function.
Assuming your file is named "myfile.ext"
If the file you're trying to load is not an R-script, for which you would use
source("myfile.ext")
you might try the readRDSfunction and assign it to a variable-name:
my.data <- readRDS("myfile.ext")
The magic number comes from UNIX-type systems where the first few bytes of a file held a marker indicating the file type.
This error indicates you are trying to load a non-valid file type into R. For some reason, R no longer recognizes this file as an R workspace file.
Install the readr package, then use library(readr).
It also occurs when you try to load() an rds object instead of using
object <- readRDS("object.rds")
I got the error when saved with saveRDS() rather than save(). E.g. save(iris, file="data/iris.RData")
This fixed the issue for me. I found this info here
Also note that with save() / load() the object is loaded in with the same name it is initially saved with (i.e you can't rename it until it's already loaded into the R environment under the name it had when you initially saved it).
I had this problem when I saved the Rdata file in an older version of R and then I tried to open in a new one. I solved by updating my R version to the newest.
If you are working with devtools try to save the files with:
devtools::use_data(x, internal = TRUE)
Then, delete all files saved previously.
From doc:
internal If FALSE, saves each object in individual .rda files in the data directory. These are available whenever the package is loaded. If
TRUE, stores all objects in a single R/sysdata.rda file. These objects
are only available within the package.
This error occured when I updated my R and R Studio versions and loaded files I created under my prior version. So I reinstalled my prior R version and everything worked as it should.

R.matlab/readMat : Error in readTag(this)

I am trying to read a matlab file into R using R.matlab but am encountering this error:
require(R.matlab)
r <- readMat("file.mat", verbose=T)
Trying to read MAT v5 file stream...
Error in readTag(this) : Unknown data type. Not in range [1,19]: 18569
In addition: Warning message:
In readMat5Header(this, firstFourBytes = firstFourBytes) :
Unknown MAT version tag: 512. Will assume version 5.
How can this issue be solved or is there an alternative way to load matlab files? I can use hdf5load but have heard this can mess with the data. Thanks!
This is a bit late on the response, but I've recently been running into the same issues. For me, the issue was that I was saving matlab files by default using the '-v7.3' option. After extensive searching, the R.matlab source documentation (http://cran.r-project.org/web/packages/R.matlab/R.matlab.pdf) indicates the following:
Reading compressed MAT files
From MATLAB v7, compressed MAT version 5 files are used by default
[3,4]. This function supports reading such
files, if running R v2.10.0 or newer. For older versions of R, the
Rcompression package is used. To install that package, please see
instructions at http://www.omegahat.org/ cranRepository.html. As a
last resort, use save -V6 in MATLAB to write MAT files that are
compatible with MATLAB v6, that is, to write non-compressed MAT
version 5 files.
About MAT files saved in MATLAB using ’-v7.3’
This function does not
support MAT files saved in MATLAB as save('foo.mat',
'-v7.3'). Such MAT files are of a completely different file format
[5,6] compared to those saved with, say, '-v7'."
adding the '-v7' option at the end of my save command fixed this issue.
i.e.: save('filename', 'variable', '-v7')
i had a very similar problem until i pointed the function to an actual .mat file that existed. before that i'd been specifying two files of the same name, but one was .mat and the other was .txt, so it may have been trying to open the other.
i realize this may not directly solve your issue (the only difference i saw in my error message was the absence of that first line "Trying ..." and the specific numbers thereafter as well as the presence of another couple similar warnings with odd numbers), but it might point to some simple filename problem as the issue.
i use the latest matlab on 64 bit vista and the latest R on 32 bit xp.

Saving in hdf5save creates an unreadable file

I'm trying to save an array as a HDF5 file using R, but having no luck.
To try and diagnose the problem I ran example(hdf5save). This successfully created a HDF5 file that I could read easily with h5dump.
When I then ran the R code manually, I found that it didn't work. The code I ran was exactly the same as is ran in the example script (except for a change of filename to avoid overwriting). Here is the code:
(m <- cbind(A = 1, diag(4)))
ll <- list(a=1:10, b=letters[1:8]);
l2 <- list(C="c", l=ll); PP <- pi
hdf5save("ex2.hdf", "m","PP","ll","l2")
rm(m,PP,ll,l2) # and reload them:
hdf5load("ex2.hdf",verbosity=3)
m # read from "ex1.hdf"; buglet: dimnames dropped
str(ll)
str(l2)
and here is the error message from h5dump:
h5dump error: unable to open file "ex2.hdf"
Does anyone have any ideas? I'm completely at a loss.
Thanks
I have had this problem. I am not sure of the cause and neither are the hdf5 maintainers. The authors of the R package have not replied.
Alternatives that work
In the time since I originally answered, the hdf5 package has been archived, and suitable alternatives (h5r, rhdf5, and ncdf4) have been created; I am currently usingncdf4`:
Since netCDF-4 uses hdf5 as a storage layer, the ncdf4 package provides an interface to both netCDF-4 and hdf5.
The h5r package with R>=2.10
the rhdf5 package is available on BioConductor.
Workarounds Two functional but unsatisfactory workarounds that I used prior to finding the alternatives above:
Install R 2.7, hdf5 version 1.6.6, R hdf5 v1.6.7, and zlib1g version 1:1.2.3.3 and use this when writing the files (this was my solution until migrating to the ncdf4 library).
Use h5totxt at the command line from the [hdf5utils][1] program (requires using bash and rewriting your R code)
A minimal, reproducible demonstration of the issue:
Here is a reproducible example that sends an error
First R session
library(hdf5)
dat <- 1:10
hdf5save("test.h5","dat")
q()
n # do not save workspace
Second R session:
library(hdf5)
hdf5load("test.h5")
output:
HDF5-DIAG: Error detected in HDF5 library version: 1.6.10 thread
47794540500448. Back trace follows.
#000: H5F.c line 2072 in H5Fopen(): unable to open file
major(04): File interface
minor(17): Unable to open file
#001: H5F.c line 1852 in H5F_open(): unable to read superblock
major(04): File interface
minor(24): Read failed
#002: H5Fsuper.c line 114 in H5F_read_superblock(): unable to find file
signature
major(04): File interface
minor(19): Not an HDF5 file
#003: H5F.c line 1304 in H5F_locate_signature(): unable to find a valid
file signature
major(05): Low-level I/O layer
minor(29): Unable to initialize object
Error in hdf5load("test.h5") : unable to open HDF file: test.h5
I've also run into the same issue and found a reasonable fix.
The issue seems like it stems from when the hdf5 library finalizes the file. If it doesn't get a chance to finalize the file, then the file is corrupted. I think this happens after the buffer is flushed but the buffer doesn't always flush.
One solution I've found is to do the hdf5save in a separate function. Assign the variables into the globalenv(), then call hdf5save and exit the function. When the function completes, the memory seems to clean up which makes the hdf5 libarary flush the buffer and finalize the file.
Hope this helps!

Resources