R: possible truncation of >= 4GB file - r

I have a 370MB zip file and the content is a 4.2GB csv file.
I did:
unzip("year2015.zip", exdir = "csv_folder")
And I got this message:
1: In unzip("year2015.zip", exdir = "csv_folder") :
possible truncation of >= 4GB file
Have you experienced that before? How did you solve it?

I agree with #Sixiang.Hu's answer, R's unzip() won't work reliably with files greater than 4GB.
To get at how did you solve it?: I've tried a few different tricks with it, and in my experience the result of anything using R's built-ins is (almost) invariably an incorrect identification of the end-of-file (EOF) marker before the actual end of the file.
I deal with this issue in a set of files I process on a nightly basis, and to deal with it consistently and in an automated fashion, I wrote the function below to wrap the UNIX unzip. This is basically what you're doing with system(unzip()), but gives you a bit more flexibility in its behavior, and allows you to check for errors more systematically.
decompress_file <- function(directory, file, .file_cache = FALSE) {
if (.file_cache == TRUE) {
print("decompression skipped")
} else {
# Set working directory for decompression
# simplifies unzip directory location behavior
wd <- getwd()
setwd(directory)
# Run decompression
decompression <-
system2("unzip",
args = c("-o", # include override flag
file),
stdout = TRUE)
# uncomment to delete archive once decompressed
# file.remove(file)
# Reset working directory
setwd(wd); rm(wd)
# Test for success criteria
# change the search depending on
# your implementation
if (grepl("Warning message", tail(decompression, 1))) {
print(decompression)
}
}
}
Notes:
The function does a few things, which I like and recommend:
uses system2 over system because the documentation says "system2 is a more portable and flexible interface than system"
separates the directory and file arguments, and moves the working directory to the directory argument; depending on your system, unzip (or your choice of decompression tool) gets really finicky about decompressing archives outside the working directory
it's not pure, but resetting the working directory is a nice step toward the function having fewer side effects
you can technically do it without this, but in my experience it's easier to make the function more verbose than have to deal with generating filepaths and remembering unzip CLI flags
I set it to use the -o flag to automatically overwrite when rerun, but you could supply any number of arguments
includes a .file_cache argument which allows you to skip decompression
this comes in handy if you're testing a process which runs on the decompressed file, since 4GB+ files tend to take some time to decompress
commented out in this instance, but if you know you don't need the archive after decompressing, you can remove it inline
the system2 command redirects the stdout to decompression, a character vector
an if + grepl check at the end looks for warnings in the stdout, and prints the stdout if it finds that expression

Checking ?unzip, found the following comment in Note:
It does have some support for bzip2 compression and > 2GB zip files
(but not >= 4GB files pre-compression contained in a zip file: like
many builds of unzip it may truncate these, in R's case with a warning
if possible).
You can try to unzip it outside of R (using 7-Zip for example).

To add to the list of possible solutions, in case you have Java (JDK) available on your machine, you can wrap jar xf into an R function similar to utils::unzip() in interface, a very simple example:
unzipLarge <- function(zipfile, exdir = getwd()) {
oldWd <- getwd()
on.exit(setwd(oldWd))
setwd(exdir)
system2("jar", args = c("xf", zipfile))
}
And then use:
unzipLarge("year2015.zip", exdir = "csv_folder")

Related

How to dump png to stdout? [duplicate]

Is it possible to get R to write a plot in bitmap format (e.g. PNG) to standard output? If so, how?
Specifically I would like to run Rscript myscript.R | other_prog_that_reads_a_png_from_stdin. I realise it's possible to create a temporary file and use that, but it's inconvenient as there will potentially be many copies of this pipeline running at the same time, necessitating schemes for choosing unique filenames and removing them afterwards.
I have so far tried setting outf <- file("stdout") and then running either bitmap(file=outf, ...) or png(filename=outf, ...), but both complain ('file' must be a non-empty character string and invalid 'filename' argument, respectively), which is in line with the official documentation for these functions.
Since I was able to persuade R's read.table() function to read from standard input, I'm hoping there's a way. I wasn't able to find anything relevant here on SO by searching for [r] stdout plot, or any of the variations with stdout replaced by "standard output" (with or without double quotes), and/or plot replaced by png.
Thanks!
Unfortunately the {grDevices} (and, by implication, {ggplot2}) seems to fundamentally not support this.
The obvious approach to work around this is: let a graphics device write to a temporary file, and then read that temporary file back into the R session and write it to stdout.
But this fails because, on the one hand, the data cannot be read into a string: character strings in R do not support embedded null characters (if you try you’ll get an error such as “nul character not allowed”). On the other hand, readBin and writeBin fail because writeBin categorically refuses to write to any device that’s hooked up to stdout, which is in text mode (ignoring the fact that, on POSIX system, the two are identical).
This can only be circumvented in incredibly hacky ways, e.g. by opening a binary pipe to a command such as cat:
dev_stdout = function (underlying_device = png, ...) {
filename = tempfile()
underlying_device(filename, ...)
filename
}
dev_stdout_off = function (filename) {
dev.off()
on.exit(unlink(filename))
fake_stdout = pipe('cat', 'wb')
on.exit(close(fake_stdout), add = TRUE)
writeBin(readBin(filename, 'raw', file.info(filename)$size), fake_stdout)
}
To use it:
tmp_dev = dev_stdout()
contour(volcano)
dev_stdout_off(tmp_dev)
On systems where /dev/stdout exists (which are most but not all POSIX systems), the dev_stdout_off function can be simplified slightly by removing the command redirection:
dev_stdout_off = function (filename) {
dev.off()
on.exit(unlink(filename))
fake_stdout = file('/dev/stdout', 'wb')
on.exit(close(fake_stdout), add = TRUE)
writeBin(readBin(filename, 'raw', file.info(filename)$size), fake_stdout)
}
This might not be a complete answer, but it's the best I've got: can you open a connection using the stdout() command? I know that png() will change the output device to a file connection, but that's not what you want, so it might work to simply substitute png by stdout. I don't know enough about standard outputs to test this theory, however.
The help page suggests that this connection might be text-only. In that case, a solution might be to generate a random string to use as a filename, and pass the name of the file through stdout so that the next step in your pipeline knows where to find your file.

readxl::read_xls returns "libxls error: Unable to open file"

I have multiple .xls (~100MB) files from which I would like to load multiple sheets (from each) into R as a dataframe. I have tried various functions, such as xlsx::xlsx2 and XLConnect::readWorksheetFromFile, both of which always run for a very long time (>15 mins) and never finish and I have to force-quit RStudio to keep working.
I also tried gdata::read.xls, which does finish, but it takes more than 3 minutes per one sheet and it cannot extract multiple sheets at once (which would be very helpful to speed up my pipeline) like XLConnect::loadWorkbook can.
The time it takes these functions to execute (and I am not even sure the first two would ever finish if I let them go longer) is way too long for my pipeline, where I need to work with many files at once. Is there a way to get these to go/finish faster?
In several places, I have seen a recommendation to use the function readxl::read_xls, which seems to be widely recommended for this task and should be faster per sheet. This one, however, gives me an error:
> # Minimal reproducible example:
> setwd("/Users/USER/Desktop")
> library(readxl)
> data <- read_xls(path="test_file.xls")
Error:
filepath: /Users/USER/Desktop/test_file.xls
libxls error: Unable to open file
I also did some elementary testing to make sure the file exists and is in the correct format:
> # Testing existence & format of the file
> file.exists("test_file.xls")
[1] TRUE
> format_from_ext("test_file.xls")
[1] "xls"
> format_from_signature("test_file.xls")
[1] "xls"
The test_file.xls used above is available here.
Any advice would be appreciated in terms of making the first functions run faster or the read_xls run at all - thank you!
UPDATE:
It seems that some users are able to open the file above using the readxl::read_xls function, while others are not, both on Mac and Windows, using the most up to date versions of R, Rstudio, and readxl. The issue has been posted on the readxl GitHub and has not been resolved yet.
I downloaded your dataset and read each excel sheet in this way (for example, for sheets "Overall" and "Area"):
install.packages("readxl")
library(readxl)
library(data.table)
dt_overall <- as.data.table(read_excel("test_file.xls", sheet = "Overall"))
area_sheet <- as.data.table(read_excel("test_file.xls", sheet = "Area"))
Finally, I get dt like this (for example, only part of the dataset for the "Area" sheet):
Just as well, you can use the read_xls function instead read_excel.
I checked, it also works correctly and even a little faster, since read_excel is a wrapper over read_xls and read_xlsx functions from readxl package.
Also, you can use excel_sheets function from readxl package to read all sheets of your Excel file.
UPDATE
Benchmarking is done with microbenchmark package for the following packages/functions: gdata::read.xls, XLConnect::readWorksheetFromFile and readxl::read_excel.
But XLConnect it's a Java-based solution, so it requires a lot of RAM.
I found that I was unable to open the file with read_xl immediately after downloading it, but if I opened the file in Excel, saved it, and closed it again, then read_xl was able to open it without issue.
My suggested workaround for handling hundreds of files is to build a little C# command line utility that opens, saves, and closes an Excel file. Source code is below, the utility can be compiled with visual studio community edition.
using System.IO;
using Excel = Microsoft.Office.Interop.Excel;
namespace resaver
{
class Program
{
static void Main(string[] args)
{
string srcFile = Path.GetFullPath(args[0]);
Excel.Application excelApplication = new Excel.Application();
excelApplication.Application.DisplayAlerts = false;
Excel.Workbook srcworkBook = excelApplication.Workbooks.Open(srcFile);
srcworkBook.Save();
srcworkBook.Close();
excelApplication.Quit();
}
}
}
Once compiled, the utility can be called from R using e.g. system2().
I will propose a different workflow. If you happen to have LibreOffice installed, then you can convert your excel files to csv programatically. I have Linux, so I do it in bash, but I'm sure it can be possible in macOS.
So open a terminal and navigate to the folder with your excel files and run in terminal:
for i in *.xls
do soffice --headless --convert-to csv "$i"
done
Now in R you can use data.table::fread to read your files with a loop:
Scenario 1: the structure of files is different
If the structure of files is different, then you wouldn't want to rbind them together. You could run in R:
files <- dir("path/to/files", pattern = ".csv")
all_files <- list()
for (i in 1:length(files)){
fileName <- gsub("(^.*/)(.*)(.csv$)", "\\2", files[i])
all_files[[fileName]] <- fread(files[i])
}
If you want to extract your named elements within the list into the global environment, so that they can be converted into objects, you can use list2env:
list2env(all_files, envir = .GlobalEnv)
Please be aware of two things: First, in the gsub call, the direction of the slash. And second, list2env may overwrite objects in your Global Environment if they have the same name as the named elements within the list.
Scenario 2: the structure of files is the same
In that case it's likely you want to rbind them all together. You could run in R:
files <- dir("path/to/files", pattern = ".csv")
joined <- list()
for (i in 1:length(files)){
joined <- rbindlist(joined, fread(files[i]), fill = TRUE)
}
On my system, i had to use path.expand.
R> file = "~/blah.xls"
R> read_xls(file)
Error:
filepath: ~/Dropbox/signal/aud/rba/balsheet/data/a03.xls
libxls error: Unable to open file
R> read_xls(path.expand(file)) # fixed
Resaving your file and you can solve your problem easily.
I also find this problem before but I get the answer from your discussion.
I used the read_excel() to open those files.
I was seeing a similar error and wanted to share a short-term solution.
library(readxl)
download.file("https://mjwebster.github.io/DataJ/spreadsheets/MLBpayrolls.xls", "MLBPayrolls.xls")
MLBpayrolls <- read_excel("MLBpayrolls.xls", sheet = "MLB Payrolls", na = "n/a")
Yields (on some systems in my classroom but not others):
Error: filepath: MLBPayrolls.xls libxls error: Unable to open file
The temporary solution was to paste the URL of the xls file into Firefox and download it via the browser. Once this was done we could run the read_excel line without error.
This was happening today on Windows 10, with R 3.6.2 and R Studio 1.2.5033.
If you have downloaded the .xls data from the internet, even if you are opening it in Ms.Excel, it will open a prompt first asking to confirm if you trust the source, see below screenshot, I am guessing this is the reason R (read_xls) also can't open it, as it's considered unsafe. Save it as .xlsx file and then use read_xlsx() or read_excel().
Even thought this is not a code-based solution, I just changed the type file. For instance, instead of xls I saved as csv or xlsx. Then I opened it as regular one.
I worked it for me, because when I opened my xlsfile, I popped up the message: "The file format and extension of 'file.xls'' don't match. The file could be corrupted or unsafe..."

Test package function that writes to disk

I am trying to write a test for a package function in R.
Let's say we have a function that simply writes a string x to disk using writeLines():
exporting_function <- function(x, file) {
writeLines(x, con = file)
invisible(NULL)
}
One way of testing it would be to check if a file exists. Typically, it should not exist at first, but after the exporting function was run it should. Also, you might want to test the file size to be greater than 0:
library(testthat)
test_that("file is written to disk", {
file = 'output.txt'
expect_false(file.exists(file))
exporting_function("This is a test",
file = file)
expect_true(file.exists(file))
expect_gt(file.info('output.txt')$size, 0)
})
Is this a good way to test it? In the CRAN Repository Policy it states that Packages should not write in the user’s home filespace (including clipboards), nor anywhere else on the file system apart from the R session’s temporary directory. Would this test violate this constraint?
There is a expect_output_file function. From the documentation and examples I am not sure if this is a more appropriate expectation to test the function. It requires a.o. an object argument which should be the object to test. What is the object to test in my case?
That looks as if it violates CRAN policy. Why not simply write to the temporary directory, using
file <- tempfile()
in place of
file = 'output.txt'
?
As to whether it is a good test: wouldn't it be better to try reading the file back in, and confirming that what was read matches what was written? That's easy in your toy example. It might be harder in the real one, but having an import function paired with your export function is always a good idea.

Creating zip file from folders in R

Try to create a zip file from one folder using R.
It mentioned "Rcompression" package here:
Creating zip file from folders
But I didn't find where I can download this package for Windows system.
Any suggestions? or other functions to create a zip file?
You can create a zip file with the function zip from utils package quite easily. Say you have a directory testDir and you wish to zip a file (or multiple files) inside the directory,
dir('testDir')
# [1] "cats.csv" "test.csv" "txt.txt"
zip(zipfile = 'testZip', files = 'testDir/test.csv')
# adding: testDir/test.csv (deflated 68%)
The zipped file is saved in the current working directory, unless a different path is specified in the zipfile argument. We can see its size relative to the original unzipped file with
file.info(c('testZip.zip', 'testDir/test.csv'))['size']
# size
# testZip.zip 805
# testDir/test.csv 1493
You can zip the whole directory of files (if no sub-folders) with
files2zip <- dir('testDir', full.names = TRUE)
zip(zipfile = 'testZip', files = files2zip)
# updating: testDir/test.csv (deflated 68%)
# updating: testDir/cats.csv (deflated 27%)
# updating: testDir/txt.txt (stored 0%)
And unzip it to view the files,
unzip('testZip.zip', list = TRUE)
# Name Length Date
# 1 testDir/test.csv 1493 2014-05-14 20:54:00
# 2 testDir/cats.csv 116 2014-05-14 20:54:00
# 3 testDir/txt.txt 32 2014-05-08 09:37:00
Note: From ?zip, regarding the zip argument.
On Windows, the default relies on a zip program (for example that from Rtools) being in the path.
For avoiding (a) an issue with relative paths (i.e., the zip file itself containing a folder structure with the full folder path to be zipped) and (b) for loops (well, style), you may use
my_wd<-getwd() # save your current working directory path
dest_path<-"C:/.../folder_with_files_to_be_zipped"
setwd(dest_path)
files<-list.files(dest_path)
named<-paste0(files,".zip")
mapply(zip,zipfile=named,files=files)
setwd(my_wd) # reset working directory path
Unlike R´s build-in unzip function, zip requires a zip-program like 7-zip (Windows) or the one being part of Rtools to be present in your system path.
For people still looking for this: there is now a "zip" package that does not depend on external executables.
You can install from the omegahat repos:
install.packages('Rcompression', repos = "http://www.omegahat.org/R", type = "source")
for windows you will need to jump through hoops installing zlib and bzip2 and linking appropriately.
utils::zip can be used in some cases. There are a number of issues with it. One case is that the maximum length of the string that you can use at the command prompt is 8191 characters (2047 characters on some versions) for windows. If you are zipping a directory with alot of characters for the names of directories/files this will cause issues. For example if you zip your firefox profile directory. Also I found the zip command needed to be issued relative the directory I was zipping to use relative directory names. Rcompression has a altNames argument which handles this.
That being said I have always had problems getting Rcompression to run on windows.
It's worth noting that zip() will fail silently if it cannot find a zip program.
zip returns an error code (or exit code) invisibly. That is, it will not print, unless you explicitly ask it to.
You can run print(zip(output, input)), to print the exit code, which in the case of no zip program found, will print 127
Alternatively you can do something along the lines of
#exit code 0 for success, all other codes are for failure
if (exit_code <- zip(output, input) != 0) {
stop("Zipping ", input, " failed with exit code:", exit_code)
}
Make that
#Convertir todas las carpetas en .zip
d <- "C:/Users/Eric/Documents/R/win-library/3.3"
array <- list.files(d)
for (i in 1:length(array)){
name <- paste0(array[i],".zip")
zip(name, files = paste0(d,paste0("/",array[i])))
}

R: sourcing files using a relative path

Sourcing files using a relative path is useful when dealing with large codebases. Other programming languages have well-defined mechanisms for sourcing files using a path relative to the directory of the file being sourced into. An example is Ruby's require_relative. What is a good way to implement relative path sourcing in R?
Below is what I pieced together a while back using various recipes and R forum posts. It's worked well for me for straight development but is not robust. For example, it breaks when the files are loaded via the testthat library, specifically auto_test(). rscript_stack() returns character(0).
# Returns the stack of RScript files
rscript_stack <- function() {
Filter(Negate(is.null), lapply(sys.frames(), function(x) x$ofile))
}
# Returns the current RScript file path
rscript_current <- function() {
stack <- rscript_stack()
r <- as.character(stack[length(stack)])
first_char <- substring(r, 1, 1)
if (first_char != '~' && first_char != .Platform$file.sep) {
r <- file.path(getwd(), r)
}
r
}
# Sources relative to the current script
source_relative <- function(relative_path, ...) {
source(file.path(dirname(rscript_current()), relative_path), ...)
}
Do you know of a better source_relative implementation?
After a discussion with #hadley on GitHub, I realized that my question goes against the common development patterns in R.
It seems that in R files that are sourced often assume that the working directory (getwd()) is set to the directory they are in. To make this work, source has a chdir argument whose default value is FALSE. When set to TRUE, it will change the working directory to the directory of the file being sourced.
In summary:
Assume that source is always relative because the working directory of the file being sourced is set to the directory where the file is.
To make this work, always set chdir=T when you source files from another directory, e.g., source('lib/stats/big_stats.R', chdir=T).
For convenient sourcing of entire directories in a predictable way I wrote sourceDir, which sources files in a directory in alphabetical order.
sourceDir <- function (path, pattern = "\\.[rR]$", env = NULL, chdir = TRUE)
{
files <- sort(dir(path, pattern, full.names = TRUE))
lapply(files, source, chdir = chdir)
}

Resources