Dealing with zip files in a targets workflow

Dealing with zip files in a targets workflow - r

I'm trying to set up a workflow that involves downloading a zip file, extracting its contents, and applying a function to each one of its files.
There are a few issues I'm running into:
How do I set up an empty file system reproducibly? Namely, I'm hoping to be able to create a system of empty directories to which files will later be downloaded to. Ideally, I'd like to do something like tar_target(my_dir, fs::dir_create("data"), format = "file"), but I know from the documentation that empty directories are not able to be used with format = "file". I know I could just do a dir_create at every instance which I need it, but this seems clumsy.
In the reprex below I'd like to operate individually on each file using pattern = map(x). As the error suggests, I'd need to specify a pattern for the parent target, since format = "file". You can see that if I did specify a pattern for the parent target, I would again need to do it for its parent target. As far as I know, a pattern cannot be set for a target that has no parents (but I have been wrong many times before).
I have a feeling I'm going about this all wrong - thank you for your time.
library(targets)
tar_script({
tar_option_set(packages = c("tidyverse", "fs"))
download_file <- function(url, dest) {
download.file(url, dest)
dest
}
do_stuff <- function(file_path) {
fs::file_copy(file_path, file_path, overwrite = TRUE)
}
list(
tar_target(downloaded_zip,
download_file("https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip",
path(dir_create("data"), "file", ext = "zip")),
format = "file"),
tar_target(extracted_files,
unzip(downloaded_zip, exdir = dir_create("data")),
format = "file"),
tar_target(stuff_done,
do_stuff(extracted_files),
pattern = map(extracted_files), format = "file",
iteration = "list"))
})
tar_make()
#> * start target downloaded_zip
#> trying URL 'https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip'
#> Content type 'application/zip' length 2036861 bytes (1.9 MB)
#> ==================================================
#> downloaded 1.9 MB
#>
#> * built target downloaded_zip
#> * start target extracted_files
#> * built target extracted_files
#> * end pipeline
#> Error : Target stuff_done tried to branch over extracted_files, which is illegal. Patterns must only branch over explicitly declared targets in the pipeline. Stems and patterns are fine, but you cannot branch over branches or global objects. Also, if you branch over a target with format = "file", then that target must also be a pattern.
#> Error: callr subprocess failed: Target stuff_done tried to branch over extracted_files, which is illegal. Patterns must only branch over explicitly declared targets in the pipeline. Stems and patterns are fine, but you cannot branch over branches or global objects. Also, if you branch over a target with format = "file", then that target must also be a pattern.
#> Visit https://books.ropensci.org/targets/debugging.html for debugging advice.
Created on 2021-12-08 by the reprex package (v2.0.1)

Original answer
Here's an idea: you could track that URL with format = "url" and then make the URL a dependency of all the file branches. Below, all of files should rerun then the upstream online data changes. That's fine because all that does is just re-hash stuff. But then not all branches of stuff_done should run if only some of those files actually changed.
Edit
On second thought, we probably need to hash the local files all in bulk. Not the most efficient, but it gets the job done. targets wants you to use its own built-in storage system instead of external files, so if you can read the data in and return it in a non-file format, dynamic branching will be easier.
# _targets.R file
library(targets)
tar_option_set(packages = c("tidyverse", "fs"))
download_file <- function(url, dest) {
download.file(url, dest)
dest
}
do_stuff <- function(file_path) {
file.info(file_path)
}
download_and_unzip <- function(url) {
downloaded_zip <- tempfile()
download_file(url, downloaded_zip)
unzip(downloaded_zip, exdir = dir_create("data"))
}
list(
tar_target(
url,
"https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip",
format = "url"
),
tar_target(
files_bulk,
download_and_unzip(url),
format = "file"
),
tar_target(file_names, files_bulk), # not a format = "file" target
tar_target(
files, {
files-bulk # Re-hash all the files separately if any file changes.
file_names
},
pattern = map(file_names),
format = "file"
),
tar_target(stuff_done, do_stuff(files), pattern = map(files))
)

Related

How to reference Rmd files which should be rendered (within a package function)

I write a package, which will be used to create automated reports.
There is one function createPdfReport which basically looks as follows (I use RStudio):
createPdfReport <- function(dataset, save_path) {
rmdName <- strsplit(x = basename(dataset, split = ".", fixed = TRUE)[[1]][1]
# some code here which uses "dataset"
relPath <- dirname(rstudioapi::getSourceEditorContext()$path)
rmarkdown::render(input = paste0(relPath, "/myRMDfile.Rmd"),
output_dir = save_path,
output_file = paste0(rmdName , ".html"),
encoding = "UTF-8", quiet = TRUE)
}
Most likely, R will finally run on a server and it is not clear, which operating system or editor is used there.
Therefore, I would like to get rid of rstudioapi::getSourceEditorContext().
But how? I could not find anything.
createPdfReport is part of a typical package with the following structure:
DESCRIPTION
NAMESPACE
/man
/R
createPdfReport.R --> Contains the function createPdfReport() above
myRMDfile.Rmd
/tests

You could store myRMDfile.Rmd in inst/extdata, see package raw data.
This allows to get the file path and use it after package installation with:
system.file("extdata", "myRMDfile.Rmd", package = "myPackage")

Importing a password protected xlsx file into R

I found an old thread (How do you read a password protected excel file into r?) that recommended that I use the following code to read in a password protected file:
install.packages("excel.link")
library("excel.link")
dat <- xl.read.file("TestWorkbook.xlsx", password = "pass", write.res.password="pass")
dat
However, when I try to do this my R immediately crashes. I've tried removing the write.res.password argument, and that doesn't seem to be the issue. I have a hunch that excel.link might not work with the newest version of R, so if you know of any other ways to do this I'd appreciate the advice.
EDIT: Using read.xlsx generates this error:
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class), :
org.apache.poi.poifs.filesystem.OfficeXmlFileException:
The supplied data appears to be in the Office 2007+ XML.
You are calling the part of POI that deals with OLE2 Office Documents.
You need to call a different part of POI to process this data (eg XSSF instead of HSSF)

You can remove the password of the excel file without knowing it with the following function (adapted version of code available at https://www.r-bloggers.com/2018/05/remove-password-protection-from-excel-sheets-using-r/)
remove_Password_Protection_From_Excel_File <- function(dir, file, bool_XLSXM = FALSE)
{
initial_Dir <- getwd()
setwd(dir)
# file name and path after removing protection
if(bool_XLSXM == TRUE)
{
file_unlocked <- stringr::str_replace(basename(file), ".xlsm$", "_unlocked.xlsm")
}else
{
file_unlocked <- stringr::str_replace(basename(file), ".xlsx$", "_unlocked.xlsx")
}
file_unlocked_path <- file.path(dir, file_unlocked)
# create temporary directory in project folder
# so we see what is going on
temp_dir <- "_tmp"
# remove and recreate _tmp folder in case it already exists
unlink(temp_dir, recursive = TRUE)
dir.create(temp_dir)
# unzip Excel file into temp folder
unzip(file, exdir = temp_dir)
# get full path to XML files for all worksheets
worksheet_paths <- list.files(paste0(temp_dir, "/xl/worksheets"), full.name = TRUE, pattern = ".xml")
# remove the XML node which contains the sheet protection
# We might of course use e.g. xml2 to parse the XML file, but this simple approach will suffice here
for(ws in worksheet_paths)
{
file_Content <- readLines(ws, encoding = "windows1")
# the "sheetProtection" node contains the hashed password "<sheetProtection SOME INFO />"
# we simply remove the whole node
out <- str_replace(file_Content, "<sheetProtection.*?/>", "")
writeLines(out, ws)
}
worksheet_Protection_Paths <- paste0(temp_dir, "/xl/workbook.xml")
file_Content <- readLines(worksheet_Protection_Paths , encoding = "windows1")
out <- stringr::str_replace(file_Content, "<workbookProtection.*?/>", "")
writeLines(out, worksheet_Protection_Paths)
# create a new zip, i.e. Excel file, containing the modified XML files
old_wd <- setwd(temp_dir)
files <- list.files(recursive = T, full.names = F, all.files = T, no.. = T)
# as the Excel file is a zip file, we can directly replace the .zip extension by .xlsx
zip::zip(file_unlocked_path, files = files) # utils::zip does not work for some reason
setwd(old_wd)
# clean up and remove temporary directory
unlink(temp_dir, recursive = T)
setwd(initial_Dir)
}
Once the password is removed, you can read the Excel file. This approach works for me.

source file specified as string in R

I want to, programmatically, source all .R files contained within a given array retrieved with the Sys.glob() function.
This is the code I wrote:
# fetch the different ETL parts
parts <- Sys.glob("scratch/*.R")
if (length(parts) > 0) {
for (part in parts) {
# source the ETL part
source(part)
# rest of code goes here
# ...
}
} else {
stop("no ETL parts found (no data to process)")
}
The problem I have is I cannot do this or, at least, I get the following error:
simpleError in source(part): scratch/foo.bar.com-https.R:4:151: unexpected string constant
I've tried different combinations for the source() function like the following:
source(sprintf("./%s", part))
source(toString(part))
source(file = part)
source(file = sprintf("./%s", part))
source(file = toString(part))
No luck. As I'm globbing the contents of a directory I need to tell R to source those files. As it's a custom-tailored ETL (extract, transform and load) script, I can manually write:
source("scratch/foo.bar.com-https.R")
source("scratch/bar.bar.com-https.R")
source("scratch/baz.bar.com-https.R")
But that's dirty and right now there are 3 different extraction patterns. They could be 8, 80 or even 2000 different patterns so writing it by hand is not an option.
How can I do this?

Try getting the list of files with dir and then using lapply:
For example, if your files are of the form t1.R, t2.R, etc., and are inside the path "StackOverflow" do:
d = dir(pattern = "^t\\d.R$", path = "StackOverflow/", recursive = T, full.names = T)
m = lapply(d, source)
The option recursive = T will search all subdirectories, and full.names = T will add the path to the filenames.
If you still want to use Sys.glob(), this works too:
d = Sys.glob(paths = "StackOverflow/t*.R")
m = lapply(d, source)

R tm package readPDF error in strptime(d, fmt) : input string too long

I would like to do text mining of the files on this website using the tm package. I am using the following code to download one of the files (i.e., abell.pdf) to my working directory and attempt to store the contents:
library("tm")
url <- "https://baltimore2006to2010acsprofiles.files.wordpress.com/2014/07/abell.pdf"
filename <- "abell.pdf"
download.file(url = url, destfile = filename, method = "curl")
doc <- readPDF(control = list(text = "-layout"))(elem = list(uri = filename),
language = "en", id = "id1")
But I receive the following error and warnings:
Error in strptime(d, fmt) : input string is too long
In addition: Warning messages:
1: In grepl(re, lines) : input string 1 is invalid in this locale
2: In grepl(re, lines) : input string 2 is invalid in this locale
The pdfs aren't particularly long (5 pages, 978 KB), and I have been able to successfully use the readPDF function to read in other pdf files on my Mac OSX. The information I want most (the total population for the 2010 census) is on the first page of each pdf, so I've tried shortening the pdf to just the first page, but I get the same message.
I am new to the tm package, so I apologize if I am missing something obvious. Any help is greatly appreciated!

Based on what I've read this error has something to do with the way that the "readPDF" function tries to make metadata for the file you're importing. Anyway, you can change the metadata info by using the "info" option. For example, I usually circumvent this error by modifying the command in the following way (using your code):
doc <- readPDF(control = list(info="-f",text = "-layout"))(elem = list(uri = filename),language = "en", id = "id1")
Where the addition of "info="-f"" is the only change. This doesn't really "fix" the problem, but it bypasses the error. Cheers :)

Interactively ask user for filename before saving file

I want to save the my tab delim files manually. I mean that I want user to choose the directory and file name when he wants to save the data. (For an example I have merged individual files into single file and want to save it.)
Usually I use write.table but in write.table we define the directory path and file name within that function but I want a function in which user can save file with any name in his desired directory.

Just use the file.choose() function,like this:
write.table(yerdata, file = file.choose(new = TRUE))
On Windows, at least, that will bring up a dialog for save-as commands.

Annoyingly the tcltk package doesn't have a function for 'Save As', it only has a file selector for choosing an existing file.
Luckily you can take the DIY approach via some tcl calls:
require(tcltk)
write.table(yerdata,file = tclvalue(tcl("tk_getSaveFile")))
The tk_choose.files function source could be used as a template to write a nicer interface to tcl("tk_getSaveFile") if needed. Does seem to be a glaring omission in package:tcltk though...

Using gWidgets:
gfile("Save yerdata", type = "save", handler = function(h, ...)
{
write.table(yerdata, file = h$file)
})

One (perhaps less than ideal) option would be to use readline to prompt the user for the full path and file name (or just the file name if you want to programmatically choose the directory) and then simply pass that value on the write.table. Here's a sketch:
FILE_PATH <- readline(prompt = "Enter a full path and file name: ")
#Some checking to make sure you got a valid file path...
write.table(yerdata, file = FILE_PATH)
Note that as per ?readline this will really only work when running R interactively.

As it is 2017 now, the tcltk2 package is an improvement of tcltk:
library(tcltk2)
filename <- tclvalue(tkgetSaveFile())
if (!nchar(filename)) {
tkmessageBox(message = "No file was selected!")
} else {
tkmessageBox(message = paste("The file selected was", filename))
}
And the use of filters, let's say one should only save as JPG/JPEG:
jpeg_filename <- tclvalue(
tkgetSaveFile(initialfile = "foo.jpg",
filetypes = "{ {JPEG Files} {.jpg .jpeg} } { {All Files} * }")
)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Dealing with zip files in a targets workflow - r

Related

How to reference Rmd files which should be rendered (within a package function)

Importing a password protected xlsx file into R

source file specified as string in R

R tm package readPDF error in strptime(d, fmt) : input string too long

Interactively ask user for filename before saving file

Categories

Resources