Decompress .gz file from Azure blob in r - r

I want to read a .csv.gz from a Azure blob container, but I am struggling with the .gz format. When I download the file local and then read it in R with readr it works fine. But when I try to read it from Azure the file isn't read propperly. It seems that the file is not decompressed. This is the code I used to read the file local (also read_csv2 works fine):
df<-read_delim("filename.csv.gz", delim=";",col_names=c('epoch','SegmentID','TT','Speed','LoS','Coverage'),
col_types=cols(epoch = col_integer(),SegmentID = col_integer(),TT = col_integer(),Speed = col_integer(),LoS = col_integer(),Coverage = col_integer()))
And this is what I try to do to reach the file from Azure:
blob_urls_with_sas<-paste("https://name.blob.core.windows.net","/container/filename.csv.gz",
sas_token, sep="")
dfAzure<-read_delim(blob_urls_with_sas,delim=";",
col_names=c('epoch','SegmentID','TT','Speed','LoS','Coverage'),
col_types=cols(epoch = col_integer(),SegmentID = col_integer(),TT = col_integer(),
Speed = col_integer(),LoS = col_integer(),Coverage =col_integer()))
or from the AzureStor package
test <- storage_read_delim(cont, "filename.csv.gz",delim=";",
col_names=c('epoch','SegmentID','TT','Speed','LoS','Coverage'), col_types=cols(epoch = col_integer(),SegmentID = col_integer(),TT = col_integer(),
Speed = col_integer(),LoS = col_integer(),Coverage = col_integer()))

One option would be to use fread() from data.table which naturally handles .gz files. Make sure you install R.utils first.

Related

How to reference Rmd files which should be rendered (within a package function)

I write a package, which will be used to create automated reports.
There is one function createPdfReport which basically looks as follows (I use RStudio):
createPdfReport <- function(dataset, save_path) {
rmdName <- strsplit(x = basename(dataset, split = ".", fixed = TRUE)[[1]][1]
# some code here which uses "dataset"
relPath <- dirname(rstudioapi::getSourceEditorContext()$path)
rmarkdown::render(input = paste0(relPath, "/myRMDfile.Rmd"),
output_dir = save_path,
output_file = paste0(rmdName , ".html"),
encoding = "UTF-8", quiet = TRUE)
}
Most likely, R will finally run on a server and it is not clear, which operating system or editor is used there.
Therefore, I would like to get rid of rstudioapi::getSourceEditorContext().
But how? I could not find anything.
createPdfReport is part of a typical package with the following structure:
DESCRIPTION
NAMESPACE
/man
/R
createPdfReport.R --> Contains the function createPdfReport() above
myRMDfile.Rmd
/tests
You could store myRMDfile.Rmd in inst/extdata, see package raw data.
This allows to get the file path and use it after package installation with:
system.file("extdata", "myRMDfile.Rmd", package = "myPackage")

Dealing with zip files in a targets workflow

I'm trying to set up a workflow that involves downloading a zip file, extracting its contents, and applying a function to each one of its files.
There are a few issues I'm running into:
How do I set up an empty file system reproducibly? Namely, I'm hoping to be able to create a system of empty directories to which files will later be downloaded to. Ideally, I'd like to do something like tar_target(my_dir, fs::dir_create("data"), format = "file"), but I know from the documentation that empty directories are not able to be used with format = "file". I know I could just do a dir_create at every instance which I need it, but this seems clumsy.
In the reprex below I'd like to operate individually on each file using pattern = map(x). As the error suggests, I'd need to specify a pattern for the parent target, since format = "file". You can see that if I did specify a pattern for the parent target, I would again need to do it for its parent target. As far as I know, a pattern cannot be set for a target that has no parents (but I have been wrong many times before).
I have a feeling I'm going about this all wrong - thank you for your time.
library(targets)
tar_script({
tar_option_set(packages = c("tidyverse", "fs"))
download_file <- function(url, dest) {
download.file(url, dest)
dest
}
do_stuff <- function(file_path) {
fs::file_copy(file_path, file_path, overwrite = TRUE)
}
list(
tar_target(downloaded_zip,
download_file("https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip",
path(dir_create("data"), "file", ext = "zip")),
format = "file"),
tar_target(extracted_files,
unzip(downloaded_zip, exdir = dir_create("data")),
format = "file"),
tar_target(stuff_done,
do_stuff(extracted_files),
pattern = map(extracted_files), format = "file",
iteration = "list"))
})
tar_make()
#> * start target downloaded_zip
#> trying URL 'https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip'
#> Content type 'application/zip' length 2036861 bytes (1.9 MB)
#> ==================================================
#> downloaded 1.9 MB
#>
#> * built target downloaded_zip
#> * start target extracted_files
#> * built target extracted_files
#> * end pipeline
#> Error : Target stuff_done tried to branch over extracted_files, which is illegal. Patterns must only branch over explicitly declared targets in the pipeline. Stems and patterns are fine, but you cannot branch over branches or global objects. Also, if you branch over a target with format = "file", then that target must also be a pattern.
#> Error: callr subprocess failed: Target stuff_done tried to branch over extracted_files, which is illegal. Patterns must only branch over explicitly declared targets in the pipeline. Stems and patterns are fine, but you cannot branch over branches or global objects. Also, if you branch over a target with format = "file", then that target must also be a pattern.
#> Visit https://books.ropensci.org/targets/debugging.html for debugging advice.
Created on 2021-12-08 by the reprex package (v2.0.1)
Original answer
Here's an idea: you could track that URL with format = "url" and then make the URL a dependency of all the file branches. Below, all of files should rerun then the upstream online data changes. That's fine because all that does is just re-hash stuff. But then not all branches of stuff_done should run if only some of those files actually changed.
Edit
On second thought, we probably need to hash the local files all in bulk. Not the most efficient, but it gets the job done. targets wants you to use its own built-in storage system instead of external files, so if you can read the data in and return it in a non-file format, dynamic branching will be easier.
# _targets.R file
library(targets)
tar_option_set(packages = c("tidyverse", "fs"))
download_file <- function(url, dest) {
download.file(url, dest)
dest
}
do_stuff <- function(file_path) {
file.info(file_path)
}
download_and_unzip <- function(url) {
downloaded_zip <- tempfile()
download_file(url, downloaded_zip)
unzip(downloaded_zip, exdir = dir_create("data"))
}
list(
tar_target(
url,
"https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip",
format = "url"
),
tar_target(
files_bulk,
download_and_unzip(url),
format = "file"
),
tar_target(file_names, files_bulk), # not a format = "file" target
tar_target(
files, {
files-bulk # Re-hash all the files separately if any file changes.
file_names
},
pattern = map(file_names),
format = "file"
),
tar_target(stuff_done, do_stuff(files), pattern = map(files))
)

Importing a password protected xlsx file into R

I found an old thread (How do you read a password protected excel file into r?) that recommended that I use the following code to read in a password protected file:
install.packages("excel.link")
library("excel.link")
dat <- xl.read.file("TestWorkbook.xlsx", password = "pass", write.res.password="pass")
dat
However, when I try to do this my R immediately crashes. I've tried removing the write.res.password argument, and that doesn't seem to be the issue. I have a hunch that excel.link might not work with the newest version of R, so if you know of any other ways to do this I'd appreciate the advice.
EDIT: Using read.xlsx generates this error:
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class), :
org.apache.poi.poifs.filesystem.OfficeXmlFileException:
The supplied data appears to be in the Office 2007+ XML.
You are calling the part of POI that deals with OLE2 Office Documents.
You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
You can remove the password of the excel file without knowing it with the following function (adapted version of code available at https://www.r-bloggers.com/2018/05/remove-password-protection-from-excel-sheets-using-r/)
remove_Password_Protection_From_Excel_File <- function(dir, file, bool_XLSXM = FALSE)
{
initial_Dir <- getwd()
setwd(dir)
# file name and path after removing protection
if(bool_XLSXM == TRUE)
{
file_unlocked <- stringr::str_replace(basename(file), ".xlsm$", "_unlocked.xlsm")
}else
{
file_unlocked <- stringr::str_replace(basename(file), ".xlsx$", "_unlocked.xlsx")
}
file_unlocked_path <- file.path(dir, file_unlocked)
# create temporary directory in project folder
# so we see what is going on
temp_dir <- "_tmp"
# remove and recreate _tmp folder in case it already exists
unlink(temp_dir, recursive = TRUE)
dir.create(temp_dir)
# unzip Excel file into temp folder
unzip(file, exdir = temp_dir)
# get full path to XML files for all worksheets
worksheet_paths <- list.files(paste0(temp_dir, "/xl/worksheets"), full.name = TRUE, pattern = ".xml")
# remove the XML node which contains the sheet protection
# We might of course use e.g. xml2 to parse the XML file, but this simple approach will suffice here
for(ws in worksheet_paths)
{
file_Content <- readLines(ws, encoding = "windows1")
# the "sheetProtection" node contains the hashed password "<sheetProtection SOME INFO />"
# we simply remove the whole node
out <- str_replace(file_Content, "<sheetProtection.*?/>", "")
writeLines(out, ws)
}
worksheet_Protection_Paths <- paste0(temp_dir, "/xl/workbook.xml")
file_Content <- readLines(worksheet_Protection_Paths , encoding = "windows1")
out <- stringr::str_replace(file_Content, "<workbookProtection.*?/>", "")
writeLines(out, worksheet_Protection_Paths)
# create a new zip, i.e. Excel file, containing the modified XML files
old_wd <- setwd(temp_dir)
files <- list.files(recursive = T, full.names = F, all.files = T, no.. = T)
# as the Excel file is a zip file, we can directly replace the .zip extension by .xlsx
zip::zip(file_unlocked_path, files = files) # utils::zip does not work for some reason
setwd(old_wd)
# clean up and remove temporary directory
unlink(temp_dir, recursive = T)
setwd(initial_Dir)
}
Once the password is removed, you can read the Excel file. This approach works for me.

Loading multiple RDS files save in the same directory using R

I'm trying to load multiple .rds files that are save in the same directory. I have made a function for that and I iterate on a list of the files dir to load it but it doesn't work, see below that I write:
markerDir="..."
markerFilesList <- list.files(markerDir,pattern = ".rds", recursive = TRUE, include.dirs = TRUE)
readRDSfct <- function(markerFilesList,markerDir,i){
print(paste0("Reading the marker file called :",basename(markerFilesList[[i]])))
nameVariableTmp=basename(markerFilesList[[i]])
nameVariable=gsub(pattern = "\\.rds", '',nameVariableTmp)
print(paste0("file saved in varibale called:", nameVariable))
currentRDSfile = readRDS(paste0(markerDir,markerFilesList[[i]])) #nameVariable
return(currentRDSfile)
}
for (i in 1:length(markerFilesList)){
readRDSfct(markerFilesList, markerDir, i)
}
Does anyone has a suggestion for me to do it ?
thanks in advance!
As I understand it correctly, you want to just load all the RDS, which are saved in the same directory in the R environment?
To load and bind all .RDS in one directory i am using something like this:
List_RDS = list.files(pattern="*.RDS")
List_using = lapply(List_RDS, readRDS)
Data_bind <-do.call("rbind", List_using)

Include query with R package

I have a SQL query that I would like to ship with an R package I have built, but when I try to include it in the inst or extdata or data folders inside my R package I don't know how to get the function to reference it. An example might be this: query file is myQuery.sql
runDbQuery = function(){
queryfile = 'folder/myQuery.sql'
query = readChar(queryfile, file.info(queryfile)$size)
require(RODBC)
channel <- odbcConnect("mydb", uid = "uid",
pwd = "pwd")
dbResults = sqlQuery(channel = channel, query = query, as.is = T)
close(channel)
return(dbResults)
}
I put .sql files I use in packages in /inst/sql and then get the path to them in functions via:
system.file("sql/myquery.sql",package = "mypackage")

Resources