Give a character string to define a file dependency in drake - r

I am learning drake to define my analysis workflow but I have trouble getting data files as dependencies.
I use the function file_in() inside drake_plan() but it only works if I give the path to the file directly. If I give it with the file.path() function or with a variable storing that file path, it doesn't work.
Examples:
# preparation
library(drake)
path.data <- "data"
dir.create(path.data)
write.csv(iris, file.path(path.data, "iris.csv"))
 Working plan:
# working plan
working_plan <-
drake_plan(iris_data = read.csv(file_in("data/iris.csv")),
strings_in_dots = "literals")
working_config <- make(working_plan)
vis_drake_graph(working_config)
This plan works fine, and the file data/iris.csv is considered as a dependency
Working plan
 Not working plan:
# not working
notworking_plan <-
drake_plan(iris_data = read.csv(file_in(file.path(path.data, "iris.csv"))),
strings_in_dots = "literals")
notworking_config <- make(notworking_plan)
vis_drake_graph(notworking_config)
Here it is trying to read the file iris.csv instead of data/iris.csv.
Working but problem with dependency:
# working but "data/iris.csv" is not considered as a dependency
file.name <- file.path(path.data, "iris.csv")
notworking_plan <-
drake_plan(iris_data = read.csv(file_in(file.name)),
strings_in_dots = "literals")
notworking_config <- make(notworking_plan)
vis_drake_graph(notworking_config)
This last one works fine but the file is not considered a dependency, so drake doesn't re-run the plan if this file is changed.
Not working drake plan
So, is there a way to tell drake file dependencies from variables?

Per tidyeval, if you add !! in front of the file.path(), it will be evaluated and not quoted.
Also, in the new version of drake, the strings_in_dots = "literals" argument is deprecated.
library(drake)
path.data <- "data"
dir.create(path.data)
write.csv(iris, file.path(path.data, "iris.csv"))
# now working
notworking_plan <-
drake_plan(iris_data = read.csv(file_in(!!file.path(path.data, "iris.csv"))))
notworking_plan
#> # A tibble: 1 x 2
#> target command
#> <chr> <expr>
#> 1 iris_data read.csv(file_in("data/iris.csv"))
Created on 2019-05-08 by the reprex package (v0.2.1)

After a answer from the developpers on Github, the code in file_in() is not evaluated so that's why it is not possible to use file.path in it.

Related

How can I input a single additional parameter to disk.frame's inmapfn at readin?

According to the article https://diskframe.com/articles/ingesting-data.html a good use case for inmapfn as part of csv_to_disk_frame(...) is for date conversion. In my data I know the name of the date column at runtime and would like to feed in the date to a convert at read in time function. One issue I am having is that it doesn't seem any additional parameters can be passed into the inmapfn argument beyond the chunk itself. I can't use a hardcoded variable at runtime as the name of the column isn't known until runtime.
To clarify the issue is that the inmapfn seems to run in its own environment to prevent any data races/other parallelisation issues but I know the variable won't be changed so I am hoping there is someway to override this as I can make sure that this is safe.
I know the function I am calling works when called on an arbitrary dataframe.
I have provided a reproducible example below.
library(tidyverse)
library(disk.frame)
setup_disk.frame()
a <- tribble(~dates, ~val,
"09feb2021", 2,
"21feb2012", 2,
"09mar2013", 3,
"20apr2021", 4,
)
write_csv(a, "a.csv")
dates_col <- "dates"
tmp.df <- csv_to_disk.frame(
"a.csv",
outdir = file.path(tempdir(), "tmp.df"),
in_chunk_size = 1L,
inmapfn = function(chunk) {
chunk[, sdate := as.Date(do.call(`$`, list(chunk,dates_col)), "%d%b%Y")]
}
)
#> -----------------------------------------------------
#> Stage 1 of 2: splitting the file a.csv into smallers files:
#> Destination: C:\Users\joelk\AppData\Local\Temp\RtmpcFBBkr\file4a1876e87bf5
#> -----------------------------------------------------
#> Stage 1 of 2 took: 0.020s elapsed (0.000s cpu)
#> -----------------------------------------------------
#> Stage 2 of 2: Converting the smaller files into disk.frame
#> -----------------------------------------------------
#> csv_to_disk.frame: Reading multiple input files.
#> Please use `colClasses = ` to set column types to minimize the chance of a failed read
#> =================================================
#>
#> -----------------------------------------------------
#> -- Converting CSVs to disk.frame -- Stage 1 of 2:
#>
#> Converting 5 CSVs to 6 disk.frames each consisting of 6 chunks
#>
#> Error in do.call(`$`, list(chunk, dates_col)): object 'dates_col' not found
You can experiment with different backend and chunk_reader arguments. For example, if you set the backend to readr, the inmapfn user defined function will have access to previously defined variables. Furthermore, readr will do column type guessing
and will automatically impute Date type columns if it recognizes the string format as a date (in your example data it wouldn't recognize that as a date type, however).
If you don't want to use the readr backend for performance reasons, then I would ask if your example correctly represents your actual scenario? I'm not seeing the need to pass in the date column as a variable in the example you provided.
There is a working solution in the Just-in-time transformation section of the link you provided, and I'm not seeing any added complexities between that example and yours.
If you really need to use the default backend and chunk_reader plan AND you really need to send the inmapfn function a previously defined variable, you can wrap the the csv_to_disk.frame call in a wrapper function:
library(disk.frame)
setup_disk.frame()
df <- tribble(~dates, ~val,
"09feb2021", 2,
"21feb2012", 2,
"09mar2013", 3,
"20apr2021", 4,
)
write.csv(df, file.path(tempdir(), "df.csv"), row.names = FALSE)
wrap_csv_to_disk <- function(col) {
my_date_col <- col
csv_to_disk.frame(
file.path(tempdir(), "df.csv"),
in_chunk_size = 1L,
inmapfn = function(chunk, dates = my_date_col) {
chunk[, dates] <- lubridate::dmy(chunk[[dates]])
chunk
})
}
date_col <- "dates"
df_disk_frame <- wrap_csv_to_disk(date_col)
#> str(collect(df_disk_frame)$dates)
# Date[1:4], format: "2021-02-09" "2012-02-21" "2013-03-09" "2021-04-20"
I see. For a work around would it be possible to do something like this?
date_var = knonw_at_runtime()
saveRDS(date_var, "some/path/date_var.rds")
a = csv_to_disk.frame(files, inmapfn = function(chunk) {
date_var = readRDS("some/path/date_var.rds")
# do the rest
})
I think letting inmapfn have other options is doable see https://github.com/xiaodaigh/disk.frame/issues/377 for tracking

How to find heavy objects that are not stored in .GlobalEnv?

I am trying to find which objects are taking a lot of memory in my R session, but the problem is that the object might have been invisibly created with an unknown name in an unknown environment.
If the object is stored in .GlobalEnv or a known environment, I can easily use a strategy like ls(enviro)+get()+object.size() (see lsos on this post for example) to list all objects and their size, allowing me to identify the heavy objects.
However, the object in question might not be stored in .GlobalEnv, but might be in some obscure environment implicitly created by an external package. How can in that case identify which object is using a lot of RAM?
The best case study is ggplot2 creating .last_plot in a dedicated environment. Looking under the hood one can find that it is stored in environment(ggplot2:::.store$get), so one can find it and eventually remove it. But if I didn't know that location or name a priori, would there be a way to find that there is a heavy object called .last_plot somewhere in memory?
pryr::mem_used()
#> 34.7 MB
## example: implicit creation of heavy and hidden object by ggplot
path <- tempfile()
if(!file.exists(path)){
saveRDS(as.data.frame(matrix(rep(1,1e07), ncol=5)), path)
}
pryr::mem_used()
#> 34.9 MB
p1 <- ggplot2::ggplot(readr::read_rds(path), ggplot2::aes(V1))
rm(p1)
pryr::mem_used()
#> 127 MB
## Hidden object is not in .GlobalEnv
ls(.GlobalEnv, all.names = TRUE)
#> [1] "path"
## Here I know where to find it: environment(ggplot2:::.store$get)
ls(all.names = TRUE, envir = environment(ggplot2:::.store$get))
#> [1] ".last_plot"
pryr::object_size(get(".last_plot", environment(ggplot2:::.store$get))$data)
#> 80 MB
## But how could I have found this otherwise?
Created on 2020-11-03 by the reprex package (v0.3.0)
I don't think there's any existing way to do this. If you combine #AllanCameron's answer with my comment, where you'd also run ls(y) for y environments calculated as
ns <- loadedNamespaces()
for (x in ns) {
y <- loadNamespace(x)
# look at the size of everything in y
}
you still won't find all the environments. I think you could do it if you also examined every object that might contain a reference to an environment (e.g. every function, formula, list, and various exotic objects) but it would be tricky not to miss something or count things more than once.
Edited to add: Actually, pryr::object_size is pretty smart at reporting on the environments attached to objects, so we'd get close by searching namespaces. For example, to find the top 20 objects:
pryr::mem_used()
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
#> 35 MB
path <- tempfile()
if(!file.exists(path)){
saveRDS(as.data.frame(matrix(rep(1,1e07), ncol=5)), path)
}
pryr::mem_used()
#> 35.2 MB
p1 <- ggplot2::ggplot(readr::read_rds(path), ggplot2::aes(V1))
rm(p1)
pryr::mem_used()
#> 127 MB
envs <- c(globalenv = globalenv(),
sapply(loadedNamespaces(), function(ns) loadNamespace(ns)))
sizes <- lapply(envs, function(e) {
objs <- ls(e, all = TRUE)
sapply(objs, function(obj) pryr::object_size(get(obj, envir = e)))
})
head(sort(unlist(sizes), decreasing = TRUE), 20)
#> base..__S3MethodsTable__. utils..__S3MethodsTable__.
#> 96216872 83443704
#> grid..__S3MethodsTable__. ggplot2..__S3MethodsTable__.
#> 80945520 80636768
#> ggplot2..store methods..classTable
#> 80418936 10101152
#> graphics..__S3MethodsTable__. tools..check_packages
#> 9325608 5185880
#> compiler.inlineHandlers methods..genericTable
#> 3444600 2808440
#> Rcpp..__T__show:methods colorspace..__T__show:methods
#> 2474672 2447880
#> Rcpp..RcppClass Rcpp..__C__C++OverloadedMethods
#> 2127584 1990504
#> Rcpp..__C__RcppClass Rcpp..__C__C++Field
#> 1982576 1980176
#> Rcpp..__C__C++Constructor Rcpp..__T__$:base
#> 1979992 1939616
#> tools..install_packages Rcpp..__C__Module
#> 1904032 1899872
Created on 2020-11-03 by the reprex package (v0.3.0)
I don't know why those methods tables come out so large (I suspect it's because ggplot2 adds methods to those tables, so its environment gets captured); but somehow they are finding your object, because they aren't so big if I don't create it.
A hint about the issue is in the 5th object, listed as ggplot2..store (i.e. the object named .store in the ggplot2 namespace). Doesn't tell you to look in the environments of the functions in .store, but at least it gets you started.
Second edit:
Here are some tweaks to make the output a bit more readable.
# Unlist first, so we can clean up the names
sizes <- unlist(sizes)
# Replace the first dot with :::
names(sizes) <- sub(".", ":::", names(sizes), fixed = TRUE)
# Remove internal R objects
keep <- !grepl(".__", names(sizes), fixed = TRUE)
sizes <- sizes[keep]
With these changes, the output from sort(sizes[keep], decreasing = TRUE) starts out as
ggplot2:::.store
80418936
base:::.userHooksEnv
47855920
base:::.Options
45016888
utils:::Rprof
44958416
If you do
unlist(lapply(search(), function(y) sapply(ls(y), function(x) object.size(get(x)))))
You will get a complete list of all the objects in all the environments on your search path, including their sizes. You can then sort these and find the offending objects.

How to apply rma() normalization to a unique CEL file?

I have implemented a R script that performs batch correction on a gene expression dataset. To do the batch correction, I first need to normalize the data in each CEL file through the Affy rma() function of Bioconductor.
If I run it on the GSE59867 dataset obtained from GEO, everything works.
I define a batch as the data collection date: I put all the CEL files having the same date into a specific folder, and then consider that date/folder as a specific batch.
On the GSE59867 dataset, a batch/folder contains only 1 CEL file. Nonetheless, the rma() function works on it perfectly.
But, instead, if I try to run my script on another dataset (GSE36809), I have some troubles: if I try to apply the rma() function to a batch/folder containing only 1 file, I get the following error:
Error in `colnames<-`(`*tmp*`, value = "GSM901376_c23583161.CEL.gz") :
attempt to set 'colnames' on an object with less than two dimensions
Here's my specific R code, to let you understand.
You first have to download the file GSM901376_c23583161.CEL.gz:
setwd(".")
options(stringsAsFactors = FALSE)
fileURL <- "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM901nnn/GSM901376/suppl/GSM901376%5Fc23583161%2ECEL%2Egz"
fileDownloadCommand <- paste("wget ", fileURL, " ", sep="")
system(fileDownloadCommand)
Library installation:
source("https://bioconductor.org/biocLite.R")
list.of.packages <- c("easypackages")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
listOfBiocPackages <- c("oligo", "affyio","BiocParallel")
bioCpackagesNotInstalled <- which( !listOfBiocPackages %in% rownames(installed.packages()) )
cat("package missing listOfBiocPackages[", bioCpackagesNotInstalled, "]: ", listOfBiocPackages[bioCpackagesNotInstalled], "\n", sep="")
if( length(bioCpackagesNotInstalled) ) {
biocLite(listOfBiocPackages[bioCpackagesNotInstalled])
}
library("easypackages")
libraries(list.of.packages)
libraries(listOfBiocPackages)
Application of rma()
thisFileDate <- "GSM901376_c23583161.CEL.gz"
thisDateRawData <- read.celfiles(thisDateCelFiles)
thisDateNormData <- rma(thisDateRawData)
After the call to rma(), I get the error.
How can I solve this problem?
I also tried to skip this normalization, by saving the thisDateRawData object directly. But then I have the problem that I cannot combine together this thisDateRawData (that is a ExpressionFeatureSet) with the outputs of rma() (that are ExpressionSet objects).
(EDIT: I extensively edited the question, and added a piece of R code you should be able to run on your pc.)
Hmm. This is a puzzling problem. the oligo::rma() function might be buggy for class GeneFeatureSet with single samples. I got it to work with a single sample by using lower-level functions, but it means I also had to create the expression set from scratch by specifying the slots:
# source("https://bioconductor.org/biocLite.R")
# biocLite("GEOquery")
# biocLite("pd.hg.u133.plus.2")
# biocLite("pd.hugene.1.0.st.v1")
library(GEOquery)
library(oligo)
# # Instead of using .gz files, I extracted the actual CELs.
# # This is just to illustrate how I read in the files; your usage will differ.
# projectDir <- "" # Path to .tar files here
# setwd(projectDir)
# untar("GSE36809_RAW.tar", exdir = "GSE36809")
# untar("GSE59867_RAW.tar", exdir = "GSE59867")
# setwd("GSE36809"); gse3_cels <- dir()
# sapply(paste(gse3_cels, sep = "/"), gunzip); setwd(projectDir)
# setwd("GSE59867"); gse5_cels <- dir()
# sapply(paste(gse5_cels, sep = "/"), gunzip); setwd(projectDir)
#
# Read in CEL
#
# setwd("GSE36809"); gse3_cels <- dir()
# gse3_efs <- read.celfiles(gse3_cels[1])
# # Assuming you've read in the CEL files as a GeneFeatureSet or
# # ExpressionFeatureSet object (i.e. gse3_efs in this example),
# # you can now fit the RMA and create an ExpressionSet object with it:
exprsData <- basicRMA(exprs(gse3_efs), pnVec = featureNames(gse3_efs))
gse3_expset <- new("ExpressionSet")
slot(gse3_expset, "assayData") <- assayDataNew(exprs = exprsData)
slot(gse3_expset, "phenoData") <- phenoData(gse3_efs)
slot(gse3_expset, "featureData") <- annotatedDataFrameFrom(attr(gse3_expset,
'assayData'), byrow = TRUE)
slot(gse3_expset, "protocolData") <- protocolData(gse3_efs)
slot(gse3_expset, "annotation") <- slot(gse3_efs, "annotation")
Hopefully the above approach will work in your code.

Importing data into R (rdata) from Github

I want to put some R code plus the associated data file (RData) on Github.
So far, everything works okay. But when people clone the repository, I want them to be able to run the code immediately. At the moment, this isn't possible because they will have to change their work directory (setwd) to directory that the RData file was cloned (i.e. downloaded) to.
Therefore, I thought it might be easier, if I changed the R code such that it linked to the RData file on github. But I cannot get this to work using the following snippet. I think perhaps there is some issue text / binary issue.
x <- RCurl::getURL("https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData")
y <- load(x)
Any help would be appreciated.
Thanks
This works for me:
githubURL <- "https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData"
load(url(githubURL))
head(df)
# X Y Z
# 1 16602794 -4183983 94.92019
# 2 16602814 -4183983 91.15794
# 3 16602834 -4183983 87.44995
# 4 16602854 -4183983 83.79617
# 5 16602874 -4183983 80.19643
# 6 16602894 -4183983 76.65052
EDIT Response to OP comment.
From the documentation:
Note that the https:// URL scheme is not supported except on Windows.
So you could try this:
download.file(githubURL,"myfile")
load("myfile")
which works for me as well, but this will clutter your working directory. If that doesn't work, try setting method="curl" in the call to download.file(...).
I've had trouble with this before as well, and the solution I've found to be the most reliable is to use a tiny modification of source_url from the fantastic [devtools][1] package. This works for me (on a Mac).
load_url <- function (url, ..., sha1 = NULL) {
# based very closely on code for devtools::source_url
stopifnot(is.character(url), length(url) == 1)
temp_file <- tempfile()
on.exit(unlink(temp_file))
request <- httr::GET(url)
httr::stop_for_status(request)
writeBin(httr::content(request, type = "raw"), temp_file)
file_sha1 <- digest::digest(file = temp_file, algo = "sha1")
if (is.null(sha1)) {
message("SHA-1 hash of file is ", file_sha1)
}
else {
if (nchar(sha1) < 6) {
stop("Supplied SHA-1 hash is too short (must be at least 6 characters)")
}
file_sha1 <- substr(file_sha1, 1, nchar(sha1))
if (!identical(file_sha1, sha1)) {
stop("SHA-1 hash of downloaded file (", file_sha1,
")\n does not match expected value (", sha1,
")", call. = FALSE)
}
}
load(temp_file, envir = .GlobalEnv)
}
I use a very similar modification to get text files from github using read.table, etc. Note that you need to use the "raw" version of the github URL (which you included in your question).
[1] https://github.com/hadley/devtoolspackage
load takes a filename.
x <- RCurl::getURL("https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData")
writeLines(x, tmp <- tempfile())
y <- load(tmp)

Get function's title from documentation

I would like to get the title of a base function (e.g.: rnorm) in one of my scripts. That is included in the documentation, but I have no idea how to "grab" it.
I mean the line given in the RD files as \title{} or the top line in documentation.
Is there any simple way to do this without calling Rd_db function from tools and parse all RD files -- as having a very big overhead for this simple stuff? Other thing: I tried with parse_Rd too, but:
I do not know which Rd file holds my function,
I have no Rd files on my system (just rdb, rdx and rds).
So a function to parse the (offline) documentation would be the best :)
POC demo:
> get.title("rnorm")
[1] "The Normal Distribution"
If you look at the code for help, you see that the function index.search seems to be what is pulling in the location of the help files, and that the default for the associated find.packages() function is NULL. Turns out tha tthere is neither a help fo that function nor is exposed, so I tested the usual suspects for which package it was in (base, tools, utils), and ended up with "utils:
utils:::index.search("+", find.package())
#[1] "/Library/Frameworks/R.framework/Resources/library/base/help/Arithmetic"
So:
ghelp <- utils:::index.search("+", find.package())
gsub("^.+/", "", ghelp)
#[1] "Arithmetic"
ghelp <- utils:::index.search("rnorm", find.package())
gsub("^.+/", "", ghelp)
#[1] "Normal"
What you are asking for is \title{Title}, but here I have shown you how to find the specific Rd file to parse and is sounds as though you already know how to do that.
EDIT: #Hadley has provided a method for getting all of the help text, once you know the package name, so applying that to the index.search() value above:
target <- gsub("^.+/library/(.+)/help.+$", "\\1", utils:::index.search("rnorm",
find.package()))
doc.txt <- pkg_topic(target, "rnorm") # assuming both of Hadley's functions are here
print(doc.txt[[1]][[1]][1])
#[1] "The Normal Distribution"
It's not completely obvious what you want, but the code below will get the Rd data structure corresponding to the the topic you're interested in - you can then manipulate that to extract whatever you want.
There may be simpler ways, but unfortunately very little of the needed coded is exported and documented. I really wish there was a base help package.
pkg_topic <- function(package, topic, file = NULL) {
# Find "file" name given topic name/alias
if (is.null(file)) {
topics <- pkg_topics_index(package)
topic_page <- subset(topics, alias == topic, select = file)$file
if(length(topic_page) < 1)
topic_page <- subset(topics, file == topic, select = file)$file
stopifnot(length(topic_page) >= 1)
file <- topic_page[1]
}
rdb_path <- file.path(system.file("help", package = package), package)
tools:::fetchRdDB(rdb_path, file)
}
pkg_topics_index <- function(package) {
help_path <- system.file("help", package = package)
file_path <- file.path(help_path, "AnIndex")
if (length(readLines(file_path, n = 1)) < 1) {
return(NULL)
}
topics <- read.table(file_path, sep = "\t",
stringsAsFactors = FALSE, comment.char = "", quote = "", header = FALSE)
names(topics) <- c("alias", "file")
topics[complete.cases(topics), ]
}

Resources