Using code_to_plan and target(..., format = "fst") in drake - r

I really like using the code_to_plan function when constructing drake plans. I also really using target(..., format = "fst") for big files. However I am struggling to combine these two workflows. For example if I have this _drake.R file:
# Data --------------------------------------------------------------------
data_plan = code_to_plan("code/01-data/data.R")
join_plan = code_to_plan("code/01-data/merging.R")
# Cleaning ----------------------------------------------------------------
cleaning_plan = code_to_plan("code/02-cleaning/remove_na.R")
# Model -------------------------------------------------------------------
model_plan = code_to_plan("code/03-model/model.R")
# Combine Plans
dplan = bind_plans(
data_plan,
join_plan,
cleaning_plan,
model_plan
)
config <- drake_config(dplan)
This works fine when called with r_make(r_args = list(show = TRUE))
As I understand it though target can only be used within a drake_plan. If I try something like this:
dplan2 <- drake_plan(full_plan = target(dplan, format = "fst"))
config <- drake_config(dplan2)
I get an r_make error like this:
target full_plan
Error in fst::write_fst(x = value$value, path = tmp) :
Unknown type found in column.
In addition: Warning message:
You selected fst format for target full_plan, so drake will convert it from class c("drake_plan", "tbl_df", "tbl", "data.frame") to a plain data frame.
Error:
-->
in process 18712
See .Last.error.trace for a stack trace.
So ultimately my question is where does one specify special data formats for targets when you are using code_to_plan?
Edit
Using #landau helpful suggestion, I defined this function:
add_target_format <- function(plan) {
# Get a list of named commands.
commands <- plan$command
names(commands) <- plan$target
# Turn it into a good plan.
do.call(drake_plan, commands)
}
So that this would work:
dplan = bind_plans(
data_plan,
join_plan,
cleaning_plan,
model_plan
) %>%
add_target_format()

It is possible, but not convenient. Here is a workaround.
writeLines(
c(
"x <- small_data()",
"y <- target(large_data(), format = \"fst\")"
),
"script.R"
)
cat(readLines("script.R"), sep = "\n")
#> x <- small_data()
#> y <- target(large_data(), format = "fst")
library(drake)
# Produces a plan, but does not process target().
bad_plan <- code_to_plan("script.R")
bad_plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <expr>
#> 1 x small_data()
#> 2 y target(large_data(), format = "fst")
# Get a list of named commands.
commands <- bad_plan$command
names(commands) <- bad_plan$target
# Turn it into a good plan.
good_plan <- do.call(drake_plan, commands)
good_plan
#> # A tibble: 2 x 3
#> target command format
#> <chr> <expr> <chr>
#> 1 x small_data() <NA>
#> 2 y large_data() fst
Created on 2019-12-18 by the reprex package (v0.3.0)

Related

sparklyr :: Error reading parquet file using sparklyr library in R

I am trying to read parquet file from databricks Filestore
library(sparklyr)
parquet_dir has been pre-defined
parquet_dir = /dbfs/FileStore/test/flc_next.parquet'
List the files in the parquet dir
filenames <- dir(parquet_dir, full.names = TRUE)
"/dbfs/FileStore/test/flc_next.parquet/_committed_6244562942368589642"
[2] "/dbfs/FileStore/test/flc_next.parquet/_started_6244562942368589642"
[3] "/dbfs/FileStore/test/flc_next.parquet/_SUCCESS"
[4] "/dbfs/FileStore/test/flc_next.parquet/part-00000-tid-6244562942368589642-0edceedf-7157-4cce-a084-0f2a4a6769e6-925-1-c000.snappy.parquet"
Show the filenames and their sizes
data_frame(
filename = basename(filenames),
size_bytes = file.size(filenames)
)
rning: `data_frame()` was deprecated in tibble 1.1.0.
Please use `tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
# A tibble: 4 × 2
filename size_bytes
<chr> <dbl>
1 _committed_6244562942368589642 124
2 _started_6244562942368589642 0
3 _SUCCESS 0
4 part-00000-tid-6244562942368589642-0edceedf-7157-4cce-a084-0f2a4a6… 248643
Import the data into Spark
timbre_tbl <- spark_read_parquet("flc_next.parquet", parquet_dir)
Error : $ operator is invalid for atomic vectors
Some(<code style = 'font-size:10p'> Error: $ operator is invalid for atomic vectors </code>)
I would appreciate any help/suggestion
Thanks in advance
The first argument of spark_read_parquet expects a spark connection, check this: sparklyr::spark_connect. If you are running the codes in Databricks then this should work:
sc <- spark_connect(method = "databricks")
timbre_tbl <- spark_read_parquet(sc, "flc_next.parquet", parquet_dir)

How can I input a single additional parameter to disk.frame's inmapfn at readin?

According to the article https://diskframe.com/articles/ingesting-data.html a good use case for inmapfn as part of csv_to_disk_frame(...) is for date conversion. In my data I know the name of the date column at runtime and would like to feed in the date to a convert at read in time function. One issue I am having is that it doesn't seem any additional parameters can be passed into the inmapfn argument beyond the chunk itself. I can't use a hardcoded variable at runtime as the name of the column isn't known until runtime.
To clarify the issue is that the inmapfn seems to run in its own environment to prevent any data races/other parallelisation issues but I know the variable won't be changed so I am hoping there is someway to override this as I can make sure that this is safe.
I know the function I am calling works when called on an arbitrary dataframe.
I have provided a reproducible example below.
library(tidyverse)
library(disk.frame)
setup_disk.frame()
a <- tribble(~dates, ~val,
"09feb2021", 2,
"21feb2012", 2,
"09mar2013", 3,
"20apr2021", 4,
)
write_csv(a, "a.csv")
dates_col <- "dates"
tmp.df <- csv_to_disk.frame(
"a.csv",
outdir = file.path(tempdir(), "tmp.df"),
in_chunk_size = 1L,
inmapfn = function(chunk) {
chunk[, sdate := as.Date(do.call(`$`, list(chunk,dates_col)), "%d%b%Y")]
}
)
#> -----------------------------------------------------
#> Stage 1 of 2: splitting the file a.csv into smallers files:
#> Destination: C:\Users\joelk\AppData\Local\Temp\RtmpcFBBkr\file4a1876e87bf5
#> -----------------------------------------------------
#> Stage 1 of 2 took: 0.020s elapsed (0.000s cpu)
#> -----------------------------------------------------
#> Stage 2 of 2: Converting the smaller files into disk.frame
#> -----------------------------------------------------
#> csv_to_disk.frame: Reading multiple input files.
#> Please use `colClasses = ` to set column types to minimize the chance of a failed read
#> =================================================
#>
#> -----------------------------------------------------
#> -- Converting CSVs to disk.frame -- Stage 1 of 2:
#>
#> Converting 5 CSVs to 6 disk.frames each consisting of 6 chunks
#>
#> Error in do.call(`$`, list(chunk, dates_col)): object 'dates_col' not found
You can experiment with different backend and chunk_reader arguments. For example, if you set the backend to readr, the inmapfn user defined function will have access to previously defined variables. Furthermore, readr will do column type guessing
and will automatically impute Date type columns if it recognizes the string format as a date (in your example data it wouldn't recognize that as a date type, however).
If you don't want to use the readr backend for performance reasons, then I would ask if your example correctly represents your actual scenario? I'm not seeing the need to pass in the date column as a variable in the example you provided.
There is a working solution in the Just-in-time transformation section of the link you provided, and I'm not seeing any added complexities between that example and yours.
If you really need to use the default backend and chunk_reader plan AND you really need to send the inmapfn function a previously defined variable, you can wrap the the csv_to_disk.frame call in a wrapper function:
library(disk.frame)
setup_disk.frame()
df <- tribble(~dates, ~val,
"09feb2021", 2,
"21feb2012", 2,
"09mar2013", 3,
"20apr2021", 4,
)
write.csv(df, file.path(tempdir(), "df.csv"), row.names = FALSE)
wrap_csv_to_disk <- function(col) {
my_date_col <- col
csv_to_disk.frame(
file.path(tempdir(), "df.csv"),
in_chunk_size = 1L,
inmapfn = function(chunk, dates = my_date_col) {
chunk[, dates] <- lubridate::dmy(chunk[[dates]])
chunk
})
}
date_col <- "dates"
df_disk_frame <- wrap_csv_to_disk(date_col)
#> str(collect(df_disk_frame)$dates)
# Date[1:4], format: "2021-02-09" "2012-02-21" "2013-03-09" "2021-04-20"
I see. For a work around would it be possible to do something like this?
date_var = knonw_at_runtime()
saveRDS(date_var, "some/path/date_var.rds")
a = csv_to_disk.frame(files, inmapfn = function(chunk) {
date_var = readRDS("some/path/date_var.rds")
# do the rest
})
I think letting inmapfn have other options is doable see https://github.com/xiaodaigh/disk.frame/issues/377 for tracking

pull all elements with specific name from a nested list

I have some archived Slack data that I am trying to get some of key message properties. I'd done this by stupidly flattening the entire list, getting a data.frame or tibble with lists nested in some cells. As this dataset gets bigger, I want to pick elements out of this list more smartly so that when this cache becomes big it doesn't take forever to create the data.frame or tibble with the elements I want.
Example where I am trying to pull everything named "type" below into a vector or flat list that I can pull in as a dataframe variable. I named the folder and message level for convenience. Anyone have model code that can help?
library(tidyverse)
l <- list(folder_1 = list(
`msg_1-1` = list(type = "message",
subtype = "channel_join",
ts = "1585771048.000200",
user = "UFUNNF8MA",
text = "<#UFUNNF8MA> has joined the channel"),
`msg_1-2` = list(type = "message",
subtype = "channel_purpose",
ts = "1585771049.000300",
user = "UNFUNQ8MA",
text = "<#UNFUNQ8MA> set the channel purpose: Talk about xyz")),
folder_2 = list(
`msg_2-1` = list(type = "message",
subtype = "channel_join",
ts = "1585771120.000200",
user = "UQKUNF8MA",
text = "<#UQKUNF8MA> has joined the channel"))
)
# gets a specific element
print(l[[1]][[1]][["type"]])
# tried to get all elements named "type", but am not at the right list level to do so
print(purrr::map(l, "type"))
As OP mentioned, this can solve the issue:
#Code
unlist(l)[grepl('.type',names(unlist(l)),fixed=T)]
Output:
folder_1.msg_1-1.type folder_1.msg_1-2.type folder_2.msg_2-1.type
"message" "message" "message"
Another options are (Many thanks and credit to #Abdessabour Mtk)
#Code1
purrr::map(l, ~ purrr::map(.x, "type"))
Depending on the desired output, I would probably use a simple recursive function here.
get_elements <- function(x, element) {
if(is.list(x))
{
if(element %in% names(x)) x[[element]]
else lapply(x, get_elements, element = element)
}
}
This allows:
get_elements(l, "type")
#> $folder_1
#> $folder_1$`msg_1-1`
#> [1] "message"
#>
#> $folder_1$`msg_1-2`
#> [1] "message"
#>
#>
#> $folder_2
#> $folder_2$`msg_2-1`
#> [1] "message"
Or if you want to get all "users":
get_elements(l, "user")
#> $folder_1
#> $folder_1$`msg_1-1`
#> [1] "UFUNNF8MA"
#>
#> $folder_1$`msg_1-2`
#> [1] "UNFUNQ8MA"
#>
#>
#> $folder_2
#> $folder_2$`msg_2-1`
#> [1] "UQKUNF8MA"
You could obviously unlist the result if you prefer it flattened into a vector.
unlist(get_elements(l, "type"))
#> folder_1.msg_1-1 folder_1.msg_1-2 folder_2.msg_2-1
#> "message" "message" "message"
Another option is to use rrapply() in the rrapply-package:
library(rrapply)
## return unlisted vector
rrapply(l, condition = function(x, .xname) .xname == "type", how = "unlist")
#> folder_1.msg_1-1.type folder_1.msg_1-2.type folder_2.msg_2-1.type
#> "message" "message" "message"
## return melted data.frame
rrapply(l, condition = function(x, .xname) .xname == "type", how = "melt")
#> L1 L2 L3 value
#> 1 folder_1 msg_1-1 type message
#> 2 folder_1 msg_1-2 type message
#> 3 folder_2 msg_2-1 type message
Related to those provided by #Duck & #Abdessabour Mtk yesterday, purrr has a function map_depth() that will let you get a named attribute if you know its name and how deep it is in the hierarchy. REALLY useful when crawling this big nested lists, and is a simpler solution to the nested map() calls above.
purrr::map_depth(l, 2, "type")

Plot the longest transcript in GenomicRanges with ggbio

I am trying to plot an specific region using ggbio. I am using the below code that produced my desire output, except that it contains several transcript. Is it possible to only plot the longest transcript? I've not been able to access the genomic ranges object within Homo.sapiens that I assume contains this information.
library(ggbio)
library(Homo.sapiens)
range <- GRanges("chr10" , IRanges(start = 78000000 , end = 79000000))
p.txdb <- autoplot(Homo.sapiens, which = range)
p.txdb
Here is a solution that involves filtering TxDb.Hsapiens.UCSC.hg19.knownGene on the longest transcript by gene_id (which does remove genes without gene_id):
suppressPackageStartupMessages({
invisible(lapply(c("ggbio", "biovizBase", "data.table",
"TxDb.Hsapiens.UCSC.hg19.knownGene",
"org.Hs.eg.db"),
require, character.only = TRUE))})
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
# retrieve transcript lengths
txlen <- transcriptLengths(txdb, with.utr5_len=TRUE, with.utr3_len=TRUE)
setDT(txlen)
txlen$len <- rowSums(as.matrix(txlen[, .(tx_len, utr5_len, utr3_len)]))
setkey(txlen, gene_id, len, tx_id)
# filter longesttranscript by gene_id
ltx <- txlen[!is.na(gene_id)][, tail(.SD,1), by=gene_id]$tx_id
# filter txdb object
txb <- as.list(txdb)
txb$transcripts <- txb$transcripts[txb$transcripts$tx_id %in% ltx, ]
txb$splicings <- txb$splicings[txb$splicings$tx_id %in% ltx,]
txb$genes <- txb$genes[txb$genes$tx_id %in% ltx,]
txb <- do.call(makeTxDb, txb)
# plot according to vignette, chapter 2.2.5
range <- GRanges("chr10", IRanges(start = 78000000 , end = 79000000))
gr.txdb <- crunch(txb, which = range)
#> Parsing transcripts...
#> Parsing exons...
#> Parsing cds...
#> Parsing utrs...
#> ------exons...
#> ------cdss...
#> ------introns...
#> ------utr...
#> aggregating...
#> Done
colnames(values(gr.txdb))[4] <- "model"
grl <- split(gr.txdb, gr.txdb$gene_id)
symbols <- select(org.Hs.eg.db, keys=names(grl), columns="SYMBOL", keytype="ENTREZID")
#> 'select()' returned 1:1 mapping between keys and columns
names(grl) <- symbols[match(symbols$ENTREZID, names(grl), nomatch=0),"SYMBOL"]
autoplot(grl, aes(type = "model"), gap.geom="chevron")
#> Constructing graphics...
Created on 2020-05-29 by the reprex package (v0.3.0)
Edit:
To get gene symbols instead of gene (or transcript) ids, just replace the names of grl with the associated gene symbols, e.g. via org.Hs.eg.db, or any other resource that matches them up.

Printing a single tibble row over multiple lines of text

It is sometimes desirable to print a string in a tibble over multiple lines. Example: https://github.com/ropensci/drake/issues/489. drake plans with long commands are hard to read.
library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")
drake_plan(
u_auckland = make_place(
Name = "University of Auckland",
Latitude = -36.8521369,
Longitude = 174.7688785
),
shapefile = {
file_out("u-auckland.prj", "u-auckland.shx", "u-auckland.dbf")
st_write(
obj = u_auckland,
dsn = file_out("u-auckland.shp"),
driver = "ESRI Shapefile",
delete_dsn = TRUE
)
}
)
#> # A tibble: 2 x 2
#> target command
#> * <chr> <chr>
#> 1 u_auckland "make_place(Name = \"University of Auckland\", Latitude = -3…
#> 2 shapefile "{\n file_out(\"u-auckland.prj\", \"u-auckland.shx\", \"u…
Can pillar::pillar_shaft() or a similar tool to achieve something nicer? I am mainly concerned with line breaks and indentation (possibly with styler) but I am also interested in syntax highlighting, possibly with hightlight and crayon.
# A tibble: 2 x 2
target command
* <chr> <drake_cmd>
1 u_auckland make_place(
Name = "University of Auckland",
Latitude = -36.8521369,
Longitude = 174.7688785
)
2 shapefile {
file_out(
"u-auckland.prj",
"u-auckland.shx",
"u-auckland.dbf"
)
st_write(
obj = u_auckland,
dsn = file_out("u-auckland.shp"),
driver = "ESRI Shapefile",
delete_dsn = TRUE
)
}
So, apparently this is currently not possible as such, but a great workaround was suggested for my original use case.

Resources