Snakemake syntax for multiple outputs with the use of checkpoint - pipeline

I'm using snakemake to build a pipeline. I have a checkpoint that should produce multiple output files. These output files are later used in my rule all within expand. The thing is that I don't know the amount of files that will be produced and therefore can't specify a dataset in expand.
The files will be produced in a R-script.
Example:
rule all:
input:
expand(["results/{output}],
output=????)
checkpoint rscript:
input:
"foo.input"
output:
report("somedir/{output}"),
script:
"../scripts/foo.R"
Of course this is only a small part but I basically have a loop in my R-script to output multiple files in the somedir. But since I don't know how many and because they are firstly evaluated in the R script I can't set output in expand.
Maybe this is a really trivial question to some of you, or even a stupid question and there are better ways to do this. If that's the case I'd still be thankful cause I had problems understanding most of the snakemake functions because of my ability to comprehend the functions in english.
If there are more questions I'd gladly answer. (The best case for me would be to let output have names that I could specify in runtime within the R script)
(I also can't aggregate the created files in another rule, because each file will show a different plot)
Edit: The main problem still seems to be that checkpoint rscript is not able to create multiple {output} files in "somedir/". The attempt with touch("rscript_finish.flag") seems to output only the svg File as "rscript_finish.flag" or seems to override "rscript_finish.flag" each time the loop in my rscript writes into snakemake#output[[1]].

There are no stupid questions :). I hope I understood, and it was actually not a trivial question at all!
def all_input(wildcards):
checkpoints.rscript.get() # make sure that checkpoint rscript is executed
filenames, = glob_wildcards("somedir/{filenames}.png") # find all the output_files of rscript
return expand("somedir_cp/{fn}", fn=filenames)
rule all:
input:
all_input
rule add_to_report:
input:
"somedir/{filename}.png"
output:
report("somedir_cp/{filename}.png")
shell:
"cp {input} {output}"
checkpoint rscript:
input:
"foo.input"
output:
touch("rscript_finish.flag")
script:
"../scripts/foo.R"
I didn't really test the code, so I am not sure if it immediatly works, but I think the logic is correct.
The way this needs to be solved is with an extra rule, which I called add_to_report. All this rule does is make a copy of the existing output of rscript, and adds it to the report. The way rule all works is that it first calls for the execution of checkpoint rscript. Once that one is executed it finds all the files it generated. Then it says that rule all needs as input the copy of each file rscript generated, which will be made by rule add_to_report, and thus the files are added to the report.

Related

Do .Rout files preserve the R working environment?

I recently started looking into Makefiles to keep track of the scripts inside my research project. To really understand what is going on, I would like to understand the contents of .Rout files produced by R CMD BATCH a little better.
Christopher Gandrud is using a Makefile for his book Reproducible research with R and RStudio. The sample project (https://github.com/christophergandrud/rep-res-book-v3-examples/tree/master/data) has only three .R files: two of them download and clean data, the third one merges both datasets. They are invoked by the following lines of the Makefile:
# Key variables to define
RDIR = .
# Run the RSOURCE files
$(RDIR)/%.Rout: $(RDIR)/%.R
R CMD BATCH $<
None of the first two files outputs data; nor does the merge script explicitly import data - it just uses the objects created in the first two scripts. So how is the data preserved between the scripts?
To me it seems like the batch execution happens within the same R environment, preserving both objects and loaded packages. Is this really the case? And is it the .Rout file that transfers the objects from one script to the other or is it a property of the batch execution itself?
If the working environment is really preserved between the scripts, I see a lot of potential for issues if there are objects with the same names or functions with the same names from different packages. Another issue of this setup seems to be that the Makefile cannot propagate changes in the first two files downstream because there is no explicit input/prerequisite for the merge script.
I would appreciate to learn if my intuition is right and if there are better ways to execute R files in a Makefile.
By default R CMD BATCH will save your workspace to a hidden .Rdata file after running unless you choose --no-save. That's why it's not really the recommended way to run R script. The recommended way is with Rscript which will not save by default. You must write code explicitly to save if that's what you want. This is different than the Rout file which should only have the output from the commands run in the script.
In this case, execution doesn't happen in the exact same environment. R is still called three times, but that environment is serialized and reloaded between each run.
You are correct that there may be a lot of problems with saving and re-loading workspaces by default. That's why most people recommend you do not do that. But in this cause, the author just figured it made things easier for their workflow so they used it. It would be better to be more explicit about input and output files in general though.

Very simple question on Console vs Script in R

I have just started to learn to code on R, so I apologize for the very simple question. I understand it is best to type your code in as a Script so you can edit and save it. However, when I try to make an object in the script section, it does not work. If I make an object in the console, R saves the object and it appears in my environment. I am typing in a very simple code to try a quick exercise on rolling dice:
die <- 1:6
But it only works in the console and not when typed as a script. Any help/explanation appreciated!
Essentially, you interact with R environment differently when running an .R script via RScript.exe or via console with R.exe, Rterm, etc. and in GUI IDEs like RGui or RStudio. (This applies to any programming language with interactive compilers not just R).
The script does save thedie object in R environment but only during the run or lifetime of that script (i.e., from beginning to end of code lines). Your code line is simply an assignment of object. You do nothing with it. Apply some function, output results, and other actions in that script to see.
On the console, the R environment persists interactively until you quit it with q(). So assigned objects remains for lifetime of your console session. After assigning, you can afterwards apply function, output results, or other actions in line by line calls.
Ultimately, scripts gathers all line by line code in advance of run for automated execution without relying on user to supply lines. Imagine running 1,000 lines of code with nested if/then or for/while loops, apply functions on console! Therefore, have all your R coding needs summarily handled in scripts.
It is always better to have the script, as you say, you can save edit correct, without having to rewrite the code to change a variable or number.
I recommend using Rstudio, it is very practical and will help you to program more efficiently and allows you to see, among other things, the different objects that you have created.

Is there a way to run julia script with arguments from REPL?

I can run julia script with arguments from Powershell as > julia test.jl 'a' 'b'. I can run a script from REPL with include("test.jl") but include accepts just one argument - the path to the script.
From playing around with include it seems that it runs a script as a code block with all the variables referencing the current(?) scope so if I explicitly redefine ARGS variable in REPL it catches on and displays corresponding script results:
>ARGS="c","d"
>include("test.jl") # prints its arguments in a loop
c
d
This however gives a warning for redefining ARGS and doesn't seem the intended way of doing that. Is there another way to run a script from REPL (or from another script) while stating its arguments explicitly?
You probably don't want to run a self-contained script by includeing it. There are two options:
If the script isn't in your control and calling it from the command-line is the canonical interface, just call it in a separate Julia process. run(`$JULIA_HOME/julia path/to/script.jl arg1 arg2`). See running external commands for more details.
If you have control over the script, it'd probably make more sense to split it up into two parts: a library-like file that just defines Julia functions (but doesn't run any analyses) and a command-line file that parses the arguments and calls the functions defined by the library. Both command-line interface and the second script your writing now can include the library — or better yet make the library-like file a full-fledged package.
This solution is not clean or Julia style of doing things. But if you insist:
To avoid the warning when messing with ARGS use the original ARGS but mutate its contents. Like the following:
empty!(ARGS)
push!(ARGS,"argument1")
push!(ARGS,"argument2")
include("file.jl")
And this question is also a duplicate, or related to: juliapassing-argument-to-the-includefile-jl as #AlexanderMorley pointed to.
Not sure if it helps, but it took me a while to figure this:
On your path "C:\Users\\.julia\config\" there may be a .jl file called startup.jl
The trick is that not always Julia setup will create this. So, if neither the directory nor the .jl file exists, create them.
Julia will treat this .jl file as a command list to be executed every time you run REPL. It is very handy in order to set the directory of your projects (i.e. C:\MyJuliaProject\MyJuliaScript.jl using cd("")) and frequent used libs (like using Pkg, using LinearAlgebra, etc)
I wanted to share this as I didn't find anyone explicit saying this directory might not exist in your Julia device's installation. It took me more than it should to figure this thing out.

How to continuously restart/loop R script

I want an R script to continuously run and check for files in a folder and do something with those files.
The code simply checks for a file, then moves the file to somewhere else and renames it, deleting the old file (in reality it's a bit more elabore than this).
If I run the script it works fine, however I want R to automatically detect for the files. In other words, is there a way to have R run the script continuously so that I don't have to run the script if I put files in that folder?
In pure R you just need an infinite repeat loop...
repeat {
print('Checking files')
# Your code to do file manipulation
Sys.sleep(time=5) # to stop execution for 5 sec
}
However there may be better tools suitable to do this kind of file manipulation depending on your OS.
You can use the function tclTaskSchedule from the tcltk2 package to schedule a function or expression to run on a regular interval. You can have multiple such tasks scheduled and still work in the R session (just be careful not to modify something that the scheduled task could also modify or you can get unpredictable results).
Though an OS based solution that runs a given rscript may still be a better approach.

Switch R script from non-interactive to interactive

I've an R script, that takes commandline arguments, where the top line is:
#!/usr/bin/Rscript --slave
I wanted to interrupt execution in a function (so I can interactively use the data variables that have been loaded by that point to work out the next bit of code I need to write). I added this inside the function in question:
browser()
but it gets ignored. A bit of searching suggests it might be because the program is running in non-interactive mode. But even more searching has not tracked down how I switch the script out non-interactive mode so that browser() will work. Something like a browser_yes_I_really_mean_it() function.
P.S. I want to avoid altering the rest of the script if at all possible. My current approach is to copy and paste the code chunks, needed to prepare the data, into an interactive session; but as the script gets more and more complex this is getting more and more unreasonable.
UPDATE: for anyone else with the same question, it appears the answer to the actual question is that it is impossible. Once you start R in a non-interactive mode the die is cast. The given answers are therefore workarounds: either you hack your code (remembering to unhack it afterwards), or you refactor to make debugging easier. (This comment is not intended as a criticism of the answers; the suggested refactoring makes the code cleaner anyway.)
Can you just fire up R and source the file instead?
R
source("script.R")
Following mdsumner's answer, I edited my script like this:
if(!exists("argv")){
argv=commandArgs(TRUE)
if(length(argv)!=4)usage_and_exit()
}else{
if(length(argv)!=4){
stop("Must set argv as a 4 element vector. E.g. argv=c(...)")
}
}
Then no other change was needed, and I was able to do:
R
> argv=c('a','b','c','d')
> source("script.R")
In addition to the previous answer, I'd create a toplevel function (e.g. doStuff) which performs the analysis you want to perform in batch. The function takes the cmd line options as input. In the batch script you source the script that contains this function and call it. In this way you can easily run the function in interactive mode and use e.g. browser().
In some cases, the suggested solution (workaround) may not work - for example, when the R code needs to be run as a part of an existing bash script. For those cases, I suggest to write in your R script into the bash script using here document:
#!/bin/bash
R --interactive << EOT
# R code starts here
argv=c('a','b','c','d')
print(interactive())
# Rest of script contents
quit("no")
# R code ends here
EOT
This way, print(interactive()) above will yield TRUE.
Sidenote: Make sure to avoid the $ character in your R code, as this would not be processed correctly - for example, retrieve a column from a data.frame() by using df[["X1"]] instead of df$X1.

Resources