Use wildcard to capture different datasets in Snakemake - wildcard

I would like to use wildcards in Snakemake in a very simple way to start a script for two datasets. Unfortunately, I cannot find the proper way of doing it.
My data folder contains three files: gene_list.txt, expression_JGI.txt, expression_UBC.txt.
Here is what my snakefile looks like:
rule extract:
input:
genes="data/gene_list.txt",
expression="data/expression_{dataset}.txt"
output:
"data/expression_{dataset}_subset.txt"
shell:
"bash scripts/extract.sh {input.genes} {input.expression} {output}"
When I use snakemake -c1 extract I get the following error message:
Building DAG of jobs...
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).
I tried adding a rule all at the beginning of the snakefile with the desired result files as input without success:
rule all:
input:
"data/expression_JGI_subset.txt",
"data/expression_UBC_subset.txt"
I also tried with expand:
DATASETS = ["JGI", "UBC"]
rule all:
input:
expand("data/expression_{dataset}_subset.txt", dataset=DATASETS)
But I get the same error message.
The script works fine when I use it outside Snakemake.
How can I achieve what I want?

When you do snakemake -c1 extract you ask snakemake to execute only rule extract and its dependencies, if any. However, because extract contains wildcards snakemake doesn't know what to replace them with. (Note that rule all is not a dependency of extract).
So either execute snakemake -c1 to run the whole pipeline or specify the concrete files you want to generate, e.g.:
snakemake -c1 -- data/expression_JGI_subset.txt data/expression_UBC_subset.txt

Related

Snakemake: wildcards do not expand in script line of rule

I am running a pipeline and was trying to optimize it by declaring the paths in a config file (config.yaml). The config.yaml file contains the path to find the scripts to run inside the pipeline, but when I expand the wildcard of the path, the pipeline does not run the script. The script itself runs fine.
To explain my problem:
rule with_script:
input: someinput
output: someoutput
script: expand("{script_path}/scriptfile", script_path = config[scriptpath])
input, output or rule all do not contain the script's path wildcard, so here is the first time I'm declaring it. The config.yaml line that contains the path looks like this:
scriptpath: /path/to/the/script
is there a way to maintain the wildcard and config file path (to make it easier for others to make changes if needed) and have the script work? Like this snakemake doesn't even enter the script file. Or maybe it is possible to declare global wildcards outside the rule all?
Thank you for your help!
P.S.: I'm sorry if this question has already been answered, but I couldn't find anything to help me with this.
You cannot define a function like expand() in the script section. Snakemake expects a path to your script.
Like the documentation states:
The script path is always relative to the Snakefile containing the directive (in contrast to the input and output file paths, which are relative to the working directory). It is recommended to put all scripts into a subfolder "scripts"
If you need to define different paths to your scripts, you can always do it in python outside of your rules. Don't forget, all python code outside of rules is executed before building the DAG. Thus, you can define all variables you want and use them in your rules.
SCRIPTSPATH = config["scriptpath"]
rule with_script:
input: someinput
output: someoutput
script: "{SCRIPTSPATH}/scriptfile"
Note:
Do not mix wildcards and "variables". In an expand function as
expand("{script_path}/scriptfile", script_path = config[scriptpath])
{script_path} is not a wildcard but just a placeholder for the values given in the second parameter of the function.

Snakemake specify a new wildcard in a new rule

I have input files:
Bob_1.fastq.gz
Bob_2.fastq.gz
Bob_3.fastq.gz
Bob_4.fastq.gz
Ron_1.fastq.gz
Ron_2.fastq.gz
Ron_3.fastq.gz
Ron_4.fastq.gz
I am running demultiplexing and trimming steps in one snakefile, like this:
workdir: "/path/to/dir/"
(SAMPLES,) =glob_wildcards('/path/to/dir/raw/{sample}.fastq.gz')
rule all:
input:
expand("demulptiplex/{sample}.fastq.gz", sample=SAMPLES),
expand("trimmed/{sample}.trimmed.fastq.gz", sample=SAMPLES)
rule sabre:
input:
infile="/path/to/dir/raw/{sample}.fastq.gz",
barcodefile= "files/{sample}.txt"
output:
unknownfile=temp("demulptiplex/unknown_barcode_{sample}.fastq.gz"),
shell:
"""
/Tools/sabre-master2/sabre se -f {input.infile} -b {input.barcodefile} -u {output.unknownfile}
"""
rule trimmomatic_se:
input:
r="{sample}.fastq.gz"
output:
r="trimmed/{sample}.trimmed.fastq.gz"
threads: 10
shell:
"""java -jar /Tools/Trimmomatic-0.36/trimmomatic-0.36.jar SE -threads {threads} {input.r} {output.r} ILLUMINACLIP:/Tools/Trimmomatic-0.36/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"""
The demultiplex output files are like this:
Bob_1_CL1.fastq.gz.... Bob_1_CL345.fastq.gz
Bob_2_CL1.fastq.gz.... Bob_1_CL248.fastq.gz
Ron_1_dad1.fastq.gz... Ron_1_dad67.fastq.gz
and so on
So,if I do not specify the demultiplex output file the program would create it by itself. My problem is how to specify/introduce a new wildcard from the output of the previous rule in the next trimming step, as the wildcards are different from initial sample now.
Wildcards just need to be consistent in a rule, not across the workflow. The issue here is that you have a rule generating 'unknown' outputs that you need to process further. For that you need to use checkpoints.
Read through the second block of code about aggregating. Your checkpoint will be demultiplexing and if you don't have any other steps, all will be your aggregate step that calls checkpoints.demultiplex.get. If you search for checkpoint on stackoverflow you will find lots of examples; it's a hard feature to use at first!

Snakemake: how to use one integer from list each call as input to script?

I'm trying to practice writing workflows in snakemake.
The contents of my Snakefile:
configfile: "config.yaml"
rule get_col:
input:
expand("data/{file}.csv",file=config["datname"])
output:
expand("output/{file}_col{param}.csv",file=config["datname"],param=config["cols"])
params:
col=config["cols"]
script:
"scripts/getCols.R"
The contents of config.yaml:
cols:
[2,4]
datname:
"GSE3790_expression_data"
My R script:
getCols=function(input,output,col) {
dat=read.csv(input)
dat=dat[,col]
write.csv(dat,output,row.names=F)
}
getCols(snakemake#input[[1]],snakemake#output[[1]],snakemake#params[['col']])
It seems like both columns are being called at once. What I'm trying to accomplish is one column being called from the list per output file.
Since the second output never gets a chance to be created (both columns are used to create first output), snakemake throws an error:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 3 of /Users/rebecca/Desktop/snakemake-tutorial/practice/Snakefile:
Job completed successfully, but some output files are missing.
On a slightly unrelated note, I thought I could write the input as:
'"data/{file}.csv"'
But that returns:
WildcardError in line 4 of /Users/rebecca/Desktop/snakemake-tutorial/practice/Snakefile:
Wildcards in input files cannot be determined from output files:
'file'
Any help would be much appreciated!
Looks like you want to run your Rscript twice per file, once for every value of col. In this case, the rule needs to be called twice as well.
The use of expand is also a bit too much here, in my opinion. expand fills your wildcards with all possible values and returns a list of the resulting files. So the output for this rule would be all possible combinations between files and cols, which the simple script can not create in one run.
This is also the reason why file can not be inferred from the output - it gets expanded there.
Instead, try writing your rule easier for just one file and column and expand on the resulting output, in a rule which needs this output as an input. If you generated the final output of your workflow, put it as input in a rule all to tell the workflow what the ultimate goal is.
rule all:
input:
expand("output/{file}_col{param}.csv",
file=config["datname"], param=config["cols"])
rule get_col:
input:
"data/{file}.csv"
output:
"output/{file}_col{param}.csv"
params:
col=lambda wc: wc.param
script:
"scripts/getCols.R"
Snakemake will infer from rule all (or any other rule to further use the output) what needs to be done and will call the rule get_col accordingly.

How to force wildcards into --report caption

I am using snakemake --report (v5.9.1) to create .html reports for pipeline/results. However I cannot use wildcards in the caption parameter of report().
Here is a short example that works, without using wildcards in caption
rule all:
input: expand("doit.{role}", role=["founder","offspring"])
rule doit:
output: report(touch("doit.{role}"),caption="doit.rst")
run: print(output[0])
Now, what I want/need is a separate caption for founder and offspring .
I have tried to simply add the {role}wildcard to the caption:
output: report(touch("doit.{role}"),caption="doit.{role}.rst")
but that gives an error
FileNotFoundError: [Errno 2] No such file or directory: 'sandBox/doit.{role}.rst'
but only when generating the hmtl-file by running snakemake --report . (Running the pipeline is OK).
It seems that output wildcards are not evaluated/substituted when captionis parsed.
I am using caption-functionality to display short results, as well as ordering the results in the .html report. (related to Snakemake report: How to show results in pipeline order ).
Can anyone suggest a work-around or a better pattern for what I am trying to do?

How to force robot framework to pick robot files in sequential order?

I have robot files in a folder (tests) as shown below:
tests
1_robotfile1.robot
2_robotfile2.robot
3_robotfile3.robot
4_robotfile4.robot
5_robotfile5.robot
6_robotfile6.robot
7_robotfile7.robot
8_robotfile8.robot
9_robotfile9.robot
10_robotfile10.robot
11_robotfile11.robot
Now if I execute '/root/users1/power$ pybot root/user1/tests' command, robot files are running in following order:
tests
1_robotfile1.robot
10_robotfile10.robot
11_robotfile11.robot
2_robotfile2.robot
3_robotfile3.robot
4_robotfile4.robot
5_robotfile5.robot
6_robotfile6.robot
7_robotfile7.robot
8_robotfile8.robot
9_robotfile9.robot
I want to force robot_framework to pick robot files in sequential order, like 1,2,3,4,5....
Do we have any option for this?
If you have the option of renaming your files, you just need to make sure that the prefix is sortable. For numbers, that means they should all have the same number of digits.
I recommend renaming your test cases to have three or four digits for the prefix:
001_robotfile1.robot
002_robotfile2.robot
003_robotfile3.robot
004_robotfile4.robot
005_robotfile5.robot
006_robotfile6.robot
007_robotfile7.robot
008_robotfile8.robot
009_robotfile9.robot
010_robotfile10.robot
011_robotfile11.robot
...
With that, they will sort in the order that you expect.
Following #Emna answer, RF docs ( http://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html#execution-order ) provides some solution.
So what could you do:
rename all the files to have consecutive and computer numbering (001-test.robot instead of 1-test.robot). This may break any internal references to other files (resources), hard to add test in-between,error prone when execution order needs to be changed
you can tag it as Emna
idea from RF docs - write a script to create argument file which will keep ordering in proper way and use it as argument to robot execution. For 1000+ files it should not take longer than few seconds.
try to design tests to not be dependent from execution order, use suite setup instead.
good luck ;)
Tag the tests as foo and bar so you can run each test separately:
pybot -i foo tests
or
pybot -i bar tests
and decide the order you want
pybot -i bar tests || pybot -i foo tests

Resources