Snakemake specify a new wildcard in a new rule - wildcard

I have input files:
Bob_1.fastq.gz
Bob_2.fastq.gz
Bob_3.fastq.gz
Bob_4.fastq.gz
Ron_1.fastq.gz
Ron_2.fastq.gz
Ron_3.fastq.gz
Ron_4.fastq.gz
I am running demultiplexing and trimming steps in one snakefile, like this:
workdir: "/path/to/dir/"
(SAMPLES,) =glob_wildcards('/path/to/dir/raw/{sample}.fastq.gz')
rule all:
input:
expand("demulptiplex/{sample}.fastq.gz", sample=SAMPLES),
expand("trimmed/{sample}.trimmed.fastq.gz", sample=SAMPLES)
rule sabre:
input:
infile="/path/to/dir/raw/{sample}.fastq.gz",
barcodefile= "files/{sample}.txt"
output:
unknownfile=temp("demulptiplex/unknown_barcode_{sample}.fastq.gz"),
shell:
"""
/Tools/sabre-master2/sabre se -f {input.infile} -b {input.barcodefile} -u {output.unknownfile}
"""
rule trimmomatic_se:
input:
r="{sample}.fastq.gz"
output:
r="trimmed/{sample}.trimmed.fastq.gz"
threads: 10
shell:
"""java -jar /Tools/Trimmomatic-0.36/trimmomatic-0.36.jar SE -threads {threads} {input.r} {output.r} ILLUMINACLIP:/Tools/Trimmomatic-0.36/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"""
The demultiplex output files are like this:
Bob_1_CL1.fastq.gz.... Bob_1_CL345.fastq.gz
Bob_2_CL1.fastq.gz.... Bob_1_CL248.fastq.gz
Ron_1_dad1.fastq.gz... Ron_1_dad67.fastq.gz
and so on
So,if I do not specify the demultiplex output file the program would create it by itself. My problem is how to specify/introduce a new wildcard from the output of the previous rule in the next trimming step, as the wildcards are different from initial sample now.

Wildcards just need to be consistent in a rule, not across the workflow. The issue here is that you have a rule generating 'unknown' outputs that you need to process further. For that you need to use checkpoints.
Read through the second block of code about aggregating. Your checkpoint will be demultiplexing and if you don't have any other steps, all will be your aggregate step that calls checkpoints.demultiplex.get. If you search for checkpoint on stackoverflow you will find lots of examples; it's a hard feature to use at first!

Related

Use wildcard to capture different datasets in Snakemake

I would like to use wildcards in Snakemake in a very simple way to start a script for two datasets. Unfortunately, I cannot find the proper way of doing it.
My data folder contains three files: gene_list.txt, expression_JGI.txt, expression_UBC.txt.
Here is what my snakefile looks like:
rule extract:
input:
genes="data/gene_list.txt",
expression="data/expression_{dataset}.txt"
output:
"data/expression_{dataset}_subset.txt"
shell:
"bash scripts/extract.sh {input.genes} {input.expression} {output}"
When I use snakemake -c1 extract I get the following error message:
Building DAG of jobs...
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).
I tried adding a rule all at the beginning of the snakefile with the desired result files as input without success:
rule all:
input:
"data/expression_JGI_subset.txt",
"data/expression_UBC_subset.txt"
I also tried with expand:
DATASETS = ["JGI", "UBC"]
rule all:
input:
expand("data/expression_{dataset}_subset.txt", dataset=DATASETS)
But I get the same error message.
The script works fine when I use it outside Snakemake.
How can I achieve what I want?
When you do snakemake -c1 extract you ask snakemake to execute only rule extract and its dependencies, if any. However, because extract contains wildcards snakemake doesn't know what to replace them with. (Note that rule all is not a dependency of extract).
So either execute snakemake -c1 to run the whole pipeline or specify the concrete files you want to generate, e.g.:
snakemake -c1 -- data/expression_JGI_subset.txt data/expression_UBC_subset.txt

Snakemake wildcards: Using wildcarded files from directory output

I'm new to Snakemake and try to use specific files in a rule, from the directory() output of another rule that clones a git repo.
Currently, this gives me an error Wildcards in input files cannot be determined from output files: 'json_file', and I don't understand why. I have previously worked through the tutorial at https://carpentries-incubator.github.io/workflows-snakemake/index.html.
The difference between my workflow and the tutorial workflow is that I want to create the data I use later in the first step, whereas in the tutorial, the data was already there.
Workflow description in plain text:
Clone a git repository to path {path}
Run a script {script} on every single JSON files in the directory {path}/parsed/ in parallel to produce the aggregate result {result}
GIT_PATH = config['git_local_path'] # git/
PARSED_JSON_PATH = f'{GIT_PATH}parsed/'
GIT_URL = config['git_url']
# A single parsed JSON file
PARSED_JSON_FILE = f'{PARSED_JSON_PATH}{{json_file}}.json'
# Build a list of parsed JSON file names
PARSED_JSON_FILE_NAMES = glob_wildcards(PARSED_JSON_FILE).json_file
# All parsed JSON files
ALL_PARSED_JSONS = expand(PARSED_JSON_FILE, json_file=PARSED_JSON_FILE_NAMES)
rule all:
input: 'result.json'
rule clone_git:
output: directory(GIT_PATH)
threads: 1
conda: f'{ENVS_DIR}git.yml'
shell: f'git clone --depth 1 {GIT_URL} {{output}}'
rule extract_json:
input:
cmd='scripts/extract_json.py',
json_file=PARSED_JSON_FILE
output: 'result.json'
threads: 50
shell: 'python {input.cmd} {input.json_file} {output}'
Running only clone_git works fine (if I set an all input of GIT_PATH).
Why do I get the error message? Is this because the JSON files don't exist when the workflow is started?
Also - I don't know if this matters - this is a subworkflow used with module.
What you need seems to be a checkpoint rule which is first executed and only then snakemake determines which .json files are present and runs your extract/aggregate functions. Here's an example adapted:
I'm struggling to fully understand the file and folder structure you get after cloning your git repo. So I have fallen back to the best practices by Snakemake of using resources for downloaded and results for created files.
You'll need to re-adjust those paths to match your case again:
GIT_PATH = config["git_local_path"] # git/
GIT_URL = config["git_url"]
checkpoint clone_git:
output:
git=directory(GIT_PATH),
threads: 1
conda:
f"{ENVS_DIR}git.yml"
shell:
f"git clone --depth 1 {GIT_URL} {{output.git}}"
rule extract_json:
input:
cmd="scripts/extract_json.py",
json_file="resources/{file_name}.json",
output:
"results/parsed_files/{file_name}.json",
shell:
"python {input.cmd} {input.json_file} {output}"
def get_all_json_file_names(wildcards):
git_dir = checkpoints.clone_git.get(**wildcards).output["git"]
file_names = glob_wildcards(
"resources/{file_name}.json"
).file_name
return expand(
"results/parsed_files/{file_name}.json",
file_name=file_names,
)
# Rule has checkpoint dependency: Only after the checkpoint is executed
# the rule is executed which then evaluates the function to determine all
# json files downloaded from the git repo
rule aggregate:
input:
get_all_json_file_names
output:
"result.json",
default_target: True
shell:
# TODO: Action which combines all JSON files
edit: Moved the expand(...) from rule aggregate into get_all_json_file_names.

define SAMPLE for different dir name and sample name in snakemake code

I have written a snakemake code to run bwa_map. Fastq files are with different folder name and different sample name (paired end). It shows error as 'SAMPLES' is not defined. Please help.
Error:
$snakemake --snakefile rnaseq.smk mapped_reads/EZ-123-B_IGO_08138_J_2_S101_R2_001.bam -np
*NameError in line 2 of /Users/singhh5/Desktop/tutorial/rnaseq.smk:
name 'SAMPLES' is not defined
File "/Users/singhh5/Desktop/tutorial/rnaseq.smk", line 2, in *
#SAMPLE DIRECTORY
fastq
Sample_EZ-123-B_IGO_08138_J_2
EZ-123-B_IGO_08138_J_2_S101_R1_001.fastq.gz
EZ-123-B_IGO_08138_J_2_S101_R2_001.fastq.gz
Sample_EZ-123-B_IGO_08138_J_4
EZ-124-B_IGO_08138_J_4_S29_R1_001.fastq.gz
EZ-124-B_IGO_08138_J_4_S29_R2_001.fastq.gz
#My Code
expand("~/Desktop/{sample}/{rep}.fastq.gz", sample=SAMPLES)
rule bwa_map:
input:
"data/genome.fa",
"fastq/{sample}/{rep}.fastq"
conda:
"env.yaml"
output:
"mapped_reads/{rep}.bam"
threads: 8
shell:
"bwa mem {input} | samtools view -Sb -> {output}"
The specific error you are seeing is because the variable SAMPLES isn't set to anything before you use it in expand.
Some other issues you may run into:
Output file is missing the {sample} wildcard.
The value of threads isn't passed into bwa or samtools
You should place your expand into the input directive of the first rule in your snakefile, typically called all to properly request the files from bwa_map.
You aren't pairing your reads (R1 and R2) in bwa.
You should look around stackoverflow or some github projects for similar rules to give you inspiration on how to do this mapping.

Defining setup, teardown and variable in argumentfile in robotframework

Basically 2 issues:
1. I plan to execute multiple test cases from argument file. The structure would look like that:
SOME_PATH/
-test_cases/
-some_keywords/
-argumentfile.txt
How should i define a suite setup and teardown for all those test cases executed from file (-A file)?
From what i know:
a) I could execute it in file with 1st and last test case, but the order of test cases may change so it is not desired.
b) provide it in init.robot and put it somewhere without test cases only to get the setup and teardown. This is because if I execute:
robot -i SOME_TAG -A argumentfile /path/to/init
and the init is in test_case folder it will execute the test_cases with a specific tag + those in a folder twice.
Is there any better way? Provide it, for example, in argumentfile?
2 How to provide PATH variable in argumentfiles in robotframework?
I know there is possibility to do:
--variable PATH:some/path/to/files
but is it not for test suite env?
How to get that variable to be visible in the file itself: ${PATH}/test_case_1.robot
For your 2nd question, you could create a temporary environment variable that you'd then use. Depending on the OS you're using, the way you'll do this will be different:
Windows:
set TESTS_PATH=some/path/here
robot -t %TESTS_PATH%/test_case_1.robot
Unix:
export TESTS_PATH="some/path/here"
robot -t $TESTS_PATH/test_case_1.robot
PS: you might want to avoid asking multiple, different questions in the same thread

How to force robot framework to pick robot files in sequential order?

I have robot files in a folder (tests) as shown below:
tests
1_robotfile1.robot
2_robotfile2.robot
3_robotfile3.robot
4_robotfile4.robot
5_robotfile5.robot
6_robotfile6.robot
7_robotfile7.robot
8_robotfile8.robot
9_robotfile9.robot
10_robotfile10.robot
11_robotfile11.robot
Now if I execute '/root/users1/power$ pybot root/user1/tests' command, robot files are running in following order:
tests
1_robotfile1.robot
10_robotfile10.robot
11_robotfile11.robot
2_robotfile2.robot
3_robotfile3.robot
4_robotfile4.robot
5_robotfile5.robot
6_robotfile6.robot
7_robotfile7.robot
8_robotfile8.robot
9_robotfile9.robot
I want to force robot_framework to pick robot files in sequential order, like 1,2,3,4,5....
Do we have any option for this?
If you have the option of renaming your files, you just need to make sure that the prefix is sortable. For numbers, that means they should all have the same number of digits.
I recommend renaming your test cases to have three or four digits for the prefix:
001_robotfile1.robot
002_robotfile2.robot
003_robotfile3.robot
004_robotfile4.robot
005_robotfile5.robot
006_robotfile6.robot
007_robotfile7.robot
008_robotfile8.robot
009_robotfile9.robot
010_robotfile10.robot
011_robotfile11.robot
...
With that, they will sort in the order that you expect.
Following #Emna answer, RF docs ( http://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html#execution-order ) provides some solution.
So what could you do:
rename all the files to have consecutive and computer numbering (001-test.robot instead of 1-test.robot). This may break any internal references to other files (resources), hard to add test in-between,error prone when execution order needs to be changed
you can tag it as Emna
idea from RF docs - write a script to create argument file which will keep ordering in proper way and use it as argument to robot execution. For 1000+ files it should not take longer than few seconds.
try to design tests to not be dependent from execution order, use suite setup instead.
good luck ;)
Tag the tests as foo and bar so you can run each test separately:
pybot -i foo tests
or
pybot -i bar tests
and decide the order you want
pybot -i bar tests || pybot -i foo tests

Resources