I have stored my SnakeMake rules in different .smk files. I have one file (a.smk) with following input function
def get_input(wildcards):
# Some processing
return input_list
rule some_rule_in_first_file:
input: get_input
# rest of the rule
Now in another file (b.smk) I want to do something like following,
rule another_rule_in_second_file:
input: get_input
# Rest of the rule
How can I achieve above?
I think you could use the include directive. I.e., in your b.smk you should add something like:
include: '/path/to/a.smk'
Related
I would like to use wildcards in Snakemake in a very simple way to start a script for two datasets. Unfortunately, I cannot find the proper way of doing it.
My data folder contains three files: gene_list.txt, expression_JGI.txt, expression_UBC.txt.
Here is what my snakefile looks like:
rule extract:
input:
genes="data/gene_list.txt",
expression="data/expression_{dataset}.txt"
output:
"data/expression_{dataset}_subset.txt"
shell:
"bash scripts/extract.sh {input.genes} {input.expression} {output}"
When I use snakemake -c1 extract I get the following error message:
Building DAG of jobs...
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).
I tried adding a rule all at the beginning of the snakefile with the desired result files as input without success:
rule all:
input:
"data/expression_JGI_subset.txt",
"data/expression_UBC_subset.txt"
I also tried with expand:
DATASETS = ["JGI", "UBC"]
rule all:
input:
expand("data/expression_{dataset}_subset.txt", dataset=DATASETS)
But I get the same error message.
The script works fine when I use it outside Snakemake.
How can I achieve what I want?
When you do snakemake -c1 extract you ask snakemake to execute only rule extract and its dependencies, if any. However, because extract contains wildcards snakemake doesn't know what to replace them with. (Note that rule all is not a dependency of extract).
So either execute snakemake -c1 to run the whole pipeline or specify the concrete files you want to generate, e.g.:
snakemake -c1 -- data/expression_JGI_subset.txt data/expression_UBC_subset.txt
I am running a pipeline and was trying to optimize it by declaring the paths in a config file (config.yaml). The config.yaml file contains the path to find the scripts to run inside the pipeline, but when I expand the wildcard of the path, the pipeline does not run the script. The script itself runs fine.
To explain my problem:
rule with_script:
input: someinput
output: someoutput
script: expand("{script_path}/scriptfile", script_path = config[scriptpath])
input, output or rule all do not contain the script's path wildcard, so here is the first time I'm declaring it. The config.yaml line that contains the path looks like this:
scriptpath: /path/to/the/script
is there a way to maintain the wildcard and config file path (to make it easier for others to make changes if needed) and have the script work? Like this snakemake doesn't even enter the script file. Or maybe it is possible to declare global wildcards outside the rule all?
Thank you for your help!
P.S.: I'm sorry if this question has already been answered, but I couldn't find anything to help me with this.
You cannot define a function like expand() in the script section. Snakemake expects a path to your script.
Like the documentation states:
The script path is always relative to the Snakefile containing the directive (in contrast to the input and output file paths, which are relative to the working directory). It is recommended to put all scripts into a subfolder "scripts"
If you need to define different paths to your scripts, you can always do it in python outside of your rules. Don't forget, all python code outside of rules is executed before building the DAG. Thus, you can define all variables you want and use them in your rules.
SCRIPTSPATH = config["scriptpath"]
rule with_script:
input: someinput
output: someoutput
script: "{SCRIPTSPATH}/scriptfile"
Note:
Do not mix wildcards and "variables". In an expand function as
expand("{script_path}/scriptfile", script_path = config[scriptpath])
{script_path} is not a wildcard but just a placeholder for the values given in the second parameter of the function.
I am currently using Snakemake for a bioinformatics project. Given a human reference genome (hg19) and a bam file, I want to be able to specify that there will be multiple output files with the same name but different extensions. Here is my code
rule gridss_preprocess:
input:
ref=config['ref'],
bam=config['bamdir'] + "{sample}.dedup.downsampled.bam",
bai=config['bamdir'] + "{sample}.dedup.downsampled.bam.bai"
output:
expand(config['bamdir'] + "{sample}.dedup.downsampled.bam{ext}", ext = config['workreq'], sample = "{sample}")
Currently config['workreq'] is a list of extensions that start with "."
For example, I want to be able to use expand to indicate the following files
S1.dedup.downsampled.bam.cigar_metrics
S1.dedup.downsampled.bam.computesamtags.changes.tsv
S1.dedup.downsampled.bam.coverage.blacklist.bed
S1.dedup.downsampled.bam.idsv_metrics
I want to be able to do this for multiple sample files, S_. Currently I am not getting an error when I try to do a dry run. However, I am not sure if this will run properly.
Am I doing this right?
expand() defines a list of files. If you're using two parameters, the cartesian product will be used. Thus, your rule will define as output ALL files with your extension list for ALL samples. Since you define a wildcard in your input, I think that what you want is all files with your extension for ONE sample. And this rule will be executed as many times as the number of samples.
You're mixing up wildcards and placeholders for the expand() function. You can define a wildcard inside an expand() by doubling the brackets:
rule all:
input: expand(config['bamdir'] + "{sample}.dedup.downsampled.bam{ext}", ext = config['workreq'], sample=SAMPLELIST)
rule gridss_preprocess:
input:
ref=config['ref'],
bam=config['bamdir'] + "{sample}.dedup.downsampled.bam",
bai=config['bamdir'] + "{sample}.dedup.downsampled.bam.bai"
output:
expand(config['bamdir'] + "{{sample}}.dedup.downsampled.bam{ext}", ext = config['workreq'])
This expand function will expand in list
{sample}.dedup.downsampled.bam.cigar_metrics
{sample}.dedup.downsampled.bam.computesamtags.changes.tsv
{sample}.dedup.downsampled.bam.coverage.blacklist.bed
{sample}.dedup.downsampled.bam.idsv_metrics
and thus define the wildcard sample to match the files in the input.
I have input files:
Bob_1.fastq.gz
Bob_2.fastq.gz
Bob_3.fastq.gz
Bob_4.fastq.gz
Ron_1.fastq.gz
Ron_2.fastq.gz
Ron_3.fastq.gz
Ron_4.fastq.gz
I am running demultiplexing and trimming steps in one snakefile, like this:
workdir: "/path/to/dir/"
(SAMPLES,) =glob_wildcards('/path/to/dir/raw/{sample}.fastq.gz')
rule all:
input:
expand("demulptiplex/{sample}.fastq.gz", sample=SAMPLES),
expand("trimmed/{sample}.trimmed.fastq.gz", sample=SAMPLES)
rule sabre:
input:
infile="/path/to/dir/raw/{sample}.fastq.gz",
barcodefile= "files/{sample}.txt"
output:
unknownfile=temp("demulptiplex/unknown_barcode_{sample}.fastq.gz"),
shell:
"""
/Tools/sabre-master2/sabre se -f {input.infile} -b {input.barcodefile} -u {output.unknownfile}
"""
rule trimmomatic_se:
input:
r="{sample}.fastq.gz"
output:
r="trimmed/{sample}.trimmed.fastq.gz"
threads: 10
shell:
"""java -jar /Tools/Trimmomatic-0.36/trimmomatic-0.36.jar SE -threads {threads} {input.r} {output.r} ILLUMINACLIP:/Tools/Trimmomatic-0.36/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"""
The demultiplex output files are like this:
Bob_1_CL1.fastq.gz.... Bob_1_CL345.fastq.gz
Bob_2_CL1.fastq.gz.... Bob_1_CL248.fastq.gz
Ron_1_dad1.fastq.gz... Ron_1_dad67.fastq.gz
and so on
So,if I do not specify the demultiplex output file the program would create it by itself. My problem is how to specify/introduce a new wildcard from the output of the previous rule in the next trimming step, as the wildcards are different from initial sample now.
Wildcards just need to be consistent in a rule, not across the workflow. The issue here is that you have a rule generating 'unknown' outputs that you need to process further. For that you need to use checkpoints.
Read through the second block of code about aggregating. Your checkpoint will be demultiplexing and if you don't have any other steps, all will be your aggregate step that calls checkpoints.demultiplex.get. If you search for checkpoint on stackoverflow you will find lots of examples; it's a hard feature to use at first!
I'm trying to practice writing workflows in snakemake.
The contents of my Snakefile:
configfile: "config.yaml"
rule get_col:
input:
expand("data/{file}.csv",file=config["datname"])
output:
expand("output/{file}_col{param}.csv",file=config["datname"],param=config["cols"])
params:
col=config["cols"]
script:
"scripts/getCols.R"
The contents of config.yaml:
cols:
[2,4]
datname:
"GSE3790_expression_data"
My R script:
getCols=function(input,output,col) {
dat=read.csv(input)
dat=dat[,col]
write.csv(dat,output,row.names=F)
}
getCols(snakemake#input[[1]],snakemake#output[[1]],snakemake#params[['col']])
It seems like both columns are being called at once. What I'm trying to accomplish is one column being called from the list per output file.
Since the second output never gets a chance to be created (both columns are used to create first output), snakemake throws an error:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 3 of /Users/rebecca/Desktop/snakemake-tutorial/practice/Snakefile:
Job completed successfully, but some output files are missing.
On a slightly unrelated note, I thought I could write the input as:
'"data/{file}.csv"'
But that returns:
WildcardError in line 4 of /Users/rebecca/Desktop/snakemake-tutorial/practice/Snakefile:
Wildcards in input files cannot be determined from output files:
'file'
Any help would be much appreciated!
Looks like you want to run your Rscript twice per file, once for every value of col. In this case, the rule needs to be called twice as well.
The use of expand is also a bit too much here, in my opinion. expand fills your wildcards with all possible values and returns a list of the resulting files. So the output for this rule would be all possible combinations between files and cols, which the simple script can not create in one run.
This is also the reason why file can not be inferred from the output - it gets expanded there.
Instead, try writing your rule easier for just one file and column and expand on the resulting output, in a rule which needs this output as an input. If you generated the final output of your workflow, put it as input in a rule all to tell the workflow what the ultimate goal is.
rule all:
input:
expand("output/{file}_col{param}.csv",
file=config["datname"], param=config["cols"])
rule get_col:
input:
"data/{file}.csv"
output:
"output/{file}_col{param}.csv"
params:
col=lambda wc: wc.param
script:
"scripts/getCols.R"
Snakemake will infer from rule all (or any other rule to further use the output) what needs to be done and will call the rule get_col accordingly.