Snakemake error "Not all output, log and benchmark files of rule <RULE> contain the same wildcards - wildcard

I have a barebone snake file as follows as a test/demo, but it kept on producing errors:
Not all output, log and benchmark files of rule test contain the same wildcards.
Here is the snakefile content:
samples = ['A', 'B', 'C']
rule test:
input:
"mapped/{sample_name}.fsa", sample_name=samples
output:
"mapped/{sample_name}_out.fsa", sample_name=samples
shell: "cp {input} {output}"
Cannot for the life of me figure out what was wrong.

Turns out I should not have used the samples list in this way. Removing sample_name=samples from this rule and adding a new rule with expand function solved it.
Sorry about the trouble.
samples = ['A', 'B', 'C']
rule all:
input: expand("mapped/{sample_name}.fsa", sample_name=samples)
rule test:
input:
"mapped/{sample_name}.fsa"
output:
"mapped/{sample_name}_out.fsa"
shell: "cp {input} {output}"

Related

execute a rule that the wildcard is obtained via the params in the rule

I have a rule that contains a wildcard that is taken from the params (decided by a function). Is it possible to run this single rule by calling it by name, e.g., snakemake a
rule a:
input: file1, file2
output: directory(1), directory("2/{domain}.txt")
params:
domain_cat = lambda wc, input: pd.read_csv(input[1], index_col=0).loc[wc.domain, "cat"]
shell:
"""
example.py {input} {output} {params.domain_cat}"
"""
No. If a rule contains a wildcard, you can no longer run it by calling it by name. snakemake needs to know the value for the wildcard, which is passed through the filename.

Snakemake WorkflowError: Target rules may not contain wildcards

rule all:
input:
"../data/A_checkm/{genome}"
rule A_checkm:
input:
"../data/genomesFna/{genome}_genomic.fna.gz"
output:
directory("../data/A_checkm/{genome}")
threads:
16
resources:
mem_mb = 40000
shell:
"""
# setup a tmp working dir
tmp=$(mktemp -d)
mkdir $tmp/ref
cp {input} $tmp/ref/genome.fna.gz
cd $tmp/ref
gunzip -c genome.fna.gz > genome.fna
cd $tmp
# run checking
checkm lineage_wf -t {threads} -x fna ref out > stdout
# prepare output folder
cd {config[project_root]}
mkdir -p {output}
# copy results over
cp -r $tmp/out/* {output}/
cp $tmp/stdout {output}/checkm.txt
# cleanup
rm -rf $tmp
"""
Thank you in advance for your help!
I would like to run checkm on a list of ~600 downloaded genome files having the extension '.fna.gz'. Each downloaded file is saved in a separate folder having the same name as the genome. I would like also to have all the results in a separate folder for each genome and that's why my output is a directory.
When I run this code with 'snakemake -s Snakefile --cores 10 A_checkm', I get the following error:
WorkflowError: Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).
Anyone could help me identifying the error, please?
You need to provide snakemake with concrete values for the {genome} wildcard. You cannot just leave it open and expect snakemake to work on all the files in some folder of your project just like that.
Determine the filenames/genome values of the files which you want to work on, using glob_wildcards(...). See the documentation for further details.
Now you can use these values to specify in rule all to create all the folders (using your other rule) with those {genome} values:
# Determine the {genome} for all downloaded files
(GENOMES,) = glob_wildcards("../data/genomesFna/{genome}_genomic.fna.gz")
rule all:
input:
expand("../data/A_checkm/{genome}", genome=GENOMES),
rule A_checkm:
input:
"../data/genomesFna/{genome}_genomic.fna.gz",
output:
directory("../data/A_checkm/{genome}"),
threads: 16
resources:
mem_mb=40000,
shell:
# Your magic goes here
If the download is supposed to happen inside snakemake, add a checkpoint for that. Have a look at this answer then.

Zsh glob: objects recursivley under sudirectories, excluding current/base directory

I am trying to do a file name generation of all objects (files, directories, and so on) recursively under all subdirectories of the current directory. Excluding the objects in said current directory.
In other words, given:
--dir1 --dir2.1
| | dir2.2 --file3.1
| --file2.1
--file1
I want to generate:
./dir2.1
./dir2.2
./dir2.2/file3.1
./file2.1
I have set the EXTENDED_GLOB option, and I assumed that the following pattern would do the trick:
./**/*~./*
But it returns:
zsh: no matches found: ./**/*~./*
I don't know what the problem is, it should work.
./**/* gives:
./dir1
./dir2.1
./dir2.2
./dir2.2/file3.1
./file2.1
./file1
And ./* gives:
./dir1
./file1
How come ./**/*~./* fails? And more important, how can I generate the name of the elements recursively in the subdirectories excluding the elements in current/base directory?
Thanks.
The (1)x~y glob operator uses y as a shell's ordinally pattern matching rather than a file name generation, so ./**/*~./* gives "no matches found":
% print -l ./**/*~./*
;# ./dir1 # <= './*' matches, so exclude this entry
;# ./dir2.1 # <= './*' matches, so exclude this entry
;# .. # ditto...
;# => finally, no matches found
The exclusion pattern ./* matches everything generated by the glob ./**/*, so zsh finally yields "no matches found". (zsh does not do filename generations for the ~y part.)
We could make the exclusion pattern a little more precise/complicated form for excluding the elements in current directory. Such that it starts with ./ and has one or more characters other than /.
% print -l ./**/*~./[^/]## ;# use '~./[^/]##' rather than '~./*'
./dir1/dir2.1
./dir1/dir2.2
./dir1/dir2.2/file3.1
./dir1/file2.1
Then, to strip the current-dir-component /dir1, we could use the (2)estring glob qualifier, such that it removes the first occurrence of /[^/]## (for example /dir1):
# $p for avoiding repetitive use of the exclusion pattern.
% p='./[^/]##'; print -l ./**/*~${~p}(e:'REPLY=${REPLY/${~p[2,-1]}}':)
./dir2.1
./dir2.2
./dir2.2/file3.1
./file2.1
Or to strip it using ordinally array/replace rather than estring glob qualifier:
% p='./[^/]##'; a=(./**/*~${~p}) ; a=(${a/${~p[2,-1]}}); print -l $a
./dir2.1
./dir2.2
./dir2.2/file3.1
./file2.1
At last, iterating over current dir's dirs could do the job, too:
a=(); dir=;
for dir in *(/); do
pushd "$dir"
a+=(./**/*)
popd
done
print -l $a
#=> ./dir2.1
./dir2.2
./dir2.2/file3.1
./file2.1
Here are some zsh documents.
(1)x~y glob operator:
x~y
(Requires EXTENDED_GLOB to be set.) Match anything that matches the pattern x but does not match y. This has lower precedence than any operator except ‘|’, so ‘*/*~foo/bar’ will search for all files in all directories in ‘.’ and then exclude ‘foo/bar’ if there was such a match. Multiple patterns can be excluded by ‘foo~bar~baz’. In the exclusion pattern (y), ‘/’ and ‘.’ are not treated specially the way they usually are in globbing.
--- zshexpn(1), x~y, Glob Operators
(2)estring glob qualifier:
estring
+cmd
...
During the execution of string the filename currently being tested is available in the parameter REPLY; the parameter may be altered to a string to be inserted into the list instead of the original filename.
--- zshexpn(1), estring, Glob Qualifiers

Purpose of square brackets in shell scripts

I came across this line in one of the shell scripts:
[-f $host_something ] && .$host_something
What are the square brackets with the -f switch supposed to do, and what is the point of ANDing it with the same environment variable?
The [ is actually an actual binary. It's an alias for the test(1) command. It will ignore it's last argument which should be ]. Run man test for further information. It's not really shell syntax.
The square bracket is really an alias for the test tool, so you can look at man test to find out how it works. the -f switch is one of many tests that can be run by this tool, and tests if a file exists and is a regular file.
You need some more spaces.
The command
[ -f $host_something ] && . $host_something
stands for
if [ -f $host_something ]; then
source $host_something
fi
or in words:
When the file given in the variable host_something really is a file, then execute the lines in that file without opening a subshell. You do not want a subshell, since all the settings in the subshell get lost as soon as the subshell is finished.

unix script error

I have the following line in my unix script file:
if [[ -f $DIR1/$FILE1 ] -a [ -f $DIR1/$FILE2 ]]; then
As clear the line checks for existence of two files in a directory and if both the files are present, some logic will be executed.
However, on running the script I am getting the following error on above line:
test_script: line 30: syntax error at line 54: `]' unexpected
line 54 is where above line is present.
What does this error mean ? Where am I wrong ?
Thanks for reading!
For the most common shells at least, [] are not like parentheses in C where you use then to group subexpressions.
What you need is something like (for bash):
if [[ -f $DIR1/$FILE1 && -f $DIR1/$FILE2 ]]; then
If you want help with a specific (non-bash) shell, you should let us know which one you're using.
There is no need of [] with -f.
if [ -f $DIR1/$FILE1 -a -f $DIR1/$FILE2 ]; then
Output:
shadyabhi#archlinux /tmp $ touch foo;touch foo2
shadyabhi#archlinux /tmp $ if [ -f "foo" -a -f "foo2" ]; then echo "Hello"; fi
Hello
shadyabhi#archlinux /tmp $
It's interesting that there are multiple answers explaining the subtle differences between [ and [[, but for some reason our culture seems to discourage people from providing the obvious solution. Stop using '[' entirely. Instead of '[', use test:
if test -f $DIR1/$FILE1 && test -f $DIR1/$FILE2; then
Test is cleaner syntax than '[', which requires a final ']' argument and continually confuses people into thinking that the brackets are part of the language. '[[' is not portable and confuses people who don't realize that many shells provide extra functionality that is non-standard. There is a case to be made that [[ can be more efficient than [, but if run-time performance is a problem in your shell, you probably shouldn't be solving the problem in sh.
You had extra [ and ]
if [ -f $DIR1/$FILE1 -a -f $DIR1/$FILE2 ]; then
Basically, you were mixing two syntax that aim to do the same thing: namely [ ] and [[ ]]. The former is more portable but the latter is more powerful; although the majority of shells you would come across do support [[ ]].
But better still is the following since you are already using the [[ ]] construct
if [[ -f $DIR1/$FILE1 && -f $DIR1/$FILE2 ]]; then
As #paxdiablo stated, you can use it this way:
if -f $DIR1/$FILE1 && -f $DIR1/$FILE2 ; then
or you can use it this way:
if -f $DIR1/$FILE || -f $DIR1/$FILE2 ;

Resources