execute a rule that the wildcard is obtained via the params in the rule - wildcard

I have a rule that contains a wildcard that is taken from the params (decided by a function). Is it possible to run this single rule by calling it by name, e.g., snakemake a
rule a:
input: file1, file2
output: directory(1), directory("2/{domain}.txt")
params:
domain_cat = lambda wc, input: pd.read_csv(input[1], index_col=0).loc[wc.domain, "cat"]
shell:
"""
example.py {input} {output} {params.domain_cat}"
"""

No. If a rule contains a wildcard, you can no longer run it by calling it by name. snakemake needs to know the value for the wildcard, which is passed through the filename.

Related

What is the meaning of each parameter for *(*ocNY1) from the shell command `echo`?

I could not find the proper place to look up for the parameter explanation for the below command.
echo *(*ocNY1)
After some tests, I discovered that *(*oc) prints executable files(file with x permission) from the current directory. And NY1 prints the first item of such. But I cannot find the manual for such options. Where can I find the definition/manual for the parameters of such?
Where can I lookup to see the explanation for each parameters for the pattern matching?
Is this glob pattern or regex that echo is using?
Sometimes it is really hard to take the first step if you do not know where you are heading.
*(*ocNY1) is a zsh glob pattern - see man zshexpn.
* is a glob operator that matches any string, including the null string.
The trailing (...) contains glob qualifiers:
* to match executable plain files
oc sort by time of last inode change, youngest first
N sets the nullglob option for the current pattern
Yn expand to at most n filenames

Zsh glob: objects recursivley under sudirectories, excluding current/base directory

I am trying to do a file name generation of all objects (files, directories, and so on) recursively under all subdirectories of the current directory. Excluding the objects in said current directory.
In other words, given:
--dir1 --dir2.1
| | dir2.2 --file3.1
| --file2.1
--file1
I want to generate:
./dir2.1
./dir2.2
./dir2.2/file3.1
./file2.1
I have set the EXTENDED_GLOB option, and I assumed that the following pattern would do the trick:
./**/*~./*
But it returns:
zsh: no matches found: ./**/*~./*
I don't know what the problem is, it should work.
./**/* gives:
./dir1
./dir2.1
./dir2.2
./dir2.2/file3.1
./file2.1
./file1
And ./* gives:
./dir1
./file1
How come ./**/*~./* fails? And more important, how can I generate the name of the elements recursively in the subdirectories excluding the elements in current/base directory?
Thanks.
The (1)x~y glob operator uses y as a shell's ordinally pattern matching rather than a file name generation, so ./**/*~./* gives "no matches found":
% print -l ./**/*~./*
;# ./dir1 # <= './*' matches, so exclude this entry
;# ./dir2.1 # <= './*' matches, so exclude this entry
;# .. # ditto...
;# => finally, no matches found
The exclusion pattern ./* matches everything generated by the glob ./**/*, so zsh finally yields "no matches found". (zsh does not do filename generations for the ~y part.)
We could make the exclusion pattern a little more precise/complicated form for excluding the elements in current directory. Such that it starts with ./ and has one or more characters other than /.
% print -l ./**/*~./[^/]## ;# use '~./[^/]##' rather than '~./*'
./dir1/dir2.1
./dir1/dir2.2
./dir1/dir2.2/file3.1
./dir1/file2.1
Then, to strip the current-dir-component /dir1, we could use the (2)estring glob qualifier, such that it removes the first occurrence of /[^/]## (for example /dir1):
# $p for avoiding repetitive use of the exclusion pattern.
% p='./[^/]##'; print -l ./**/*~${~p}(e:'REPLY=${REPLY/${~p[2,-1]}}':)
./dir2.1
./dir2.2
./dir2.2/file3.1
./file2.1
Or to strip it using ordinally array/replace rather than estring glob qualifier:
% p='./[^/]##'; a=(./**/*~${~p}) ; a=(${a/${~p[2,-1]}}); print -l $a
./dir2.1
./dir2.2
./dir2.2/file3.1
./file2.1
At last, iterating over current dir's dirs could do the job, too:
a=(); dir=;
for dir in *(/); do
pushd "$dir"
a+=(./**/*)
popd
done
print -l $a
#=> ./dir2.1
./dir2.2
./dir2.2/file3.1
./file2.1
Here are some zsh documents.
(1)x~y glob operator:
x~y
(Requires EXTENDED_GLOB to be set.) Match anything that matches the pattern x but does not match y. This has lower precedence than any operator except ‘|’, so ‘*/*~foo/bar’ will search for all files in all directories in ‘.’ and then exclude ‘foo/bar’ if there was such a match. Multiple patterns can be excluded by ‘foo~bar~baz’. In the exclusion pattern (y), ‘/’ and ‘.’ are not treated specially the way they usually are in globbing.
--- zshexpn(1), x~y, Glob Operators
(2)estring glob qualifier:
estring
+cmd
...
During the execution of string the filename currently being tested is available in the parameter REPLY; the parameter may be altered to a string to be inserted into the list instead of the original filename.
--- zshexpn(1), estring, Glob Qualifiers

noglob function then use ls with param?

I just want to pass a glob through and then use it against ls directly. The simplest example would be:
test() { ls -d ~/$1 }
alias test="noglob test"
test D*
If I simply run ls D in my home directory: it outputs three files. but if I run the snippet provided, I get "/Users/jubi/D*": No such file or directory. What should I be doing? thanks!
The authoritative and complete documentation of Zsh expansion mechanism is located at http://zsh.sourceforge.net/Doc/Release/Expansion.html.
Here's the reason your version doesn't work:
If a word contains an unquoted instance of one of the characters ‘*’, ‘(’, ‘|’, ‘<’, ‘[’, or ‘?’, it is regarded as a pattern for filename generation, unless the GLOB option is unset.
emphasis mine. Your glob operator, generated by parameter expansion, isn't considered unquoted.
You need the GLOB_SUBST option to evaluate the parameter expansion result as a glob pattern. a setopt globsubst, unsetopt globsubst pair works, of course, but the easiest way is to use the following pattern specifically for this purpose:
${~spec}
Turn on the GLOB_SUBST option for the evaluation of spec; if the ‘~’ is doubled, turn it off. When this option is set, the string resulting from the expansion will be interpreted as a pattern anywhere that is possible, such as in filename expansion and filename generation and pattern-matching contexts like the right hand side of the ‘=’ and ‘!=’ operators in conditions.
In nested substitutions, note that the effect of the ~ applies to the result of the current level of substitution. A surrounding pattern operation on the result may cancel it. Hence, for example, if the parameter foo is set to *, ${~foo//\*/*.c} is substituted by the pattern *.c, which may be expanded by filename generation, but ${${~foo}//\*/*.c} substitutes to the string *.c, which will not be further expanded.
So:
t () { ls -d ~/${~1} }
alias t="noglob t"
By the way, test is a POSIX shell builtin (aka [). Don't shadow it.

How to select all files from one sample?

I have a problem figuring out how to make the input directive only select all {samples} files in the rule below.
rule MarkDup:
input:
expand("Outputs/MergeBamAlignment/{samples}_{lanes}_{flowcells}.merged.bam", zip,
samples=samples['sample'],
lanes=samples['lane'],
flowcells=samples['flowcell']),
output:
bam = "Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
metrics = "Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics",
shell:
"gatk --java-options -Djava.io.tempdir=`pwd`/tmp \
MarkDuplicates \
$(echo ' {input}' | sed 's/ / --INPUT /g') \
-O {output.bam} \
--VALIDATION_STRINGENCY LENIENT \
--METRICS_FILE {output.metrics} \
--MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 200000 \
--CREATE_INDEX true \
--TMP_DIR Outputs/MarkDuplicates/tmp"
Currently it will create correctly named output files, but it selects all files that match the pattern based on all wildcards. So I'm perhaps halfway there. I tried changing {samples} to {{samples}} in the input directive as such:
expand("Outputs/MergeBamAlignment/{{samples}}_{lanes}_{flowcells}.merged.bam", zip,
lanes=samples['lane'],
flowcells=samples['flowcell']),`
but this broke the previous rule somehow. So the solution is something like
input:
"{sample}_*.bam"
But clearly this doesn't work.
Is it possible to collect all files that match {sample}_*.bam with a function and use that as input? And if so, will the function still work with $(echo ' {input}' etc...) in the shell directive?
If you just want all the files in the directory, you can use a lambda function
from glob import glob
rule MarkDup:
input:
lambda wcs: glob('Outputs/MergeBamAlignment/%s*.bam' % wcs.samples)
output:
bam="Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
metrics="Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics"
shell:
...
Just be aware that this approach can't do any checking for missing files, since it will always report that the files needed are the files that are present. If you do need confirmation that the upstream rule has been executed, you can have the previous rule touch a flag, which you then require as input to this rule (though you don't actually use the file for anything other than enforcing execution order).
If I understand correctly, zip needs to be applied only to {lane} and {flowcells} and not to {samples}. In that case, use two expand instances can achieve that.
input:
expand(expand("Outputs/MergeBamAlignment/{{samples}}_{lanes}_{flowcells}.merged.bam",
zip, lanes=samples['lane'], flowcells=samples['flowcell']),
samples=samples['sample'])
PS: output.tmp file uses {sample} instead of {samples}. Typo?

applying zsh qualifiers on array elements or directly on a result of a command substitution

I did
a=( pacman -Qlq packagename )
to put files belonging to package into array
Why is this printing only the frist match, and how to print them all in zsh:
print -l ${a[(r)*i*]}
Also, how to apply zsh qualifiers on all array elements, say to list files
only via (.)
Is there an easier way to skip intermediary array in this process,
in a way to have qualifier specified on a result of a command substition?
As per documentation the subscript flag (r) will only return the first matching array element.
In order to get all matching elements you can use the {name:#pattern} parameter expansion, which removes any element maching pattern from the expansion. In order to remove the non-matching elements you can either use the flag (M) or negate the pattern with ^ (this requires the EXTENDED_GLOB option to be enabled):
print -l ${(M)a:#*i*}
setopt extendedglob
print -l ${a:#^*i*}
You can skip explicitly creating an intermediary array by just using the parameter expansion on the command substitution ($(...)) directly:
print -l ${(M)$(pacman -Qlq packagename):#*i*}
It seems that globbing qualifiers do not work with patterns inside parameter expansions. But you can enable the RC_EXPAND_PARAM option to expand every single array element within a word instead of the whole array. So foo${xx}bar with x=(a b c) will be expanded to fooabar foobbar foocbar instead of fooa b cbar. You can enabley it either globally with setopt rcexpandparam or for a specific expansion by wrapping it in ${^...}. This way you can add a glob qualifier to each element of the filtered array. To print only elements that are paths to files, you can use
print -l ${^${(M)$(pacman -Qlq packagename):#*i*}}(.N)
This essentially takes each path and attaches (.N) as glob qualifier (which works, even though there are no globs). The resulting patterns are then evaluated as part of filename generation. . tells zsh to only match plain files. N enables the NULL_GLOB option for these patterns, otherwise the command would abort with an "no matches found" error, if it encounters a pattern that is not a plain file (e.g. /usr is a directory, so /usr(.) does not match any plain file on your system.).

Resources