How to select all files from one sample? - wildcard

I have a problem figuring out how to make the input directive only select all {samples} files in the rule below.
rule MarkDup:
input:
expand("Outputs/MergeBamAlignment/{samples}_{lanes}_{flowcells}.merged.bam", zip,
samples=samples['sample'],
lanes=samples['lane'],
flowcells=samples['flowcell']),
output:
bam = "Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
metrics = "Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics",
shell:
"gatk --java-options -Djava.io.tempdir=`pwd`/tmp \
MarkDuplicates \
$(echo ' {input}' | sed 's/ / --INPUT /g') \
-O {output.bam} \
--VALIDATION_STRINGENCY LENIENT \
--METRICS_FILE {output.metrics} \
--MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 200000 \
--CREATE_INDEX true \
--TMP_DIR Outputs/MarkDuplicates/tmp"
Currently it will create correctly named output files, but it selects all files that match the pattern based on all wildcards. So I'm perhaps halfway there. I tried changing {samples} to {{samples}} in the input directive as such:
expand("Outputs/MergeBamAlignment/{{samples}}_{lanes}_{flowcells}.merged.bam", zip,
lanes=samples['lane'],
flowcells=samples['flowcell']),`
but this broke the previous rule somehow. So the solution is something like
input:
"{sample}_*.bam"
But clearly this doesn't work.
Is it possible to collect all files that match {sample}_*.bam with a function and use that as input? And if so, will the function still work with $(echo ' {input}' etc...) in the shell directive?

If you just want all the files in the directory, you can use a lambda function
from glob import glob
rule MarkDup:
input:
lambda wcs: glob('Outputs/MergeBamAlignment/%s*.bam' % wcs.samples)
output:
bam="Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
metrics="Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics"
shell:
...
Just be aware that this approach can't do any checking for missing files, since it will always report that the files needed are the files that are present. If you do need confirmation that the upstream rule has been executed, you can have the previous rule touch a flag, which you then require as input to this rule (though you don't actually use the file for anything other than enforcing execution order).

If I understand correctly, zip needs to be applied only to {lane} and {flowcells} and not to {samples}. In that case, use two expand instances can achieve that.
input:
expand(expand("Outputs/MergeBamAlignment/{{samples}}_{lanes}_{flowcells}.merged.bam",
zip, lanes=samples['lane'], flowcells=samples['flowcell']),
samples=samples['sample'])
PS: output.tmp file uses {sample} instead of {samples}. Typo?

Related

Zsh completions with multiple repeated options

I am attempting to bend zsh, my shell of choice, to my will, and am completely at a loss on the syntax and operation of completions.
My use case is this: I wish to have completions for 'ansible-playbook' under the '-e' option support three variations:
Normal file completion: ansible-playbook -e vars/file_name.yml
Prepended file completion: ansible-playbook -e #vars/file_name.yml
Arbitrary strings: ansible-playbook -e key=value
I started out with https://github.com/zsh-users/zsh-completions/blob/master/src/_ansible-playbook which worked decently, but required modifications to support the prefixed file pathing. To achieve this I altered the following lines (the -e line):
...
"(-D --diff)"{-D,--diff}"[when changing (small files and templates, show the diff in those. Works great with --check)]"\
"(-e --extra-vars)"{-e,--extra-vars}"[EXTRA_VARS set additional variables as key=value or YAML/JSON]:extra vars:(EXTRA_VARS)"\
'--flush-cache[clear the fact cache]'\
to this:
...
"(-D --diff)"{-D,--diff}"[when changing (small files and templates, show the diff in those. Works great with --check)]"\
"(-e --extra-vars)"{-e,--extra-vars}"[EXTRA_VARS set additional variables as key=value or YAML/JSON]:extra vars:__at_files"\
'--flush-cache[clear the fact cache]'\
and added the '__at_files' function:
__at_files () {
compset -P #; _files
}
This may be very noobish, but for someone that has never encountered this before, I was pleased that this solved my problem, or so I thought.
This fails me if I have multiple '-e' parameters, which is totally a supported model (similar to how docker allows multiple -v or -p arguments). What this means is that the first '-e' parameter will have my prefixed completion work, but any '-e' parameters after that point become 'dumb' and only allow for normal '_files' completion from what I can tell. So the following will not complete properly:
ansible-playbook -e key=value -e #vars/file
but this would complete for the file itself:
ansible-playbook -e key=value -e vars/file
Did I mess up? I see the same type of behavior for this particular completion plugin's '-M' option (it also becomes 'dumb' and does basic file completion). I may have simply not searched for the correct terminology or combination of terms, or perhaps in the rather complicated documentation missed what covers this, but again, with only a few days experience digging into this, I'm lost.
If multiple -e options are valid, the _arguments specification should start with * so instead of:
"(-e --extra-vars)"{-e,--extra-vars}"[EXTR ....
use:
\*{-e,--extra-vars}"[EXTR ...
The (-e --extra-vars) part indicates a list of options that can not follow the one being specified. So that isn't needed anymore because it is presumably valid to do, e.g.:
ansible-playbook -e key-value --extra-vars #vars/file

How to use AWS CLI to only copy files in S3 bucket that match a given string pattern

I'm using the AWS CLI to copy files from an S3 bucket to my R machine using a command like below:
system(
"aws s3 cp s3://my_bucket_location/ ~/my_r_location/ --recursive --exclude '*' --include '*trans*' --region us-east-1"
)
This works as expected, i.e. it copies all files in my_bucket_location that have "trans" in the filename at that location.
The problem that I am facing is that I have other files with similar naming conventions that I don't want to import in this step. As an example, in the list below I only want to copy the first two files, not the last two:
File list
trans_120215.csv
trans_130215.csv
sum_trans_120215.csv
sum_trans_130215.csv
If I was using regex I could make it more specific like "^trans_\\d+" to bring in just the first two files, but this doesn't seem possible using AWS CLI. So my question is there a way to have more complex pattern matching using AWS CLI like below?
system(
"aws s3 cp s3://my_bucket_location/ ~/my_r_location/ --recursive --exclude '*' --include '^trans_\\d+' --region us-east-1"
)
Please note that I can only use information about the file in question, i.e. that I want to import a file with pattern "^trans_\\d+", I can't use the fact that the other unwanted files contain sum_ at the start, because this is only an example there could be other files with similar names like "check_trans_120215.csv".
I have considered other alternatives like below, but hoping there is a way to adjust the copy command to avoid going down either of these routes:
Listing all items in the bucket > using regex in R to specify the files that I want > Only importing those files
Keeping the copy command as it is > delete unwanted files on the R machine after the copy
The alternatives that you have listed are the best options because S3 CLI doesn't support regex.
Use of Exclude and Include Filters:
Currently, there is no support for the use of UNIX style wildcards in
a command's path arguments. However, most commands have --exclude
"" and --include "" parameters that can achieve the
desired result. These parameters perform pattern matching to either
exclude or include a particular file or object. The following pattern
symbols are supported.
*: Matches everything
?: Matches any single character
[sequence]: Matches any character in sequence
[!sequence]: Matches any character not in sequence
Putting this here for others to find, since I just had to figure this out. Here's what I came up with:
s3cmd del $(s3cmd ls s3://[BUCKET]/ | grep '.*s3://[BUCKET]/[FILENAME]' | cut -c 41-)
You can put the regex in the grep search string. For instance, I was searching for specific files to delete (hence the s3cmd del). My regex looked like: '2016-11-04.*s3.*[DN][RS].*'. You may have to adjust the cut for your use. Should also work with s3cmd get.

Trim a file name in Unix

I have a file with name
ROCKET_25_08:00.csv
I want to trim the name of the file to
ROCKET_25_.csv
I tried mv but mv is not what I required because there will be cases where the files may be more than one.
I want the name till the second _.
How to get that in unix.
Please advise.
There are some utilities that provide more flexible renaming. But one solution that won't use anything other but included UNIX tools (like sed) would be:
ls -d * | sed -re 's/^([^_]*_[^_]*_)(.*)(\....)$/mv -v \1\2\3 \1\3/' | bash
This will only work in one directory, it won't process subdirectories.
It's not at all clear what you are actually trying to do, but if you just want to remove text between the last underscore and the period, you can do:
f=ROCKET_25_08:00.csv
echo ${f%_*}_.csv

How to edit path variable in ZSH

In my .bash_profile I have the following lines:
PATHDIRS="
/usr/local/mysql/bin
/usr/local/share/python
/opt/local/bin
/opt/local/sbin
$HOME/bin"
for dir in $PATHDIRS
do
if [ -d $dir ]; then
export PATH=$PATH:$dir
fi
done
However I tried copying this to my .zshrc, and the $PATH is not being set.
First I put echo statements inside the "if directory exists" function and I found that the if statement was evaluating to false, even for directories that clearly existed.
Then I removed the directory-exists check, and the $PATH was being set incorrectly like this:
/usr/bin:/bin:/usr/sbin:/sbin:
/usr/local/bin
/opt/local/bin
/opt/local/sbin
/Volumes/Xshare/kburke/bin
/usr/local/Cellar/ruby/1.9.2-p290/bin
/Users/kevin/.gem/ruby/1.8/bin
/Users/kevin/bin
None of the programs in the bottom directories were being found or executed.
What am I doing wrong?
Unlike other shells, zsh does not perform word splitting or globbing after variable substitution. Thus $PATHDIRS expands to a single string containing exactly the value of the variable, and not to a list of strings containing each separate whitespace-delimited piece of the value.
Using an array is the best way to express this (not only in zsh, but also in ksh and bash).
pathdirs=(
/usr/local/mysql/bin
…
~/bin
)
for dir in $pathdirs; do
if [ -d $dir ]; then
path+=$dir
fi
done
Since you probably aren't going to refer to pathdirs later, you might as well write it inline:
for dir in \
/usr/local/mysql/bin \
… \
~/bin
; do
if [[ -d $dir ]]; then path+=$dir; fi
done
There's even a shorter way to express this: add all the directories you like to the path array, then select the ones that exist.
path+=/usr/local/mysql/bin
…
path=($^path(N))
The N glob qualifier selects only the matches that exist. Add the -/ to the qualifier list (i.e. (-/N) or (N-/)) if you're worried that one of the elements may be something other than a directory or a symbolic link to one (e.g. a broken symlink). The ^ parameter expansion flag ensures that the glob qualifier applies to each array element separately.
You can also use the N qualifier to add an element only if it exists. Note that you need globbing to happen, so path+=/usr/local/mysql/bin(N) wouldn't work.
path+=(/usr/local/bin/mysql/bin(N-/))
You can put
setopt shwordsplit
in your .zshrc. Then zsh will perform world splitting like all Bourne shells do. That the default appears to be noshwordsplit is a misfeature that causes many a head scratching. I'd be surprised if it wasn't a FAQ. Lets see... yup:
http://zsh.sourceforge.net/FAQ/zshfaq03.html#l18
3.1: Why does $var where var="foo bar" not do what I expect?
Still not sure what the problem was (maybe newlines in $PATHDIRS)? but changing to zsh array syntax fixed it:
PATHDIRS=(
/usr/local/mysql/bin
/usr/local/share/python
/usr/local/scala/scala-2.8.0.final/bin
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/bin
/opt/local/etc
/opt/local/bin
/opt/local/sbin
$HOME/.gem/ruby/1.8/bin
$HOME/bin)
and
path=($path $dir)

Why did my use of the read command not do what I expected?

I did some havoc on my computer, when I played with the commands suggested by vezult [1]. I expected the one-liner to ask file-names to be removed. However, it immediately removed my files in a folder:
> find ./ -type f | while read x; do rm "$x"; done
I expected it to wait for my typing of stdin:s [2]. I cannot understand its action. How does the read command work, and where do you use it?
What happened there is that read reads from stdin. When you put it at the end of a pipe, it read from that pipe.
So your find becomes
file1
file2
and so on; read reads that and replaces x successively with file1 then file2, and so your loop becomes
rm "file1"
rm "file2"
and sure enough, that rm's every file starting at the current directory ".".
A couple hints.
You didn't need the "/".
It's better and safer to say
find . -type f
because should you happen to type ". /" (ie, dot SPACE slash) find will start at the current directory and then go look starting at the root directory. That trick, given the right privileges, would delete every file in the computer. "." is already the name of a directory; you don't need to add the slash.
The find or rm commands will do this
It sounds like what you wanted to do was go through all the files in all the directories starting at the current directory ".", and have it ASK if you want to delete it. You could do that with
find . -type f -exec rm -i {} \;
or
find . -type f -ok rm {} \;
and not need a loop at all. You can also do
rm -r -i *
and get nearly the same effect, except that it will try to delete directories too. If the directory is empty, that'll even work.
Another thought
Come to think of it, unless you have a LOT of files, you could also do
rm -i `find . -type f`
Now the find in backquotes will become a bunch of file names on the command line, and the '-i' interactive flag on rm will ask the yes or no question.
Charlie Martin gives you a good dissection and explanation of what went wrong with your specific example, but doesn't address the general question of:
When should you use the read command?
The answer to that is - when you want to read successive lines from some file (quite possibly the standard output of some previous sequence of commands in a pipeline), possibly splitting the lines into several separate variables. The splitting is done using the current value of '$IFS', which normally means on blanks and tabs (newlines don't count in this context; they separate lines). If there are multiple variables in the read command, then the first word goes into the first variable, the second into the second, ..., and the residue of the line into the last variable. If there's only one variable, the whole line goes into that variable.
There are many uses. This is one of the simpler scripts I have that uses the split option:
#!/bin/ksh
#
# #(#)$Id: mkdbs.sh,v 1.4 2008/10/12 02:41:42 jleffler Exp $
#
# Create basic set of databases
MKDUAL=$HOME/bin/mkdual.sql
ELEMENTS=$HOME/src/sqltools/SQL/elements.sql
cat <<! |
mode_ansi with log mode ansi
logged with buffered log
unlogged
stores with buffered log
!
while read dbs logging
do
if [ "$dbs" = "unlogged" ]
then bw=""; cw=""
else bw="-ebegin"; cw="-ecommit"
fi
sqlcmd -xe "create database $dbs $logging" \
$bw -e "grant resource to public" -f $MKDUAL -f $ELEMENTS $cw
done
The cat command with a here-document has its output sent to a pipe, so the output goes into the while read dbs logging loop. The first word goes into $dbs and is the name of the (Informix) database I want to create. The remainder of the line is placed into $logging. The body of the loop deals with unlogged databases (where begin and commit do not work), then run a program sqlcmd (completely separate from the Microsoft new-comer of the same name; it's been around since about 1990) to create a database and populate it with some standard tables and data - a simulation of the Oracle 'dual' table, and a set of tables related to the 'table of elements'.
Other scripts that use the read command are bigger (by far), but generally read lines containing one or more file names and some other attributes of relevance, and then apply an appropriate transform to the files using the attributes.
Osiris JL: file * | grep 'sh.*script' | sed 's/:.*//' | xargs wgrep read
esqlcver:read version letter
jlss: while read directory
jlss: read x || exit
jlss: read x || exit
jlss: while read file type link owner group perms
jlss: read x || exit
jlss: while read file type link owner group perms
kb: while read size name
mkbod: while read directory
mkbod:while read dist comp
mkdbs:while read dbs logging
mkmsd:while read msdfile master
mknmd:while read gfile sfile version notes
publictimestamp:while read name type title
publictimestamp:while read name type title
Osiris JL:
'Osiris JL: ' is my command line prompt; I ran this in my 'bin' directory. 'wgrep' is a variant of grep that only matches entire words (to avoid words like 'already'). This gives some indication of how I've used it.
The 'read x || exit' lines are for an interactive script that reads a response from standard input, but exits if the command gets EOF (for example, if standard input comes from /dev/null).

Resources