Snakemake : samples with sub-items, how catch them? - wildcard

Here is a short example of a config.yaml :
samples:
sample1:
stranded: True
sample2:
stranded: False
As you see, each sample has a sub-item (multiples in fact). But I don't know how catch them.
My Snakefile :
configfile: "config.yaml"
rule all:
input:
expand("output/{sample}.bam", sample=config['samples']),
rule one:
input:
"input/{sample}.bam",
output:
"output/{sample}.bam",
run:
if config['samples']["{sample}"]['stranded']: # How catch stranded value ?
option = "--stranded",
shell(
'some_command '
' {option}'
' {input} > {output}'
)
Thanks in advance for your help.
Hetica

finally, I found a response, using lambda function in params instruction, and a condition in run:
configfile: "config.yaml"
rule all:
input:
expand("output/{sample}.bam", sample=config['samples']),
rule one:
input:
"input/{sample}.bam",
output:
"output/{sample}.bam",
params:
stranded = lambda wildcards: config['samples'][wildcards.sample]['stranded'],
run:
stranded = "--stranded" if params.stranded else ''
shell(
'echo '
+ stranded +
' {input} > {output};\n'
'touch {output}'
)
If this can helps someone...

Related

How to introduce a new wildcard in a snakemake pipeline with several rules

It has been several times that I face this problem and would like to understand finally: is it possible to introduce a new wildcard in a rule in a snakemake pipeline?
workdir: "/path/to/"
(SAMPLES,) =glob_wildcards('/path/to/trimmed/{sample}.trimmed.fastq.gz')
rule all:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES),
expand("merged/{sample}.merged.bam", sample=SAMPLES)
rule bwa_mem:
input:
bwa_index_done = "ref",
fastq="path/to/trimmed/{sample}.trimmed.fastq.gz"
output:
bam = "{sample}.bam"
threads: 10
shell:
"""/Tools/bwa-0.7.12/bwa mem -t {threads} ref {input.fastq} | /Tools/samtools-1.10/samtools sort -o {output.bam}"""
rule samtools merge:
input:
lane1="{sample}_L1.bam",
lane2="{sample}_L2.bam",
lane3="{sample}_L3.bam",
lane4="{sample}_L4.bam"
output:
outf = "merged/{sample}.merged.bam"
threads: 4
shell:
"""Tools/samtools-1.10/samtools merge -# {threads} {output.outf} {input.lane1} {input.lane2} {input.lane3} {input.lane4}"""
My input files:
RD1_1_L1.fastq.gz - RD1_100_L1.fastq.gz
RD1_1_L2.fastq.gz - RD1_100_L2.fastq.gz
RD1_1_L3.fastq.gz - RD1_100_L3.fastq.gz
RD1_1_L4.fastq.gz - RD1_100_L4.fastq.gz
RD2_100_L1.fastq.gz - RD2_200_L1.fastq.gz
RD2_100_L2.fastq.gz - RD2_200_L2.fastq.gz
RD2_100_L3.fastq.gz - RD2_200_L3.fastq.gz
RD2_100_L4.fastq.gz - RD2_200_L4.fastq.gz
While trimming it is ok to use it as one single sample, but when merging I need to specify L1, L2, L3 and L4. So is it possible to introduce a new wildcard somehow specific for a rule?
is it possible to introduce a new wildcard somehow specific for a rule?
I'm not 100% sure what you mean by that but I think the answer is yes.
Looking at your example, maybe this is what you are trying to do:
SAMPLES = ['RD1_1', 'RD2_100', 'RD1_100', 'RD2_200']
LANE = ['L1', 'L2', 'L3', 'L4']
rule all:
input:
expand("merged/{sample}.merged.bam", sample= SAMPLES)
rule trim:
input:
fastq= "{sample}_{L}.fastq.gz"
output:
fastq="trimmed/{sample}_{L}.trimmed.fastq.gz"
shell:
r"""
trim {input} {output}
"""
rule bwa_mem:
input:
fastq="trimmed/{sample}_{L}.trimmed.fastq.gz"
output:
bam= "{sample}_{L}.bam"
shell:
r"""
bwa mem {input} {output}
"""
rule samtools merge:
input:
expand('{{sample}}_{L}.bam', L= LANE),
output:
outf= "merged/{sample}.merged.bam",
shell:
r"""
samtools merge {output} {input}
"""
It assumes that all samples have lanes 1 to 4 which is not great but hopefully you get the idea.

awk to parse text file issue

I have just started to learn how to use awk to parse and print text file .
I have started with bellow code who can help me ?
NB: quote are mandatory at output file ( see desired output )
awk '/^IPDATA=/ && /A|B|C| '{print "ADD IP ="$0"\n(\n \Ref "$1",Type vlan="$2"\"\n)\n"}' file > file1
NB: Ref is the sum of line of IPREF here in the example we have three : [2] && [2] && [1].
the sample text file in fact is huge but I have summurized it :
IPDATA=A IPID A
IPDATA=A IPREF [2] =
--- IPREF = VRID=A_1
--- IPREF = VRID=A_2
IPDATA=B IPID B
IPDATA=B IPREF [2] =
--- IPREF = VRID=B_1
--- IPREF = VRID=B_2
IPDATA=C IPID C
IPDATA=C IPREF [1] =
--- IPREF = VRID=C_1
I want to get bellow result :
"ADD IP=A "
( Ref 2
"Type vlan=VRID=A_1"
"Type vlan=VRID=A_2"
)
"ADD IP=B "
( Ref 2
"Type vlan=VRID=B_1"
"Type vlan=VRID=B_2"
)
"ADD IP=C "
( Ref 1
"Type vlan=VRID=C_1"
)
thanks
Could you please try following, written and tested with shown samples only in GNU awk.
awk -v s1="\"" '
/^IPDATA/ && /IPID .*/{
if(FNR>1){ print ")" }
print s1 "ADD IP" s1 "="s1 $NF OFS s1
next
}
/^IPDATA.*IPREF.*\[[0-9]+\]/{
match($0,/\[[^]]*/)
print "( Ref sum of IPREF " substr($0,RSTART+1,RLENGTH-1)
next
}
/^--- IPREF/{
print s1 "Type vlan="$NF s1
}
END{
print ")"
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -v s1="\"" ' ##Starting awk program from here.
/^IPDATA/ && /IPID .*/{ ##Checking condition if line starts IPDATA and having IPID here.
if(FNR>1){ print ")" } ##Checking condition if FNR>1 then printing ) here.
print s1 "ADD IP" s1 "="s1 $NF OFS s1 ##Printing s1 ADD IP s1 equal to s1 last field OFS and s1 here.
next ##next will skip all further statements from here.
}
/^IPDATA.*IPREF.*\[[0-9]+\]/{ ##Checking condition if line starts from IPDATA and having IPREF till [ digits ].
match($0,/\[[^]]*/) ##Using match to match from [ till ] in line.
print "( Ref sum of IPREF " substr($0,RSTART+1,RLENGTH-1) ##printing string as per request and sub-string from RSTART+1 to till RLENGTH-1 here.
next
}
/^--- IPREF/{ ##Checking conditon if line starts from --- IPREF then do following.
print s1 "Type vlan="$NF s1 ##Printing s1 string last field and s1 here.
}
END{ ##Starting END block of this program from here.
print ")" ##Printing ) here.
}
' Input_file ##Mentioning Input_file name here.

awk pattern matching with if

I'm trying to multiply field $2 either by .75 or .1
I have this data:
Disputed,279
Processed,12112
Uncollected NSF,4732
Declined,14
Invalid / Closed Account,3022
Awk statement:
#!/usr/local/bin/gawk -f
BEGIN { FPAT="([^,]*)|(\"[^\"]+\")"; FS=OFS=","; OFMT="%.2f"; }
{
if ($1 "/Disputed|Uncollected|Invalid/")
$3 = $2 * .75
else
if ($1 ~ "/Processed|Declined/")
$3 = $2 * 0.10
print
}
Expected output:
Disputed,279,209.25
Processed,12112,1211.2
Uncollected NSF,4732,3549
Declined,14,1.4
Invalid / Closed Account,3022,2266.5
Current results:
Disputed,279,209.25
Processed,12112,9084
Uncollected NSF,4732,3549
Declined,14,10.5
Invalid / Closed Account,3022,2266.5
These are multiplied by .75: Disputed, Uncollected NSF and Invalid / Closed Account
These are multiplied by .1: Processed and Declined
what's causing all records to be multiplied by .75?
edit: this is my working solution...
#!/usr/local/bin/gawk -f
BEGIN {
FPAT="([^,]*)|(\"[^\"]+\")"
FS=OFS=","
OFMT="%.2f"
print "status","acct type","count","amount"
}
NF>1 {
$4=$3 * ($1 ~ /Processed|Declined/ ? 0.10 : 0.75 )
print
trans+=$3
fee+=$4
}
END {
printf "------------\n"
print "# of transactions: " trans
print "processing fee: " fee
}
Yes, there's four fields. $2 is a hidden special field!
status,acct type,count,amount
Processed,Savings,502,50.2
Uncollected NSF,Checking,4299,3224.25
Disputed,Checking,263,197.25
Processed,Checking,11610,1161
Uncollected NSF,Savings,433,324.75
Declined,Checking,14,1.4
Invalid / Closed Account,Checking,2868,2151
Disputed,Savings,16,12
Invalid / Closed Account,Savings,154,115.5
------------
# of transactions: 20159
processing fee: 7237.35
The way to write your code in awk would be with a ternary expression, e.g.:
$ awk 'BEGIN{FS=OFS=","} {print $0, $2 * ($1 ~ /Processed|Declined/ ? 0.10 : 0.75)}' file
Disputed,279,209.25
Processed,12112,1211.2
Uncollected NSF,4732,3549
Declined,14,1.4
Invalid / Closed Account,3022,2266.5
Note that regexp constants are delimited by / (see http://www.gnu.org/software/gawk/manual/gawk.html#Regexp) but awk can construct dynamic regexps from variables and/or string constants (see http://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps) so when you wrote:
"/Processed|Declined/"
in a context appropriate for a dynamic regexp ($1 ~ <regexp>), awk constructed a regexp from it as:
`/Processed` OR `Declined/`
(note the literal / chars as part of the regexp terms) instead of what you wanted:
`Processed` OR `Declined`
You can see that effect here:
$ echo 'abc' | awk '$0 ~ /b|x/'
abc
$ echo 'abc' | awk '$0 ~ "/b|x/"'
$ echo 'a/bc' | awk '$0 ~ "/b|x/"'
a/bc
Now, see if you can figure this out:
$ echo 'abc' | awk '$0 ~ "/b|x/"'
$ echo 'abc' | awk '"/b|x/"'
abc
i.e. why the first one prints nothing but the second one prints the input.
As the other poster said, you left out the ~ operator before the first regular expression.
Also, don't include slashes at the start and end of your regular expressions. Either enclose your regular expressions in slashes (as in Perl/Ruby/JavaScript) or in quotes - not both.
if ($1 ~ "Disputed|Uncollected|Invalid")
$3 = $2 * .75
else
if ($1 ~ "Processed|Declined")
$3 = $2 * 0.10
print
Issue
You are missing a matching operator ~. This statement:
if ($1 "/Disputed|Uncollected|Invalid/")
always evaluates to true because it checks whether the concatenation of $1 with "/Disputed|Uncollected|Invalid/" is not empty — and it isn't.
Try instead:
if ($1 ~ /Disputed|Uncollected|Invalid/)
Examples
You can see this behavior using following awk one-liners:
$ awk 'BEGIN { if ("" "a") { print "true" } else { print "false" }}'
true
$ awk 'BEGIN { if ("" "") { print "true" } else { print "false" }}'
false
$ awk 'BEGIN { if ("") { print "true" } else { print "false" }}'
false
$ awk 'BEGIN { if (RS FS "a") { print "true" } else { print "false" }}'
true
$ awk 'BEGIN { if (variable) { print "true" } else { print "false" }}'
false
$ awk 'BEGIN { var="0"; if (var) { print "true" } else { print "false" }}'
true

how to solve nested if issue

I am new to UNIX ... I am trying to write a bash script that takes two integers from the user and prints out the even numbers between these two numbers using an if condition. I am stuck on the nested if "an unexpected token near else" error message appears. I do not know what the error is about. Any help?
This is what I have done so far:
echo plz enter first number
read n1
echo plz enter second number
read n2
start=$n1
end=$n2
if [ start < end ] then
for (c=start;c<=end;c++)
do
if [ $((c % 2 )) -eq 0 ]; then
echo $c
fi
done
else
echo "not bigger"
fi
I think I would recommend a different approach:
((start % 2)) && ((start = 1 + start))
while ((start < end))
do
echo ${start}
(( start += 2))
done
I have tried like this:-
echo "Enter first number"
read first
echo "Enter second number"
read second
start=$first
endLine=$second
while [ $start -le $endLine ]
do
if [ $((start % 2 )) -eq 0 ]
then
echo $start "is an even number"
#else
# echo $start "is an odd number"
fi
start=`expr $start + 1`
done
You need to insert either a semicolon or a newline before the first "then":
if [ start < end ] ; then
^

How to check if the variable value in AWK script is null or empty?

I am using AWK script to process some logs.
At one place I need to check if the variable value is null or empty to make some decision.
Any Idea how to achieve the same?
awk '
{
{
split($i, keyVal, "#")
key=keyVal[1];
val=keyVal[2];
if(val ~ /^ *$/)
val="Y";
}
}
' File
I have tried with
1) if(val == "")
2) if(val ~ /^ *$/)
not working in both cases.
The comparison with "" should have worked, so that's a bit odd
As one more alternative, you could use the length() function, if zero, your variable is null/empty. E.g.,
if (length(val) == 0)
Also, perhaps the built-in variable NF (number of fields) could come in handy? Since we don't have access to your input data it's hard to say though, but another possibility.
You can directly use the variable without comparison, an empty/null/zero value is considered false, everything else is true.
See here :
# setting default tag if not provided
if (! tag) {
tag="default-tag"
}
So this script will have the variable tag with the value default-tag except if the user call it like this :
$ awk -v tag=custom-tag -f script.awk targetFile
This is true as of :
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
It works just fine for me
$ awk 'BEGIN{if(val==""){print "null or empty"}}'
null or empty
You can't differentiate between variable being empty and null, when you access "unset" variable, awk just initializes it with default value(here it is "" - empty string). You can use some sort of workaround, for example, setting val_accessed variable to 0 and then to 1 when you access it. Or more simple approach(somewhat "hackish") setting val to "unitialized"(or to some other value which can't appear when running your program).
PS: your script looks strange for me, what are the nested brackets for?
I accidentally discovered this less-used function specific in gawk that could help differentiate :
****** gawk-only ******
BEGIN {
$0 = "abc"
print NF, $0
test_function()
test_function($(NF + 1))
test_function("")
test_function($0)
}
function test_function(_) { print typeof(_) }
1 abc
untyped
unassigned
string
string
So it seems, for non-numeric-like data :
absolutely no input to function at all : untyped
non-existent or empty field, including $0 : unassigned
any non-numeric-appearing string, including "" : string
Here's the chaotic part - numeric data :
strangely enough, for absolutely identical input, only differing between using $0 vs. $1 in function call, you frequently get a different value for typeof()
even a combination of both leading and trailing spaces doesn't prevent gawk from identifying it as strnum
[123]:NF:1
$0 = number:123 $1 = strnum:123 +$1 = number:123
[ 456.33]:NF:1
$0 = string: 456.33 $1 = strnum:456.33 +$1 = number:456.33000
[ 19683 ]:NF:1
$0 = string: 19683 $1 = strnum:19683 +$1 = number:19683
[-20.08554]:NF:1
$0 = number:-20.08554 $1 = strnum:-20.08554 +$1 = number:-20.08554
+/- inf/nan (same for all 4):
[-nan]:NF:1
$0 = string:-nan $1 = strnum:-nan +$1 = number:-nan
this one is a string because it was made from sprintf() :
[0x10FFFF]:NF:1
$0 = string:0x10FFFF $1 = string:0x10FFFF +$1 = number:0
using -n / --non-decimal-data flag, all stays same except
[0x10FFFF]:NF:1
$0 = string:0x10FFFF $1 = strnum:0x10FFFF +$1 = number:1114111
Long story short, if you want your gawk function to be able to differentiate between
empty-string input (""), versus
actually no input at all
e.g. when original intention is to directly apply changes to $0
then typeof(x) == "untyped" seems to be the most reliable indicator.
It gets worse when null-string padding versus a non-empty string of all zeros ::
function __(_) { return (!_) ":" (!+_) }
function ___(_) { return (_ == "") }
function ____(_) { return (!_) ":" (!""_) }
$0--->[ "000" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ]
___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 1:1000 ]
$0--->[ "" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 1:1 ]
___($0)-->{ $0=="" }-->[ 1 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 1:1 ]
$0--->[ " -0.0 -0" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ]
___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 0:1 -0.0 -0 ]
$0--->[ " 0x5" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ]
___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 0:1 0x5 ]

Resources