I don't understand how to redefine my snakemake rule to fix the Wildcards issue below.
Ignore the logic of batches, it internally makes sense in the python script. In theory, I want the rule to be run for each batch 1-20. I use BATCHES list for {batch} in output, and in the shell command, I use {wildcards.batch}:
OUTDIR="my_dir/"
nBATCHES = 20
BATCHES = list(range(1,21)) # [1,2,3 ..20] list
[...]
rule step5:
input:
ids = expand('{IDLIST}', IDLIST=IDLIST)
output:
type1 = expand('{OUTDIR}/resources/{batch}_output_type1.csv.gz', OUTDIR=OUTDIR, batch=BATCHES),
type2 = expand('{OUTDIR}/resources/{batch}_output_type2.csv.gz', OUTDIR=OUTDIR, batch=BATCHES),
type3 = expand('{OUTDIR}/resources/{batch}_output_type3.csv.gz', OUTDIR=OUTDIR, batch=BATCHES)
shell:
"./some_script.py --outdir {OUTDIR} --idlist {input.ids} --total_batches {nBATCHES} --current_batch {wildcards.batch}"
Error:
RuleException in rule step5 in line 241 of Snakefile:
AttributeError: 'Wildcards' object has no attribute 'batch', when formatting the following:
./somescript.py --outdir {OUTDIR} --idlist {input.idlist} --totalbatches {nBATCHES} --current_batch {wildcards.batch}
Executing script for a single batch manually looks like this (and works): (total_batches is a constant; current_batch is supposed to iterate)
./somescript.py --outdir my_dir/ --idlist ids.csv --total_batches 20 --current_batch 1
You seem to want to run the rule step5 once for each batch in BATCHES. So you need to structure your Snakefile to do exactly that.
In the following Snakefile running the rule all runs your rule step5 for all combinations of OUTDIR and BATCHES:
OUTDIR = "my_dir"
nBATCHES = 20
BATCHES = list(range(1, 21)) # [1,2,3 ..20] list
IDLIST = ["a", "b"] # dummy data, I don't have the original
rule all:
input:
type1=expand(
"{OUTDIR}/resources/{batch}_output_type1.csv.gz",
OUTDIR=OUTDIR,
batch=BATCHES,
),
rule step5:
input:
ids=expand("{IDLIST}", IDLIST=IDLIST),
output:
type1="{OUTDIR}/resources/{batch}_output_type1.csv.gz",
type2="{OUTDIR}/resources/{batch}_output_type2.csv.gz",
type3="{OUTDIR}/resources/{batch}_output_type3.csv.gz",
shell:
"./some_script.py --outdir {OUTDIR} --idlist {input.ids} --total_batches {nBATCHES} --current_batch {wildcards.batch}"
In your earlier version {batches} was just an expand-placeholder, but not a wildcard and the rule was only called once.
Instead of the rule all, this could be a subsequent rule which uses one or multiple of the outputs generated from step5.
Related
I have a task in which the output is a dictionary with a list value in each key
#task(task_id="gen_dict")
def generate_dict():
...
return output_dict # output look like this {"A" : ["aa","bb", "cc"], "B" : ["dd","ee", "ff"]}
# my dag (Not mention the part of generating DAG and its properties)
start = DummyOperator(task_id="st")
end = DummyOperator(task_id="ed")
output = generate_dict()
for keys, values in output.items():
for v in values:
dm = DummyOperator(task_id=f"dm_{keys}_{v}")
dm >> end
start >> output
For this sample output above, it should create 6 dummy tasks which are dm_A_aa, dm_A_bb, dm_A_cc, dm_B_dd, dm_B_ee, dm_B_ff
But right now I'm facing the import error
AttributeError: 'XComArg' object has no attribute 'items'
Is it possible to do what I aim to do? If not, is it possible to do it using a list like ["aa", "bb", "cc", "dd", "ee", "ff"] instead?
The code in the question won't work as-is because the loop shown would run when the dag is parsed (happens when the scheduler starts up and periodically thereafter), but the data that it would loop over is not known until the task that generates it is actually run.
There are ways to do something similar though.
AIP-42 added the ability to map list data into task kwargs in airflow 2.3:
#task
def generate_lists():
# presumably the data below would come from a query executed at runtime
return [["aa", "bb", "cc"], ["dd", "ee", "ff"]]
#task
def use_list(the_list):
for item in the_list:
print(item)
with DAG(...) as dag:
use_list.expand(the_list=generate_lists())
The code above will create two tasks with output:
aa
bb
cc
dd
ee
ff
In 2.4 the expand_kwargs function was added. It's an alternative to expand (shown above) which operates on dicts instead.
It takes an XComArg referencing a list of dicts whose keys are the names of the arguments that you're mapping the data into. So the following code...
#task
def generate_dicts():
# presumably the data below would come from a query made at runtime
return [{"foo":6, "bar":7}, {"foo":8, "bar":9}]
#task
def two_things(foo, bar):
print(foo, bar)
with DAG(...) as dag:
two_things.expand_kwargs(generate_dicts())
... gives two tasks with output:
6 7
...and...
8 9
expand only lets you create tasks from the Cartesian product of the input lists, expand_kwargs lets you do the associating of data to kwargs at runtime.
It has been several times that I face this problem and would like to understand finally: is it possible to introduce a new wildcard in a rule in a snakemake pipeline?
workdir: "/path/to/"
(SAMPLES,) =glob_wildcards('/path/to/trimmed/{sample}.trimmed.fastq.gz')
rule all:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES),
expand("merged/{sample}.merged.bam", sample=SAMPLES)
rule bwa_mem:
input:
bwa_index_done = "ref",
fastq="path/to/trimmed/{sample}.trimmed.fastq.gz"
output:
bam = "{sample}.bam"
threads: 10
shell:
"""/Tools/bwa-0.7.12/bwa mem -t {threads} ref {input.fastq} | /Tools/samtools-1.10/samtools sort -o {output.bam}"""
rule samtools merge:
input:
lane1="{sample}_L1.bam",
lane2="{sample}_L2.bam",
lane3="{sample}_L3.bam",
lane4="{sample}_L4.bam"
output:
outf = "merged/{sample}.merged.bam"
threads: 4
shell:
"""Tools/samtools-1.10/samtools merge -# {threads} {output.outf} {input.lane1} {input.lane2} {input.lane3} {input.lane4}"""
My input files:
RD1_1_L1.fastq.gz - RD1_100_L1.fastq.gz
RD1_1_L2.fastq.gz - RD1_100_L2.fastq.gz
RD1_1_L3.fastq.gz - RD1_100_L3.fastq.gz
RD1_1_L4.fastq.gz - RD1_100_L4.fastq.gz
RD2_100_L1.fastq.gz - RD2_200_L1.fastq.gz
RD2_100_L2.fastq.gz - RD2_200_L2.fastq.gz
RD2_100_L3.fastq.gz - RD2_200_L3.fastq.gz
RD2_100_L4.fastq.gz - RD2_200_L4.fastq.gz
While trimming it is ok to use it as one single sample, but when merging I need to specify L1, L2, L3 and L4. So is it possible to introduce a new wildcard somehow specific for a rule?
is it possible to introduce a new wildcard somehow specific for a rule?
I'm not 100% sure what you mean by that but I think the answer is yes.
Looking at your example, maybe this is what you are trying to do:
SAMPLES = ['RD1_1', 'RD2_100', 'RD1_100', 'RD2_200']
LANE = ['L1', 'L2', 'L3', 'L4']
rule all:
input:
expand("merged/{sample}.merged.bam", sample= SAMPLES)
rule trim:
input:
fastq= "{sample}_{L}.fastq.gz"
output:
fastq="trimmed/{sample}_{L}.trimmed.fastq.gz"
shell:
r"""
trim {input} {output}
"""
rule bwa_mem:
input:
fastq="trimmed/{sample}_{L}.trimmed.fastq.gz"
output:
bam= "{sample}_{L}.bam"
shell:
r"""
bwa mem {input} {output}
"""
rule samtools merge:
input:
expand('{{sample}}_{L}.bam', L= LANE),
output:
outf= "merged/{sample}.merged.bam",
shell:
r"""
samtools merge {output} {input}
"""
It assumes that all samples have lanes 1 to 4 which is not great but hopefully you get the idea.
I wrote some R scripts, and I 'd like to use snakemake to integrate them to an analysis pipeline. I almost finish this pipeline, except one of the R script. In this R script, one of the parameters is a list, like this:
group=list(A=c("a","b","c"),B=c("d","e"),C=c("f","g","h"))
I don't know how to call this kind of parameters in snakemake.
The R script and snakemake script I wrote are as follow:
R script:
library(optparse)
library(ggtree)
library(ggplot2)
library(colorspace)
# help doc
option_list=list(
make_option("--input",type="character",help="<file> input file"),
make_option("--output",type="character",help="<file> output file"),
make_option("--families",type="character",help="<list> a list containing classified families"),
make_option("--wide",type="numeric",help="<num> width of figure"),
make_option("--high",type="numeric",help="<num> height of figure"),
make_option("--labsize",type="numeric",help="<num> size of tip lable")
)
opt_parser=OptionParser(usage="\n\nName: cluster_vil.r",
description="\nDescription: This script is to virualize the result of cluster analysis.
\nContact: huisu<hsu#kangpusen.com>
\nDate: 9.5.2019",
option_list=option_list,
epilogue="Example: Rscript cluster_vil.r --input mega_IBSout_male.nwk
--output NJ_IBS_male.ggtree.pdf
--families list(Family_A=c('3005','3021','3009','3119'),Family_B=c('W','4023'),Family_C=c('810','3003'),Family_D=c('4019','1001','4015','4021'),Family_E=c('4017','3115'))
--wide 18
--high 8
--labsize 7"
)
opt=parse_args(opt_parser)
input=opt$input
output=opt$output
families=opt$families
wide=opt$wide
high=opt$high
labsize=opt$labsize
# start plot
nwk=read.tree(input)
tree=groupOTU(nwk, families)
pdf(file=output,width=wide,height=high) # 18,8 for male samples; 12,18 for all samples
ggtree(tree,aes(color=group),branch.length='none') + geom_tiplab(size=labsize) +
theme(legend.position=("left"),legend.text=element_text(size=12),legend.title=element_text(size=18),
legend.key.width=unit(0.5,"inches"),legend.key.height=unit(0.3,"inches")) +
scale_color_manual(values=c("black", rainbow_hcl(length(families)))) +
theme(plot.margin=unit(rep(2,4),'cm'))
dev.off()
snakemake:
rule cluster_virual:
input:
nwk="mega_IBS.nwk",
output:
all="mega_IBS.pdf",
params:
fam=collections.OrderedDict([('Family_A',['3005','3021','3009','3119']),
('Family_B',['W','4023']),
('Family_C',['810','3003']),
('Family_D',["4019","1001","4015","4021"]),
('Family_E',["4017","3115"])])
message:
"====cluster analysis virualization===="
shell:
"Rscript Rfunction/cluster_vil.r "
"--input {input.nwk} "
"--output {output.all} "
"--families {params.fam} "
"--wide 12 "
"--high 18 "
"--labsize 3"
So, I want to know how to properly call the write the parameter fam in snakemake.
I think in python/snakemake you can use OrderedDict to represent an R list. So:
params:
fam=list(A=c('a','b','c'),B=c('d','e'),C=c('f','g','h'))
Would be:
params:
fam= collections.OrderedDict([('A', ['a', 'b', 'c']),
('B', ['d', 'e', 'f']),
('C', ['h', 'g'])])
Of course, add import collections to the top of your snakemake file (or wherever you want to import the collections module).
How is it possible to join multiple lines of a log file into 1 dataframe row?
ADDED ONE LINE -- Example 4-line log file:
[WARN ][2016-12-16 13:43:10,138][ConfigManagerLoader] - [Low max memory=477102080. Java max memory=1000 MB is recommended for production use, as a minimum.]
[DEBUG][2016-05-26 10:10:22,185][DataSourceImpl] - [SELECT mr.lb_id,mr.lf_id,mr.mr_id FROM mr WHERE (( mr.cap_em >
0 AND mr.cap_em > 5
)) ORDER BY mr.lb_id, mr.lf_id, mr.mr_id]
[ERROR][2016-12-21 13:51:04,710][DWRWorkflowService] - [Update Wizard - : [DWR WFR request error:
workflow rule = BenCommonResources-getDataRecords
version = 2.0
filterValues = [{"fieldName": "wotable_hwohtable.status", "filterValue": "CLOSED"}, {"fieldName": "wotable_hwohtable.status_clearance", "filterValue": "Goods Delivered"}]
sortValues = [{"fieldName": "wotable_hwohtable.cost_actual", "sortOrder": -1}]
Result code = ruleFailed
Result message = Database error while processing request.
Result details = null
]]
[INFO ][2019-03-15 12:34:55,886][DefaultListableBeanFactory] - [Overriding bean definition for bean 'cpnreq': replacing [Generic bean: class [com.ar.moves.domain.bom.Cpnreq]; scope=prototype; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in URL [jar:file:/D:/Dev/404.jar!/com/ar/moves/moves-context.xml]] with [Generic bean: class [com.ar.bl.bom.domain.Cpnreq]; scope=prototype; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in URL [jar:file:/D:/Dev/Tools/Tomcatv8.5-appGit-master/404.jar!/com/ar/bl/bom/bl-bom-context.xml]]]
(See representative 8-line extract at https://pastebin.com/bsmWWCgw.)
The structure is clean:
[PRIOR][datetime][ClassName] - [Msg]
but the message is often multi-lined, there may be multiple brackets in the message itself (even trailing…), or ^M newlines, but not necessarily… That makes it difficult to parse. Dunno where to begin here…
So, in order to process such a file, and be able to read it with something like:
#!/usr/bin/env Rscript
df <- read.table('D:/logfile.log')
we really need to have that merge of lines happening first. How is that doable?
The goal is to load the whole log file for making graphics, analysis (grepping out stuff), and eventually writing it back into a file, so -- if possible -- newlines should be kept in order to respect the original formatting.
The expected dataframe would look like:
PRIOR Datetime ClassName Msg
----- ------------------- ------------------- ----------
WARN 2016-12-16 13:43:10 ConfigManagerLoader Low max...
DEBUG 2016-05-26 10:10:22 DataSourceImpl SELECT ...
And, ideally once again, this should be doable in R directly (?), so that we can "process" a live log file (opened in write mode by the server app), "à la tail -f".
This is a pretty wicked Regex bomb. I'd recommend using the stringr package, but you could do all this with grep style functions.
library(stringr)
str <- c(
'[WARN ][2016-12-16 13:43:10,138][ConfigManagerLoader] - [Low max memory=477102080. Java max memory=1000 MB is recommended for production use, as a minimum.]
[DEBUG][2016-05-26 10:10:22,185][DataSourceImpl] - [SELECT mr.lb_id,mr.lf_id,mr.mr_id FROM mr WHERE (( mr.cap_em >
0 AND mr.cap_em > 5
)) ORDER BY mr.lb_id, mr.lf_id, mr.mr_id]
[ERROR][2016-12-21 13:51:04,710][DWRWorkflowService] - [Update Wizard - : [DWR WFR request error:
workflow rule = BenCommonResources-getDataRecords
version = 2.0
filterValues = [{"fieldName": "wotable_hwohtable.status", "filterValue": "CLOSED"}, {"fieldName": "wotable_hwohtable.status_clearance", "filterValue": "Goods Delivered"}]
sortValues = [{"fieldName": "wotable_hwohtable.cost_actual", "sortOrder": -1}]
Result code = ruleFailed
Result message = Database error while processing request.
Result details = null
]]'
)
Using regex we can split each line by checking for the pattern you mentioned. This regex is checking for a [, followed by any non-line feed character or line feed character or carriage return character, followed by a [. But do this is a lazy (non-greedy) way by using *?. Repeat that 3 times, then check for a -. Finally, check for a [, followed by any characters or a group that includes information within square brackets, then a ]. That's a mouthful. Type it into a regex calculator. Just remember to remove the extra backlashes (in a regex calculator \ is used but in R \\ is used).
# Split the text into each line without using \n or \r.
# pattern for each line is a lazy (non-greedy) [][][] - []
linesplit <- str %>%
# str_remove_all("\n") %>%
# str_extract_all('\\[(.|\\n|\\r)+\\]')
str_extract_all('\\[(.|\\n|\\r)*?\\]\\[(.|\\n|\\r)*?\\]\\[(.|\\n|\\r)*?\\] - \\[(.|\\n|\\r|(\\[(.|\\n|\\r)*?\\]))*?\\]') %>%
unlist()
linesplit # Run this to view what happened
Now that we have each line separated break them into columns. But we don't want to keep the [ or ] so we use a positive lookbehind and a positive lookahead in the regex code to check to see if the are there without capturing them. Oh, and capture everything between them of course.
# Split each line into columns
colsplit <- linesplit %>%
str_extract_all("(?<=\\[)(.|\\n|\\r)*?(?=\\])")
colsplit # Run this to view what happened
Now we have a list with an object for each line. In each object are 4 items for each column. We need to convert those 4 items to a dataframe and then join those dataframes together.
# Convert each line to a dataframe, then join the dataframes together
df <- lapply(colsplit,
function(x){
data.frame(
PRIOR = x[1],
Datetime = x[2],
ClassName = x[3],
Msg = x[4],
stringsAsFactors = FALSE
)
}
) %>%
do.call(rbind,.)
df
# PRIOR Datetime ClassName Msg
# 1 WARN 2016-12-16 13:43:10,138 ConfigManagerLoader Low max memory=
# 2 DEBUG 2016-05-26 10:10:22,185 DataSourceImpl SELECT mr.lb_id
# 3 ERROR 2016-12-21 13:51:04,710 DWRWorkflowService Update Wizard -
# Note: there are extra spaces that probably should be trimmed,
# and the dates are slightly messed up. I'll leave those for the
# questioner to fix using a mutate and the string functions.
I will leave it to you to fix the extra spaces, and date field.
I have 5 functions working relatively
1- singleline_diff(line1, line2)
comparing 2 line in one file
Inputs:
line1 - first single line string
line2 - second single line string
Output:
the index of the first difference between the two lines
identical if the two lines are the same.
2- singleline_diff_format(line1, line2, idx):
comparing 2 line in one file
Inputs:
line1 - first single line string
line2 - second single line string
idx - index at which to indicate difference (from 1st function)
Output:
abcd (first line)
==^ (= indicate identical character, ^ indicate the difference)
abef (second line)
If either input line contains a newline or carriage return,
then returns an empty string.
If idx is not a valid index, then returns an empty string.
3- multiline_diff(lines1, lines2):
deal with two lists of lines
Inputs:
lines1 - list of single line strings
lines2 - list of single line strings
Output:
a tuple containing the line number (starting from 0) and
the index in that line where the first difference between lines1
and lines2 occurs.
Returns (IDENTICAL, IDENTICAL) if the two lists are the same.
4-get_file_lines(filename)
Inputs:
filename - name of file to read
Output:
a list of lines from the file named filename.
If the file does not exist or is not readable, then the
behavior of this function is undefined.
5- file_diff_format(filename1, filename2) " the function with the problem"
deals with two input files
Inputs:
filename1 - name of first file
filename2 - name of second file
Output:
four line string showing the location of the first
difference between the two files named by the inputs.
If the files are identical, the function instead returns the
string "No differences\n".
If either file does not exist or is not readable, then the
behavior of this function is undefined.
testing the function:
everything goes will until it the test use one empty file
it gave me "list index out of range"
this is the code I use
def file_diff_format(filename1, filename2):
file_1 = get_file_lines(filename1)
file_2 = get_file_lines(filename2)
mli_dif = multiline_diff(file_1, file_2)
min_lens = min(len(file_1), len(file_2))
if mli_dif == (-1,-1) :
return "No differences" "\n"
else:
diff_line_indx = mli_dif[0]
diff_str_indx = int (mli_dif[1])
if len(file_1) >= 0:
line_file_1 = ""
else:
line_file_1 = file_1[diff_line_indx]
if len(file_2) >= 0:
line_file_2 = ""
else:
line_file_2 = file_2[diff_line_indx]
line_file_1 = file_1[diff_line_indx]
line_file_2 = file_2 [diff_line_indx]
out_print = singleline_diff_format(line_file_1, line_file_2, diff_str_indx)
return ( "Line {}{}{}".format ((diff_line_indx), (":\n"), (out_print)))
If one of the files is empty, either file1 or file2 should be an empty list, so that trying to access an element of either would cause the error you describe.
Your code checks for these files to be empty when assigning to line_file_`` andline_file_2`, but then goes ahead and tries to access elements of both.