How to call a list style parameter in snakemake - r

I wrote some R scripts, and I 'd like to use snakemake to integrate them to an analysis pipeline. I almost finish this pipeline, except one of the R script. In this R script, one of the parameters is a list, like this:
group=list(A=c("a","b","c"),B=c("d","e"),C=c("f","g","h"))
I don't know how to call this kind of parameters in snakemake.
The R script and snakemake script I wrote are as follow:
R script:
library(optparse)
library(ggtree)
library(ggplot2)
library(colorspace)
# help doc
option_list=list(
make_option("--input",type="character",help="<file> input file"),
make_option("--output",type="character",help="<file> output file"),
make_option("--families",type="character",help="<list> a list containing classified families"),
make_option("--wide",type="numeric",help="<num> width of figure"),
make_option("--high",type="numeric",help="<num> height of figure"),
make_option("--labsize",type="numeric",help="<num> size of tip lable")
)
opt_parser=OptionParser(usage="\n\nName: cluster_vil.r",
description="\nDescription: This script is to virualize the result of cluster analysis.
\nContact: huisu<hsu#kangpusen.com>
\nDate: 9.5.2019",
option_list=option_list,
epilogue="Example: Rscript cluster_vil.r --input mega_IBSout_male.nwk
--output NJ_IBS_male.ggtree.pdf
--families list(Family_A=c('3005','3021','3009','3119'),Family_B=c('W','4023'),Family_C=c('810','3003'),Family_D=c('4019','1001','4015','4021'),Family_E=c('4017','3115'))
--wide 18
--high 8
--labsize 7"
)
opt=parse_args(opt_parser)
input=opt$input
output=opt$output
families=opt$families
wide=opt$wide
high=opt$high
labsize=opt$labsize
# start plot
nwk=read.tree(input)
tree=groupOTU(nwk, families)
pdf(file=output,width=wide,height=high) # 18,8 for male samples; 12,18 for all samples
ggtree(tree,aes(color=group),branch.length='none') + geom_tiplab(size=labsize) +
theme(legend.position=("left"),legend.text=element_text(size=12),legend.title=element_text(size=18),
legend.key.width=unit(0.5,"inches"),legend.key.height=unit(0.3,"inches")) +
scale_color_manual(values=c("black", rainbow_hcl(length(families)))) +
theme(plot.margin=unit(rep(2,4),'cm'))
dev.off()
snakemake:
rule cluster_virual:
input:
nwk="mega_IBS.nwk",
output:
all="mega_IBS.pdf",
params:
fam=collections.OrderedDict([('Family_A',['3005','3021','3009','3119']),
('Family_B',['W','4023']),
('Family_C',['810','3003']),
('Family_D',["4019","1001","4015","4021"]),
('Family_E',["4017","3115"])])
message:
"====cluster analysis virualization===="
shell:
"Rscript Rfunction/cluster_vil.r "
"--input {input.nwk} "
"--output {output.all} "
"--families {params.fam} "
"--wide 12 "
"--high 18 "
"--labsize 3"
So, I want to know how to properly call the write the parameter fam in snakemake.

I think in python/snakemake you can use OrderedDict to represent an R list. So:
params:
fam=list(A=c('a','b','c'),B=c('d','e'),C=c('f','g','h'))
Would be:
params:
fam= collections.OrderedDict([('A', ['a', 'b', 'c']),
('B', ['d', 'e', 'f']),
('C', ['h', 'g'])])
Of course, add import collections to the top of your snakemake file (or wherever you want to import the collections module).

Related

snakemake error: 'Wildcards' object has no attribute 'batch'

I don't understand how to redefine my snakemake rule to fix the Wildcards issue below.
Ignore the logic of batches, it internally makes sense in the python script. In theory, I want the rule to be run for each batch 1-20. I use BATCHES list for {batch} in output, and in the shell command, I use {wildcards.batch}:
OUTDIR="my_dir/"
nBATCHES = 20
BATCHES = list(range(1,21)) # [1,2,3 ..20] list
[...]
rule step5:
input:
ids = expand('{IDLIST}', IDLIST=IDLIST)
output:
type1 = expand('{OUTDIR}/resources/{batch}_output_type1.csv.gz', OUTDIR=OUTDIR, batch=BATCHES),
type2 = expand('{OUTDIR}/resources/{batch}_output_type2.csv.gz', OUTDIR=OUTDIR, batch=BATCHES),
type3 = expand('{OUTDIR}/resources/{batch}_output_type3.csv.gz', OUTDIR=OUTDIR, batch=BATCHES)
shell:
"./some_script.py --outdir {OUTDIR} --idlist {input.ids} --total_batches {nBATCHES} --current_batch {wildcards.batch}"
Error:
RuleException in rule step5 in line 241 of Snakefile:
AttributeError: 'Wildcards' object has no attribute 'batch', when formatting the following:
./somescript.py --outdir {OUTDIR} --idlist {input.idlist} --totalbatches {nBATCHES} --current_batch {wildcards.batch}
Executing script for a single batch manually looks like this (and works): (total_batches is a constant; current_batch is supposed to iterate)
./somescript.py --outdir my_dir/ --idlist ids.csv --total_batches 20 --current_batch 1
You seem to want to run the rule step5 once for each batch in BATCHES. So you need to structure your Snakefile to do exactly that.
In the following Snakefile running the rule all runs your rule step5 for all combinations of OUTDIR and BATCHES:
OUTDIR = "my_dir"
nBATCHES = 20
BATCHES = list(range(1, 21)) # [1,2,3 ..20] list
IDLIST = ["a", "b"] # dummy data, I don't have the original
rule all:
input:
type1=expand(
"{OUTDIR}/resources/{batch}_output_type1.csv.gz",
OUTDIR=OUTDIR,
batch=BATCHES,
),
rule step5:
input:
ids=expand("{IDLIST}", IDLIST=IDLIST),
output:
type1="{OUTDIR}/resources/{batch}_output_type1.csv.gz",
type2="{OUTDIR}/resources/{batch}_output_type2.csv.gz",
type3="{OUTDIR}/resources/{batch}_output_type3.csv.gz",
shell:
"./some_script.py --outdir {OUTDIR} --idlist {input.ids} --total_batches {nBATCHES} --current_batch {wildcards.batch}"
In your earlier version {batches} was just an expand-placeholder, but not a wildcard and the rule was only called once.
Instead of the rule all, this could be a subsequent rule which uses one or multiple of the outputs generated from step5.

How to introduce a new wildcard in a snakemake pipeline with several rules

It has been several times that I face this problem and would like to understand finally: is it possible to introduce a new wildcard in a rule in a snakemake pipeline?
workdir: "/path/to/"
(SAMPLES,) =glob_wildcards('/path/to/trimmed/{sample}.trimmed.fastq.gz')
rule all:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES),
expand("merged/{sample}.merged.bam", sample=SAMPLES)
rule bwa_mem:
input:
bwa_index_done = "ref",
fastq="path/to/trimmed/{sample}.trimmed.fastq.gz"
output:
bam = "{sample}.bam"
threads: 10
shell:
"""/Tools/bwa-0.7.12/bwa mem -t {threads} ref {input.fastq} | /Tools/samtools-1.10/samtools sort -o {output.bam}"""
rule samtools merge:
input:
lane1="{sample}_L1.bam",
lane2="{sample}_L2.bam",
lane3="{sample}_L3.bam",
lane4="{sample}_L4.bam"
output:
outf = "merged/{sample}.merged.bam"
threads: 4
shell:
"""Tools/samtools-1.10/samtools merge -# {threads} {output.outf} {input.lane1} {input.lane2} {input.lane3} {input.lane4}"""
My input files:
RD1_1_L1.fastq.gz - RD1_100_L1.fastq.gz
RD1_1_L2.fastq.gz - RD1_100_L2.fastq.gz
RD1_1_L3.fastq.gz - RD1_100_L3.fastq.gz
RD1_1_L4.fastq.gz - RD1_100_L4.fastq.gz
RD2_100_L1.fastq.gz - RD2_200_L1.fastq.gz
RD2_100_L2.fastq.gz - RD2_200_L2.fastq.gz
RD2_100_L3.fastq.gz - RD2_200_L3.fastq.gz
RD2_100_L4.fastq.gz - RD2_200_L4.fastq.gz
While trimming it is ok to use it as one single sample, but when merging I need to specify L1, L2, L3 and L4. So is it possible to introduce a new wildcard somehow specific for a rule?
is it possible to introduce a new wildcard somehow specific for a rule?
I'm not 100% sure what you mean by that but I think the answer is yes.
Looking at your example, maybe this is what you are trying to do:
SAMPLES = ['RD1_1', 'RD2_100', 'RD1_100', 'RD2_200']
LANE = ['L1', 'L2', 'L3', 'L4']
rule all:
input:
expand("merged/{sample}.merged.bam", sample= SAMPLES)
rule trim:
input:
fastq= "{sample}_{L}.fastq.gz"
output:
fastq="trimmed/{sample}_{L}.trimmed.fastq.gz"
shell:
r"""
trim {input} {output}
"""
rule bwa_mem:
input:
fastq="trimmed/{sample}_{L}.trimmed.fastq.gz"
output:
bam= "{sample}_{L}.bam"
shell:
r"""
bwa mem {input} {output}
"""
rule samtools merge:
input:
expand('{{sample}}_{L}.bam', L= LANE),
output:
outf= "merged/{sample}.merged.bam",
shell:
r"""
samtools merge {output} {input}
"""
It assumes that all samples have lanes 1 to 4 which is not great but hopefully you get the idea.

Data from hive to R to bash

Im trying to execute hive query in R using system command. This R function is called from a Bash script.
R Code
hivedata<-function(query)
{
data <- system(paste0("hive -S -e ", query), wait = TRUE,intern=TRUE)
if (identical(data, character(0))){data=NULL}
message("return value:")
message(data)
message("return value type:")
message(class(data))
return(cat(data))
}
if (length(query)>0 && is.na(query)==FALSE){
data=hivedata(query)
print(data)
}
Bash function
gethivedata(){
set -f #disable aterisk in sql getting expanded as filenames
query=$1
data=`Rscript hivedata.r "'$query'"`
echo $data
}
Calling function in Bash
totalcount=$(gethivedata " select count(*) from hivedb.hivetable ")
The outputs
[usr#host dir]$ totalcount=$(execute_sql " select count(*) from
hivedb.hivetable ")
return value:
0
return value type:
character
-------------------------------
[usr#host dir]$ echo $totalcount
0NULL
When cat is not used, the output value comes as [1]"0". Because R returns the index also with the output. When cat is used then the output becomes 0NULL. I want only the actual value which is "0"
What does [1] mean in the output of any command executed on R command line?

How to call exe program and input parameters using R?

I want to call .exe program (spi_sl_6.exe) using a command of R (system), however I can't input parameters to the program using "system". The followwing is my command and parameters:system("D:\\working\spi_sl_6.exe")
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this. Thanks in advance.
This is using the Standardized Precipitation Index software from
http://drought.unl.edu/MonitoringTools/DownloadableSPIProgram.aspx.
This seems to give a working solution using Windows (but not without warnings!)
Fisrt download the software and example files
# Create directory to download software
mydir <- "C:\\Users\\david\\spi"
dir.create(mydir)
url <- "http://drought.unl.edu/archive/Programs/SPI"
download.file(file.path(url, "spi_sl_6.exe"), file.path(mydir, "spi_sl_6.exe"), mode="wb")
# Download example files
download.file(file.path(url, "SPI_samplefiles.zip"), file.path(mydir, "SPI_samplefiles.zip"))
# extract one example file, and write out
temp <- unzip(file.path(mydir, "SPI_samplefiles.zip"), "wymo.cor")
dat <- read.table(temp)
# Use this file as an example input
write.table(dat, file.path(mydir,"wymo.cor"), col.names = FALSE, row.names = FALSE)
From page 3 of the help file basic-spi-program-information.pdf at the above link the command line code should be of the form spi 3 6 12 <infile.dat >outfile.dat, however,
neither of the following worked (just from command line not in R), and various iterations of how to pass parameters.
C:\Users\david\spi\spi_sl_6 3 <C:\Users\david\spi\wymo.cor >C:\Users\david\spi\out.dat
cd C:\Users\david\spi && spi_sl_6 3 <wymo.cor >out.dat
However, using the accepted answer from Running .exe file with multiple parameters in c#
seems to work. That is again from the command line
cd C:\Users\david\spi && (echo 2 && echo 3 && echo 6 && echo wymo.cor && echo out1.dat) | spi_sl_6
So to run this in R you can wrap this in a shell (you will need to change the path to where you have saved the exe)
shell("cd C:\\Users\\david\\spi && (echo 2 && echo 3 && echo 6 && echo wymo.cor && echo out2.dat) | spi_sl_6", intern=TRUE)
out1.dat and out2.dat should be the same.
This throws warning messages, I think from the echo (in R but not from command line) but the output file is produced.
Suppose you can automate all the echo calls sligtly, so all you need to do is input the time parameters.
timez <- c(2, 3, 6)
stime <- paste("echo", timez, collapse =" && ")
infile <- "wymo.cor"
outfile <- "out3.dat"
spiCall <- paste("cd", mydir, "&& (" , stime, "&& echo", infile, "&& echo", outfile, " ) | spi_sl_6")
shell(spiCall)
You can construct the command using sprintf :
cmd_name <- "D:\\working\spi_sl_6.exe"
param1 <- "a"
param2 <- "b"
system2(sprintf("%s %s %s",cmd_name,param1,param2))
Or using system2( I prefer this option):
system2(cmd_name, args = c(param1,param2))

R: pass variable from R to unix

I am running an R script via bash script and want to return the output of the R script to the bash script to keep working with it there.
The bash is sth like this:
#!/bin/bash
Rscript MYRScript.R
a=OUTPUT_FROM_MYRScript.R
do sth with a
and the R script is sth like this:
for(i in 1:5){
i
sink(type="message")
}
I want bash to work with one variable from R at the time, meaning: bash receives i=1 and works with that, when that task is done, receives i=2 and so on.
Any ideas how to do that?
One option is to make your R script executable with #!/usr/bin/env Rscript (setting the executable bit; e.g. chmod 0755 myrscript.r, chmod +x myrscript.r, etc...), and just treat it like any other command, e.g. assigning the results to an array variable below:
myrscript.r
#!/usr/bin/env Rscript
cat(1:5, sep = "\n")
mybashscript.sh
#!/bin/bash
RES=($(./myrscript.r))
for elem in "${RES[#]}"
do
echo elem is "${elem}"
done
nrussell$ ./mybashscript.sh
elem is 1
elem is 2
elem is 3
elem is 4
elem is 5
Here is MYRScript.R:
for(iter in 1:5) {
cat(iter, ' ')
}
and here is your bash script:
#!/bin/bash
r_output=`Rscript ~/MYRscript.R`
for iter in `echo $r_output`
do
echo Here is some output from R: $iter
done
Here is some output from R: 1
Here is some output from R: 2
Here is some output from R: 3
Here is some output from R: 4
Here is some output from R: 5

Resources