Snakemake: how to use one integer from list each call as input to script? - r

I'm trying to practice writing workflows in snakemake.
The contents of my Snakefile:
configfile: "config.yaml"
rule get_col:
input:
expand("data/{file}.csv",file=config["datname"])
output:
expand("output/{file}_col{param}.csv",file=config["datname"],param=config["cols"])
params:
col=config["cols"]
script:
"scripts/getCols.R"
The contents of config.yaml:
cols:
[2,4]
datname:
"GSE3790_expression_data"
My R script:
getCols=function(input,output,col) {
dat=read.csv(input)
dat=dat[,col]
write.csv(dat,output,row.names=F)
}
getCols(snakemake#input[[1]],snakemake#output[[1]],snakemake#params[['col']])
It seems like both columns are being called at once. What I'm trying to accomplish is one column being called from the list per output file.
Since the second output never gets a chance to be created (both columns are used to create first output), snakemake throws an error:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 3 of /Users/rebecca/Desktop/snakemake-tutorial/practice/Snakefile:
Job completed successfully, but some output files are missing.
On a slightly unrelated note, I thought I could write the input as:
'"data/{file}.csv"'
But that returns:
WildcardError in line 4 of /Users/rebecca/Desktop/snakemake-tutorial/practice/Snakefile:
Wildcards in input files cannot be determined from output files:
'file'
Any help would be much appreciated!

Looks like you want to run your Rscript twice per file, once for every value of col. In this case, the rule needs to be called twice as well.
The use of expand is also a bit too much here, in my opinion. expand fills your wildcards with all possible values and returns a list of the resulting files. So the output for this rule would be all possible combinations between files and cols, which the simple script can not create in one run.
This is also the reason why file can not be inferred from the output - it gets expanded there.
Instead, try writing your rule easier for just one file and column and expand on the resulting output, in a rule which needs this output as an input. If you generated the final output of your workflow, put it as input in a rule all to tell the workflow what the ultimate goal is.
rule all:
input:
expand("output/{file}_col{param}.csv",
file=config["datname"], param=config["cols"])
rule get_col:
input:
"data/{file}.csv"
output:
"output/{file}_col{param}.csv"
params:
col=lambda wc: wc.param
script:
"scripts/getCols.R"
Snakemake will infer from rule all (or any other rule to further use the output) what needs to be done and will call the rule get_col accordingly.

Related

Use wildcard to capture different datasets in Snakemake

I would like to use wildcards in Snakemake in a very simple way to start a script for two datasets. Unfortunately, I cannot find the proper way of doing it.
My data folder contains three files: gene_list.txt, expression_JGI.txt, expression_UBC.txt.
Here is what my snakefile looks like:
rule extract:
input:
genes="data/gene_list.txt",
expression="data/expression_{dataset}.txt"
output:
"data/expression_{dataset}_subset.txt"
shell:
"bash scripts/extract.sh {input.genes} {input.expression} {output}"
When I use snakemake -c1 extract I get the following error message:
Building DAG of jobs...
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).
I tried adding a rule all at the beginning of the snakefile with the desired result files as input without success:
rule all:
input:
"data/expression_JGI_subset.txt",
"data/expression_UBC_subset.txt"
I also tried with expand:
DATASETS = ["JGI", "UBC"]
rule all:
input:
expand("data/expression_{dataset}_subset.txt", dataset=DATASETS)
But I get the same error message.
The script works fine when I use it outside Snakemake.
How can I achieve what I want?
When you do snakemake -c1 extract you ask snakemake to execute only rule extract and its dependencies, if any. However, because extract contains wildcards snakemake doesn't know what to replace them with. (Note that rule all is not a dependency of extract).
So either execute snakemake -c1 to run the whole pipeline or specify the concrete files you want to generate, e.g.:
snakemake -c1 -- data/expression_JGI_subset.txt data/expression_UBC_subset.txt

What is the best way to use expand() with one unknown variable in Snakemake?

I am currently using Snakemake for a bioinformatics project. Given a human reference genome (hg19) and a bam file, I want to be able to specify that there will be multiple output files with the same name but different extensions. Here is my code
rule gridss_preprocess:
input:
ref=config['ref'],
bam=config['bamdir'] + "{sample}.dedup.downsampled.bam",
bai=config['bamdir'] + "{sample}.dedup.downsampled.bam.bai"
output:
expand(config['bamdir'] + "{sample}.dedup.downsampled.bam{ext}", ext = config['workreq'], sample = "{sample}")
Currently config['workreq'] is a list of extensions that start with "."
For example, I want to be able to use expand to indicate the following files
S1.dedup.downsampled.bam.cigar_metrics
S1.dedup.downsampled.bam.computesamtags.changes.tsv
S1.dedup.downsampled.bam.coverage.blacklist.bed
S1.dedup.downsampled.bam.idsv_metrics
I want to be able to do this for multiple sample files, S_. Currently I am not getting an error when I try to do a dry run. However, I am not sure if this will run properly.
Am I doing this right?
expand() defines a list of files. If you're using two parameters, the cartesian product will be used. Thus, your rule will define as output ALL files with your extension list for ALL samples. Since you define a wildcard in your input, I think that what you want is all files with your extension for ONE sample. And this rule will be executed as many times as the number of samples.
You're mixing up wildcards and placeholders for the expand() function. You can define a wildcard inside an expand() by doubling the brackets:
rule all:
input: expand(config['bamdir'] + "{sample}.dedup.downsampled.bam{ext}", ext = config['workreq'], sample=SAMPLELIST)
rule gridss_preprocess:
input:
ref=config['ref'],
bam=config['bamdir'] + "{sample}.dedup.downsampled.bam",
bai=config['bamdir'] + "{sample}.dedup.downsampled.bam.bai"
output:
expand(config['bamdir'] + "{{sample}}.dedup.downsampled.bam{ext}", ext = config['workreq'])
This expand function will expand in list
{sample}.dedup.downsampled.bam.cigar_metrics
{sample}.dedup.downsampled.bam.computesamtags.changes.tsv
{sample}.dedup.downsampled.bam.coverage.blacklist.bed
{sample}.dedup.downsampled.bam.idsv_metrics
and thus define the wildcard sample to match the files in the input.

Is there a way to check where R is 'stuck' within a for loop? (R)

I am using system() to run several files iteratively through a program via CMD. It deposits each outputs into a sub-directory designated for specifically and only that input file. So # of inputs is exactly equal to the number of output directories/outputs.
My code works for the first iteration, but I can see in the console that it won't move on to the second file after completing the first. The stop sign remains active so I know R is still 'running', but since the for loop environment is unique I can't really tell what it's stuck on. It just stays like this for hours. Therefore I'm not sure how to begin to diagnose the issue I'm having. Is there a way of tracing what happened after cancelling the code, for example?
If your curious, the code looks like this btw. I don't know how to make it reproducible, so I just commented each line:
for (i in 1:length(flist)) {
##flist is a vector of character strings. Each
row of characters is both the name of the input file and the name of the
output directory
setwd(paste0(solutions_dir, "\\", flist[i]))
#sets the appropriate dir
system(paste0(program_dir,"\\program.exe I=",
file_dir, "\\", flist[i], " O=",solutions_dir, "\\", flist[i],
"\\solv"))
##line that inputs program's exe file and the appropriate input/output
locations
}

remove log information from report and save report in desire location

I am new to robot framework and wanted to see if i can get any simple code for custom report. I am also fine with answer to my problem. I went through all questions related to report but could not find any specific answer to my problem. currently my report contains log and wanted to see if i can remove log information from reports and save report in specific location. I just want to get PASS/FAIL information in my report. Can any one give me example how i can overcome this problem? I also need to know how i can save my report in different location. Any example would be helpful. Thank you in advance.
There is a tool called Rebot which is part of Robot Framework.
By default, Robot Framework creates XML reports. The XML reports are automatically converted into HTML reports by Rebot.
You can set the location of the output files in the execution by specifying the parameter --outputdir (and thus set a different base directory for outputs).
From the documentaiton:
All output files can be set using an absolute path, in which case they are created to the specified place, but in other cases, the path is considered relative to the output directory. The default output directory is the directory where the execution is started from, but it can be altered with the --outputdir (-d) option. The path set with this option is, again, relative to the execution directory, but can naturally be given also as an absolute path. Regardless of how a path to an individual output file is obtained, its parent directory is created automatically, if it does not exist already.
You can call Rebot yourself to control this conversion.
You can also run Rebot after the test was run in order to create new output on a different location.
See documentation in:
http://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html#post-processing-outputs
The following example shows how to store the HTML reports in a different location and including only partial data:
rebot --include smoke --name Smoke_Tests c:\results\output.xml --outputdir c:\version1.0\reports
In the example above, we process the file c:\results\output.xml, create a new report called Smoke_Tests that includes only tests with the tag smoke and save it to the output folder c:\version1.0\reports
In addition you can also set the location of the log file (HTML) from the execution.
The command line option --log (-l) determines where log files are created.
The command line option --report (-r) determines where report files are created
Removing log lines can be done a bit differently. If you run rebot --help you'll get the following options:
--removekeywords all|passed|for|wuks|name: * Remove keyword data
from all generated outputs. Keywords containing
warnings are not removed except in `all` mode.
all: remove data from all keywords
passed: remove data only from keywords in passed
test cases and suites
for: remove passed iterations from for loops
wuks: remove all but the last failing keyword
inside `BuiltIn.Wait Until Keyword Succeeds`
name:: remove data from keywords that match
the given pattern. The pattern is matched
against the full name of the keyword (e.g.
'MyLib.Keyword', 'resource.Second Keyword'),
is case, space, and underscore insensitive,
and may contain `*` and `?` as wildcards.
Examples: --removekeywords name:Lib.HugeKw
--removekeywords name:myresource.*
--flattenkeywords for|foritem|name: * Flattens matching keywords
in all generated outputs. Matching keywords get all
log messages from their child keywords and children
are discarded otherwise.
for: flatten for loops fully
foritem: flatten individual for loop iterations
name:: flatten matched keywords using same
matching rules as with
`--removekeywords name:`

How to insert text into middle of text file in QT?

I'm writing a program that performs several tests on a hardware unit, and logs both the results of each test and the steps taken to perform the test. The trick is that I want the program to log these results to a text file as they become available, so that if the program crashes the results that had been obtained are not lost, and the log can help debug the crash.
For example, assume a program consisting of two tests. If the program has finished the first test and is working on the second, the log file would look like:
Results:
Test 1 Result A: Passed
Test 1 Result B: 1.5 Volts
Log:
Setting up instruments.
Beginning test 1.
[Steps in test 1]
Finished test 1.
Beginning test 2.
[whatever test 2 steps have been completed]
Once the second test has finished, the log file would look like this:
Results:
Test 1 Result A: Passed
Test 1 Result B: 1.5 Volts
Test 2 Result A: Passed
Test 2 Result B: 2.0 Volts
Log:
Setting up instruments.
Beginning test 1.
[Steps in test 1]
Finished test 1.
Beginning test 2.
[Steps in test 2]
Finished test 2.
All tests complete.
How would I go about doing this? I've been looking at the help files for QFile and QTextStream, but I'm not seeing a way to insert text in the middle of existing text. I don't want to create separate files and merge them at the end because I'd end up with separate files in the event of a crash. I also don't want to write the file from scratch every time a change is made because it seems like there should be a faster, more elegant way of doing this.
QFile.readAll will read the entire file into a QByteArray.
On the QByteArray you can then use insert to insert text in the middle,
and then write it back to file again.
Or you could use the classic c style that can modify files in the middle with the help of filepointers.
As #Roku pointed out, there is no built in way to insert data in a file with a rewrite. However if you know the size of the region, i.e., if the text you want to write has a fixed length, then you can write an empty space in the file and replace it later. Check
this discussion in overwriting part of a file.
I ended up going with the "write the file from scratch" method that I mentioned being hesitant about in my question. The benefit of this technique is that it results in a single file, even in the event of a crash since the log and the results are never placed in different files to begin with. Additionally, rewriting the file only happens when adding new results (an infrequent occurrence), whereas updating the log means simply appending text to the file as usual. I'm still a bit surprised that there isn't a way to have the OS insert text into a file for you.
Oh, and for those of you who absolutely must have this functionality as efficiently as possible, the following might be of use:
http://www.codeproject.com/Articles/17716/Insert-Text-into-Existing-Files-in-C-Without-Temp
You just cannot add more stuff in the middle of a file. I would go with two separate files, another for the results and another for the logs.

Resources