Snakemake WorkflowError: Target rules may not contain wildcards - wildcard

rule all:
input:
"../data/A_checkm/{genome}"
rule A_checkm:
input:
"../data/genomesFna/{genome}_genomic.fna.gz"
output:
directory("../data/A_checkm/{genome}")
threads:
16
resources:
mem_mb = 40000
shell:
"""
# setup a tmp working dir
tmp=$(mktemp -d)
mkdir $tmp/ref
cp {input} $tmp/ref/genome.fna.gz
cd $tmp/ref
gunzip -c genome.fna.gz > genome.fna
cd $tmp
# run checking
checkm lineage_wf -t {threads} -x fna ref out > stdout
# prepare output folder
cd {config[project_root]}
mkdir -p {output}
# copy results over
cp -r $tmp/out/* {output}/
cp $tmp/stdout {output}/checkm.txt
# cleanup
rm -rf $tmp
"""
Thank you in advance for your help!
I would like to run checkm on a list of ~600 downloaded genome files having the extension '.fna.gz'. Each downloaded file is saved in a separate folder having the same name as the genome. I would like also to have all the results in a separate folder for each genome and that's why my output is a directory.
When I run this code with 'snakemake -s Snakefile --cores 10 A_checkm', I get the following error:
WorkflowError: Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).
Anyone could help me identifying the error, please?

You need to provide snakemake with concrete values for the {genome} wildcard. You cannot just leave it open and expect snakemake to work on all the files in some folder of your project just like that.
Determine the filenames/genome values of the files which you want to work on, using glob_wildcards(...). See the documentation for further details.
Now you can use these values to specify in rule all to create all the folders (using your other rule) with those {genome} values:
# Determine the {genome} for all downloaded files
(GENOMES,) = glob_wildcards("../data/genomesFna/{genome}_genomic.fna.gz")
rule all:
input:
expand("../data/A_checkm/{genome}", genome=GENOMES),
rule A_checkm:
input:
"../data/genomesFna/{genome}_genomic.fna.gz",
output:
directory("../data/A_checkm/{genome}"),
threads: 16
resources:
mem_mb=40000,
shell:
# Your magic goes here
If the download is supposed to happen inside snakemake, add a checkpoint for that. Have a look at this answer then.

Related

How to select all files from one sample?

I have a problem figuring out how to make the input directive only select all {samples} files in the rule below.
rule MarkDup:
input:
expand("Outputs/MergeBamAlignment/{samples}_{lanes}_{flowcells}.merged.bam", zip,
samples=samples['sample'],
lanes=samples['lane'],
flowcells=samples['flowcell']),
output:
bam = "Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
metrics = "Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics",
shell:
"gatk --java-options -Djava.io.tempdir=`pwd`/tmp \
MarkDuplicates \
$(echo ' {input}' | sed 's/ / --INPUT /g') \
-O {output.bam} \
--VALIDATION_STRINGENCY LENIENT \
--METRICS_FILE {output.metrics} \
--MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 200000 \
--CREATE_INDEX true \
--TMP_DIR Outputs/MarkDuplicates/tmp"
Currently it will create correctly named output files, but it selects all files that match the pattern based on all wildcards. So I'm perhaps halfway there. I tried changing {samples} to {{samples}} in the input directive as such:
expand("Outputs/MergeBamAlignment/{{samples}}_{lanes}_{flowcells}.merged.bam", zip,
lanes=samples['lane'],
flowcells=samples['flowcell']),`
but this broke the previous rule somehow. So the solution is something like
input:
"{sample}_*.bam"
But clearly this doesn't work.
Is it possible to collect all files that match {sample}_*.bam with a function and use that as input? And if so, will the function still work with $(echo ' {input}' etc...) in the shell directive?
If you just want all the files in the directory, you can use a lambda function
from glob import glob
rule MarkDup:
input:
lambda wcs: glob('Outputs/MergeBamAlignment/%s*.bam' % wcs.samples)
output:
bam="Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
metrics="Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics"
shell:
...
Just be aware that this approach can't do any checking for missing files, since it will always report that the files needed are the files that are present. If you do need confirmation that the upstream rule has been executed, you can have the previous rule touch a flag, which you then require as input to this rule (though you don't actually use the file for anything other than enforcing execution order).
If I understand correctly, zip needs to be applied only to {lane} and {flowcells} and not to {samples}. In that case, use two expand instances can achieve that.
input:
expand(expand("Outputs/MergeBamAlignment/{{samples}}_{lanes}_{flowcells}.merged.bam",
zip, lanes=samples['lane'], flowcells=samples['flowcell']),
samples=samples['sample'])
PS: output.tmp file uses {sample} instead of {samples}. Typo?

How to make a single makefile that applies the same command to sub-directories?

For clarity, I am running this on windows with GnuWin32 make.
I have a set of directories with markdown files in at several different levels - theoretically they could be in the branch nodes, but I think currently they are only in the leaf nodes. I have a set of pandoc/LaTeX commands to run to turn the markdown files into PDFs - and obviously only want to recreate the PDFs if the markdown file has been updated, so a makefile seems appropriate.
What I would like is a single makefile in the root, which iterates over any and all sub-directories (to any depth) and applies the make rule I'll specify for running pandoc.
From what I've been able to find, recursive makefiles require you to have a makefile in each sub-directory (which seems like an administrative overhead that I would like to avoid) and/or require you to list out all the sub-directories at the start of the makefile (again, would prefer to avoid this).
Theoretical folder structure:
root
|-make
|-Folder AB
| |-File1.md
| \-File2.md
|-Folder C
| \-File3.md
\-Folder D
|-Folder E
| \-File4.md
|-Folder F
\-File5.md
How do I write a makefile to deal with this situation?
Here is a small set of Makefile rules that hopefuly would get you going
%.pdf : %.md
pandoc -o $# --pdf-engine=xelatex $^
PDF_FILES=FolderA/File1.pdf FolderA/File2.pdf \
FolderC/File3.pdf FolderD/FolderE/File4.pdf FolderD/FolderF/File5.pdf
all: ${PDF_FILES}
Let me explain what is going on here. First we have a pattern rule that tells make how to convert a Markdown file to a PDF file. The --pdf-engine=xelatex option is here just for the purpose of illustration.
Then we need to tell Make which files to consider. We put the names together in a single variable PDF_FILES. This value for this variable can be build via a separate scripts that scans all subdirectories for .md files.
Note that one has to be extra careful if filenames or directory names contain spaces.
Then we ask Make to check if any of the PDF_FILES should be updated.
If you have other targets in your makefile, make sure that all is the first non-pattern target, or call make as make all
Updating the Makefile
If shell functions works for you and basic utilities such as sed and find are available, you could make your makefile dynamic with a single line.
%.pdf : %.md
pandoc -o $# --pdf-engine=xelatex $^
PDF_FILES:=$(shell find -name "*.md" | xargs echo | sed 's/\.md/\.pdf/g' )
all: ${PDF_FILES}
MadScientist suggested just that in the comments
Otherwise you could implement a script using the tools available on your operating system and add an additional target update: that would compute the list of files and replace the line starting with PDF_FILES with an updated list of files.
Final version of the code that worked for Windows, based on #DmitiChubarov and #MadScientist's suggestions is as follows:
%.pdf: %.md
pandoc $^ -o $#
PDF_FILES:=$(shell dir /s /b *.md | sed "s/\.md/\.pdf/g")
all: ${PDF_FILES}

Rsync all files (recursively) from one dir to another, maintaining only a portion of the original dir structure

I have two directories:
Directory #1, 'C'
C's absolute path:
/A/B/C
Directory #2, 'T'
T's absolute path:
/Q/R/T
I want to use rsync, to copy all files, recursively, from C, and copy them in to T, while maintaining the original directory structure - but only from B onwards.
Example to make it clearer: suppose 'B' has only 3 files nested within it:
/A/B/f1.txt
/A/B/C/f2.txt
/A/B/C/D/f3.txt
Then I want to end up with only f2.txt and f3.txt being copied over, with the final filepaths as follows (notice how I keep the directory structure, only from B onwards):
/Q/R/T/B/C/f2.txt
/Q/R/T/B/C/D/f3.txt
Here is the catch: I must execute the rsync cmd from within /Q/R/. So when I execute this command, my pwd must be /Q/R/.
Can anyone help me figure out how to do this?
[If I did not have this constraint of where my cwd must be, I could cd to /A/B, and then execute: rsync . /Q/R/T/ --recursive --relative . Unfortunately, I can not do that for reasons that would take a lot of pointless explaining here. And when I try to execute rsync /A/. /Q/R/T/ --recursive --relative, I end up with not only everything within A, but maintaining that first part of the dir structure (/A/) that I don't want. (Note - in the real life scenario the dir structure is much more complex then this, this is just the general problem.]
The rsync command includes a couple of options which are suitable for this scenario. They are:
--include=PATTERN - Don't exclude files matching PATTERN
--exclude=PATTERN - Exclude files matching PATTERN
An excellent description and examples of the --exclude flag can be found here.
Solution
Given the directory structures provided in your question and your pwd being set to /Q/R/. Running the following command will meet your requirement:
rsync ../../A/ T/ --recursive --include A/B/** --exclude B/*.*
Edit:
If you do want /A/B/f1.txt to copy to /Q/R/T/B/f1.txt (as it's unclear in your question because you don't show it in the "I want to end up with" example"). Then omit the --exclude B/*.* part, so the complete command is reduced to:
rsync ../../A/ T/ --recursive --include A/B/**
or reduced even further in complexity to just:
rsync ../../A/** T/ --recursive
Explanation of the command
../../A/
The first argument provides the path to the source directory. I.e. The relative position within the hierarchical tree of names (Based on your pwd being /Q/R).
T/
The second argument provides the path to the destination directory. Again this is a relative position within the hierarchical tree of names (and is also based on the pwd being /Q/R).
--recursive
The first option is to recurse into the directories.
--include A/B/**
This says that you want to include all the assets (files/folders), however many levels deep, from within the folder named B which resides inside folder A.
--exclude B/*.*
This says that you want to exclude any assets (files/folders), whose name includes a dot [.] plus extension, which reside inside folder B (at the top level). This will prevent the file named f1.txt from being copied. You could be even more specific here and use --exclude B/f1.txt instead, however I'm assuming in real life you perhaps have additional files you want to exclude here too.
Additional notes
Both the --include and --exclude options can be utilized multiple times. This can be very useful for some scenarios too as it enables you to be specific about what to include and/or exclude during the copy process.
For example, lets assume that your source directory /A/B/, (as described in your question), also contains a folder named X. So its path is A/B/X.
Lets say that we also do not want to copy this folder named X (in the same way as you currently do not want to copy /A/B/f1.txt).
For this scenario we add another --exclude option as follows:
rsync ../../A/ T/ --recursive --include A/B/** --exclude B/*.* --exclude X/
Note the additional --exclude X/ at the end.
You mention...
(Note - in the real life scenario the dir structure is much more complex then this, this is just the general problem.
... in your question, so you may find it necessary to add additional --exclude=PATTERN to truly meet your requirements.
Grunt
As you have included the gruntjs flag with your question, then you may want to consider utilizing plug-ins which can run shell commands like rsync such as:
grunt-shell
grunt-exec

Makefile rule depend on directory content changes

Using Make is there a nice way to depend on a directories contents.
Essentially I have some generated code which the application code depends on. The generated code only needs to change if the contents of a directory changes, not necessarily if the files within change their content. So if a file is removed or added or renamed I need the rule to run.
My first thought is generate a text file listing of the directory and diff that with the last listing. A change means rerun the build. I think I will have to pass off the generate and diff part to a bash script.
I am hoping somehow in their infinite intelligence might have an easier solution.
Kudos to gjulianm who got me on the right track. His solution works perfect for a single directory.
To get it working recursively I did the following.
ASSET_DIRS = $(shell find ../../assets/ -type d)
ASSET_FILES = $(shell find ../../assets/ -type f -name '*')
codegen: ../../assets/ $(ASSET_DIRS) $(ASSET_FILES)
generate-my-code
It appears now any changes to the directory or files (add, delete, rename, modify) will cause this rule to run. There is likely some issue with file names here (spaces might cause issues).
Let's say your directory is called dir, then this makefile will do what you want:
FILES = $(wildcard dir/*)
codegen: dir # Add $(FILES) here if you want the rule to run on file changes too.
generate-my-code
As the comment says, you can also add the FILES variable if you want the code to depend on file contents too.
A disadvantage of having the rule depend on a directory is that any change to that directory will cause the rule to be out-of-date — including creating generated files in that directory. So unless you segregate source and target files into different directories, the rule will trigger on every make.
Here is an alternative approach that allows you to specify a subset of files for which additions, deletions, and changes are relevant. Suppose for example that only *.foo files are relevant.
# replace indentation with tabs if copy-pasting
.PHONY: codegen
codegen:
find . -name '*.foo' |sort >.filelist.new
diff .filelist.current .filelist.new || cp -f .filelist.new .filelist.current
rm -f .filelist.new
$(MAKE) generate
generate: .filelist.current $(shell cat .filelist.current)
generate-my-code
.PHONY: clean
clean:
rm -f .filelist.*
The second line in the codegen rule ensures that .filelist.current is only modified when the list of relevant files changes, avoiding false-positive triggering of the generate rule.

Can I symlink multiple directories into one?

I have a feeling that I already know the answer to this one, but I thought I'd check.
I have a number of different folders:
images_a/
images_b/
images_c/
Can I create some sort of symlink such that this new directory has the contents of all those directories? That is this new "images_all" would contain all the files in images_a, images_b and images_c?
No. You would have to symbolically link all the individual files.
What you could do is to create a job to run periodically which basically removed all of the existing symbolic links in images_all, then re-create the links for all files from the three other directories, but it's a bit of a kludge, something like this:
rm -f images_all/*
for i in images_[abc]/* ; do; ln -s $i images_all/$(basename $i) ; done
Note that, while this job is running, it may appear to other processes that the files have temporarily disappeared.
You will also need to watch out for the case where a single file name exists in two or more of the directories.
Having come back to this question after a while, it also occurs to me that you can minimise the time during which the files are not available.
If you link them to a different directory then do relatively fast mv operations that would minimise the time. Something like:
mkdir images_new
for i in images_[abc]/* ; do
ln -s $i images_new/$(basename $i)
done
# These next two commands are the minimal-time switchover.
mv images_all images_old
mv images_new images_all
rm -rf images_old
I haven't tested that so anyone implementing it will have to confirm the suitability or otherwise.
You could try a unioning file system like unionfs!
http://www.filesystems.org/project-unionfs.html
http://aufs.sourceforge.net/
to add on to paxdiablo 's great answer, i think you could use cp -s
(-s or --symbolic-link)
which makes symbolic links instead of literal copying
to maybe speed up or simplify the the bulk adding of symlinks to the "merge" folder A , of the files from folder B and C.
(i have not tested this though)
I cant recall of the top of my head, but im sure there is some option for CP to NOT overwrite existing, thus only symlinks of new files will be "cp -s" ed

Resources