Variable Template Files with Dictionary - dictionary

I've looked into the variables definition pages; unable to find a working example of using a dictionary in a template file.
The following error is shown when pressing run pipeline:
Encountered error(s) while parsing pipeline YAML:
/var.yaml (Line: 3, Col: 5): A mapping was not expected
This is what I'm looking to do.
# var.yaml file
variables:
appServicePlanObj:
planName: "name-goes-here"
skuName: S1
skuTier: Standard
skuSize: S1
skuFamily: S
skuCapacity: 1
# pipeline.yaml file
stages:
- stage: stage_Id
displayName: Stage Name
variables:
- template: "var.yaml"
jobs:
- deployment: deploymentId
Ideally, I want to create more complex objects with nested arrays and objects.

Related

Snakemake wildcards: Using wildcarded files from directory output

I'm new to Snakemake and try to use specific files in a rule, from the directory() output of another rule that clones a git repo.
Currently, this gives me an error Wildcards in input files cannot be determined from output files: 'json_file', and I don't understand why. I have previously worked through the tutorial at https://carpentries-incubator.github.io/workflows-snakemake/index.html.
The difference between my workflow and the tutorial workflow is that I want to create the data I use later in the first step, whereas in the tutorial, the data was already there.
Workflow description in plain text:
Clone a git repository to path {path}
Run a script {script} on every single JSON files in the directory {path}/parsed/ in parallel to produce the aggregate result {result}
GIT_PATH = config['git_local_path'] # git/
PARSED_JSON_PATH = f'{GIT_PATH}parsed/'
GIT_URL = config['git_url']
# A single parsed JSON file
PARSED_JSON_FILE = f'{PARSED_JSON_PATH}{{json_file}}.json'
# Build a list of parsed JSON file names
PARSED_JSON_FILE_NAMES = glob_wildcards(PARSED_JSON_FILE).json_file
# All parsed JSON files
ALL_PARSED_JSONS = expand(PARSED_JSON_FILE, json_file=PARSED_JSON_FILE_NAMES)
rule all:
input: 'result.json'
rule clone_git:
output: directory(GIT_PATH)
threads: 1
conda: f'{ENVS_DIR}git.yml'
shell: f'git clone --depth 1 {GIT_URL} {{output}}'
rule extract_json:
input:
cmd='scripts/extract_json.py',
json_file=PARSED_JSON_FILE
output: 'result.json'
threads: 50
shell: 'python {input.cmd} {input.json_file} {output}'
Running only clone_git works fine (if I set an all input of GIT_PATH).
Why do I get the error message? Is this because the JSON files don't exist when the workflow is started?
Also - I don't know if this matters - this is a subworkflow used with module.
What you need seems to be a checkpoint rule which is first executed and only then snakemake determines which .json files are present and runs your extract/aggregate functions. Here's an example adapted:
I'm struggling to fully understand the file and folder structure you get after cloning your git repo. So I have fallen back to the best practices by Snakemake of using resources for downloaded and results for created files.
You'll need to re-adjust those paths to match your case again:
GIT_PATH = config["git_local_path"] # git/
GIT_URL = config["git_url"]
checkpoint clone_git:
output:
git=directory(GIT_PATH),
threads: 1
conda:
f"{ENVS_DIR}git.yml"
shell:
f"git clone --depth 1 {GIT_URL} {{output.git}}"
rule extract_json:
input:
cmd="scripts/extract_json.py",
json_file="resources/{file_name}.json",
output:
"results/parsed_files/{file_name}.json",
shell:
"python {input.cmd} {input.json_file} {output}"
def get_all_json_file_names(wildcards):
git_dir = checkpoints.clone_git.get(**wildcards).output["git"]
file_names = glob_wildcards(
"resources/{file_name}.json"
).file_name
return expand(
"results/parsed_files/{file_name}.json",
file_name=file_names,
)
# Rule has checkpoint dependency: Only after the checkpoint is executed
# the rule is executed which then evaluates the function to determine all
# json files downloaded from the git repo
rule aggregate:
input:
get_all_json_file_names
output:
"result.json",
default_target: True
shell:
# TODO: Action which combines all JSON files
edit: Moved the expand(...) from rule aggregate into get_all_json_file_names.

How to use a config group multiple times, while overriding each instance

Here is my current config structure
hydra/
pipeline/
common/
feature.yaml
stage/
train.yaml
with the following files:
train.yaml
# #package _global_
defaults:
- _self_
- ../pipeline/common#train: feature
- ../pipeline/common#val: feature
train:
conf:
split: train
val:
conf:
split: val
pipeline:
- ${oc.dict.values: train.steps}
- ${oc.dict.values: val.steps}
feature.yaml
conf:
split: train
steps:
tabular:
name: "${conf.split}-tabular
class: FeatureGeneration
dataset:
datasources: [ "${conf.split}_split" ]
What I've accomplished:
I've been able to figure out how to use the config group multiple times utilizing the defaults in train.yaml.
What I'm stuck on:
I'm getting an error: InterpolationKeyError 'conf.split' not found
I do realize that imports are absolute. If I put #package common.feature at the beginning of feature.yaml I can import conf.split via common.feature.conf.split, but is there not a cleaner way? I tried relative imports but got the same error.
I can't seem to override conf.split from train.yaml. You can see where I set train.conf.split and val.conf.split but these do not get propagated. What I need to be able to do is have each instance of the config group utilize a different conf.split value. This is the biggest issue I'm facing.
What I've referenced so far:
The following resources have gotten me to where I am so far, but am still having trouble with what's listed above.
Hydra : how to assign config files from same group to two different fields
https://hydra.cc/docs/advanced/overriding_packages/
https://hydra.cc/docs/patterns/extending_configs/
Interpolation is not import and it's evaluated at when you access the config node. At that point your config is already composed so it should be straight forward to use either absolute interpolation (the default) or relative based on the structure of your final config.
Hard to be 100% sure, but I suspect this problem is because your defaults list has _self_ at the beginning. This means that the content of the config with containing the defaults list is overridden by what comes after in the defaults list.
Try to move _self_ to the end:
# #package _global_
defaults:
- ../pipeline/common#train: feature
- ../pipeline/common#val: feature
- _self_
#...

define SAMPLE for different dir name and sample name in snakemake code

I have written a snakemake code to run bwa_map. Fastq files are with different folder name and different sample name (paired end). It shows error as 'SAMPLES' is not defined. Please help.
Error:
$snakemake --snakefile rnaseq.smk mapped_reads/EZ-123-B_IGO_08138_J_2_S101_R2_001.bam -np
*NameError in line 2 of /Users/singhh5/Desktop/tutorial/rnaseq.smk:
name 'SAMPLES' is not defined
File "/Users/singhh5/Desktop/tutorial/rnaseq.smk", line 2, in *
#SAMPLE DIRECTORY
fastq
Sample_EZ-123-B_IGO_08138_J_2
EZ-123-B_IGO_08138_J_2_S101_R1_001.fastq.gz
EZ-123-B_IGO_08138_J_2_S101_R2_001.fastq.gz
Sample_EZ-123-B_IGO_08138_J_4
EZ-124-B_IGO_08138_J_4_S29_R1_001.fastq.gz
EZ-124-B_IGO_08138_J_4_S29_R2_001.fastq.gz
#My Code
expand("~/Desktop/{sample}/{rep}.fastq.gz", sample=SAMPLES)
rule bwa_map:
input:
"data/genome.fa",
"fastq/{sample}/{rep}.fastq"
conda:
"env.yaml"
output:
"mapped_reads/{rep}.bam"
threads: 8
shell:
"bwa mem {input} | samtools view -Sb -> {output}"
The specific error you are seeing is because the variable SAMPLES isn't set to anything before you use it in expand.
Some other issues you may run into:
Output file is missing the {sample} wildcard.
The value of threads isn't passed into bwa or samtools
You should place your expand into the input directive of the first rule in your snakefile, typically called all to properly request the files from bwa_map.
You aren't pairing your reads (R1 and R2) in bwa.
You should look around stackoverflow or some github projects for similar rules to give you inspiration on how to do this mapping.

Error when inserting a UUID into YAML using "!!python/object"

For an automated test script, I would like to generate random UUID values at runtime, so I added some YAML that looks like this:
---
applicant:
idNumbers:
nationalId: !!python/object:uuid.uuid4
However, this generates an error when I try to yaml.load the value:
ConstructorError: expected a mapping node, but found scalar
in "<unicode string>", line 4, column 17:
nationalId: !!python/object:uuid.uuid4
^
How do I inject a UUID value via YAML tags?
I found the error message a bit intimidating at first, but after some thought, I was able to unpack it.
The parser is expecting a "mapping" node, not a scalar. So, what happens if I add a mapping?
>>> yaml.load('''---
... applicant:
... idNumbers:
... nationalId: !!python/object:uuid.uuid4 {}''')
{'applicant': {'idNumbers': {'nationalId': UUID('71e09d1d-e84e-4ea6-855d-be1a2e91b60a')}}}
Additional info: http://yaml.org/type/map.html

AFx Library library exception: File: ..\..\Dataset2\Dataset2.dataset cannot be found. . ( Error 1000 )

I have a simple Azure Machine Learning experiment, with two input blocks ("Enter Data Manually") that pass their input to an "Execute R Script" block that binds the two inputs.
When the two input values are the same, I get an AFx Library exception FileNotFound. When the two input values are different, everything works fine.
Here is the R code and the experiment outline.
d1 <- maml.mapInputPort(1) # class: data.frame
d2 <- maml.mapInputPort(2) # class: data.frame
print(d1)
print(class(d1))
print(d2)
print(class(d2))
The error I get when I set the same input data in the two input blocks, in more detail is the following:
[Critical] Error: Error 1000: AFx Library library exception: File: ..\..\Dataset2\Dataset2.dataset cannot be found.
[Critical] {"InputParameters":{"DataTable":[{"Rows":2,"Columns":1,"estimatedSize":12001280,"ColumnTypes":
{"System.Int32":1},"IsComplete":true,"Statistics":
{"0":[1.5,1.5,1.0,2.0,0.70710678118654757,2.0,0.0]}}]},"OutputParameters":
[],"ModuleType":"LanguageWorker","ModuleVersion":"
Version=6.0.0.0","AdditionalModuleInfo":"LanguageWorker, Version=6.0.0.0,
Culture=neutral, PublicKeyToken=69c3241e6f0468ca;
Microsoft.MetaAnalytics.LanguageWorker.LanguageWorkerClientRS;
RunRSNR","Errors":"Microsoft.Analytics.Exceptions.ErrorMapping+ModuleException:
Error 1000: AFx Library library exception: File: ..\\..\\Dataset2
\\Dataset2.dataset cannot be found. --->
Microsoft.Numerics.AFxLibraryFileNotFoundException: File: ..\\..\\Dataset2
\\Dataset2.dataset cannot be found.\r\n at
Microsoft.Analytics.IO.Local.DataTableReader..ctor(String filePath)\r\n at
Microsoft.MetaAnalytics.DllModuleHost.DataLab.Handlers.DataTableDatasetHandler.H
andleArgumentString(String argument, ParameterInfo paramInfo)\r\n at
Microsoft.MetaAnalytics.DllModuleHost.ParameterArgumentBinder.InitializeParamete
rValues(MethodInfo method, Dictionary`2 moduleDescription)\r\n at
Microsoft.MetaAnalytics.DllModuleHost.DllModuleMethod.Execute(Dictionary`2
moduleDescription)\r\n at
Microsoft.MetaAnalytics.DllModuleHost.Program.Main(String[] args)\r\n --- End
of inner exception stack trace ---","Warnings":[],"Duration":"00:00:00.5755180"}
Module finished after a runtime of 00:00:01.4722617 with exit code -2
Module failed due to negative exit code of -2
Any suggestion is much appreciated, Flo.
You should get rid of the R code referencing the second dataset because you only have one input dataset.

Resources