How to remove duplicate lines in YAML format configuration files? - unix

I have a bunch of manifest/yaml files that may or may not have these key value pair duplicates:
...
app: activity-worker
app: activity-worker
...
I need to search through each of those files and find those duplicates so that I can remove one of them.
Note: I know that to replace a certain string (say, switch service: to app:) in all files of a directory (say, dev) I can run grep -l 'service:' dev/* | xargs sed -i "" 's/\service:/app:/g'. I'm looking for a relation between lines.

What you call YAML, is not YAML. The YAML specification
very explicitly states that
keys in a mapping must be unique, and your keys are not:
The content of a mapping node is an unordered set of key: value node
pairs, with the restriction that each of the keys is unique. YAML
places no further restrictions on the nodes. In particular, keys may
be arbitrary nodes, the same node may be used as the value of
several key: value pairs, and a mapping could even contain itself as
a key or a value (directly or indirectly).
On the other hand some libraries have implemented this incorrectly, choosing to overwrite
any previous value associated with a key, with a later value. In your case, since
the values are the same, which value would be taken doesn't really matter.
Also your block style representation is not the only way to represent key-value pairs of a
mapping in "YAML", these duplicates could also be represented in a mapping, as
{...., app: activity-worker, app: activity-worker, .... }
With the two occurences not necessarily being next to each, nor on the same line. The
following is also semantically equivalent "YAML" to your input:
{...., app: activity-worker, app:
activity-worker, .... }
If you have such faulty "YAML" files, the best way to clean them up is
using the round-trip capabilities of
ruamel.yaml (disclaimer: I
am the author of that package), and its ability to switch except/warn
on faulty input containing duplicate keys. You can install it for your
Python (virtual environment) using:
pip install ruamel.yaml
Assuming your file is called input.yaml and it contains:
a: 1 # some duplicate keys follow
app: activity-worker
app: activity-worker
b: "abc"
You can run the following one-liner:
python -c "import sys; from ruamel.yaml import YAML; yaml = YAML(); yaml.preserve_quotes=yaml.allow_duplicate_keys=True; yaml.dump(yaml.load(open('input.yaml')), sys.stdout)"
to get:
a: 1 # some duplicate keys follow
app: activity-worker
b: "abc"
and if your input were like:
{a: 1, app: activity-worker, app:
activity-worker, b: "abc"}
the output would be:
{a: 1, app: activity-worker, b: "abc"}

Related

Set list of config nodes as value entries in yaml contrasting with structured configs in Hydra

I would like to get a list of configs as a (default) value entry
and use a structured schema to validate the input list.
E.g., in trainer.yaml:
defaults:
- callbacks:
- checkpointer
- early_stopping
In callbacks/checkpointer.yaml and callbacks/early_stopping.yaml I have a link to appropriate structured configs as default values, e.g.:
# callbacks/checkpointer.yaml
defaults:
- /trainer_lib/callbacks/base_checkpointer#_here_
The structured schema:
#dataclass
class CheckpointerConfig:
_target_: str = "some_library_class"
data_dir: str = "folder"
#dataclass
class TrainerConfig:
callbacks: List[Any] = MISSING
and config store:
cs = ConfigStore.instance()
cs.store(group="trainer_lib/callbacks", name="base_checkpointer", node=CheckpointerConfig)
I am not sure what is the correct syntax (what I tried fails) to accomplish this. I get an omegaconf.errors.ConfigTypeError: Cannot merge DictConfig with ListConfig.
Is there a way to accomplish this? Thanks.
Discussion on this topic in this Hydra issue.
Are you on Hydra 1.0? This is actually supported in Hydra 1.1. Here is the documentation: https://hydra.cc/docs/next/patterns/select_multiple_configs_from_config_group

Snakemake specify a new wildcard in a new rule

I have input files:
Bob_1.fastq.gz
Bob_2.fastq.gz
Bob_3.fastq.gz
Bob_4.fastq.gz
Ron_1.fastq.gz
Ron_2.fastq.gz
Ron_3.fastq.gz
Ron_4.fastq.gz
I am running demultiplexing and trimming steps in one snakefile, like this:
workdir: "/path/to/dir/"
(SAMPLES,) =glob_wildcards('/path/to/dir/raw/{sample}.fastq.gz')
rule all:
input:
expand("demulptiplex/{sample}.fastq.gz", sample=SAMPLES),
expand("trimmed/{sample}.trimmed.fastq.gz", sample=SAMPLES)
rule sabre:
input:
infile="/path/to/dir/raw/{sample}.fastq.gz",
barcodefile= "files/{sample}.txt"
output:
unknownfile=temp("demulptiplex/unknown_barcode_{sample}.fastq.gz"),
shell:
"""
/Tools/sabre-master2/sabre se -f {input.infile} -b {input.barcodefile} -u {output.unknownfile}
"""
rule trimmomatic_se:
input:
r="{sample}.fastq.gz"
output:
r="trimmed/{sample}.trimmed.fastq.gz"
threads: 10
shell:
"""java -jar /Tools/Trimmomatic-0.36/trimmomatic-0.36.jar SE -threads {threads} {input.r} {output.r} ILLUMINACLIP:/Tools/Trimmomatic-0.36/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"""
The demultiplex output files are like this:
Bob_1_CL1.fastq.gz.... Bob_1_CL345.fastq.gz
Bob_2_CL1.fastq.gz.... Bob_1_CL248.fastq.gz
Ron_1_dad1.fastq.gz... Ron_1_dad67.fastq.gz
and so on
So,if I do not specify the demultiplex output file the program would create it by itself. My problem is how to specify/introduce a new wildcard from the output of the previous rule in the next trimming step, as the wildcards are different from initial sample now.
Wildcards just need to be consistent in a rule, not across the workflow. The issue here is that you have a rule generating 'unknown' outputs that you need to process further. For that you need to use checkpoints.
Read through the second block of code about aggregating. Your checkpoint will be demultiplexing and if you don't have any other steps, all will be your aggregate step that calls checkpoints.demultiplex.get. If you search for checkpoint on stackoverflow you will find lots of examples; it's a hard feature to use at first!

What *is* a salt formula, really?

I am trying to work through the Salt Formulas documentation and seem to be having a fundamental misunderstanding of what a salt formula really is.
Understandably, this question may seem like a duplicate of these questions, but due to my failing to grasp the basic concepts I'm also struggling to make use of the answers to these questions.
I thought, that a salt formula is basically just a package that implements extra functions, a lot like
#include <string.h>
in C, or
import numpy as np
in Python. Thus, I thought, I could download the salt-formula-linux to /srv/formulas/salt-formula-linux/, add that to file_roots, restart the master (all as per the docs), and then write a file like swapoff.sls containing
disable_swap:
linux:
storage:
swap:
file:
enabled: False
(the above is somewhat similar to the examples in the repo's root) in hope that the formula would then handle removing the swap entry from /etc/fstab and running swapoff -a for me. Needless to say, this didn't work, clearly because I'm not understanding what a salt formula is meant to be.
So, what is a salt formula and how do I use it? Can I make use of it as a library of functions too?
This answer might not be fully correct in all technicalities, but this is what solved my problem.
A salt formula is not a library of functions. It is, rather, a collection of state files. While often a state file can be very simple, such as some of my user defined
--> top.sls <--
base:
'*':
- docker
--> docker.sls <--
install_docker_1703:
pkgrepo.managed:
# stuff
pkg.installed:
- name: docker-ce
creating a state file like
--> swapoff.sls <--
disable_swap:
linux.storage.swap: # and so on
is, perhaps, not the way to go. Well, at least, maybe not for a beginner with lacking knowledge.
Instead, add an item to top.sls:
- linux.storage.swap
This is not enough, however. Most formulas (or the state files within them, if you will) are highly parametrizable, i.e. they're full of placeholders with variable names, such as {{ swap.device }}. If there's nothing to fill this gap, the state fill will not be able to do anything. These gaps are filled from pillars.
All that remains, is to create a file like swap.sls in /srv/pillar/ that would contain something like (as per the examples of that formula)
linux:
storage:
enabled: true
swap:
file:
enabled: true
engine: file
device: /swapfile
size: 1024
and also /srv/pillar/top.sls with
base:
'*':
- swap
Perhaps /srv/pillar should also be included in pillar_roots in /etc/salt/master.
So now /srv/salt/top.sls runs /srv/formulas/salt-formula-linux/linux/storage/swap.sls which using the guidance of /srv/pillar/top.sls pulls some parameters from /srv/pillar/swap.sls and enables a swapfile.

How to delete an inherit property from yaml config?

I have a yaml file like this:
local: &local
image: xxx
# *tons of config*
ci:
<<: *local
image: # delete
build: .
I want ci to inherit all values from local, except the image.
Is there a way to "delete" this value?
No there isn't a way to mark a key for deletion in a YAML file. You can only overwrite existing values.
And the latter is what you do, you associate the empty scalar as value to the key image as if you would have written:
image: null # delete
There are two things you can do: post-process or make a base mapping in your YAML file.
If you want to post-process, you associate a special unique value to image, or a specially tagged object, and after loading recursively walk over the tree to remove key-value pairs with this special value. Whether you can already do this during parsing, using hooks or overwriting some of its methods, depends on the parser.
Using a base mapping requires less work, but is more intrusive to the YAML file:
localbase: &lb
# *tons of config*
local: &local
image: xxx
ci:
<<: *lb
build: .
If you do the former you should note that if you use a parsers that preserve the "merge-hierarchy" on round-tripping (like my ruamel.yaml parser can do) it is not enough to delete the key-value pair, in that case the original from local would come back. Other parsers that simply resolve this at load time don't have this issue.
For properties that accept a list of values, you can send [] as value.
For example in docker-compose you don't want to inherit ports:
service_1: &service_1
# some other properties.
ports:
- "49281:22"
- "8876:8000"
# some other properties
image: some_image:latest
service_2:
<<: *service_1
ports: [] # it removes ports values.
image: null # it removes image value.

Easiest way to specify alternate transmogrifier _path?

I'm doing a content migration with collective.transmogrifier and I'm reading files off the file system with transmogrify.filesystem. Instead of importing the files "as is", I'd like to import them to a sub directory in Plone. What is the easiest way to modify the _path?
For example, if the following exists:
/var/www/html/bar/index.html
I'd like to import to:
/Plone/foo/bar/index.html
In other words, import the contents of "baz" to a subdirectory "foo". I see two options:
Use some blueprint in collective.transmogrifier to mangle _path.
Write some blueprint to mangle _path.
Am I missing anything easier?
Use the standard inserter blueprint to generate the paths; it accepts python expressions and can replace keys in-place:
[manglepath]
blueprint = collective.transmogrifier.sections.inserter
key = string:_path
value = python:item['_path'].replace('/var/www/html', '/Plone/foo')
This thus takes the output of the value python expression (which uses the item _path and stores it back under the same key.

Resources