Passing multiple config groups

Passing multiple config groups - fb-hydra

In my config.yaml, how to pass two datasets for e.g. cifar and cinic at once? Can I pass a multiple config groups to the defaults list?
This is for the case when I want to train my model on a mix of datasets, but I do not want to create a config group for every possible combination.
├── config.yaml
└── dataset
├── cifar.yaml
└── imagenet.yaml
└── cinic.yaml
What I tried is as follows:
dataset:
- cifar
- cinic
which resulted in following error:
Could not load train_dataset/['cifar', 'cinic']. Available options 'cifar ... '

Currently config groups are mutually exclusive.
Support for this is planned for Hydra 1.1. See issue 499.
One possible workaround is to put everything in the config and to use interpolation:
all_datasets:
imagenet:
name: imagenet
cifar10:
name: cifar10
datasets:
- ${all_datasets.imagenet}
- ${all_datasets.cifar10}
This way you can override dataset to a different list of datasets from the command line (with the interpolation).
If you want to simplify the usage at the expense of some additional code, you can do something like:
all_datasets:
...
datasets_list:
- imagenet
- cifar10
datasets: []
#hydra.main(config_path="conf", config_name="config")
def my_app(cfg: DictConfig) -> None:
for ds in cfg.datasets_list:
cfg.datasets.append(cfg.all_datasets[ds])
if __name__ == "__main__":
my_app()
I didn't test this but I hope you get the idea.

Related

How to use a config group multiple times, while overriding each instance

Here is my current config structure
hydra/
pipeline/
common/
feature.yaml
stage/
train.yaml
with the following files:
train.yaml
# #package _global_
defaults:
- _self_
- ../pipeline/common#train: feature
- ../pipeline/common#val: feature
train:
conf:
split: train
val:
conf:
split: val
pipeline:
- ${oc.dict.values: train.steps}
- ${oc.dict.values: val.steps}
feature.yaml
conf:
split: train
steps:
tabular:
name: "${conf.split}-tabular
class: FeatureGeneration
dataset:
datasources: [ "${conf.split}_split" ]
What I've accomplished:
I've been able to figure out how to use the config group multiple times utilizing the defaults in train.yaml.
What I'm stuck on:
I'm getting an error: InterpolationKeyError 'conf.split' not found
I do realize that imports are absolute. If I put #package common.feature at the beginning of feature.yaml I can import conf.split via common.feature.conf.split, but is there not a cleaner way? I tried relative imports but got the same error.
I can't seem to override conf.split from train.yaml. You can see where I set train.conf.split and val.conf.split but these do not get propagated. What I need to be able to do is have each instance of the config group utilize a different conf.split value. This is the biggest issue I'm facing.
What I've referenced so far:
The following resources have gotten me to where I am so far, but am still having trouble with what's listed above.
Hydra : how to assign config files from same group to two different fields
https://hydra.cc/docs/advanced/overriding_packages/
https://hydra.cc/docs/patterns/extending_configs/

Interpolation is not import and it's evaluated at when you access the config node. At that point your config is already composed so it should be straight forward to use either absolute interpolation (the default) or relative based on the structure of your final config.
Hard to be 100% sure, but I suspect this problem is because your defaults list has _self_ at the beginning. This means that the content of the config with containing the defaults list is overridden by what comes after in the defaults list.
Try to move _self_ to the end:
# #package _global_
defaults:
- ../pipeline/common#train: feature
- ../pipeline/common#val: feature
- _self_
#...

Django unittest: required mock patch dotted path varies depending on how one calls/runs the tests

It took me hours to figure out how to patch the below code. The path to it was very much unexpected.
Depending on how I run the tests, and which dir I am in, I find the dotted path to the module to patch changes. This is really bad for unittesting. Which makes me think I am doing it wrong.
The file structure related to the code is:
loaders.py <-- Has a load_palette() func required to be patched
typers.py <-- Has `from . loaders import load_palette`, and calls load_palette()
render.py <-- Has a func that calls the typers func
tests/test_render.py <-- Tests for render which calls a func in render, which calls a func in typers, which calls load_palette()
In the code below __package__.replace('.tests', '.typers.load_palette') takes the current path to the current package which could be:
bar.tests or
foo.bar.tests
or something else
and builds the dotted path relatively so that is is correct. This seems very hackish. How is one supposed to safe guard against these kind of issues?
Ideally the dotted path would be ..typers.load_palette but it did not accept the relative dotted path.
Heres the actual code:
# file: test_render.py
# Depending where one runs the test from, the path is different, so generate it dynamically
#mock.patch(__package__.replace('.tests', '.typers.load_palette'), return_value=mocks.palette)
class render_rule_Tests(SimpleTestCase):
def test_render_preset_rule(self, _): # _ = mocked_load_palette
...

files layout as following:
$ tree issue
issue
├── __init__.py
├── loaders.py
├── renders.py
├── tests
│   ├── __init__.py
│   └── test_render.py
├── run_tests.sh
└── typers.py
1 directory, 7 files
the root package is issue, you should always import modules from issue, and patch issue.xxx.yyy.
then run pytest (or some other unittest tools) from the same path as tests resident.
for example, run_tests.sh is a shell script to run all test cases under tests.
and test_render may be like this
# file: test_render.py
# Depending where one runs the test from, the path is different, so generate it dynamically
#mock.patch('issue.typers.load_palette', return_value=mocks.palette)
class render_rule_Tests(SimpleTestCase):
def test_render_preset_rule(self, _): # _ = mocked_load_palette
...

You can add the path of the "tests" directory using sys.path.insert.
In the top of "tests/test_render.py" add:
import sys
sys.path.insert(0, "<path/to/the/folder/tests/>")
# Depending where one runs the test from, the path is different, so generate it dynamically
#mock.patch(__package__.replace('.tests', '.typers.load_palette'), return_value=mocks.palette)
class render_rule_Tests(SimpleTestCase):
def test_render_preset_rule(self, _): # _ = mocked_load_palette
...
This will add the path in system paths where python interpreter. From there, the python interpreter can locate the relative imports.
Note: The safest option would be to add the absolute path to the tests folder. However, if it's not possible, add the shortest relative path possible.

define SAMPLE for different dir name and sample name in snakemake code

I have written a snakemake code to run bwa_map. Fastq files are with different folder name and different sample name (paired end). It shows error as 'SAMPLES' is not defined. Please help.
Error:
$snakemake --snakefile rnaseq.smk mapped_reads/EZ-123-B_IGO_08138_J_2_S101_R2_001.bam -np
*NameError in line 2 of /Users/singhh5/Desktop/tutorial/rnaseq.smk:
name 'SAMPLES' is not defined
File "/Users/singhh5/Desktop/tutorial/rnaseq.smk", line 2, in *
#SAMPLE DIRECTORY
fastq
Sample_EZ-123-B_IGO_08138_J_2
EZ-123-B_IGO_08138_J_2_S101_R1_001.fastq.gz
EZ-123-B_IGO_08138_J_2_S101_R2_001.fastq.gz
Sample_EZ-123-B_IGO_08138_J_4
EZ-124-B_IGO_08138_J_4_S29_R1_001.fastq.gz
EZ-124-B_IGO_08138_J_4_S29_R2_001.fastq.gz
#My Code
expand("~/Desktop/{sample}/{rep}.fastq.gz", sample=SAMPLES)
rule bwa_map:
input:
"data/genome.fa",
"fastq/{sample}/{rep}.fastq"
conda:
"env.yaml"
output:
"mapped_reads/{rep}.bam"
threads: 8
shell:
"bwa mem {input} | samtools view -Sb -> {output}"

The specific error you are seeing is because the variable SAMPLES isn't set to anything before you use it in expand.
Some other issues you may run into:
Output file is missing the {sample} wildcard.
The value of threads isn't passed into bwa or samtools
You should place your expand into the input directive of the first rule in your snakefile, typically called all to properly request the files from bwa_map.
You aren't pairing your reads (R1 and R2) in bwa.
You should look around stackoverflow or some github projects for similar rules to give you inspiration on how to do this mapping.

How to find a common variable in a large number of databases using Stata

So I have a large number of databases (82) in Stata, that each contain around 1300 variables and several thousand observations. Some of these databases contain variables that give the mean or standard deviation of certain concepts. For example, a variable in such a dataset could be called "leverage_mean". Now, I want to know which datasets contain variables called concept_mean or concept_sd, without having to go through every dataset by hand.
I was thinking that maybe there is a way to loop through the databases looking for variables containing "mean" or "sd", unfortunately I have idea how to do this. I'm using R and Stata datafiles.

Yes, you can do this with a loop in stata as well as R. First, you should check out the stata command ds and the package findname, which will do many of the things described here and much more. But to show you what is happening "under the hood", I'll show the Stata code that can achieve this below:
/*Set your current directory to the location of your databases*/
cd "[your cd here]"
Save the names of the 82 databases to a list called "filelist" using stata's dir function for macros. NOTE: you don't specify what kind of file your database files are, so I'm assuming .xls. This command saves all files with extension ".xls" into the list. What type of file you save into the list and how you import your database will depend on what type of files you are reading in.
local filelist : dir . files "*.xls"
Then loop over all files to show which ones contain variables that end with "_sd" or "_mean".
foreach file of local filelist {
/*import the data*/
import excel "`file'", firstrow clear case(lower)
/*produce a list of the variables that end with "_sd" and "_mean"*/
cap quietly describe *_sd *_mean, varlist
if length("r(varlist)") > 0 {
/*If the database contains variables of interest, display the database file name and variables on screen*/
display "Database `file' contains variables: " r(varlist)
}
}
Final note, this loop will only display the database name and variables of interest contained within it. If you want to perform actions on the data, or do anything else, those actions need to be included in the position of the final "display" command (which you may or may not ultimately actually need).

You can use filelist, (from SSC) to create a dataset of files. To install filelist, type in Stata's Command window:
ssc install filelist
With a list of datasets in memory, you can then loop over each file and use describe to get a list of variables for each file. You can store this list of variables in a single string variable. For example, the following will collect the names of all Stata datasets shipped with Stata and then store for each the variables they contain:
findfile "auto.dta"
local base_dir = subinstr("`r(fn)'", "/a/auto.dta", "", 1)
dis "`base_dir'"
filelist, dir("`base_dir'") pattern("*.dta")
gen variables = ""
local nmatch = _N
qui forvalues i = 1/`nmatch' {
local f = dirname[`i'] + "/" + filename[`i']
describe using "`f'", varlist
replace variables = " `r(varlist)' " in `i'
}
leftalign // also from SSC, to install: ssc install leftalign
Once you have all this information in the data in memory, you can easily search for specific variables. For example:
. list filename if strpos(variables, " rep78 ")
+-----------+
| filename |
|-----------|
13. | auto.dta |
14. | auto2.dta |
+-----------+

The lookfor_all package (SSC) is there for that purpose:
cd "pathtodirectory"
lookfor_all leverage_mean
Just make sure the file extensions are in lowercase(.dta) and not upper.

Is it possible to specify which tests to choose from?

We have a vast amount of tests. We would like infinitest only to choose between tests that have been included in an .xml-file (i.e. a TestNG suite).
We do not want to put the annotation groups = { "shouldbetested" } in every testcase but rather feed the info from our .xml file into infinitest.
Is this possible?
Is it another tool that could do that for us?

you can use a regular expresstion to "not" skip a certain test:
(?!.*YourTest)

Infinitest can filter out the tests you don't want to run using regular expressions in the infinitest.filters file.
The infinitest.filters contains regular expressions (one per line) that match the test classes you want to filter. Put this file in the root of your project (a.k.a. the working directory), and Infinitest will filter those tests out.
Note that the class names include package names, so use .* in front to match any package.