Dynamically generate multiple tasks based on output dictionary from task in Airflow - dictionary

I have a task in which the output is a dictionary with a list value in each key
#task(task_id="gen_dict")
def generate_dict():
...
return output_dict # output look like this {"A" : ["aa","bb", "cc"], "B" : ["dd","ee", "ff"]}
# my dag (Not mention the part of generating DAG and its properties)
start = DummyOperator(task_id="st")
end = DummyOperator(task_id="ed")
output = generate_dict()
for keys, values in output.items():
for v in values:
dm = DummyOperator(task_id=f"dm_{keys}_{v}")
dm >> end
start >> output
For this sample output above, it should create 6 dummy tasks which are dm_A_aa, dm_A_bb, dm_A_cc, dm_B_dd, dm_B_ee, dm_B_ff
But right now I'm facing the import error
AttributeError: 'XComArg' object has no attribute 'items'
Is it possible to do what I aim to do? If not, is it possible to do it using a list like ["aa", "bb", "cc", "dd", "ee", "ff"] instead?

The code in the question won't work as-is because the loop shown would run when the dag is parsed (happens when the scheduler starts up and periodically thereafter), but the data that it would loop over is not known until the task that generates it is actually run.
There are ways to do something similar though.
AIP-42 added the ability to map list data into task kwargs in airflow 2.3:
#task
def generate_lists():
# presumably the data below would come from a query executed at runtime
return [["aa", "bb", "cc"], ["dd", "ee", "ff"]]
#task
def use_list(the_list):
for item in the_list:
print(item)
with DAG(...) as dag:
use_list.expand(the_list=generate_lists())
The code above will create two tasks with output:
aa
bb
cc
dd
ee
ff
In 2.4 the expand_kwargs function was added. It's an alternative to expand (shown above) which operates on dicts instead.
It takes an XComArg referencing a list of dicts whose keys are the names of the arguments that you're mapping the data into. So the following code...
#task
def generate_dicts():
# presumably the data below would come from a query made at runtime
return [{"foo":6, "bar":7}, {"foo":8, "bar":9}]
#task
def two_things(foo, bar):
print(foo, bar)
with DAG(...) as dag:
two_things.expand_kwargs(generate_dicts())
... gives two tasks with output:
6 7
...and...
8 9
expand only lets you create tasks from the Cartesian product of the input lists, expand_kwargs lets you do the associating of data to kwargs at runtime.

Related

snakemake error: 'Wildcards' object has no attribute 'batch'

I don't understand how to redefine my snakemake rule to fix the Wildcards issue below.
Ignore the logic of batches, it internally makes sense in the python script. In theory, I want the rule to be run for each batch 1-20. I use BATCHES list for {batch} in output, and in the shell command, I use {wildcards.batch}:
OUTDIR="my_dir/"
nBATCHES = 20
BATCHES = list(range(1,21)) # [1,2,3 ..20] list
[...]
rule step5:
input:
ids = expand('{IDLIST}', IDLIST=IDLIST)
output:
type1 = expand('{OUTDIR}/resources/{batch}_output_type1.csv.gz', OUTDIR=OUTDIR, batch=BATCHES),
type2 = expand('{OUTDIR}/resources/{batch}_output_type2.csv.gz', OUTDIR=OUTDIR, batch=BATCHES),
type3 = expand('{OUTDIR}/resources/{batch}_output_type3.csv.gz', OUTDIR=OUTDIR, batch=BATCHES)
shell:
"./some_script.py --outdir {OUTDIR} --idlist {input.ids} --total_batches {nBATCHES} --current_batch {wildcards.batch}"
Error:
RuleException in rule step5 in line 241 of Snakefile:
AttributeError: 'Wildcards' object has no attribute 'batch', when formatting the following:
./somescript.py --outdir {OUTDIR} --idlist {input.idlist} --totalbatches {nBATCHES} --current_batch {wildcards.batch}
Executing script for a single batch manually looks like this (and works): (total_batches is a constant; current_batch is supposed to iterate)
./somescript.py --outdir my_dir/ --idlist ids.csv --total_batches 20 --current_batch 1
You seem to want to run the rule step5 once for each batch in BATCHES. So you need to structure your Snakefile to do exactly that.
In the following Snakefile running the rule all runs your rule step5 for all combinations of OUTDIR and BATCHES:
OUTDIR = "my_dir"
nBATCHES = 20
BATCHES = list(range(1, 21)) # [1,2,3 ..20] list
IDLIST = ["a", "b"] # dummy data, I don't have the original
rule all:
input:
type1=expand(
"{OUTDIR}/resources/{batch}_output_type1.csv.gz",
OUTDIR=OUTDIR,
batch=BATCHES,
),
rule step5:
input:
ids=expand("{IDLIST}", IDLIST=IDLIST),
output:
type1="{OUTDIR}/resources/{batch}_output_type1.csv.gz",
type2="{OUTDIR}/resources/{batch}_output_type2.csv.gz",
type3="{OUTDIR}/resources/{batch}_output_type3.csv.gz",
shell:
"./some_script.py --outdir {OUTDIR} --idlist {input.ids} --total_batches {nBATCHES} --current_batch {wildcards.batch}"
In your earlier version {batches} was just an expand-placeholder, but not a wildcard and the rule was only called once.
Instead of the rule all, this could be a subsequent rule which uses one or multiple of the outputs generated from step5.

Loop many times on many airflow tasks on one dag

I am creating one dag that will have following structure of tasks. This DAG will be schedule to run
on everyday at 1:00 AM UTC time.
Get rows from database ---- loop on rows to run many task that require each row data.
For example I have method in my DAG that call MySQL database and return many rows .Each row data I have to pass in 4 task as a parameter. I have followed some google search docs but that is not running correctly.
return_db_result is method to get result from Cloud SQL in GCP.
def return_result():
db_engine_connection = create_cloud_sql_connection()
session = get_db_session(db_engine_connection)
result = session.query(Scheduled).filter(Scheduled.job_status == "Scheduled").all()
session.commit()
return result
I tried using for loop like something following
for row in return_result():
op1 = operator({ param=row.id})
op2 = operator({ param=row.id})
op3 = operator({ param=row.id})
op4 = operator({ param=row.id})
op1 >> op2 >> op3 >> op4
But these task does not show on airflow UI.
Based on your comments assuming your operator is:
class MyOperator(BaseOperator):
#apply_defaults
def __init__(self,
input_id,
input_date,
input_status,
*args, **kwargs):
super(MyOperator, self).__init__(*args, **kwargs)
self.input_id=input_id
self.input_date=input_date
self.input_status=input_status
def execute(self, context):
pass
You can use it as follows:
start_op = DummyOperator(task_id='start_op')
for row in return_db_result:
op1 = MyOperator(task_id=f"op1_{row.id}", input_id=row.id, input_date=row.date, input_status=row.status)
op2 = MyOperator(task_id=f"op2_{row.id}", input_id=row.id, input_date=row.date, input_status=row.status)
op3 = MyOperator(task_id=f"op3_{row.id}", input_id=row.id, input_date=row.date, input_status=row.status)
op4 = MyOperator(task_id=f"op4_{row.id}", input_id=row.id, input_date=row.date, input_status=row.status)
start_op >> op1 >> op2 >> op3 >> op4

Will a nest loop help in parsing results

I am trying to pull information from two different dictionaries. (excuse me because I am literally hacking to understand.)
I have a for loop that gives me the vmname. I have another for loop that gives me the other information like 'replicationid'.
I could be doing a very huge assumption here but hey ill start there. what I want to do it to integrate for loop 1 and for loop 2. as so the results are like this, is it even possible?
initial output of for loop1 which I can get:
vma
vmb
vmc
initial output of for loop2 which I can get:
replication job 1
replication job 2
replication job 3
desired results is:
vma
replication job 1
vmb
replication job 2
vmc
replication job 3
def get_replication_job_status():
sms = boto3.client('sms')
resp = sms.get_replication_jobs()
#print(resp)
things = [(cl['replicationJobId'], cl['serverId']) for cl in
resp['replicationJobList']]
thangs = [cl['vmServer'] for cl in resp['replicationJobList']]
for i in thangs:
print()
print("this is vm " + (i['vmName']))
print("this is the vm location " + (i['vmPath']))
print("this is the vm address, " +(str(i['vmServerAddress'])))
for j in things:
print("The Replication ID is : " +(str(j[0])))
again I want:
vma
replication job 1
vmb
replication job 2
vmc
replication job 3
im am getting:
vma
replication job 1
replication job 2
replication job 3
vmb
replication job 1
replication job 2
replication job 3
..
..
..
If you are sure that your lists both have the same length, then what you need is python built-in zip function:
for thing, thang in zip(things, thangs):
print()
print(thing)
print(thang)
But if one of the lists is longer then another then zip will crop both lists to have the same length as the shortest, for example:
>>> for i, j in zip(range(3), range(5)):
... print(i, j)
...
(0, 0)
(1, 1)
(2, 2)
UPD:
You can also unpack your tuples right in for loop definition, so each item (they are 2-tuples) in things list gets saved to two variables:
for (replicationJobId, serverId), thang in zip(things, thangs):
print()
print(replicationJobId)
print(serverId)
print(thang)
UPD 2:
Why do you split resp into to two lists?
def get_replication_job_status():
sms = boto3.client('sms')
resp = sms.get_replication_jobs()
#print(resp)
for replication_job in resp['replicationJobList']:
vm_server = replication_job['vmServer']
print()
print("this is vm:", vm_server['vmName'])
print("this is the vm location:", vm_server['vmPath'])
print("this is the vm address:", vm_server['vmServerAddress'])
print("The Replication ID is :", replication_job['replicationJobId'])

Return multiple nested dictionaries from Tcl

I have a Tcl proc that creates two dictionaries from a large file. It is something like this:
...
...
proc makeCircuitData {spiceNetlist} {
#read the spiceNetlist file line by line
# create a dict with multilevel nesting called elementMap that will have the following structure:
# elementMap key1 key2 value12
# elementMap keyA keyB valueAB
# and so on
# ... some other code here ...
# create another dict with multilevel nesting called cktElementAttr that will have the following structure:
# cktElementAttr resistor leftVoltageNode1 rightVoltageNode1 resValue11
# cktElementAttr resistor leftVoltageNode2 rightVoltageNode2 resValue12
# cktElementAttr inductor leftVoltageNode2 rightVoltageNode2 indValue11
# cktElementAttr inductor leftVoltageNode2 rightVoltageNode2 indValue12
# cktElementAttr capacitor leftVoltageNode2 rightVoltageNode2 capValue11
# ... so on...
}
I want to return these two nested dictionaries:
cktElementAttr and elementMap from the above types of procedures as these two dictionaries get used by other parts of my program.
What is the recommended way to return two dictionaries from Tcl procs?
Thanks.
This should work:
return [list $cktElementAttr $elementMap]
Then, at the caller, you can assign the return value to a list:
set theDictionaries [makeCircuitData ...]
or assign them to different variables:
lassign [makeCircuitData ...] cEltAttr elmMap
In Tcl 8.4 or older (which are obsolete!), you can (ab)use foreach to do the job of lassign:
foreach {cEltAttr elmMap} [makeCircuitData ...] break
Documentation:
break,
foreach,
lassign,
list,
return,
set

Restore vgg16 network in tensorflow

This one has been giving me a headache for quite some time now, even though it seems to be very basic.
I have the vgg16 network downloaded as a .cpkt
(from https://github.com/tensorflow/models/blob/master/slim/README.md#Pretrained)
Now what I want to do is loading for example the tensor of the first convolution layer of this network as an array in R.
I tried
restorer = tf$train$Saver()
sess = tf$Session()
restorer$restore(sess, "/home/beheerder/R/vgg_16.ckpt")
But then I do not see any variables apearing in my enviroment.
I'm working in R, but an awnser in Python is OK as well, as I can probably translate it to R.
Saver takes the variables to restore in constructor. In other words, you have to create the variables before you can restore them. Here is the example from Saver's doc:
v1 = tf.Variable(..., name='v1')
v2 = tf.Variable(..., name='v2')
# Pass the variables as a dict:
saver = tf.train.Saver({'v1': v1, 'v2': v2})
# Or pass them as a list.
saver = tf.train.Saver([v1, v2])
If you were to run the first line of your code in python you would get:
In [1]: import tensorflow as tf
In [2]: saver = tf.train.Saver()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-18da33d742f9> in <module>()
----> 1 saver = tf.train.Saver()
/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.pyc in __init__(self, var_list, reshape, sharded, max_to_keep, keep_checkpoint_every_n_hours, name, restore_sequentially, saver_def, builder, defer_build, allow_empty, write_version, pad_step_number)
1054 self._pad_step_number = pad_step_number
1055 if not defer_build:
-> 1056 self.build()
1057 if self.saver_def:
1058 self._check_saver_def()
/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.pyc in build(self)
1075 return
1076 else:
-> 1077 raise ValueError("No variables to save")
1078 self._is_empty = False
1079 self.saver_def = self._builder.build(
ValueError: No variables to save
You can see how model variables are created before being restored in the 20 lines starting from https://github.com/tensorflow/models/blob/master/slim/train_image_classifier.py#L338
This code gets executed if you make a call to train_image_classifier.py similar to the flower example in https://github.com/tensorflow/models/blob/master/slim/README.md#fine-tuning-a-model-from-an-existing-checkpoint

Resources