How to dynamically shape a DAG in Airflow

How to dynamically shape a DAG in Airflow - airflow

I have a custom DAG (meant to be subclassed), let's name it MyDAG. In the __enter__ method I want to add (or not) an operator based on the subclassing DAG. I'm not interested in using the BranchPythonOperator.
class MyDAG(DAG):
def __enter__(self, context):
start = DummyOperator(taks_id=start)
end = DummyOperator(task_id=end)
op = self.get_additional_operator()
if op:
start >> op
else:
start >> end
retrun self
def get_additional_operator(self):
# None if the subclass doesn't add any operator. A reference to another operator otherwise
if get_additional_operator is returning a reference, I'm obtaining this shape (two branches):
* start --> op
* end
otherwise, if it's returning None, I'm obtaining this (one branch):
* start --> end
What I want is not having end at all in the subclass inherting from MyDAG if get_additional_operator doesn't return None, something like this:
* start --> op
Instead of the two branches I'm obtaining above.

Airflow is somehow parsing every operator declared in the __enter__ method of a subclass of MyDAG. From that assumption, in order not to have an operator it only suffices to declare the operator in the right place. code below:
class MyDAG(DAG):
def __enter__(self, context):
start = DummyOperator(taks_id=start)
op = self.get_additional_operator()
if op:
start >> op
else:
end = DummyOperator(task_id=end)
start >> end
retrun self
def get_additional_operator(self):
# None if the subclass doesn't add any operator. A reference to another operator otherwise
The declaration of the end operator is made in the else section. I think it's only parsed when the else is evaluated to true.

Related

How to negate Airflow sensor task result?

Is there a built-in facility or some operator that will run a sensor and negate its status? I am writing a workflow that needs to detect that an object does not exist in order to proceed to eventual success. I have a sensor, but it detects when the object does exist.
For instance, I would like my workflow to detect that an object does not exist. I need almost exactly S3KeySensor, except that I need to negate its status.

The use case you are describing is checking key in S3, if exist wait otherwise continue workflow. As you mentioned this is a Sensor use case. The S3Hook has function check_for_key that checks if key exist so all needed is just to wrap it with Sensor poke function..
A simple basic implementation would be:
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from airflow.sensors.base import BaseSensorOperator
class S3KeyNotPresentSensor(BaseSensorOperator):
""" Waits for a key to not be present in S3. """
template_fields: Sequence[str] = ('bucket_key', 'bucket_name')
def __init__(
self,
*,
bucket_key: str,
bucket_name: Optional[str] = None,
aws_conn_id: str = 'aws_default',
verify: Optional[Union[str, bool]] = None,
**kwargs,
):
super().__init__(**kwargs)
self.bucket_name = bucket_name
self.bucket_key = [bucket_key] if isinstance(bucket_key, str) else bucket_key
self.aws_conn_id = aws_conn_id
self.verify = verify
self.hook: Optional[S3Hook] = None
def poke(self, context: 'Context'):
return not self.get_hook().check_for_key(self.bucket_key, self.bucket_name)
def get_hook(self) -> S3Hook:
"""Create and return an S3Hook"""
if self.hook:
return self.hook
self.hook = S3Hook(aws_conn_id=self.aws_conn_id, verify=self.verify)
return self.hook

I ended up going another way. I can use the trigger_rule argument of (any) Task -- by setting it to one_failed or all_failed on the next task I can play around with the desired status.
For example,
file_exists = FileSensor(task_id='exists', timeout=3, poke_interval=1, filepath='/tmp/error', mode='reschedule')
sing = SmoothOperator(task_id='sing', trigger_rule='all_failed')
file_exists >> sing
It requires no added code or operator, but has the possible disadvantage of being somewhat surprising.
Replying to myself in the hope that this may be useful to someone else. Thanks!

What is a good way for writing a function to measure another function in Elixir

I'm new to elixir, I'm trying to find something similar to Python's ContextManager.
Problem:
I have a bunch of functions and I want to add latency metric around them.
Now we have:
def method_1 do
...
end
def method_2 do
...
end
... more methods
I'd like to have:
def method_1 do
start = System.monotonic_time()
...
end = System.monotonic_time()
emit_metric(end-start)
end
def method_2 do
start = System.monotonic_time()
...
end = System.monotonic_time()
emit_metric(end-start)
end
... more methods
Now code duplication is a problem
start = System.monotonic_time()
...
end = System.monotonic_time()
emit_metric(end-start)
So what is a better way to avoid code duplication in this case? I like the context manager idea in python. But now sure how I can achieve something similar in Elixir, thanks for the help in advance!

In Erlang/Elixir this is done through higher-order functions, take a look at BEAM telemetry. It is an Erlang and Elixir library/standard for collecting metrics and instrumenting your code - it is widely adopted by Pheonix, Ecto, cowboy and other libraries. Specifically, you'd be interested in :telemetry.span/3 function as it emits start time and duration measurements by default:
def some_function(args) do
:telemetry.span([:my_app, :my_function], %{metadata: "Some data"}, fn ->
result = do_some_work(args)
{result, %{more_metadata: "Some data here"}}
end)
end
def do_some_work(args) # actual work goes here
And then, in some other are of your code you listen to those events and log them/send them to APM:
:telemetry.attach_many(
"test-telemetry",
[[:my_app, :my_function, :start],
[:my_app, :my_function, :stop],
[:my_app, :my_function, :exception]],
fn event, measurements, metadata, config ->
# Handle the actual event.
end)
nil
)

I think the closest thing to python context manager would be to use higher order functions, i.e. functions taking a function as argument.
So you could have something like:
def measure(fun) do
start = System.monotonic_time()
result = fun.()
stop = System.monotonic_time()
emit_metric(stop - start)
result
end
And you could use it like:
measure(fn ->
do_stuff()
...
end)
Note: there are other similar instances where you would use a context manager in python that would be done in a similar way, on the top of my head: Django has a context manager for transactions but Ecto uses a higher order function for the same thing.
PS: to measure elapsed time, you probably want to use :timer.tc/1 though:
def measure(fun) do
{elapsed, result} = :timer.tc(fun)
emit_metric(elapsed)
result
end

There is actually a really nifty library called Decorator in which macros can be used to "wrap" your functions to do all sorts of things.
In your case, you could write a decorator module (thanks to #maciej-szlosarczyk for the telemetry example):
defmodule MyApp.Measurements do
use Decorator.Define, measure: 0
def measure(body, context) do
meta = Map.take(context, [:name, :module, :arity])
quote do
# Pass the metadata information about module/name/arity as metadata to be accessed later
:telemetry.span([:my_app, :measurements, :function_call], unquote(meta), fn ->
{unquote(body), %{}}
end)
end
end
end
You can set up a telemetry listener in your Application.start definition:
:telemetry.attach_many(
"my-app-measurements",
[[:my_app, :measurements, :function_call, :start],
[:my_app, :measurements, :function_call, :stop],
[:my_app, :measurements, :function_call, :exception]],
&MyApp.MeasurementHandler.handle_telemetry/4)
nil
)
Then in any module with a function call you'd like to measure, you can "decorate" the functions like so:
defmodule MyApp.Domain.DoCoolStuff do
use MyApp.Measurements
#decorate measure()
def awesome_function(a, b, c) do
# regular function logic
end
end
Although this example uses telemetry, you could just as easily print out the time difference within your decorator definition.

accumulator in pyspark with dict as global variable

Just for learning purpose, I tried to set a dictionary as a global variable in accumulator the add function works well, but I ran the code and put dictionary in the map function, it always return empty.
But similar code for setting list as a global variable
class DictParam(AccumulatorParam):
def zero(self, value = ""):
return dict()
def addInPlace(self, acc1, acc2):
acc1.update(acc2)
if __name__== "__main__":
sc, sqlContext = init_spark("generate_score_summary", 40)
rdd = sc.textFile('input')
#print(rdd.take(5))
dict1 = sc.accumulator({}, DictParam())
def file_read(line):
global dict1
ls = re.split(',', line)
dict1+={ls[0]:ls[1]}
return line
rdd = rdd.map(lambda x: file_read(x)).cache()
print(dict1)

For anyone who arrives at this thread looking for a Dict accumulator for pyspark: the accepted solution does not solve the posed problem.
The issue is actually in the DictParam defined, it does not update the original dictionary. This works:
class DictParam(AccumulatorParam):
def zero(self, value = ""):
return dict()
def addInPlace(self, value1, value2):
value1.update(value2)
return value1
The original code was missing the return value.

I believe that print(dict1()) simply gets executed before the rdd.map() does.
In Spark, there are 2 types of operations:
transformations, that describe the future computation
and actions, that call for action, and actually trigger the execution
Accumulators are updated only when some action is executed:
Accumulators do not change the lazy evaluation model of Spark. If they
are being updated within an operation on an RDD, their value is only
updated once that RDD is computed as part of an action.
If you check out the end of this section of the docs, there is an example exactly like yours:
accum = sc.accumulator(0)
def g(x):
accum.add(x)
return f(x)
data.map(g)
# Here, accum is still 0 because no actions have caused the `map` to be computed.
So you would need to add some action, for instance:
rdd = rdd.map(lambda x: file_read(x)).cache() # transformation
foo = rdd.count() # action
print(dict1)
Please make sure to check on the details of various RDD functions and accumulator peculiarities because this might affect the correctness of your result. (For instance, rdd.take(n) will by default only scan one partition, not the entire dataset.)

For accumulator updates performed inside actions only, their value is
only updated once that RDD is computed as part of an action

pytest: Mark on test class overrides same mark on test function

I'm using pytest.mark to give my tests kwargs. However, if I use the same mark on both the class and a test within the class, the class's mark overrides the mark on the function when the same kwargs are used for both.
import pytest
animal = pytest.mark.animal
#animal(species='croc') # Mark the class with a kwarg
class TestClass(object):
#animal(species='hippo') # Mark the function with new kwarg
def test_function(self):
pass
#pytest.fixture(autouse=True) # Use a fixture to inspect my function
def animal_inspector(request):
print request.function.animal.kwargs # Show how the function object got marked
# prints {'species': 'croc'} but the function was marked with 'hippo'
Where'd my hippo go and how can I get him back?

There are unfortunately various pytest bugs related to this, I'm guessing you're running into one of them. The ones I found are related to subclassing which you don't do there though.

So I've been digging around in the pytest code and figured out why this is happening. The marks on the functions are applied to the function at import time but the class and module level marks don't get applied on the function level until test collection. Function marks happen first and add their kwargs to the function. Then class marks overwrite any same kwargs and module marks further overwrite any matching kwargs.
My solution was to simply create my own modified MarkDecorator that filters kwargs before they are added to the marks. Basically, whatever kwarg values get set first (which seems to always be by a function decorator) will always be the value on the mark. Ideally I think this functionality should be added in the MarkInfo class but since my code wasn't creating instances of that I went with what I was creating instances of: MarkDecorator. Note that I only change two lines from the source code (the bits about keys_to_add).
from _pytest.mark import istestfunc, MarkInfo
import inspect
class TestMarker(object): # Modified MarkDecorator class
def __init__(self, name, args=None, kwargs=None):
self.name = name
self.args = args or ()
self.kwargs = kwargs or {}
#property
def markname(self):
return self.name # for backward-compat (2.4.1 had this attr)
def __repr__(self):
d = self.__dict__.copy()
name = d.pop('name')
return "<MarkDecorator %r %r>" % (name, d)
def __call__(self, *args, **kwargs):
""" if passed a single callable argument: decorate it with mark info.
otherwise add *args/**kwargs in-place to mark information. """
if args and not kwargs:
func = args[0]
is_class = inspect.isclass(func)
if len(args) == 1 and (istestfunc(func) or is_class):
if is_class:
if hasattr(func, 'pytestmark'):
mark_list = func.pytestmark
if not isinstance(mark_list, list):
mark_list = [mark_list]
mark_list = mark_list + [self]
func.pytestmark = mark_list
else:
func.pytestmark = [self]
else:
holder = getattr(func, self.name, None)
if holder is None:
holder = MarkInfo(
self.name, self.args, self.kwargs
)
setattr(func, self.name, holder)
else:
# Don't set kwargs that already exist on the mark
keys_to_add = {key: value for key, value in self.kwargs.items() if key not in holder.kwargs}
holder.add(self.args, keys_to_add)
return func
kw = self.kwargs.copy()
kw.update(kwargs)
args = self.args + args
return self.__class__(self.name, args=args, kwargs=kw)
# Create my Mark instance. Note my modified mark class must be imported to be used
animal = TestMarker(name='animal')
# Apply it to class and function
#animal(species='croc') # Mark the class with a kwarg
class TestClass(object):
#animal(species='hippo') # Mark the function with new kwarg
def test_function(self):
pass
# Now prints {'species': 'hippo'} Yay!

Ruby, what is the difference between a setter and initialize method

class Account
def initialize(starting_balance = 0)
#balance = starting_balance
end
def balance #instance getter method
#balance #instance variable visible only to this object
end
def balance=(new_amount)
#balance = new_amount
end
def deposit(amount)
#balance+=amount
end
##bank_name= "MyBank.com" # class (static) variable
# A class method
def self.bank_name
##bank_name
end
# or: def SavingsAccount.bank_name : ##bank_name : end
end
I want to understand the code snippets in bold. What do they do? what is the difference between a setter and initialize method.
If I had an object test=Account.new() and why is test(30) giving an error. Isn't that suppose to call the setter method with parameter 30 and set the balance?

initialize is the method that is called on the newly created object when you do Account.new or Account.new(my_starting_balance). In the first case initialize would be called with the default value 0 for starting_balance and in the second with my_starting_balance.
The setter method balance= is called when you do my_account.balance = some_value where my_account is an instance of the class Account. So if you have the following code, initialize will be called on line 1 (with 0 as its argument) and balance= on line 2 (with 23) as its argument:
my_account = Account.new
my_account.balance = 23
Of course in this case I could just as well write the following and not use the setter method at all:
my_account = Account.new(23)
However that doesn't always work because some times you might want to change the value of balance after the object has already been created.
If I had an object test=Account.new() and why is test(30) giving an error.
Because test(30) means "call the method test with the argument 30" and there is no method called test in your code.
Regarding the second bolded part of your code: As the comments indicate, it sets a class variable named ##bank_name and defines a class method that returns that variable's value.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to dynamically shape a DAG in Airflow - airflow

Related

How to negate Airflow sensor task result?

What is a good way for writing a function to measure another function in Elixir

accumulator in pyspark with dict as global variable

pytest: Mark on test class overrides same mark on test function

Ruby, what is the difference between a setter and initialize method

Categories

Resources