Executing a solid when at least one of the required inputs is given - dagster

As an input, I would like to retrieve data based on user input, or "randomly" from a DB if no user input is given. All other downstream tasks of the pipeline would be the same.
Therefore, I would like to create a pipeline starting with solids A and B, and a downstream solid C executed based on input from solid A OR solid B.
However, when using conditional outputs on solids A and B, solid C is not executed, as one input is not generated by upstream solids.
Is there a simple way of doing this that I am missing out?
Thanks for your help.

"fan-in" dependencies will not skip unless all the fanned in outputs were skipped, so that is one way to accomplish this.
#pipeline
def example():
maybe_a = A()
maybe_b = B()
C(items=[maybe_a, maybe_b])
https://docs.dagster.io/examples/fan_in_pipeline#main

Working off of #Alex's suggestion, here's an executable pipeline:
from dagster import pipeline, solid
#solid()
def A(_):
return
#solid()
def B(_):
return
#solid()
def C(context, results):
context.log.info("a returned {}".format(results[0]))
context.log.info("b returned {}".format(results[1]))
context.log.info("compute c")
#pipeline
def my_pipeline():
a = A()
b = B()
C([a, b])

Related

How to Get the Task Status and Result of a Celery Task Type When Multiple Tasks Are Defined?

Typically I send an asynchronous task with .apply_async method of the Promise defined, and then I use the taskid on the AsyncResult method of the same object to get task status, and eventually, result.
But this requires me to know the exact type of task when more than one tasks are defined in the same deployment. Is there any way to circumvent this, when I can know the task status and result (if available) without knowing the exact task?
For example, take this example celery master node code.
#!/usr/bin/env python3
# encoding:utf-8
"""Define the tasks in this file."""
from celery import Celery
redis_host: str = 'redis://localhost:6379/0'
celery = Celery(main='test', broker=redis_host,
backend=redis_host)
celery.conf.CELERY_TASK_SERIALIZER = 'pickle'
celery.conf.CELERY_RESULT_SERIALIZER = 'pickle'
celery.conf.CELERY_ACCEPT_CONTENT = {'json', 'pickle'}
# pylint: disable=unused-argument
#celery.task(bind=True)
def add(self, x: float, y: float) -> float:
"""Add two numbers."""
return x + y
#celery.task(bind=True)
def multiply(self, x: float, y: float) -> float:
"""Multiply two numbers."""
return x * y
When I call something like this in a different module
task1=add.apply_async(args=[2, 3]).id
task2=multiply.apply_async(args=[2, 3]).id
I get two uuids for the tasks. But when checking back the task status, I need to know which method (add or multiply) is associated with that task id, since I have to call the method on the corresponding object, like this.
status: str = add.AsyncResult(task_id=task1).state
My question is how can I fetch the state and result armed only with the task id without knowing whether the task belongs add, multiply or any other category defined.
id and state are just properties of the AsyncResult objects. If you looked at documentation for the AsyncResult class, you would find the name property which is exactly what you are asking for.

What is a good way for writing a function to measure another function in Elixir

I'm new to elixir, I'm trying to find something similar to Python's ContextManager.
Problem:
I have a bunch of functions and I want to add latency metric around them.
Now we have:
def method_1 do
...
end
def method_2 do
...
end
... more methods
I'd like to have:
def method_1 do
start = System.monotonic_time()
...
end = System.monotonic_time()
emit_metric(end-start)
end
def method_2 do
start = System.monotonic_time()
...
end = System.monotonic_time()
emit_metric(end-start)
end
... more methods
Now code duplication is a problem
start = System.monotonic_time()
...
end = System.monotonic_time()
emit_metric(end-start)
So what is a better way to avoid code duplication in this case? I like the context manager idea in python. But now sure how I can achieve something similar in Elixir, thanks for the help in advance!
In Erlang/Elixir this is done through higher-order functions, take a look at BEAM telemetry. It is an Erlang and Elixir library/standard for collecting metrics and instrumenting your code - it is widely adopted by Pheonix, Ecto, cowboy and other libraries. Specifically, you'd be interested in :telemetry.span/3 function as it emits start time and duration measurements by default:
def some_function(args) do
:telemetry.span([:my_app, :my_function], %{metadata: "Some data"}, fn ->
result = do_some_work(args)
{result, %{more_metadata: "Some data here"}}
end)
end
def do_some_work(args) # actual work goes here
And then, in some other are of your code you listen to those events and log them/send them to APM:
:telemetry.attach_many(
"test-telemetry",
[[:my_app, :my_function, :start],
[:my_app, :my_function, :stop],
[:my_app, :my_function, :exception]],
fn event, measurements, metadata, config ->
# Handle the actual event.
end)
nil
)
I think the closest thing to python context manager would be to use higher order functions, i.e. functions taking a function as argument.
So you could have something like:
def measure(fun) do
start = System.monotonic_time()
result = fun.()
stop = System.monotonic_time()
emit_metric(stop - start)
result
end
And you could use it like:
measure(fn ->
do_stuff()
...
end)
Note: there are other similar instances where you would use a context manager in python that would be done in a similar way, on the top of my head: Django has a context manager for transactions but Ecto uses a higher order function for the same thing.
PS: to measure elapsed time, you probably want to use :timer.tc/1 though:
def measure(fun) do
{elapsed, result} = :timer.tc(fun)
emit_metric(elapsed)
result
end
There is actually a really nifty library called Decorator in which macros can be used to "wrap" your functions to do all sorts of things.
In your case, you could write a decorator module (thanks to #maciej-szlosarczyk for the telemetry example):
defmodule MyApp.Measurements do
use Decorator.Define, measure: 0
def measure(body, context) do
meta = Map.take(context, [:name, :module, :arity])
quote do
# Pass the metadata information about module/name/arity as metadata to be accessed later
:telemetry.span([:my_app, :measurements, :function_call], unquote(meta), fn ->
{unquote(body), %{}}
end)
end
end
end
You can set up a telemetry listener in your Application.start definition:
:telemetry.attach_many(
"my-app-measurements",
[[:my_app, :measurements, :function_call, :start],
[:my_app, :measurements, :function_call, :stop],
[:my_app, :measurements, :function_call, :exception]],
&MyApp.MeasurementHandler.handle_telemetry/4)
nil
)
Then in any module with a function call you'd like to measure, you can "decorate" the functions like so:
defmodule MyApp.Domain.DoCoolStuff do
use MyApp.Measurements
#decorate measure()
def awesome_function(a, b, c) do
# regular function logic
end
end
Although this example uses telemetry, you could just as easily print out the time difference within your decorator definition.

accumulator in pyspark with dict as global variable

Just for learning purpose, I tried to set a dictionary as a global variable in accumulator the add function works well, but I ran the code and put dictionary in the map function, it always return empty.
But similar code for setting list as a global variable
class DictParam(AccumulatorParam):
def zero(self, value = ""):
return dict()
def addInPlace(self, acc1, acc2):
acc1.update(acc2)
if __name__== "__main__":
sc, sqlContext = init_spark("generate_score_summary", 40)
rdd = sc.textFile('input')
#print(rdd.take(5))
dict1 = sc.accumulator({}, DictParam())
def file_read(line):
global dict1
ls = re.split(',', line)
dict1+={ls[0]:ls[1]}
return line
rdd = rdd.map(lambda x: file_read(x)).cache()
print(dict1)
For anyone who arrives at this thread looking for a Dict accumulator for pyspark: the accepted solution does not solve the posed problem.
The issue is actually in the DictParam defined, it does not update the original dictionary. This works:
class DictParam(AccumulatorParam):
def zero(self, value = ""):
return dict()
def addInPlace(self, value1, value2):
value1.update(value2)
return value1
The original code was missing the return value.
I believe that print(dict1()) simply gets executed before the rdd.map() does.
In Spark, there are 2 types of operations:
transformations, that describe the future computation
and actions, that call for action, and actually trigger the execution
Accumulators are updated only when some action is executed:
Accumulators do not change the lazy evaluation model of Spark. If they
are being updated within an operation on an RDD, their value is only
updated once that RDD is computed as part of an action.
If you check out the end of this section of the docs, there is an example exactly like yours:
accum = sc.accumulator(0)
def g(x):
accum.add(x)
return f(x)
data.map(g)
# Here, accum is still 0 because no actions have caused the `map` to be computed.
So you would need to add some action, for instance:
rdd = rdd.map(lambda x: file_read(x)).cache() # transformation
foo = rdd.count() # action
print(dict1)
Please make sure to check on the details of various RDD functions and accumulator peculiarities because this might affect the correctness of your result. (For instance, rdd.take(n) will by default only scan one partition, not the entire dataset.)
For accumulator updates performed inside actions only, their value is
only updated once that RDD is computed as part of an action

Impossibility to iterate over a Map using Groovy within Jenkins Pipeline

We are trying to iterate over a Map, but without any success. We reduced our issue to this minimal example:
def map = [
'monday': 'mon',
'tuesday': 'tue',
]
If we try to iterate with:
map.each{ k, v -> println "${k}:${v}" }
Only the first entry is output: monday:mon
The alternatives we know of are not even able to enter the loop:
for (e in map)
{
println "key = ${e.key}, value = ${e.value}"
}
or
for (Map.Entry<String, String> e: map.entrySet())
{
println "key = ${e.key}, value = ${e.value}"
}
Are failing, both only showing the exception java.io.NotSerializableException: java.util.LinkedHashMap$Entry. (which could be related to an exception occurring while raising the 'real' exception, preventing us from knowing what happened).
We are using latest stable jenkins (2.19.1) with all plugins up-to-date as of today (2016/10/20).
Is there a solution to iterate over elements in a Map within a Jenkins pipeline Groovy script ?
Its been some time since I played with this, but the best way to iterate through maps (and other containers) was with "classical" for loops, or the "for in". See Bug: Mishandling of binary methods accepting Closure
To your specific problem, most (all?) pipeline DSL commands will add a sequence point, with that I mean its possible to save the state of the pipeline and resume it at a later time. Think of waiting for user input for example, you want to keep this state even through a restart.
The result is that every live instance has to be serialized - but the standard Map iterator is unfortunately not serializable. Original Thread
The best solution I can come up with is defining a Function to convert a Map into a list of serializable MapEntries. The function is not using any pipeline steps, so nothing has to be serializable within it.
#NonCPS
def mapToList(depmap) {
def dlist = []
for (def entry2 in depmap) {
dlist.add(new java.util.AbstractMap.SimpleImmutableEntry(entry2.key, entry2.value))
}
dlist
}
This has to be obviously called for each map you want to iterate, but the upside it, that the body of the loop stays the same.
for (def e in mapToList(map))
{
println "key = ${e.key}, value = ${e.value}"
}
You will have to approve the SimpleImmutableEntry constructor the first time, or quite possibly you could work around that by placing the mapToList function in the workflow library.
Or much simpler
for (def key in map.keySet()) {
println "key = ${key}, value = ${map[key]}"
}

Parallelize function on dictionary in IPython

Up till now, I have parallelized functions by mapping them on to lists that are distributed out to the various clusters using the function map_sync(function, list) .
Now, I need to run a function on each entry of a dictionary.
map_sync does not seem work on dictionaries. I have also tried to scatter the dictionary and use decorators to run the function in parallel. However, dictionaries dont seem to lend themselves to scattering either. Is there some other way to parallelize functions on dictionaries without having to convert to lists?
These are my attempts thus far:
from IPython.parallel import Client
rc = Client()
dview = rc[:]
test_dict = {'43':"lion", '34':"tiger", '343':"duck"}
dview.scatter("test",test)
dview["test"]
# this yields [['343'], ['43'], ['34'], []] on 4 clusters
# which suggests that a dictionary can't be scattered?
Needless to say, when I run the function itself, I get an error:
#dview.parallel(block=True)
def run():
for d,v in test.iteritems():
print d,v
run()
AttributeError
Traceback (most recent call last) in ()
in run(dict)
AttributeError: 'str' object has no attribute 'iteritems'
I don't know if it's relevant, but I'm using an IPython Notebook connected to Amazon AWS clusters.
You can scatter a dict with:
def scatter_dict(view, name, d):
"""partition a dictionary across the engines of a view"""
ntargets = len(view)
keys = d.keys() # list(d.keys()) in Python 3
for i, target in enumerate(view.targets):
subd = {}
for key in keys[i::ntargets]:
subd[key] = d[key]
view.client[target][name] = subd
scatter_dict(dview, 'test', test_dict)
and then operate on it remotely, as you normally would.
You can also gather the remote dicts into one local one again with:
def gather_dict(view, name):
"""gather dictionaries from a DirectView"""
merged = {}
for d in view.pull(name):
merged.update(d)
return merged
gather_dict(dv, 'test')
An example notebook

Resources