Dagster -Execute an #Op only when all parallel executions are finished(DynamicOutput) - dagster

I have a problem that in fact I am not able to solve in dagster.
I have the following configuration:
I have step 1 where I get the data from an endpoint
step 2 gets a list of customers dynamically:
step 3 is the database update with the response from step 1, for each customer from step 2, but in parallel.
before calling step 3, I have a function that serves to create DynamicOutput for each client of step 2, with the name "parallelize_clients "so that when it is invoked, it parallelizes the update processes of step_3 and finally I have a graph to join operations.
#op()
def step_1_get_response():
return {'exemple': 'data'}
#op()
def step_2_get_client_list():
return ['client_1', 'client_2', 'client_3'] #the number of customers is dynamic.
#op(out=DynamicOut())
def parallelize_clients(context, client_list):
for client in client_list:
yield DynamicOutput(client, mapping_key=str(client))
#op()
def step_3_update_database_cliente(response, client):
...OPERATION UPDATE IN DATABASE CLIENT
#graph()
def job_exemple_graph():
response = step_1_get_response()
clients_list = step_2_get_client_list()
clients = parallelize_clients(clients_list)
#run the functions in parallel
clients.map(lambda client: step_3_update_database_cliente(response, client))
According to the documentation, an #Op starts as soon as its dependencies are fulfilled, and in the case of Ops that have no dependency, they are executed instantly, without having an exact order of execution. Example: My step1 and step2 have no dependencies, so both are running in parallel automatically. After the clients return, the "parallelize_clients()" function is executed, and finally, I have a map in the graph that dynamically creates several executions according to the amount of client(DynamicOutput)
So far it works, and everything is fine. Here's the problem. I need to execute a specific function only when step3 is completely finished, and as it is created dynamically, several executions are generated in parallel, however, I am not able to control to execute a function only when all these executions in parallel are finished.
in the graph I tried to put the call to an op "exemplolaststep() step_4" at the end, however, step 4 is executed together with "step1" and "step2", and I really wanted step4 to only execute after step3, but not I can somehow get this to work. Could someone help me?
I tried to create a fake dependency with
#op(ins={"start": In(Nothing)})
def step_4():
pass
and in the graph, when calling the operations, I tried to execute the map call inside the step_4() function call; Example
#graph()
def job_exemple_graph():
response = step_1_get_response()
clients_list = step_2_get_client_list()
clients = parallelize_clients(clients_list)
#run the functions in parallel
step_4(start=clients.map(lambda client: step_3_update_database_cliente(response, client)))
I have tried other approaches as well, however, to no avail.

You just need to add a .collect() call on the mapped function in your graph, to indicate that all the parallel operations should join before moving on. Something like
#graph()
def job_exemple_graph():
response = step_1_get_response()
clients_list = step_2_get_client_list()
clients = parallelize_clients(clients_list)
# run the functions in parallel
step_4(
start=clients.map(
lambda client: step_3_update_database_cliente(response, client)
).collect()
)

Related

how to create a chain of dynamic tasks?

I try create graph with chain of dynamic tasks.
First of all, I start with expand function. But problem is program should wait, when all the Add tasks finished and only then start Mul tasks. I need the next Mul to run immediately after each Add. Then I got the code that the graph could make
with DAG(dag_id="simple_maping", schedule='* * * * *', start_date=datetime(2022, 12, 22)) as dag:
#task
def read_conf():
conf = Variable.get('tables', deserialize_json=True)
return conf
#task
def add_one(x: str):
sleep(5)
return x + '1'
#task
def mul_two(x: str):
return x * 2
for i in read_conf():
mul_two(add_one(i))
but now there is an error - 'xcomarg' object is not iterable. I can fix it just remove task decorator from read_conf method, but I am not sure it's the best decision, because in my case list configuration names could contain >1000 elements. Without decorator, method have to read configuration every time when scheduler parsed graph.
Maybe the load without the decorator will not be critical? Or is there a way to make an object iterable? How to do it right?
EDIT: This solution has a bug in 2.5.0 which was solved for 2.5.1 (not released yet).
Yes, when you are chaining dynamically mapped tasks the latter (mul_2) will wait until all mapped instances of the first task (add_one) are done by default because the default trigger rule is all_success. While you can change the trigger rule for example to one_done this will not solve your issue because the second task will only once, when it first starts running, decide how many mapped task instances it creates (with one_done it only creates one mapped task instance, so not helpful for your use-case).
The issue with the for-loop (and why Airflow wont allow you to iterate over an XComArg) is that for-loops are parsed when the DAG code is parsed, which happens outside of runtime, when Airflow does not know yet how many results read_conf() will return. If the number of the configurations only rarely change then having a for-loop like that iterating over list in a separate file is an option, but yes at scale this can cause performance issues.
The best solution in my opinion is to use dynamic task group mapping which was added in Airflow 2.5.0:
All mapped task groups will run in parallel and for every input from read_conf(). So for every add_one its mul_two will run immediately. I put the code for this below.
One note: You will not be able to see the mapped task groups in the Airflow UI or be able to access their logs just yet, the feature is still quite new and the UI extension should come in 2.5.1. That is why I added a task downstream of the mapped task groups that prints out the list of results of the mul_two tasks, so you can check if it is working.
from airflow import DAG
from airflow.decorators import task, task_group
from datetime import datetime
from time import sleep
with DAG(
dag_id="simple_mapping",
schedule=None,
start_date=datetime(2022, 12, 22),
catchup=False
) as dag:
#task
def read_conf():
return [10, 20, 30]
#task_group
def calculations(x):
#task
def add_one(x: int):
sleep(x)
return x + 1
#task()
def mul_two(x: int):
return x * 2
mul_two(add_one(x))
#task
def pull_xcom(**context):
pulled_xcom = context["ti"].xcom_pull(
task_ids=['calculations.mul_two'],
key="return_value"
)
print(pulled_xcom)
calculations.expand(x=read_conf()) >> pull_xcom()
Hope this helps! :)
PS: you might want to set catchup=False unless you want to backfill a few weeks of tasks.

reusable traversal components not always working with gremlin

I'm trying to create reusable components for my traversals using gremlin python by putting traversal components into functions and I'm running into a problem where some of the traversal components aren't working correctly.
As setup I'm running gremlin server using the docker container with the configuration file loading in the modern graph from the github repo
docker run -p 8182:8182 tinkerpop/gremlin-server:3.4.6 conf/gremlin-server-modern.yaml
My test python code looks like the following:
from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.process.graph_traversal import __
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
def connect_gremlin(endpoint='ws://localhost:8182/gremlin'):
return traversal().withRemote(DriverRemoteConnection(endpoint,'g'))
def n():
return __.values('name')
def r():
return __.range(2,4)
g = connect_gremlin()
# works as expected
g.V().map(n()).toList()
# returns an empty list
g.V().map(n()).filter(r()).toList()
# but using range step directly works as expected
g.V().map(n()).range(2,4).toList()
I can successfully move the values step into a function but when I try to do the same thing with the range step it returns an empty list rather than the 2nd through 4th items. Anyone know what I'm doing wrong?
The map step is intended to map the state of each traverser to a new state. In the context of a single traverser a range starting anywhere but zero is not going to do what you expect.
Here are some examples using Python:
>>> g.V().map(__.range(0,1)).limit(5).toList()
[v[1400], v[1401], v[1402], v[1403], v[1404]]
>>> g.V().map(__.range(0,2)).limit(5).toList()
[v[1400], v[1401], v[1402], v[1403], v[1404]]
>>> g.V().map(__.range(1,2)).limit(5).toList()
[]
This is why the values step works inside a map step and range does not.
Rather than inject code using a map step why not just incrementally add to the traversal and then iterate it when complete?

R - doRedis - Overwrite getTask to control the order of execution in parallel foreach loops

Problem: I need to control the order of execution in which tasks are processed in parallel by a foreach loop. Unfortunately, this is not supported by foreach.
Solution in mind: Using doRedis to use the database to hold all tasks, that are executed in the foreach loop. To control the order I want to overwrite getTask by setGetTask to get the tasks based on pre-specified order. Though I could not find to much documentation on how to do this.
Additional Information:
There is a small paragraph on setGetTask with an example in the redis documentation.
getTask <- function ( queue , job_id , ...)
{
key <- sprintf("
redisEval("local x=redis.call('hkeys',KEYS[1])[1];
if x==nil then return nil end;
local ans=redis.call('hget',KEYS[1],x);
redis.call('hdel',KEYS[1],x);i
return ans",key)
}
setGetTask(getTask)
I though think the code in the documentation is syntactically not correct (missing imho a " and a closing bracket ")"). I thought this is not possible on CRAN, as the code for the documentation is executed on submission.
Changing the getTask function does not change anything in regard of the workers getting tasks (even if introducing obvious non-sense into the redisEval like changing it to redisEval("dddddddddd(((")
I only had access to the setGetTask function after installing the package from source (which I downloaded from the official CRAN package page of version 1.1.1 (which imho should make no difference than installing it directly from CRAN)
Data: The Dataframe of tasks to execute looks the following:
taskName;taskQueuePosition;parameter1;paramterN
taskT;1;val1;10
taskK;2;val2;8
taskP;3;val3;7
taskA;4;val4;7
I want to use 'taskQueuePosition' to control the order, tasks with lower numbers should be executed first.
Questions:
Does anybody know any sources where I can get more information on doing this with doRedis or on setGetTask?
Does anybody know how I need to change getTask to achieve the above described?
Any other smart ideas to control the order of execution in a foreach loop? Preferably so that at some point I can use doRedis as parallel back end (changing this would mean a major change in the processing due to complicated technical infrastructure reasons).
Code (for easy reproduction):
The following assumes that the redis-server is started on the local machine.
Redis DB Filling:
library(doRedis)
library(foreach)
options('redis:num'=TRUE) # needed for proper execution
REDIS_JOB_QUEUE = "jobs"
registerDoRedis(REDIS_JOB_QUEUE)
# filling up the data frame
taskDF = data.frame(taskName=c("taskT","taskK","taskP","taskA"),
taskQueuePosition=c(1,2,3,4),
parameter1=c("val1","val2","val3","val4"),
parameterN=c(10,8,7,7))
foreach(currTask=iter(taskDF, by='row'),
.verbose = T
) %dopar% {
print(paste("Executing task: ",currTask$taskName))
Sys.sleep(currTask$parameterN)
}
removeQueue(REDIS_JOB_QUEUE)
Worker:
library(doRedis)
REDIS_JOB_QUEUE = "jobs"
startLocalWorkers(n=1, queue=REDIS_JOB_QUEUE)
I could solve the problem and now can control the order of task execution.
Additional information:
1. There seems to be a typo in the documentation, that renders the getTask example not working. By considering the form of the default_getTask function from the file task.R in the package, it should look probably something like:
getTaskDefault <- function ( queue , job_id , ...)
{
key <- sprintf("%s:%s",queue, job_id)
return(redisEval("local x=redis.call('hkeys',KEYS[1])[1];
if x==nil then return nil end;
local ans=redis.call('hget',KEYS[1],x);
redis.call('set', KEYS[1] .. '.start.' .. x, x);
redis.call('hdel',KEYS[1],x);
return ans",key))
}
It seems that the letters behind first percent sign in the first line of the function got lost. This would explain the uneven number of brackets and quotes.
2) setGetTask still does not have any effect for me. When I set the getTask function though through .option while the DB is filled (like it is described in the vignette of the package) it is successfully called.
3) The information on 2) means that I do not need the getTask function, so I can use the package from CRAN.
----- Questions -----
1) The doRedis vignette describes how a custom getTask can be successfully set.
2 and 3) When the LUA script in getTask function is modified like below, the tasks are drawn from the database in the way they are submitted. This is not exactly what I was asking for, but due to time restraints and the fact I have (or better had) not the first idea about LUA script, it is imho a satisfying solution to control the order of submission by the taskQueuePosition column.
getTaskInOrder <- function ( queue , job_id , ...)
{
key <- sprintf("%s:%s",queue, job_id)
return(redisEval("
local tasks=redis.call('hkeys',KEYS[1]); -- get all tasks
local x=tasks[1]; -- get first task available task
if x==nil then -- if there are no tasks left, stop processing
return nil
end;
local xMin = 65535; -- if we have more tasks than 65535, getting the
-- task with the lowest taskID is not guaranteed to be the first one
local i = 1;
-- local iMinFound = -1;
while (x ~= nil) do -- search the array until there are no tasks left
-- print('x: ',x)
local xNum = tonumber(x);
if(xNum<xMin) then
xMin = xNum;
-- iMinFound = i;
end
i=i+1;
-- print('i is now: ',i);
x=tasks[i];
end
-- print('Minimum is task number',xMin,' found at i ', iMinFound)
x=tostring(xMin) -- convert it back to a string (maybe it would
-- be better to keep the original string somewhere,
-- in case we loose some information whilst converting to number)
-- print('x is now:',x);
-- print(KEYS[1] .. '.start.' .. x, x);
-- print('');
local ans=redis.call('hget',KEYS[1],x);
redis.call('set', KEYS[1] .. '.start.' .. x, x);
redis.call('hdel',KEYS[1],x);
return ans",key))
}
Important note: I noticed that if a task is aborted, the order is screwed up and the resubmitted task (even though the task number remains the same), will be executed after the originally submitted tasks. This is okay for me.
------ Code (for easy reproduction):------
This leads to the following code example (with 12 entries in the task data frame, instead the original 4):
Redis DB Filling:
library(doRedis)
library(foreach)
options('redis:num'=TRUE) # needed for proper execution
REDIS_JOB_QUEUE = "jobs"
getTaskInOrder <- function ( queue , job_id , ...)
{
...like above
}
registerDoRedis(REDIS_JOB_QUEUE)
# filling up the data frame already in order of tasks to be executed
# otherwise the dataframe has to be sorted by taskQueuePosition
taskDF = data.frame(taskName=c("taskA","taskB","taskC","taskD","taskE","taskF","taskG","taskH","taskI","taskJ","taskK","taskL"),
taskQueuePosition=c(1,2,3,4,5,6,7,8,9,10,11,12),
parameter1=c("val1","val2","val3","val4","val1","val2","val3","val4","val1","val2","val3","val4"),
parameterN=c(5,5,5,4,4,4,4,3,3,3,2,2))
foreach(currTask=iter(taskDF, by='row'),
.verbose = T,
.options.redis = list(getTask = getTaskInOrder
) %dopar% {
print(paste("Executing task: ",currTask$taskName))
Sys.sleep(currTask$parameterN)
}
removeQueue(REDIS_JOB_QUEUE)
Worker:
library(doRedis)
REDIS_JOB_QUEUE = "jobs"
startLocalWorkers(n=1, queue=REDIS_JOB_QUEUE)
Another note: just in case you are processing long jobs, as I do, please notice a bug in redis 1.1.1 (the current version on CRAN), which leads to tasks being resubmitted (due to a timeout) despite the workers still working on them.

Multiprocessing with worker.run() does work in serie instead of parallel?

I'm trying to create a program which in its essence works like this:
import multiprocessing
import time
def worker(numbers):
print(numbers)
time.sleep(2)
return
if __name__ =='__main__':
multiprocessing.set_start_method("spawn")
p1 = multiprocessing.Process(target=worker, args=([0,1,2,3,4],))
p2 = multiprocessing.Process(target=worker, args=([5,6,7,8],))
p1.start()
p2.start()
p1.join()
p2.join()
while(1):
p1.run()
p2.run()
p1.join()
p2.join()
print('Done!')
The first time the processes are called via p#.start(), they are executed in parallel. The second time they are called via the p#.run() method, they are executed in series.
How can I make sure the subsequent method calls are also performed in parallel?
Edit: It is important that the processes start together. It cannot happen that process 1 gets executed twice while process 2 only gets executed once.
Edit: I should also note that this code is running on a raspberry pi v3 model B.
As far as I know, a thread can only be started once. After that when you call the run method, it's just a simple function. That's why it isn't run in parallel.

Callback from "multiprocessing" with CFFI segfaults after ~100 iterations

A PyPy callback, that works perfectly (in an infinite loop) when implemented (straightforwardly) as method of a Python object, segfaults after approximately 100 iterations when I move the Python object into a separate multiprocessing process.
In the main code I have:
import multiprocessing as mp
class Task(object):
def __init__(self, com, lib):
self.com = com # communication queue
self.lib = lib # ffi library
self.proc = mp.Process(target=self.spawn, args=(self.com,))
self.register_callback()
def spawn(self, com):
print('%s spawned.'%self.name)
# loop (keeping 'self' alive) until BREAK:
while True:
cmd = com.get()
if cmd == self.BREAK:
break
print("%s stopped."%self.name)
#ffi.calback("int(void*, Data*"): # old cffi (ABI mode)
def callback(self, data):
# <work on data>
return 1
def register_callback(self):
s = ffi.new_handle(self)
self.lib.register_callback(s, self.callback) # C-call
The idea is that multiple tasks should serve an equal number of callbacks concurrently. I have no clue what may cause the segfault, especially since it runs fine for the first ~100 iterations or so. Help much appreciated!
Solution
Handle 's' is garbage collected when returning from 'register_callback()'. Making the handle an attribute of 'self' and passing keeps it alive.
Standard CPython (cffi 1.6.0) segfaulted at the first iteration (i.e. gc was immediate) and provided me a crucial informative error message. PyPy on the other hand segfaulted after approximately 100 iterations without providing a message... Both run fine now.

Resources