Task error: 'multiprocessing.Queue' cannot be JSON serialized - airflow

I have a task decorator that returns an instance of multiprocessing.Queue. Upon running the DAG I get an error message stating that type multiprocessing.Queue cannot be JSON serialized. I think it cannot be loaded into the XCOM. Is the only way around this to dump the queue's contents into a list and have task return that?
I am using the billiard package fork of the multiprocessing python package (i.e. import billiard as multiprocessing). This was suggested in this chat: https://github.com/apache/airflow/issues/14896

Related

xcom_pull is not working in dynamic task generation function

I have a function which dynamically creates a sub task, where I am reading value from xcom_pull but there I am getting the error:
File "/home/airflow/gcs/recon_nik_v6.py", line 168, in create_audit_task
my_dict=kwargs["ti"].xcom_pull(task_ids='accept_input_from_cli', key='my_ip_dict')
KeyError: 'ti'
If I use the same my_dict=kwargs["ti"].xcom_pull(task_ids='accept_input_from_cli', key='my_ip_dict') code in another function then it works, but in this dynamic part it's not working.
Ssimilarly to your other questions (and explained in slack several times). This is not how Airflow works.
XCom pull and task instances are only available when DAG Run is being executed. When you create your DAG structure (i.e. dynamically generate DAGs) you cannot use them.
Only Task Instances when executing tasks can access them and this is already long after the DAGs have been parsed and DAG structure established.
So what you try to do is simply impossible.

Multithreaded Flask application causes stack error in rpy2 R process

Essentially the same error as here, but those solutions do not provide enough information to replicate a working example: Rpy2 in a Flask App: Fatal error: unable to initialize the JIT
Within my Flask app, using the rpy2.rinterface module, whenever I intialize R I receive the same stack usage error:
import rpy2.rinterface as rinterface
from rpy2.rinterface_lib import openrlib
with openrlib.rlock:
rinterface.initr()
Error: C stack usage 664510795892 is too close to the limit Fatal error: unable to initialize the JIT
rinterface is the low-level R hook in rpy2, but the higher-level robjects module gives the same error. I've tried wrapping the context lock and R initialization in a Process from the multiprocessing module, but have the same issue. Docs say that a multithreaded environment will cause problems for R: https://rpy2.github.io/doc/v3.3.x/html/rinterface.html#multithreading
But the context manager doesn't seem to be preventing the issue with interfacing with R
rlock is an instance of a Python's threading.Rlock. It should take care of multithreading issues.
However, multitprocessing can cause a similar issue if the embedded R is shared across child processes. The code for this demo script showing parallel processing with R and Python processes illustrate this: https://github.com/rpy2/rpy2/blob/master/doc/_static/demos/multiproc_lab.py
I think that the way around this is to configure Flask, or most likely your wsgi layer, to create isolated child processes, or have all of your Flask processes delegate R calculations to a secondary process (created on the fly, or in a pool of processes waiting for tasks to perform).
As other answers for similar questions have implied, Flask users will need to initialize and run rpy2 outside of the WSGI context to prevent the embedded R process from crashing. I accomplished this with Celery, where workers provide an environment separate from Flask to handle requests made in R.
I used the low-level rinterface library as mentioned in the question, and wrote Celery tasks using classes
import rpy2.rinterface as rinterface
from celery import Celery
celery = Celery('tasks', backend='redis://', broker='redis://')
class Rpy2Task(Task):
def __init__(self):
self.name = "rpy2"
def run(self, args):
rinterface.initr()
r_func = rinterface.baseenv['source']('your_R_script.R')
r_func[0](args)
pass
Rpy2Task = celery.register_task(Rpy2Task())
async_result = Rpy2Task.delay(args)
Calling rinterface.initr() anywhere but in the body of the task run by the worker results in the aforementioned crash. Celery is usually packaged with redis, and I found this a useful way to support exchanging information between R and Python, but of course Rpy2 also provides flexible ways of doing this.

How to run an Airflow DAG that is defined in a test?

I am trying to write a test case where I:
instantiate a collection of (Python)Operators (patching some of their dependencies with unittest.mock.patch)
arrange those Operators in a DAG
run that DAG
make assertions about the calls to various mocked downstreams
I see from here that running a DAG is not so simple as calling dag.run - I should instantiate a local_client and call trigger_dag on that. However, the resultant code constructs its own DagBag, and does not accept any parameter that allows me to pass in my manually-constructed DAG - so I cannot see how to run this DAG with local_client.
I see a couple of options here:
I could declare my testing DAG in a separate folder, as specified by DagModel.get_current(dag_id).fileloc, so that my DAG will get picked up by trigger_dag and so run - but this seems pretty indirect, and also I doubt that I'd be able to cleanly reference the injected mocks from my test code.
I could directly call api.common.experimental.trigger_dag._trigger_dag, which has a dag_bag argument. Both the experimental in the name, and the underscored-prefixed method name, suggest that this would be A Bad Idea.

Preserve detailed Gremlin error message when running Gremlin query with eval()

in my script I do the following:
eval("query")
and get:
unexpected EOF while parsing (<string>, line 1)
in Jupyter i do:
query
and get:
GremlinServerError: 499: {"requestId":"2602387d-f9a1-4478-a90d-3612d1943b71","code":"ConstraintViolationException","detailedMessage":"Vertex with id already exists: ba48297665fc3da684627c0fcb3bb1fd6738e7ad8eb8768528123904b240aaa7b21f66624de1fea84c87e5e2707995fe52435f1fb5fc4c2f9eaf85a605c6877a"}
Is there a way to preserve the detailed error message whilst doing Gremlin queries with the eval("querystring") approach?
I need to concatenate many strings into one query, that is why.
Also, the detailed error message allows me to catch the errors like this ConstraintViolationException
Details:
I am interacting with Neptune with Python.
I have this at the beginning of my script:
from gremlin_python import statics
statics.load_statics(globals())
from gremlin_python.structure.graph import Graph
from gremlin_python.process.graph_traversal import __
from gremlin_python.process.strategies import *
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
which is from the official documentation on how to connect with Python.
There is insufficient info in the question to provide a good answer for this one. There should be no difference in the error message you see between a client program and Jupyter notebook, as long as you're using the exact same code. From your messages, I suspect that there is a difference in either the serializer or the protocol (websocket vs HTTP) between your experiments. The response formats (and possibly the error formats too) different between serializers and protocol, so thats probably where you should start looking.

Airflow: how to access running task from task instance?

This is my situation:
I am trying to access the instance of my custom operator running in another DAG. I am able to get the correct DagRun and TaskInstance objects by doing the following.
dag_bag = DagBag(settings.DAGS_FOLDER)
target_dag = dag_bag.get_dag('target_dag_id')
dr = target_dag.get_dagrun(target_dag.latest_execution_date)
ti = dr.get_task_instance('target_task_id')
I have printed the TaskInstance object aquired by the above lines and it is correct (it is running/has the correct task_id etc.), however when I am unable to access the task object, which would allow me to interface with the running operator. I should be able to do the following:
running_custom_operator = ti.task #AttributeError: TaskInstance has not attribute task
Any help would be much appreciated, either following my approach or if someone knows how to access the task object of a running task instance.
Thank you
You can simply grab the task object from the DAG object: target_dag.task_dict["target_task_id"]

Resources