Apache Airflow Connection hook is instantiating multiple connection - airflow

Background: Apache Airflow documentation reads:
Hooks Hooks act as an interface to communicate with the external
shared resources in a DAG. For example, multiple tasks in a DAG can
require access to a MySQL database. Instead of creating a connection
per task, you can retrieve a connection from the hook and utilize it.
I have tried spawning 10 tasks using different DB: MYSQL, POSTGRES, MONGODB. Please note that I am using one DB (ex: MYSQL) in one DAG (consisting of 10 tasks).
But, All tasks are instantiating a new connection.
Example of my task:
conn_string = kwargs.get('conn_id')
pg = PostgresHook(conn_string)
pg_query ="...."
records = pg.get_records(pg_query)
why is airflow instantiating a new connection when airflow documentation itself reads (..... multiple tasks in a DAG can require access to a MySQL database. Instead of creating a connection per task, you can retrieve a connection from the hook and utilize it...........)
What is being missed here...

I believe what they mean to say with that part of documentation is that hooks prevent you from redefining the same credentials over and over again. With connection, they are referring to an Airflow connection you define in the web interface and not an actual internet connection to a host.
If you think about it this way:
A task can be scheduled in any of the 3 airflow worker nodes.
Your 10 tasks are divided between these 3.
How would they be able to share the same internet connection if it comes from different hosts? It would be very very hard to maintain those internet connections across workers.
But don't worry, it also took ages for me to understand what they meant there.

Related

Airflow can not connect to Mysql Server in multiple tasks

So, I am using airflow and I made a pretty large DAG, with mutliple different tasks all using cursors and hooks to connect and interact with the database. Lets say for example, the first 3 tasks will work successfully, but then the 4th one will say it cant connect to MySQL server, even though they use the same connections which I defined as environmental variables in Airflow interface. However, sometimes if I just re run it without changing anything, it will connect and work. Any ideas?

Airflow - How to configure that all DAG's tasks run in 1 worker

I have a DAG with 2 tasks:
download_file_from_ftp >> transform_file
My concern is that tasks can be performed on different workers.The file will be downloaded on the first worker and will be transformed on another worker. An error will occur because the file is missing on the second worker. Is it possible to configure the dag that all tasks are performed on one worker?
It's a bad practice. Even if you will find a work around it will be very unreliable.
In general, if your executor allows this - you can configure tasks to execute on a specific worker type. For example in CeleryExecutor you can set tasks to a specific Queue. Assuming there is only 1 worker consuming from that queue then your tasks will be executed on the same worker BUT the fact that it's 1 worker doesn't mean it will be the same machine. It highly depended on the infrastructure that you use. For example: when you restart your machines do you get the exact same machine or new one is spawned?
I highly advise you - don't go down this road.
To solve your issue either download the file to shared disk space like S3, Google cloud storage, etc... then all workers can read the file as it's stored in cloud or combine the download and transform into a single operator thus both actions are executed together.

Airflow DAG getting psycopg2.OperationalError when running tasks with KubernetesPodOperator

Context
We are running Airflow 2.1.4. on a AKS cluster. The Airflow metadata database is an Azure managed postgreSQL(8 cpu). We have a DAG that has like 30 tasks, each task use a KubernetesPodOperator (using the apache-airflow-providers-cncf-kubernetes==2.2.0) to execute some container logic. Airflow is configured with the Airflow official HELM chart. The executor is Celery.
Issue
Usually the first like 5 tasks execute successfully (taking like 1 or 2 minute each) and get marked as done (and colored green) in the Airflow UI. The tasks after that are also successfully executed on AKS, but Airflow not marked as completed in Airflow as such. In the end this leads up to this error message and marking the already finished task as a fail:
[2021-12-15 11:17:34,138] {pod_launcher.py:333} INFO - Event: task.093329323 had an event of type Succeeded
...
[2021-12-15 11:19:53,866]{base_job.py:230} ERROR - LocalTaskJob heartbeat got an exception
psycopg2.OperationalError: could not connect to server: Connection timed out
Is the server running on host "psql-airflow-dev-01.postgres.database.azure.com" (13.49.105.208) and accepting
TCP/IP connections on port 5432?
Similar posting
This issue is also described in this post: https://www.titanwolf.org/Network/q/98b355ff-d518-4de3-bae9-0d1a0a32671e/y Where in the post a link to Stackoverflow does not work anymore.
The metadata database (Azure managed postgreSQL) is not overloading. Also the AKS node pool we are using does not show any sign of stress. It seems like the scheduler cannot pick up / detect a finished task after like a couple of tasks have run.
We also looked at several configuration option as stated here
We are looking now for a number of days now to get this solved but unfortunately no success.
Anyone any idea's what the cause could be? Any help is appreciated!

Create and use Connections in Airflow operator at runtime [duplicate]

This question already has answers here:
Is there a way to create/modify connections through Airflow API
(5 answers)
Closed 4 years ago.
Note: This is NOT a duplicate of
Export environment variables at runtime with airflow
Set Airflow Env Vars at Runtime
I have to trigger certain tasks at remote systems from my Airflow DAG. The straight-forward way to achieve this is SSHHook.
The problem is that the remote system is an EMR cluster which is itself created at runtime (by an upstream task) using EmrCreateJobFlowOperator. So while I can get hold of job_flow_id of the launched EMR cluster (using XCOM), what I need is to an ssh_conn_id to be passed to each downstream task.
Looking at the docs and code, it is evident that Airflow will try to look up for this connection (using conn_id) in db and environment variables, so now the problem boils down to being able to set either of these two properties at runtime (from within an operator).
This seems a rather common problem because if this isn't achievable then the utility of EmrCreateJobFlowOperator would be severely hampered; but I haven't come across any example demonstrating it.
Is it possible to create (and also destroy) either of these from within an Airflow operator?
Connection (persisted in Airflow's db)
Environment Variable (should be accessible to all downstream tasks and not just current task as told here)
If not, what are my options?
I'm on
Airflow v1.10
Python 3.6.6
emr-5.15 (can upgrade if required)
Connections come from the ORM
Yes, you can create connections at runtime, even at DAG creation time if you're careful enough. Airflow is completely transparent on its internal models, so you can interact with the underlying SqlAlchemy directly. As exemplified originally in this answer, it's as easy as:
from airflow.models import Connection
from airflow import settings
def create_conn(username, password, host=None):
new_conn = Connection(conn_id=f'{username}_connection',
login=username,
host=host if host else None)
new_conn.set_password(password)
session = settings.Session()
session.add(new_conn)
session.commit()
Where you can, of course, interact with any other extra Connection properties you may require for the EMR connection.
Environment are process-bounded
This is not a limitation of Airflow or Python, but (AFAIK for every major OS) environments are bound to the lifetime of a process. When you export a variable in bash for example, you're simply stating that when you spawn child processes, you want to copy that variable to the child's environment. This means that the parent process can't change the child's environment after its creation and the child can't change the parents environment.
In short, only the process itself can change its environment after it's created. And considering that worker process are Airflow subprocesses, it's hard to control the creation of their environments as well. What you can do is to write the environment variables into a file and intentionally update the current environment with overrides from that file on each task start.
The way you can do this is to create an Airflow task after EmrCreateJobFlowOperator, that uses BashOperator to probably use aws-cli to retrieve the IP Address of the Virtual Machine where you want to run the task and in the same task run airflow cli that creates an SSH connection using that IP address.

Sharing large intermediate state between Airflow tasks

We have an Airflow deployment with Celery executors.
Many of our DAGs require a local processing step of some file in a BashOperator or PythonOperator.
However, in our understanding the tasks of a given DAG may not always be scheduled on the same machine.
The options for state sharing between tasks I've gathered so far:
Use Local Executors - this may suffice for one team, depending on the load, but may not scale to the wider company
Use XCom - does this have a size limit? Probably unsuitable for large files
Write custom Operators for every combination of tasks that need local processing in between. This approach reduces modularity of tasks and requires replicating existing operators' code.
Use Celery queues to route DAGs to the same worker (docs) - This option seems attractive at first, but what would be an appropriate way to set it up in order to avoid routing everything to one executor, or crafting a million queues?
Use a shared network storage in all machines that run executors - Seems like an additional infrastructure burden, but is a possibility.
What is the recommended way to do sharing of large intermediate state, such as files, between tasks in Airflow?
To clarify something: No matter how you setup airflow, there will only be one executor running.
The executor runs on the same machine as the scheduler.
Currently (current is airflow 1.9.0 at time of writing) there is no safe way to run multiple schedulers, so there will only ever be one executor running.
Local executor executes the task on the same machine as the scheduler.
Celery Executor just puts tasks in a queue to be worked on the celery workers.
However, the question you are asking does apply to Celery workers. If you use Celery Executor you will probably have multiple celery workers.
Using network shared storage solves multiple problems:
Each worker machine sees the same dags because they have the same dags folder
Results of operators can be stored on a shared file system
The scheduler and webserver can also share the dags folder and run on different machines
I would use network storage, and write the output file name to xcom. Then when you need to input the output from a previous task, you would read the file name from that task's Xcom and process that file.
Change datatype of column key in xcom table of airflow metastore.
Default datatype of key is: blob.
Change it to LONGBLOB. It will help you to store upto 4GB in between intermediate tasks.

Resources