This question already has answers here:
Is there a way to create/modify connections through Airflow API
(5 answers)
Closed 4 years ago.
Note: This is NOT a duplicate of
Export environment variables at runtime with airflow
Set Airflow Env Vars at Runtime
I have to trigger certain tasks at remote systems from my Airflow DAG. The straight-forward way to achieve this is SSHHook.
The problem is that the remote system is an EMR cluster which is itself created at runtime (by an upstream task) using EmrCreateJobFlowOperator. So while I can get hold of job_flow_id of the launched EMR cluster (using XCOM), what I need is to an ssh_conn_id to be passed to each downstream task.
Looking at the docs and code, it is evident that Airflow will try to look up for this connection (using conn_id) in db and environment variables, so now the problem boils down to being able to set either of these two properties at runtime (from within an operator).
This seems a rather common problem because if this isn't achievable then the utility of EmrCreateJobFlowOperator would be severely hampered; but I haven't come across any example demonstrating it.
Is it possible to create (and also destroy) either of these from within an Airflow operator?
Connection (persisted in Airflow's db)
Environment Variable (should be accessible to all downstream tasks and not just current task as told here)
If not, what are my options?
I'm on
Airflow v1.10
Python 3.6.6
emr-5.15 (can upgrade if required)
Connections come from the ORM
Yes, you can create connections at runtime, even at DAG creation time if you're careful enough. Airflow is completely transparent on its internal models, so you can interact with the underlying SqlAlchemy directly. As exemplified originally in this answer, it's as easy as:
from airflow.models import Connection
from airflow import settings
def create_conn(username, password, host=None):
new_conn = Connection(conn_id=f'{username}_connection',
login=username,
host=host if host else None)
new_conn.set_password(password)
session = settings.Session()
session.add(new_conn)
session.commit()
Where you can, of course, interact with any other extra Connection properties you may require for the EMR connection.
Environment are process-bounded
This is not a limitation of Airflow or Python, but (AFAIK for every major OS) environments are bound to the lifetime of a process. When you export a variable in bash for example, you're simply stating that when you spawn child processes, you want to copy that variable to the child's environment. This means that the parent process can't change the child's environment after its creation and the child can't change the parents environment.
In short, only the process itself can change its environment after it's created. And considering that worker process are Airflow subprocesses, it's hard to control the creation of their environments as well. What you can do is to write the environment variables into a file and intentionally update the current environment with overrides from that file on each task start.
The way you can do this is to create an Airflow task after EmrCreateJobFlowOperator, that uses BashOperator to probably use aws-cli to retrieve the IP Address of the Virtual Machine where you want to run the task and in the same task run airflow cli that creates an SSH connection using that IP address.
Related
So, I am using airflow and I made a pretty large DAG, with mutliple different tasks all using cursors and hooks to connect and interact with the database. Lets say for example, the first 3 tasks will work successfully, but then the 4th one will say it cant connect to MySQL server, even though they use the same connections which I defined as environmental variables in Airflow interface. However, sometimes if I just re run it without changing anything, it will connect and work. Any ideas?
Background: Apache Airflow documentation reads:
Hooks Hooks act as an interface to communicate with the external
shared resources in a DAG. For example, multiple tasks in a DAG can
require access to a MySQL database. Instead of creating a connection
per task, you can retrieve a connection from the hook and utilize it.
I have tried spawning 10 tasks using different DB: MYSQL, POSTGRES, MONGODB. Please note that I am using one DB (ex: MYSQL) in one DAG (consisting of 10 tasks).
But, All tasks are instantiating a new connection.
Example of my task:
conn_string = kwargs.get('conn_id')
pg = PostgresHook(conn_string)
pg_query ="...."
records = pg.get_records(pg_query)
why is airflow instantiating a new connection when airflow documentation itself reads (..... multiple tasks in a DAG can require access to a MySQL database. Instead of creating a connection per task, you can retrieve a connection from the hook and utilize it...........)
What is being missed here...
I believe what they mean to say with that part of documentation is that hooks prevent you from redefining the same credentials over and over again. With connection, they are referring to an Airflow connection you define in the web interface and not an actual internet connection to a host.
If you think about it this way:
A task can be scheduled in any of the 3 airflow worker nodes.
Your 10 tasks are divided between these 3.
How would they be able to share the same internet connection if it comes from different hosts? It would be very very hard to maintain those internet connections across workers.
But don't worry, it also took ages for me to understand what they meant there.
We are setting up airflow for scheduling/orchestration , currently we have Spark python loads, and non-spark loads in different server and push files to gcp available in another server. Is there an option to decide to which worker nodes the airflow task are submitted? Currently we are using ssh connection to run all work loads. Our processing is mostly on-perm
Usage is celery executor model, How to we make sure that a specific task is run on its appropriate node.
task run a non spark server ( no spark binaries available)
task 2 executes PySpark submit - (This has spark binaries)
Task Push the files created from task 2 from another server/nodes ( Only this has the gcp utilities installed to push the files due to security reason ) .
If create a dag, is it possible to mention the task to execute on set of worker nodes ?
Currently we are having wrapper shell script for each task and making 3 ssh runs to complete these process. We would like to avoid such wrapper shell script rather use the inbuild have pythonOperator , SparkSubmitOperator, SparkJdbcOperator and SFTPToGCSOperator and make sure the specific task runs in specific server or worknodes .
In short , can we have 3 worker node groups and make the task to execute on a group of nodes based on the operations?
We can assign a queue to each worker node like
Start the airflow worker with mentioning the queue
airflow worker -q sparkload
airflow worker -q non-sparkload
airflow worker -q gcpload
The start each task with queue mentioned. Similar thread found as well.
How can Airflow be used to run distinct tasks of one workflow in separate machines?
We have a kubernetes pod operator that will spit out a python dictionary that will define which further downstream kubernetes pod operators to run along with their dependencies and the environment variables to pass into each operator.
How do I get this python dictionary object back into the executor's context (or is it worker's context?) so that airflow can spawn the downstream kubernetes operators?
I've looked at BranchOperator and TriggerDagRunOperator and XCOM push/pull and Variable.get and Variable.set, but nothing seems to quite work.
We have a kubernetes pod operator that will spit out a python
dictionary that will define which further downstream kubernetes pod
operators to run
This is possible, albeit not in the way you are trying. You'll have to have all possible KubernetesPodOperators already in your workflow and then skip those that need not be run.
An elegant way to do this would be to attach a ShortCircuitOperator before each KubernetesPodOperator that reads the XCom (dictionary) published by the upstream KubernetesPodOperator and determines whether or not to continue with the downstream task.
EDIT-1
Actually a cleaner way would be to just raise an AirflowSkipException within the task that you want to skip (rather than using a separate ShortCircuitOperator to do this)
How do I get this python dictionary ... so that airflow can spawn the
downstream kubernetes operators..
No. You can't dynamically spawn new tasks based on output of an upstream task.
Think of it this way: for scheduler it is imperative to know all the tasks (their task_ids, trigger_rules, priority_weight etc) ahead of time so as to be able to execute them when the right time comes. If the tasks were to just keep coming up dynamically then Airflow's scheduler would have to become akin to an Operating System scheduler (!). For more details read the EDIT-1 part of this answer
We have an Airflow deployment with Celery executors.
Many of our DAGs require a local processing step of some file in a BashOperator or PythonOperator.
However, in our understanding the tasks of a given DAG may not always be scheduled on the same machine.
The options for state sharing between tasks I've gathered so far:
Use Local Executors - this may suffice for one team, depending on the load, but may not scale to the wider company
Use XCom - does this have a size limit? Probably unsuitable for large files
Write custom Operators for every combination of tasks that need local processing in between. This approach reduces modularity of tasks and requires replicating existing operators' code.
Use Celery queues to route DAGs to the same worker (docs) - This option seems attractive at first, but what would be an appropriate way to set it up in order to avoid routing everything to one executor, or crafting a million queues?
Use a shared network storage in all machines that run executors - Seems like an additional infrastructure burden, but is a possibility.
What is the recommended way to do sharing of large intermediate state, such as files, between tasks in Airflow?
To clarify something: No matter how you setup airflow, there will only be one executor running.
The executor runs on the same machine as the scheduler.
Currently (current is airflow 1.9.0 at time of writing) there is no safe way to run multiple schedulers, so there will only ever be one executor running.
Local executor executes the task on the same machine as the scheduler.
Celery Executor just puts tasks in a queue to be worked on the celery workers.
However, the question you are asking does apply to Celery workers. If you use Celery Executor you will probably have multiple celery workers.
Using network shared storage solves multiple problems:
Each worker machine sees the same dags because they have the same dags folder
Results of operators can be stored on a shared file system
The scheduler and webserver can also share the dags folder and run on different machines
I would use network storage, and write the output file name to xcom. Then when you need to input the output from a previous task, you would read the file name from that task's Xcom and process that file.
Change datatype of column key in xcom table of airflow metastore.
Default datatype of key is: blob.
Change it to LONGBLOB. It will help you to store upto 4GB in between intermediate tasks.