Use DB to generate airflow tasks dynamically - airflow

I want to run an airflow dag like so ->
I have 2 airflow workers W1 and W2.
In W1 I have scheduled a single task (W1-1) but in W2, I want to create X number of tasks (W2-1, W2-2 ... W2-X).
The number X and the bash command for each task will be derived from a DB call.
All tasks for worker W2 should run in parallel after W1 completes.
This is my code
dag = DAG('deploy_single', catchup=False, default_args=default_args, schedule_interval='16 15 * * *')
t1 = BashOperator(
task_id='dummy_task',
bash_command='echo hi > /tmp/hi',
queue='W1_queue',
dag=dag)
get_all_engines = "select full_command, queue_name from internal_airflow_hosts where logical_group = 'live_engines';"
db_creds = json.loads(open('/opt/airflow/db_creds.json').read())
conn_dict = db_creds["airflowdb_local"]
connection = psycopg2.connect(**conn_dict)
cursor = connection.cursor()
cursor.execute(get_all_engines)
records = cursor.fetchall()
i = 1
for record in records:
t = BashOperator(
task_id='script_test_'+str(i),
bash_command="{full_command} ".format(full_command=str(record[0])),
queue=str(record[1]),
dag=dag)
t.set_upstream(t1)
i += 1
cursor.close()
connection.close()
However, when I run this, the task on W1 completed successfully but all tasks on W2 failed. In the airflow UI, I can see that it can resolve the correct number of tasks (10 in this case) but each of these 10 failed.
Looking at the logs, I saw that on W2 (which is on a different machine), airflow could not find the db_creds.json file.
I do not want to provide the DB creds file to W2.
My question is how can an airflow task be created dynamically in this case?
Basically i want to run a DB query on the airflow server and assign tasks to one or more workers based on the results of that query. The DB will contain updated info about which engines are active etc I want the DAG to reflect this. From logs, it looks like each worker runs the DB query. Providing access to DB to each worker is not an option.

Thank you #viraj-parekh and #cwurtz.
After much trial and error, found the correct way to use airflow variables for this case.
Step 1) We create another script called gen_var.pyand place it in the dag folder. This way, the scheduler will pick up and generate the variables. If the code for generating variables is within the deploy_single dag then we run into the same dependency issue as the worker will try and process the dag too.
"""
Code that goes along with the Airflow tutorial located at:
https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py
"""
import json
import psycopg2
from airflow.models import Variable
from psycopg2.extensions import AsIs
get_all_engines = "select full_command, queue_name from internal_airflow_hosts where logical_group = 'live_engines';"
db_creds = json.loads(open('/opt/airflow/db_creds.json').read())
conn_dict = db_creds["airflowdb_local"]
connection = psycopg2.connect(**conn_dict)
cursor = connection.cursor()
cursor.execute(get_all_engines)
records = cursor.fetchall()
hosts = {}
i = 1
for record in records:
comm_dict = {}
comm_dict['full_command'] = str(record[0])
comm_dict['queue_name'] = str(record[1])
hosts[i] = comm_dict
i += 1
cursor.close()
connection.close()
Variable.set("hosts",hosts,serialize_json=True)
Note the call to serialize_json. Airflow will try to store the variable as a string. If you want it to be stored as a dict, then use serialize_json=True. Airflow will still store it as string via json.dumps
Step 2) Simplify the dag and call this "hosts" variable (now deserialize to get back the dict) like so -
hoztz = Variable.get("hosts",deserialize_json=True)
for key in hoztz:
host = hoztz.get(key)
t = BashOperator(
task_id='script_test_'+str(key),
bash_command="{full_command} ".format(full_command=str(host.get('full_command'))),
queue=str(host.get('queue_name')),
dag=dag)
t.set_upstream(t1)
Hope it helps someone else.

One way to do this would be to store the information in an Airflow Variable.
You can fetch the information needed to dynamically generate the DAG (and necessary configs) in a Variable and have W2 access it from there.
Variables are an airflow model that can be used to store static information (information that does not have an associated timestamp) that all tasks can access.

Related

How to find out the delayed jobs in airflow

Some of my DAG are waiting to get scheduled, and some are waiting in the queue. I suspect there are reasons for this delay but not sure how I can start to debug this problem. Majority of the pipelines are running Spark jobs.
Can someone help to give me some directions in terms of where to look at to 1) anaylse which DAGs were delayed (did not start at the scheduled time) 2) where are the places I should look at to find out if the resources are enough. I'm quite new to scheduling in Airflow. Many thanks. Please let me know if I can describe the question better.
If you are looking for code that takes advantage of Airflows' wider capabilities.
There are three modules within airflow.models which can be harnessed.
To programmatically retrieve all DAGs which your Airflow is away of, we import DagBag. From the docs "A dagbag is a collection of dags, parsed out of a folder tree and has high"
We utilise DagModel and the method get_current, to initialise each dag_id present in our bag
We check if any DAG is active using the DagModel property is_paused
We retrieve the latest DAG run using the DagRun.find
Sort the individual dag runs by latest to earliest
Here you could just subset [0] to get 1, however, for your debugging purposes I just loop through them all
DagRun returns a lot of information for us to use. In my loop I have output print(i, run.state, run.execution_date, run.start_date). So you can see what is going on under the hood.
id
state
dag_id
queued_at
execution_date
start_date
end_date
run_id
data_interval_start
data_interval_end
last_scheduling_decision
I have commented out an if check for any queued Dags for you to uncomment. Additionally you can do some arithmetic on dates if you desire, to add further conditional functionality.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import DagBag, DagModel, DagRun
from airflow.operators.python import PythonOperator
# make a function that returns if a DAG is set to active or paused
def check_dag_active():
bag = DagBag()
for dag_id in bag.dags:
in_bag = DagModel.get_current(dag_id)
if not in_bag.is_paused:
latest = DagRun.find(dag_id=dag_id)
latest.sort(key=lambda x: x.execution_date, reverse=True)
for i, run in enumerate(latest):
print(i, run.state, run.execution_date, run.start_date)
# if run.state == 'queued':
# return [run.dag_id, run.execution_date, run.start_date]
with DAG(
'stack_overflow_ans_3',
tags = ['SO'],
start_date = datetime(2022, 1, 1),
schedule_interval = None,
catchup = False,
is_paused_upon_creation = False
) as dag:
t1 = PythonOperator(
task_id = 'task_that_will_fail',
python_callable = check_dag_active
)
Depending on your version of Airflow and your setup, you should be able to query the Airflow DB directly to get this information.
If you're using Airflow 1.x, there should be an "Ad Hoc Query" executor in the Data Profiling tab in the UI. This was disabled in 2.x though, so if you're running 2.x you'll need to connect directly to your Airflow DB using psql or something similar (this differs from Google to AWS to Docker).
Once you're in, check out this link for some queries on DAG runtime.

Can an Airflow task dynamically generate a DAG at runtime?

I have an upload folder that gets irregular uploads. For each uploaded file, I want to spawn a DAG that is specific to that file.
My first thought was to do this with a FileSensor that monitors the upload folder and, conditional on presence of new files, triggers a task that creates the separate DAGs. Conceptually:
Sensor_DAG (FileSensor -> CreateDAGTask)
|-> File1_DAG (Task1 -> Task2 -> ...)
|-> File2_DAG (Task1 -> Task2 -> ...)
In my initial implementation, CreateDAGTask was a PythonOperator that created DAG globals, by placing them in the global namespace (see this SO answer), like so:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from datetime import datetime, timedelta
from pathlib import Path
UPLOAD_LOCATION = "/opt/files/uploaded"
# Dynamic DAG generation task code, for the Sensor_DAG below
def generate_dags_for_files(location=UPLOAD_LOCATION, **kwargs):
dags = []
for filepath in Path(location).glob('*'):
dag_name = f"process_{filepath.name}"
dag = DAG(dag_name, schedule_interval="#once", default_args={
"depends_on_past": True,
"start_date": datetime(2020, 7, 15),
"retries": 1,
"retry_delay": timedelta(hours=12)
}, catchup=False)
dag_task = DummyOperator(dag=dag, task_id=f"start_{dag_name}")
dags.append(dag)
# Try to place the DAG into globals(), which doesn't work
globals()[dag_name] = dag
return dags
The main DAG then invokes this logic via a PythonOperator:
# File-sensing DAG
default_args = {
"depends_on_past" : False,
"start_date" : datetime(2020, 7, 16),
"retries" : 1,
"retry_delay" : timedelta(hours=5),
}
with DAG("Sensor_DAG", default_args=default_args,
schedule_interval= "50 * * * *", catchup=False, ) as sensor_dag:
start_task = DummyOperator(task_id="start")
stop_task = DummyOperator(task_id="stop")
sensor_task = FileSensor(task_id="my_file_sensor_task",
poke_interval=60,
filepath=UPLOAD_LOCATION)
process_creator_task = PythonOperator(
task_id="process_creator",
python_callable=generate_dags_for_files,
)
start_task >> sensor_task >> process_creator_task >> stop_task
But that doesn't work, because by the time process_creator_task runs, the globals have already been parsed by Airflow. New globals after parse time are irrelevant.
Interim solution
Per Airflow dynamic DAG and task Ids, I can achieve what I'm trying to do by omitting the FileSensor task altogether and just letting Airflow generate the per-file task at each scheduler heartbeat, replacing the Sensor_DAG with just executing generate_dags_for_files: Update: Nevermind -- while this does create a DAG in the dashboard, actual execution runs into the "DAG seems to be missing" issue:
generate_dags_for_files()
This does mean that I can no longer regulate the frequency of folder polling with the poke_interval parameter of FileSensor; instead, Airflow will poll the folder every time it collects DAGs.
Is that the best pattern here?
Other related StackOverflow threads
Run Airflow DAG for each file and Airflow: Proper way to run DAG for each file: identical use case, but the accepted answer uses two static DAGs, presumably with different parameters.
Proper way to create dynamic workflows in Airflow - accepted answer dynamically creates tasks, not DAGs, via a complicated XCom setup.
In short: if the task writes where the DagBag reads from, yes, but it's best to avoid a pattern that requires this. Any DAG you're tempted to custom-create in a task should probably instead be a static, heavily parametrized, conditionally-triggered DAG. y2k-shubham provides an excellent example of such a setup, and I'm grateful for his guidance in the comments on this question.
That said, here are the approaches that would accomplish what the question is asking, no matter how bad of an idea it is, in the increasing degree of ham-handedness:
If you dynamically generate DAGs from a Variable (like so), modify the Variable.
If you dynamically generate DAGs from a list of config files, add a new config file to wherever you're pulling config files from, so that a new DAG gets generated on the next DAG collection.
Use something like Jinja templating to write a new Python file in the dags/ folder.
To retain access to the task after it runs, you'd have to keep the new DAG definition stable and accessible on future dashboard updates / DagBag collection. Otherwise, the Airflow dashboard won't be able to render much about it.
Airflow is suited for building DAGs dynamically; as pointed it out by its creator:
https://youtu.be/Fvu2oFyFCT0?t=411 p.s. thanks to #Yiannis for the video reference
Here is an example of how this could be accomplished:
https://docs.astronomer.io/learn/dynamically-generating-dags

The conn_id isn't defined

I'm learning Airflow and I'm trying to understand how connections works.
I have a first dag with the following code:
c = Connection(
conn_id='aws_credentials',
conn_type='Amazon Web Services',
login='xxxxxxxx',
password='xxxxxxxxx'
)
def list_keys():
hook = S3Hook(aws_conn_id=c.conn_id)
logging.info(f"Listing Keys from {bucket}/{prefix}")
keys = hook.list_keys(bucket, prefix=prefix)
for key in keys:
logging.info(f"- s3://{bucket}/{key}")
In this case It's working fine. The connection is well passed to the S3Hook.
Then I have a second dag:
redshift_connection = Connection(
conn_id='redshift',
conn_type='postgres',
login='duser',
password='xxxxxxxxxx',
host='xxxxxxxx.us-west-2.redshift.amazonaws.com',
port=5439,
schema='db'
)
aws_connection = Connection(
conn_id='aws_credentials',
conn_type='Amazon Web Services',
login='xxxxxxxxx',
password='xxxxxxxx'
)
def load_data_to_redshift(*args, **kwargs):
aws_hook = AwsHook(aws_connection.conn_id)
credentials = aws_hook.get_credentials()
redshift_hook = PostgresHook(redshift_connection.conn_id)
sql_stmnt = sql_statements.COPY_STATIONS_SQL.format(aws_connection.login, aws_connection.password)
redshift_hook.run(sql_stmnt)
dag = DAG(
's3_to_Redshift',
start_date=datetime.datetime.now()
)
create_table = PostgresOperator(
task_id='create_table',
postgres_conn_id=redshift_connection.conn_id,
sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
dag=dag
)
This dag return me the following error: The conn_idredshiftisn't defined
Why is that? What are the differences between my first and second dag? Why the connection does seems to work in the first example and not in the second situation?
Thanks.
Connections are usually created using the UI or CLI as described here and stored by Airflow in the database backend. The operators and the respective hooks then take a connection ID as an argument and use it to retrieve the usernames, passwords, etc. for those connections.
In your case, I suspect you created a connection with the ID aws_credentials using the UI or CLI. So, when you pass its ID to S3Hook it successfully retrieves the credentials (from the Airflow's database backend, not from the Connection object that you created).
But, you did not create a connection with the ID redshift, therefore, AwsHook complains that it is not defined. You have to create the connection as described in the documentation first.
Note: The reason for not defining connections in the DAG code is that the DAG code is usually stored in a version control system (e.g., Git). And it would be a security risk to store credentials there.

Can I create an adhoc parameterized DAG from scheduler job DAG running once a minute

I'm researching Airflow to see if it is a viable fit for my use case and not clear from the documentation if it fits this scenario. I'd like to schedule a job workflow per customer based on some very dynamic criteria which doesn't fall into the standard "CRON" loop of running every X minutes etc. (since there is some impact of running together)
Customer DB
Customer_id, "CRON" desired interval (best case)
1 , 30 minutes
2 , 2 hours
...
... <<<<<<< thousands of these potential jobs
Every minute I'd like to query the state of the current work in the system as well as real world "sensor" data which changes often (such as load on some DBs, or quotas to other resources, or adhoc priorities, boosting etc.)
When decided, I'd need to create a "DAG" (pipeline) of work per customer which had been deemed worthy of running at this time (since perhaps we want to delay work for the "CRON" given some complicated analysis).
For instance :
Every minute run this test:
for customer in DB:
if (shouldRunDAGForCustomer(customer)):
Create a DAG with states ..... and run it
"Sleep for a minute"
def shouldRunDagForCustomer(...):
currentStateOfJobs = ....
situationalAwarenessOfWorld = .... // check all sort of interesting metrics
if some criteria is met for this customer: return true // run the dag for this customer
else : return false
From the material I've read, it seems that the Dags are given a specifed schedule and are static in their structure. Also seems that DAGs run on all their inputs, not generated per input.
Also wasn't clear on how the scheduling works, if the given DAG hadn't completed but the schedule time had arrived. Would I have potentially multiple runs of the same pipeline for the same input (Bad). As I have pipelines whose time to complete varies depending on customer, dynamic load of system etc. I'd like to manage the scheduling aspect myself and generation of "DAG"
This is possible through a "controller" DAG that is scheduled every minute, which then triggers runs for a "target" DAG when desired conditions are met. Airflow has a good example of this, see example_trigger_controller_dag and example_trigger_target_dag. The controller uses the TriggerDagRunOperator() which is an operator that kicks off a run for any desired DAG.
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="example_trigger_target_dag", # Ensure this equals the dag_id of the DAG to trigger
conf={"message": "Hello World"},
dag=dag,
)
Then the target DAG doesn't need to do anything special except should have schedule_interval=None. Note that on trigger, you can populate a conf dictionary that the target can later consume, in case you want to customize each triggered run.
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: \'{{ dag_run.conf["message"] if dag_run else "" }}\'"',
dag=dag,
)
Back to your case, your scenario is similar, but where you differ from the example is you won't kick off a DAG every time and you have multiple target DAGs that you could kick off. This is where the ShortCircuitOperator comes into play, which basically is a task that runs a python method you specify, which just needs to return true or false. If it returns true, then it continues onto the next downstream task as usual, otherwise it "short circuits" and stops skips the downstream task. Worth giving example_short_circuit_operator a run if you want to see this demonstrated. With that and dynamic creation of tasks with a for-loop, I think you'll get something like this in your controller DAG:
dag = DAG(
dag_id='controller_customer_pipeline',
default_args=args,
schedule_interval='* * * * *',
)
def shouldRunDagForCustomer(customer, ...):
currentStateOfJobs = ....
situationalAwarenessOfWorld = .... // check all sort of interesting metrics
if some criteria is met for this customer: return true // run the dag for this customer
else : return false
for customer in DB:
check_run_conditions = ShortCircuitOperator(
task_id='check_run_conditions_' + customer,
python_callable=shouldRunDagForCustomer,
op_args=[customer],
op_kwargs={...}, # extra params if needed
dag=dag,
)
trigger_run = TriggerDagRunOperator(
task_id='trigger_run_' + customer,
trigger_dag_id='target_customer_pipeline_' + customer, # standardize on DAG ids for per customer DAGs
conf={"foo": "bar"}, # pass on extra info to triggered DAG if needed
dag=dag,
)
check_run_conditions >> trigger_run
Then your target DAG is just the per customer work.
This is probably not the only way you could implement something like this, but basically yes I think it's viable to implement in Airflow.

Reference filename via xcom in Airflow

I'm trying to understand how to pass values via airflow xcom functionality. The specific usecase I am trying to build is to write a file, then move it, then run another command. The idea is that I pass the file name from one operator to the next.
Here is what I have:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import datetime as dt
DAG = DAG(
dag_id='xcom_test_dag',
start_date=dt.datetime.now(),
schedule_interval='#once'
)
def push_function(**context):
file_name = 'test_file_{date}'.format(date=dt.datetime.now())
return context['task_instance'].xcom_push(key='filename', value=file_name)
def pull_function(**context):
dir(context['task_instance'].xcom_pull())
push_task = PythonOperator(
task_id='push_task',
python_callable=push_function,
provide_context=True,
dag=DAG)
pull_task = PythonOperator(
task_id='pull_task',
python_callable=pull_function,
provide_context=True,
dag=DAG)
push_task >> pull_task
If I want to reference the file name in the pull_task so I could perform read the file - how should I call that? Trying to access context['task_instance'] does not contain a value. Further - is it best practices to try and reference a file name like this from task to task/operator to operator?
When pulling data from XCOM, you want to provide the task ID of the task where you push the data. In your example, the task_id of your push task is push_task, so you'd want to do something like:
value = context['task_instance'].xcom_pull(task_ids='push_task')
However, from the airflow documentation, note that:
By default, xcom_pull() filters for the keys that are automatically given to XComs when they are pushed by being returned from execute functions (as opposed to XComs that are pushed manually).
If you're pushing data to XCOM manually with specific keys, you may need to include that key when calling xcom_pull. In your example, you push a key called filename in your push task, so you'd likely need to do something like this in your pull task:
value = context['task_instance'].xcom_pull(task_ids='push_task', key='filename')
This information is outlined in further detail in the Airflow documentation: https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#concepts-xcom
As for your question regarding "best practices" - for communicating between Airflow Tasks/Operators, XCOM is the best way to go. However, if you're wanting to read a file from disk across multiple operators, you would need to ensure that all your workers have access to where the file is stored. If that isn't possible, an alternative could be to have the push task store that file remotely (e.g. in AWS S3) and push the S3 URL to XCOM. The pull task could then read the S3 URL from XCOM, and download the file from S3.

Resources