How to write a DAG connecting with Amazon Redshift? - airflow

Say I want to write a DAG to show all tables in a specific schema of Redshift.
The SQL query is Show Tables;
How do I create the DAG for it?
I assume this should be something like:
dag = airflow.DAG(
'process_dimensions',
schedule_interval="#daily",
dagrun_timeout=timedelta(minutes=60),
default_args=args,
max_active_runs=1)
process_product_dim = SQLOperator(
task_id='process_product_dim',
conn_id='??????',
sql='Show Tables',
dag=dag)
Does anyone know how to write it correctly?

Because you want to return the result of that query and not just execute it, you'll want to use the PostgresHook, specifically the get_records method.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.hooks import PostgresHook
def process_product_dim_py(**kwargs):
conn_id = kwargs.get('conn_id')
pg_hook = PostgresHook(conn_id)
sql = "Show Tables;"
records = pg_hook.get_records(sql)
return records
dag = DAG(
'process_dimensions',
schedule_interval="#daily",
dagrun_timeout=timedelta(minutes=60),
default_args=args,
max_active_runs=1)
process_product_dim = PythonOperator(
task_id='process_product_dim',
op_kwargs = {'conn_id':'my_redshift_connection'}
python_callable=process_product_dim_py,
dag=dag)

Related

airflow - reuse a task

Im exporting some tables from PostgreSQL to GCS. To make it looks simple, I have created the dag looks like below.
The export dag is this.
from airflow.models import DAG
from airflow.contrib.operators.postgres_to_gcs_operator import PostgresToGoogleCloudStorageOperator
def sub_dag_export(parent_dag_name, child_dag_name, args, export_suffix):
dag = DAG(
'%s.%s' % (parent_dag_name, child_dag_name),
default_args=args,
start_date=args['start_date'],
max_active_runs=1,
)
export_tbl1 = PostgresToGoogleCloudStorageOperator(
task_id='export_tbl1',
postgres_conn_id='cloudsqlpg',
google_cloud_storage_conn_id='gcsconn',
sql='SELECT * FROM tbl1',
export_format='csv',
field_delimiter='|',
bucket='dsrestoretest',
filename='file/export_tbl1/tbl1_{}.csv',
schema_filename='file/schema/tbl1.json',
dag=dag)
export_tbl1 = PostgresToGoogleCloudStorageOperator(
task_id='export_tbl2',
postgres_conn_id='cloudsqlpg',
google_cloud_storage_conn_id='gcsconn',
sql='SELECT * FROM tbl2',
export_format='csv',
field_delimiter='|',
bucket='dsrestoretest',
filename='file/export_tbl1/tbl2_{}.csv',
schema_filename='file/schema/tbl2.json',
dag=dag)
Both task 1 and 2 doing the same work,so I want to reuse my export1 task for all the table. But it should not change the flow. (Start --> export table1 --> table2 -->table3 --end), because due to some reasons, if the task is failed, I need to re-run the task from where its failed. So even Im going to use a single task, the DAG diagram should be the same.
I saw there is a way(from this link), but still, I'm not able to understand this fully.
Simply you could extract the common code in to a function and have it create operator instances for you.
def pg_table_to_gcs(table_name: str) -> PostgresToGoogleCloudStorageOperator:
return PostgresToGoogleCloudStorageOperator(
task_id=f"export_{table_name}",
postgres_conn_id="cloudsqlpg",
google_cloud_storage_conn_id="gcsconn",
sql=f"SELECT * FROM {table_name}",
export_format="csv",
field_delimiter="|",
bucket="dsrestoretest",
filename=f"file/export_{table_name}/{table_name}.csv",
schema_filename=f"file/schema/{table_name}.json",
dag=dag)
tables = ["table0", "table1", "table2"]
with DAG(dag_id="kube_example", default_args=default_args) as dag:
reduce(lambda t0, t1: t0 >> t1, [pg_table_to_gcs(table, dag) for table in table_names])

Airflow - TriggerDagRunOperator Cross Check

I am trying to trigger one dag from another. I am using TriggerDagRunOperator for the same.
I have the following two dags.
Dag 1:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dagrun_operator import TriggerDagRunOperator
def print_hello():
return 'Hello world!'
dag = DAG('dag_one', description='Simple tutorial DAG',
schedule_interval='0/15 * * * *',
start_date=datetime(2017, 3, 20), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="dag_two", # Ensure this equals the dag_id of the DAG to trigger
dag=dag,
)
dummy_operator >> hello_operator >> trigger
Dag 2:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def print_hello():
return 'Hello XYZABC!'
dag = DAG('dag_two', description='Simple tutorial DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 3, 20), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
dummy_operator >> hello_operator
Going through the webserver, everything seems fine and running (ie: dag one is triggering dag two ).
My question is how to make sure or check that Dag 2 is actually triggered by Dag 1 and it is not triggered because of its schedule or any other manual action.
Basically, where I can find who triggered the Dag or how the Dag was triggered?
If you see Tree view of Dag 1, Dag 2 that was run by Dag 1 is seen as tasks in this view.
If you see Tree view of Dag 2, you can find AIRFLOW_CTX_DAG_RUN_ID=trig__YYYY_MM_DD... in View Log.
If this is scheduled, it should say
AIRFLOW_CTX_DAG_RUN_ID=scheduled__YYYY_MM_DDT...
You can compare the occurrence time of dag2 with the occurrence time of the triggle task in dag1

DAG marked as "success" if one task fails, because of trigger rule ALL_DONE

I have the following DAG with 3 tasks:
start --> special_task --> end
The task in the middle can succeed or fail, but end must always be executed (imagine this is a task for cleanly closing resources). For that, I used the trigger rule ALL_DONE:
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
Using that, end is properly executed if special_task fails. However, since end is the last task and succeeds, the DAG is always marked as SUCCESS.
How can I configure my DAG so that if one of the tasks failed, the whole DAG is marked as FAILED?
Example to reproduce
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils import trigger_rule
dag = DAG(
dag_id='my_dag',
start_date=datetime.datetime.today(),
schedule_interval=None
)
start = BashOperator(
task_id='start',
bash_command='echo start',
dag=dag
)
special_task = BashOperator(
task_id='special_task',
bash_command='exit 1', # force failure
dag=dag
)
end = BashOperator(
task_id='end',
bash_command='echo end',
dag=dag
)
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
start.set_downstream(special_task)
special_task.set_downstream(end)
This post seems to be related, but the answer does not suit my needs, since the downstream task end must be executed (hence the mandatory trigger_rule).
I thought it was an interesting question and spent some time figuring out how to achieve it without an extra dummy task. It became a bit of a superfluous task, but here's the end result:
This is the full DAG:
import airflow
from airflow import AirflowException
from airflow.models import DAG, TaskInstance, BaseOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils.db import provide_session
from airflow.utils.state import State
from airflow.utils.trigger_rule import TriggerRule
default_args = {"owner": "airflow", "start_date": airflow.utils.dates.days_ago(3)}
dag = DAG(
dag_id="finally_task_set_end_state",
default_args=default_args,
schedule_interval="0 0 * * *",
description="Answer for question https://stackoverflow.com/questions/51728441",
)
start = BashOperator(task_id="start", bash_command="echo start", dag=dag)
failing_task = BashOperator(task_id="failing_task", bash_command="exit 1", dag=dag)
#provide_session
def _finally(task, execution_date, dag, session=None, **_):
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
print("Do logic here...")
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
finally_ = PythonOperator(
task_id="finally",
python_callable=_finally,
trigger_rule=TriggerRule.ALL_DONE,
provide_context=True,
dag=dag,
)
succesful_task = DummyOperator(task_id="succesful_task", dag=dag)
start >> [failing_task, succesful_task] >> finally_
Look at the _finally function, which is called by the PythonOperator. There are a few key points here:
Annotate with #provide_session and add argument session=None, so you can query the Airflow DB with session.
Query all upstream task instances for the current task:
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
From the returned task instances, get the states and check if State.FAILED is in there:
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
Perform your own logic:
print("Do logic here...")
And finally, fail the task if fail_this_task=True:
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
The end result:
As #JustinasMarozas explained in a comment, a solution is to create a dummy task like :
dummy = DummyOperator(
task_id='test',
dag=dag
)
and bind it downstream to special_task :
failing_task.set_downstream(dummy)
Thus, the DAG is marked as failed, and the dummy task is marked as upstream_failed.
Hope there is an out-of-the-box solution, but waiting for that, this solution does the job.
To expand on Bas Harenslak answer, a simpler _finally function which will check the state of all tasks (not only the upstream ones) can be:
def _finally(**kwargs):
for task_instance in kwargs['dag_run'].get_task_instances():
if task_instance.current_state() != State.SUCCESS and \
task_instance.task_id != kwargs['task_instance'].task_id:
raise Exception("Task {} failed. Failing this DAG run".format(task_instance.task_id))

Airflow - Broken DAG - Timeout

I have a DAG that executes a function that connects to a Postgres DB, deletes the contents in the table and then inserts a new data set.
I am trying this in my local and I see when I try to run this, the web server takes a long time to connect and in most cases doesn't succeed. However as part of the connecting process it seems to be executing the queries from the back-end. Since I have a delete function I see the data getting deleted from the table(basically one of the functions gets executed) even though I have not scheduled the script or manually started. Could someone advice as to what I am doing wrong in this.
One error that pops out in the UI is
Broken DAG: [/Users/user/airflow/dags/dwh_sample23.py] Timeout
Also see an i next to the dag id in the UI that says This is DAG isn't available in the web server's DAG object.
Given below is the code I am using:
## Third party Library Imports
import pandas as pd
import psycopg2
import airflow
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'admin',
'depends_on_past': False,
'start_date': datetime(2018, 5, 21),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('dwh_sample23', default_args=default_args)
#######################
## Login to DB
def db_login():
''' This function connects to the Data Warehouse and returns the cursor to execute queries '''
global dwh_connection
try:
dwh_connection = psycopg2.connect(" dbname = 'dbname' user = 'user' password = 'password' host = 'hostname' port = '5439' sslmode = 'require' ")
except:
print("I am unable to connect to the database.")
print('Success')
return(dwh_connection)
def tbl1_del():
''' This function takes clears all rows from tbl1 '''
cur = dwh_connection.cursor()
cur.execute("""DELETE FROM tbl1;""")
dwh_connection.commit()
def pop_tbl1():
''' This function populates all rows in tbl1 '''
cur = dwh_connection.cursor()
cur.execute(""" INSERT INTO tbl1
select id,name,price from tbl2;""")
dwh_connection.commit()
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
t1 = BashOperator(
task_id='DB_Connect',
python_callable=db_login(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t2 = BashOperator(
task_id='del',
python_callable=tbl1_del(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t3 = BashOperator(
task_id='populate',
python_callable=pop_tbl1(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t1.set_downstream(t2)
t2.set_downstream(t3)
Could anyone assist? Thanks.
Instead of using BashOperator you can use PythonOperator and call db_login(), tbl1_del(), pop_tbl1() in PythonOperator
## Third party Library Imports
import pandas as pd
import psycopg2
import airflow
from airflow import DAG
from airflow.operators import PythonOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'admin',
'depends_on_past': False,
'start_date': datetime(2018, 5, 21),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('dwh_sample23', default_args=default_args)
#######################
## Login to DB
def db_login():
''' This function connects to the Data Warehouse and returns the cursor to execute queries '''
global dwh_connection
try:
dwh_connection = psycopg2.connect(" dbname = 'dbname' user = 'user' password = 'password' host = 'hostname' port = '5439' sslmode = 'require' ")
except:
print("I am unable to connect to the database.")
print('Success')
return(dwh_connection)
def tbl1_del():
''' This function takes clears all rows from tbl1 '''
cur = dwh_connection.cursor()
cur.execute("""DELETE FROM tbl1;""")
dwh_connection.commit()
def pop_tbl1():
''' This function populates all rows in tbl1 '''
cur = dwh_connection.cursor()
cur.execute(""" INSERT INTO tbl1
select id,name,price from tbl2;""")
dwh_connection.commit()
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
t1 = PythonOperator(
task_id='DB_Connect',
python_callable=db_login(),
dag=dag)
t2 = PythonOperator(
task_id='del',
python_callable=tbl1_del(),
dag=dag)
t3 = PythonOperator(
task_id='populate',
python_callable=pop_tbl1(),
dag=dag)
t1.set_downstream(t2)
t2.set_downstream(t3)
This is really old by now, but we got this error in prod and I found this question, and think its nice that it would have an answer.
Some of the code is getting executed during DAG load, i.e. you actually run
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
inside webserver and scheduler loop, when they load dag definition from the file.
I believe you didn't intend that to happen.
Everything should work just fine if you just remove these 4 lines.
Generally don't place function you want executors to execute on file/module level, because when interpreter of the scheduler/webserver loads the file to get dag definition, it would invoke them.
Just try putting this in your dag file and see check webserver logs to see what happens.
from time import sleep
def do_some_printing():
print(1111111)
sleep(60)
do_some_printing()

How to add task at run time if task 1 is failed

I want to execute task 2 if task 1 is success if task 1 fails i want to run task 3 and want to assign another flow if required.
Basically i want to run conditional tasks in airflow without ssh operators.
from airflow import DAG
from airflow.operators import PythonOperator,BranchPythonOperator
from airflow.operators import BashOperator
from datetime import datetime, timedelta
from airflow.models import Variable
def t2_error_task(context):
instance = context['task_instance']
if instance.task_id == "performExtract":
print ("Please implement something over this")
task_3 = PythonOperator(
task_id='performJoin1',
python_callable=performJoin1, # maybe main?
dag = dag
)
dag.add_task(task_3)
with DAG(
'manageWorkFlow',
catchup=False,
default_args={
'owner': 'Mannu',
'start_date': datetime(2018, 4, 13),
'schedule_interval':None,
'depends_on_past': False,
},
) as dag:
task_1 = PythonOperator(
task_id='performExtract',
python_callable=performExtract,
on_failure_callback=t2_error_task,
depends_on_past=True
)
task_2 = PythonOperator(
task_id='printSchemas',
depends_on_past=True,
python_callable=printSchemaAll, # maybe main?
)
task_2.set_upstream(task_1)
Adding tasks dynamically based on execution-time statuses is not something Airflow supports. In order to get the desired behaviour, you should add task_3 to your dag but change its trigger_rule to all_failed. In this case, the task will get marked as skipped when task_1 succeeds, but it will get executed when it fails.

Resources