Im exporting some tables from PostgreSQL to GCS. To make it looks simple, I have created the dag looks like below.
The export dag is this.
from airflow.models import DAG
from airflow.contrib.operators.postgres_to_gcs_operator import PostgresToGoogleCloudStorageOperator
def sub_dag_export(parent_dag_name, child_dag_name, args, export_suffix):
dag = DAG(
'%s.%s' % (parent_dag_name, child_dag_name),
default_args=args,
start_date=args['start_date'],
max_active_runs=1,
)
export_tbl1 = PostgresToGoogleCloudStorageOperator(
task_id='export_tbl1',
postgres_conn_id='cloudsqlpg',
google_cloud_storage_conn_id='gcsconn',
sql='SELECT * FROM tbl1',
export_format='csv',
field_delimiter='|',
bucket='dsrestoretest',
filename='file/export_tbl1/tbl1_{}.csv',
schema_filename='file/schema/tbl1.json',
dag=dag)
export_tbl1 = PostgresToGoogleCloudStorageOperator(
task_id='export_tbl2',
postgres_conn_id='cloudsqlpg',
google_cloud_storage_conn_id='gcsconn',
sql='SELECT * FROM tbl2',
export_format='csv',
field_delimiter='|',
bucket='dsrestoretest',
filename='file/export_tbl1/tbl2_{}.csv',
schema_filename='file/schema/tbl2.json',
dag=dag)
Both task 1 and 2 doing the same work,so I want to reuse my export1 task for all the table. But it should not change the flow. (Start --> export table1 --> table2 -->table3 --end), because due to some reasons, if the task is failed, I need to re-run the task from where its failed. So even Im going to use a single task, the DAG diagram should be the same.
I saw there is a way(from this link), but still, I'm not able to understand this fully.
Simply you could extract the common code in to a function and have it create operator instances for you.
def pg_table_to_gcs(table_name: str) -> PostgresToGoogleCloudStorageOperator:
return PostgresToGoogleCloudStorageOperator(
task_id=f"export_{table_name}",
postgres_conn_id="cloudsqlpg",
google_cloud_storage_conn_id="gcsconn",
sql=f"SELECT * FROM {table_name}",
export_format="csv",
field_delimiter="|",
bucket="dsrestoretest",
filename=f"file/export_{table_name}/{table_name}.csv",
schema_filename=f"file/schema/{table_name}.json",
dag=dag)
tables = ["table0", "table1", "table2"]
with DAG(dag_id="kube_example", default_args=default_args) as dag:
reduce(lambda t0, t1: t0 >> t1, [pg_table_to_gcs(table, dag) for table in table_names])
Related
I have defined an airflow dag with certain numbers of different tasks, and all of these tasks are related to one parameter which is passed to the dag-define function. So when I pass different values to this parameter, I will get a different dag that has same task structure.
However, I come across a problem: How can I let some of these tasks not to run when I pass a certain parameter to the dag, and when I pass other parameters, all of these tasks should run normally.
For example, the dag code is like the following:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from libs.config.dag_config import Config
from etl.funcs import function1, function2, function3
def build_dag(id, schedule_interval):
config = Config()
config.add_params({'id': id,})
with DAG(f'{id}_etl', default_args=config.default_args, schedule_interval=schedule_interval) as dag:
start = DummyOperator(task_id='start')
end = DummyOperator(task_id='end')
task1 = PythonOperator(task_id='task1',
python_callable=function1,
provide_context=True)
task2 = PythonOperator(task_id='task2',
python_callable=function2,
provide_context=True)
task3 = PythonOperator(task_id='task3',
python_callable=function3,
provide_context=True)
tasks = [task1, task2, task3]
start >> tasks >> end
return dag
dag1 = build_dag(10001, '0 * * * *')
dag2 = build_dag(10002, '0 * * * *')
dag3 = build_dag(10003, '0 * * * *')
For dag1 and dag2, I want all the three tasks run normally, but for dag3, I only want to run the task1 and task2, and do not want to run task3. How to achieve this?
I am working on some simple Apache Airflow DAG. My goal is to:
1. calculate the data parameter based on the DAG run date - I try achieve that by the Python operator.
2. pass the parameter calculated above as a bq query parameter.
Any ideas are welcom.
My code below - I have marked the two points with I am struggling with by the 'TODO' label.
...
def set_date_param(dag_run_time):
# a business logic applied here
....
return "2020-05-28" # example result
# --------------------------------------------------------
# DAG definition below
# --------------------------------------------------------
# Python operator
set_data_param = PythonOperator(
task_id='set_data_param',
python_callable=set_data_param,
provide_cotext=True,
op_kwargs={
"dag_run_date": #TODO - how to pass the DAG running date as a function input parameter
},
dag=dag
)
# bq operator
load_data_to_bq_table = BigQueryOperator(
task_id='load_data_to_bq_table',
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {date_key_param}
""".format(
date_key_param =
), #TODO - how to get the python operator results from the previous step
use_legacy_sql=False,
destination_dataset_table="my_project.dataset2.table2}",
trigger_rule='all_success',
dag=dag
)
set_data_param >> load_data_to_bq_table
For PythonOperator to pass the execution date to the python_callable, you only need to set provide_cotext=True (as it has been already done in your example). This way, Airflow automatically passes a collection of keyword arguments to the python callable, such that the names and values of these arguments are equivalent to the template variables described here. That is, if you define the python callable as set_data_param(ds, **kwargs): ..., the ds parameter will automatically get the execution date as a string value in the format YYYY-MM-DD.
XCOM allows task instances to exchange messages. To use the date returned by set_date_param() inside the sql query string of BigQueryOperator, you can combine XCOM with Jinja templating:
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {{ task_instance.xcom_pull(task_ids='set_data_param') }}
"""
The following complete example puts all pieces together. In the example, the get_date task creates a date string based on the execution date. After that, the use_date task uses XCOM and Jinja templating to retrieve the date string and writes it to a log.
import logging
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
default_args = {'start_date': days_ago(1)}
def calculate_date(ds, execution_date, **kwargs):
return f'{ds} ({execution_date.strftime("%m/%d/%Y")})'
def log_date(date_string):
logging.info(date_string)
with DAG(
'a_dag',
schedule_interval='*/5 * * * *',
default_args=default_args,
catchup=False,
) as dag:
get_date = PythonOperator(
task_id='get_date',
python_callable=calculate_date,
provide_context=True,
)
use_date = PythonOperator(
task_id='use_date',
python_callable=log_date,
op_args=['Date: {{ task_instance.xcom_pull(task_ids="get_date") }}'],
)
get_date >> use_date
I have a DAG with many sub-tasks in it. In the middle of the DAG, there is a validation task and based on the result/return code from the task, i want to take two different paths. If success, one route(a sequence of tasks) will be followed and in case of failure, we would like to execute a different set of tasks. There are two problems with the current approach, one is that, validation tasks execute many times(as per the retries configured) if the exit code is 1. Second there is no way possible to take different branches of execution
To solve problem number 1, we can use the retry number is available from the task instance, which is available via the macro {{ task_instance }} . Appreciate if someone could point us to a cleaner approach, and the problem number 2 of taking different paths remains unsolved.
You can have retries at the task level.
run_this = BashOperator(
task_id='run_after_loop',
bash_command='echo 1',
retries=3,
dag=dag,
)
run_this_last = DummyOperator(
task_id='run_this_last',
retries=1,
dag=dag,
)
Regarding your 2nd problem, there is a concept of Branching.
The BranchPythonOperator is much like the PythonOperator except that it expects a python_callable that returns a task_id (or list of task_ids). The task_id returned is followed, and all of the other paths are skipped. The task_id returned by the Python function has to be referencing a task directly downstream from the BranchPythonOperator task.
Example DAG:
import random
import airflow
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import BranchPythonOperator
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
}
dag = DAG(
dag_id='example_branch_operator',
default_args=args,
schedule_interval="#daily",
)
run_this_first = DummyOperator(
task_id='run_this_first',
dag=dag,
)
options = ['branch_a', 'branch_b', 'branch_c', 'branch_d']
branching = BranchPythonOperator(
task_id='branching',
python_callable=lambda: random.choice(options),
dag=dag,
)
run_this_first >> branching
join = DummyOperator(
task_id='join',
trigger_rule='one_success',
dag=dag,
)
for option in options:
t = DummyOperator(
task_id=option,
dag=dag,
)
dummy_follow = DummyOperator(
task_id='follow_' + option,
dag=dag,
)
branching >> t >> dummy_follow >> join
Regarding your first problem, you set task/Operator specific retry options quite easily. Reference: baseoperator.py#L77.
Problem two, you can branch within a DAG easily with BranchPythonOperator (Example Usage: example_branch_operator.py). You will want to nest your validation task/logic within the BranchPythonOperator (You can define and execute operators within operators).
Say I want to write a DAG to show all tables in a specific schema of Redshift.
The SQL query is Show Tables;
How do I create the DAG for it?
I assume this should be something like:
dag = airflow.DAG(
'process_dimensions',
schedule_interval="#daily",
dagrun_timeout=timedelta(minutes=60),
default_args=args,
max_active_runs=1)
process_product_dim = SQLOperator(
task_id='process_product_dim',
conn_id='??????',
sql='Show Tables',
dag=dag)
Does anyone know how to write it correctly?
Because you want to return the result of that query and not just execute it, you'll want to use the PostgresHook, specifically the get_records method.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.hooks import PostgresHook
def process_product_dim_py(**kwargs):
conn_id = kwargs.get('conn_id')
pg_hook = PostgresHook(conn_id)
sql = "Show Tables;"
records = pg_hook.get_records(sql)
return records
dag = DAG(
'process_dimensions',
schedule_interval="#daily",
dagrun_timeout=timedelta(minutes=60),
default_args=args,
max_active_runs=1)
process_product_dim = PythonOperator(
task_id='process_product_dim',
op_kwargs = {'conn_id':'my_redshift_connection'}
python_callable=process_product_dim_py,
dag=dag)
I have created a on_failure_callback function(refering Airflow default on_failure_callback) to handle task's failure.
It works well when there is only one task in a DAG, however, if there are 2 more tasks, a task is randomly failed since the operator is null, it can resume later by manully . In airflow-scheduler.out the log is:
[2018-05-08 14:24:21,237] {models.py:1595} ERROR - Executor reports
task instance %s finished (%s) although the task says its %s. Was the
task killed externally? NoneType [2018-05-08 14:24:21,238]
{jobs.py:1435} ERROR - Cannot load the dag bag to handle failure for
. Setting task to FAILED without
callbacks or retries. Do you have enough resources?
The DAG code is:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
import airflow
from devops.util import WechatUtil
from devops.util import JiraUtil
def on_failure_callback(context):
ti = context['task_instance']
log_url = ti.log_url
owner = ti.task.owner
ti_str = str(context['task_instance'])
wechat_msg = "%s - Owner:%s"%(ti_str,owner)
WeChatUtil.notify_team(wechat_msg)
jira_desc = "Please check log from url %s"%(log_url)
JiraUtil.create_incident("DW",ti_str,jira_desc,owner)
args = {
'queue': 'default',
'start_date': airflow.utils.dates.days_ago(1),
'retry_delay': timedelta(minutes=1),
'on_failure_callback': on_failure_callback,
'owner': 'user1',
}
dag = DAG(dag_id='test_dependence1',default_args=args,schedule_interval='10 16 * * *')
load_crm_goods = BashOperator(
task_id='crm_goods_job',
bash_command='date',
dag=dag)
load_crm_memeber = BashOperator(
task_id='crm_member_job',
bash_command='date',
dag=dag)
load_crm_order = BashOperator(
task_id='crm_order_job',
bash_command='date',
dag=dag)
load_crm_eur_invt = BashOperator(
task_id='crm_eur_invt_job',
bash_command='date',
dag=dag)
crm_member_cohort_analysis = BashOperator(
task_id='crm_member_cohort_analysis_job',
bash_command='date',
dag=dag)
crm_member_cohort_analysis.set_upstream(load_crm_goods)
crm_member_cohort_analysis.set_upstream(load_crm_memeber)
crm_member_cohort_analysis.set_upstream(load_crm_order)
crm_member_cohort_analysis.set_upstream(load_crm_eur_invt)
crm_member_kpi_daily = BashOperator(
task_id='crm_member_kpi_daily_job',
bash_command='date',
dag=dag)
crm_member_kpi_daily.set_upstream(crm_member_cohort_analysis)
I had tried to update the airflow.cfg by adding the default memory from 512 to even 4096, but no luck. Would anyone have any advice ?
Ialso try to updated my JiraUtil and WechatUtil as following, encoutering the same error
WechatUtil:
import requests
class WechatUtil:
#staticmethod
def notify_trendy_user(user_ldap_id, message):
return None
#staticmethod
def notify_bigdata_team(message):
return None
JiraUtil:
import json
import requests
class JiraUtil:
#staticmethod
def execute_jql(jql):
return None
#staticmethod
def create_incident(projectKey, summary, desc, assignee=None):
return None
(I'm shooting tracer bullets a bit here, so bear with me if this answer doesn't get it right on the first try.)
The null operator issue with multiple task instances is weird... it would help approaching troubleshooting this if you could boil the current code down to a MCVE e.g., 1–2 operators and excluding the JiraUtil and WechatUtil parts if they're not related to the callback failure.
Here are 2 ideas:
1. Can you try changing the line that fetches the task instance out of the context to see if this makes a difference?
Before:
def on_failure_callback(context):
ti = context['task_instance']
...
After:
def on_failure_callback(context):
ti = context['ti']
...
I saw this usage in the Airflow repo (https://github.com/apache/incubator-airflow/blob/c1d583f91a0b4185f760a64acbeae86739479cdb/airflow/contrib/hooks/qubole_check_hook.py#L88). It's possible it can be accessed both ways.
2. Can you try adding provide_context=True on the operators either as a kwarg or in default_args?