How can we use Deferred DAG assignment in airflow - airflow

I am new to Apache airflow and working with DAG. y code is given below.
In the input json i have a parameter named as 'sports_category'. if its value is 'football' then football_players task need to run if its value is cricket then 'cricket_players' task runs.
import airflow
from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksSubmitRunOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 6, 23)
}
dag = DAG('PLAYERS_DETAILS',default_args=default_args,schedule_interval=None,max_active_runs=5)
football_players = DatabricksSubmitRunOperator(
task_id='football_players',
databricks_conn_id='football_players_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': {{ jar path }}
}
],
databricks_retry_limit = 3,
spark_jar_task={
'main_class_name': 'football class name1',
'parameters' : [
'json ={{ dag_run.conf.json }}'
]
}
)
cricket_players = DatabricksSubmitRunOperator(
task_id='cricket_players',
databricks_conn_id='cricket_players_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': {{ jar path }}
}
],
databricks_retry_limit = 3,
spark_jar_task={
'main_class_name': 'cricket class name2',
'parameters' : [
'json ={{ dag_run.conf.json }}'
]
}
)

I would recommend using BranchPythonOperator which takes one function as an argument and return which TASK need to be run next in the flow based on the logic written inside the function.
Refer here for documentation and here for an example dag.
Let me know your response!

Related

TypeError: PostgresOperator.partial() got an unexpected keyword argument 'schedule_interval'

So I am trying to use partial() method in PostgresOperator, but I am getting this error apparently because I unintentionally pass schedule_interval to it. I looked up in airflow git there's no such parameter for partial() method of BasicOperator which I assume is parent for all the operators classes.
So I am confused I have to pass this parameter to DAG params yet there's no such parameter for .partial() so how am I supposed to create this dag and tasks? I haven't found any information on how to pull it off.
from airflow import DAG
from airflow.decorators import task
from airflow.providers.postgres.operators.postgres import PostgresOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2022, 9, 7),
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval' : '#daily'
}
with DAG(
'name',
default_args=default_args
) as dag:
#task
def generate_sql_queries(src_list: list) -> list:
queries = []
for i in src_list:
query = f'SELECT sql_epic_function()'
queries.append(query)
return queries
queries = generate_sql_queries([4,8])
task = PostgresOperator.partial(
task_id='name',
postgres_conn_id='postgres_default_id_connection' # don't forget to change
).expand(sql=queries)
task
The values in default_args dict are passed to each Operator, and you have 'schedule_interval' : '#daily' on it. Also, schedule_interval is not an operator argument but a DAG object argument. So, besides removing it from the default_args dict, you have to add it to the DAG definition like:
with DAG(
'name',
schedule_interval="#daily"
default_args=default_args
) as dag:

Airflow DAG status is Success, but task states Dag has yet to run

I am using Airflow 2.3.4 ;
I am Triggering with Config. When I hardcode the config values, this DAG runs successfully.
But on Triggering after passing config
my tasks never start, but the status turn green(Success).
Please help me understand what's going wrong !
from datetime import datetime, timedelta
from airflow import DAG
from pprint import pprint
from airflow.operators.python import PythonOperator
from operators.jvm import JVMOperator
args = {
'owner': 'satyam',
'depends_on_past': False,
'start_date': datetime.utcnow(),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag_params = {
'dag_id': 'synthea_etl_end_to_end_with_config',
'start_date': datetime.utcnow(),
'end_date': datetime(2025, 2, 5),
'default_args': args,
'schedule_interval': timedelta(hours=4)
}
dag = DAG(**dag_params)
# [START howto_operator_python]
def print_context(ds, **kwargs):
"""Print the Airflow context and ds variable from the context."""
pprint(kwargs)
pprint(ds)
return 'Whatever you return gets printed in the logs'
jvm_task = JVMOperator(
task_id='jvm_task',
correlation_id='123456',
jar='/home/i1136/Synthea/synthea-with-dependencies.jar',
options={
'java_args': [''],
'jar_args': ["-p {{ dag_run.conf['population_count'] }} --exporter.fhir.export {{ dag_run.conf['fhir'] }} --exporter.ccda.export {{ dag_run.conf['ccda'] }} --exporter.csv.export {{ dag_run.conf['csv'] }} --exporter.csv.append_mode {{ dag_run.conf['csv'] }} --exporter.baseDirectory /home/i1136/Synthea/output_dag_config" ]
})
print_context_task = PythonOperator(task_id='print_context_task', provide_context=True, python_callable=print_context, dag=dag)
jvm_task.set_downstream(print_context_task)
The problem is with 'start_date': datetime.utcnow(), which is always >= the dag_run start_date, in this case Airflow will mark the run as succeeded without running it. For this variable, it's better to choose the minimum date of your runs, if you don't have one, you can use the yesterday date, but the next day you will not be able to re-run the tasks failed on the previous day:
import pendulum
dag_params = {
...,
'start_date': pendulum.yesterday(),
...,
}
In my case it was a small bug in python script - not detected by Airflow after refreshing

DatabricksRunOperator Execution date

opr_run_now = DatabricksRunNowOperator(
task_id = 'run_now',
databricks_conn_id = 'databricks_default',
job_id = 754377,
notebook_params = meta_data,
dag = dag
) here
Is there way to pass execution date using databricks run operator.
What do you want to pass the execution_date to? What are you trying to achieve in the end? The following doc was helpful for me:
https://www.astronomer.io/guides/airflow-databricks
And here is an example where I am passing execution_date to be used in a python file run in Databricks. I'm capturing the execution_date using sys.argv.
from airflow import DAG
from airflow.providers.databricks.operators.databricks import (
DatabricksRunNowOperator,
)
from datetime import datetime, timedelta
spark_python_task = {
"python_file": "dbfs:/FileStore/sandbox/databricks_test_python_task.py"
}
# Define params for Run Now Operator
python_params = [
"{{ execution_date }}",
"{{ execution_date.subtract(hours=1) }}",
]
default_args = {
"owner": "airflow",
"depends_on_past": False,
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=2),
}
with DAG(
dag_id="databricks_dag",
start_date=datetime(2022, 3, 11),
schedule_interval="#hourly",
catchup=False,
default_args=default_args,
max_active_runs=1,
) as dag:
opr_run_now = DatabricksRunNowOperator(
task_id="run_now",
databricks_conn_id="Databricks",
job_id=2060,
python_params=python_params,
)
opr_run_now
There are two ways to set DatabricksRunOperator. One with named arguments (as you did) - which doesn't support templating. The second way is to use JSON payload that you typically use to call the api/2.0/jobs/run-now - This way also gives you the ability to pass execution_date as the json parameter is templated.
notebook_task_params = {
'new_cluster': new_cluster,
'notebook_task': {
'notebook_path': '/test-{{ ds }}',
}
DatabricksSubmitRunOperator(task_id='notebook_task', json=notebook_task_params)
For more information see the operator docs.

Airflow tasks set to `no_status` when catchup is True

I'm attempting to configure a series of Airflow tasks to backfill some data (catchup=True). Once the DAG is deployed and unpaused, the first job runs successfully, but all subsequent runs have their tasks set to no_status and they never start.
I've tried variations on renaming the DAG, restarting the Airflow server and scheduler, clearing out old logs, but I'm not making any progress here.
Thoughts?
DAG code:
default_args = {
"owner": "me",
"retries": 2,
"retry_delay": timedelta(minutes=2),
"sla": timedelta(hours=1),
"start_date": "2021-01-01T00:00",
}
dag = DAG(
catchup=True,
dag_id="ingest_dag_testing_6",
dagrun_timeout=timedelta(hours=1),
default_args=default_args,
max_active_runs=1,
schedule_interval="30 * * * *",
)
DATA_SOURCE_TYPES = [
{
"target_name": "task_a",
"children": [
{
"target_name": "subtask_a1",
},
{
"target_name": "subtask_a2",
},
],
}
]
with dag:
for dst in DATA_SOURCE_TYPES:
sub_ingest_tasks = []
ingest_task = PythonOperator(
task_id=f"ingest_{dst.get('target_name')}",
python_callable=run_ingestion,
op_args=[logger, exe_date, dst],
)
if dst.get("children"):
for sdst in dst.get("children"):
sub_ingest_tasks.append(
PythonOperator(
task_id=f"ingest_{sdst.get('target_name')}",
python_callable=run_ingestion,
op_args=[logger, exe_date, sdst],
)
)
ingest_task >> sub_ingest_tasks
Your code is executed just fine.
I created a runnable example from your code (as it lack imports/callables):
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import timedelta
def run_ingestion(**context):
print("Hello World")
default_args = {
"owner": "me",
"retries": 2,
"retry_delay": timedelta(minutes=2),
"sla": timedelta(hours=1),
"start_date": "2021-01-01T00:00",
}
dag = DAG(
catchup=True,
dag_id="ingest_dag_testing_6",
dagrun_timeout=timedelta(hours=1),
default_args=default_args,
max_active_runs=1,
schedule_interval="30 * * * *",
)
DATA_SOURCE_TYPES = [
{
"target_name": "task_a",
"children": [
{
"target_name": "subtask_a1",
},
{
"target_name": "subtask_a2",
},
],
}
]
with dag:
for dst in DATA_SOURCE_TYPES:
sub_ingest_tasks = []
ingest_task = PythonOperator(
task_id=f"ingest_{dst.get('target_name')}",
python_callable=run_ingestion,
#op_args=[logger, exe_date, dst],
)
if dst.get("children"):
for sdst in dst.get("children"):
sub_ingest_tasks.append(
PythonOperator(
task_id=f"ingest_{sdst.get('target_name')}",
python_callable=run_ingestion,
#op_args=[logger, exe_date, sdst],
)
)
ingest_task >> sub_ingest_tasks
You can see it's working fine:
If you are running old Airflow version it's possible that changing the dag_id may fix the problem. It could be that there are some old traces of db records related to this dag_id which were not cleaned properly. The scheduler was refactored significantly in later versions.
If the above did not help probably the only solution would be to upgrade to the latest Airflow version as it's probably a bug in older versions which was fixed along the way (Since the code you shared doesn't reproduce the problem you are describing in latest Airflow version).

How to pass parameters to a hql run via airflow

I would like to know how I can pass a parameter to an hive query script run via airflow. If I want to add a parameter only for this script say target_db = mydatabase, how can i do that? Do I need to add it to the default_args and then call it in then call it in the op_kwargs of the script?
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2017, 11, 1),
'email_on_failure': True,
'email_on_retry': False,
'retries': 2,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(dag_name, default_args=default_args, schedule_interval="#daily")
t_add_step = PythonOperator(
task_id='add__step',
provide_context=True,
python_callable=add_emr_step,
op_kwargs={
'aws_conn_id': dag_params['aws_conn_id'],
'create_job_flow_task': 'create_emr_flow',
'get_step_task': 'get_email_step'
},
dag=dag
)
Assuming you are invoking Hive using BashOperator, it would look something like this
...
set_hive_db = BashOperator (
bash_command = """
hive --database {{params.database}} -f {{params.hql_file}}
""",
params = {
"database": "testingdb",
"hql_file": "myhql.hql"
},
dag = dag
)
...
Another approach would be to USE database inside your hql and just call hive -f hqlfile.hql in your BashOperator

Resources