Airflow DAG status is Success, but task states Dag has yet to run - airflow

I am using Airflow 2.3.4 ;
I am Triggering with Config. When I hardcode the config values, this DAG runs successfully.
But on Triggering after passing config
my tasks never start, but the status turn green(Success).
Please help me understand what's going wrong !
from datetime import datetime, timedelta
from airflow import DAG
from pprint import pprint
from airflow.operators.python import PythonOperator
from operators.jvm import JVMOperator
args = {
'owner': 'satyam',
'depends_on_past': False,
'start_date': datetime.utcnow(),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag_params = {
'dag_id': 'synthea_etl_end_to_end_with_config',
'start_date': datetime.utcnow(),
'end_date': datetime(2025, 2, 5),
'default_args': args,
'schedule_interval': timedelta(hours=4)
}
dag = DAG(**dag_params)
# [START howto_operator_python]
def print_context(ds, **kwargs):
"""Print the Airflow context and ds variable from the context."""
pprint(kwargs)
pprint(ds)
return 'Whatever you return gets printed in the logs'
jvm_task = JVMOperator(
task_id='jvm_task',
correlation_id='123456',
jar='/home/i1136/Synthea/synthea-with-dependencies.jar',
options={
'java_args': [''],
'jar_args': ["-p {{ dag_run.conf['population_count'] }} --exporter.fhir.export {{ dag_run.conf['fhir'] }} --exporter.ccda.export {{ dag_run.conf['ccda'] }} --exporter.csv.export {{ dag_run.conf['csv'] }} --exporter.csv.append_mode {{ dag_run.conf['csv'] }} --exporter.baseDirectory /home/i1136/Synthea/output_dag_config" ]
})
print_context_task = PythonOperator(task_id='print_context_task', provide_context=True, python_callable=print_context, dag=dag)
jvm_task.set_downstream(print_context_task)

The problem is with 'start_date': datetime.utcnow(), which is always >= the dag_run start_date, in this case Airflow will mark the run as succeeded without running it. For this variable, it's better to choose the minimum date of your runs, if you don't have one, you can use the yesterday date, but the next day you will not be able to re-run the tasks failed on the previous day:
import pendulum
dag_params = {
...,
'start_date': pendulum.yesterday(),
...,
}

In my case it was a small bug in python script - not detected by Airflow after refreshing

Related

TypeError: PostgresOperator.partial() got an unexpected keyword argument 'schedule_interval'

So I am trying to use partial() method in PostgresOperator, but I am getting this error apparently because I unintentionally pass schedule_interval to it. I looked up in airflow git there's no such parameter for partial() method of BasicOperator which I assume is parent for all the operators classes.
So I am confused I have to pass this parameter to DAG params yet there's no such parameter for .partial() so how am I supposed to create this dag and tasks? I haven't found any information on how to pull it off.
from airflow import DAG
from airflow.decorators import task
from airflow.providers.postgres.operators.postgres import PostgresOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2022, 9, 7),
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval' : '#daily'
}
with DAG(
'name',
default_args=default_args
) as dag:
#task
def generate_sql_queries(src_list: list) -> list:
queries = []
for i in src_list:
query = f'SELECT sql_epic_function()'
queries.append(query)
return queries
queries = generate_sql_queries([4,8])
task = PostgresOperator.partial(
task_id='name',
postgres_conn_id='postgres_default_id_connection' # don't forget to change
).expand(sql=queries)
task
The values in default_args dict are passed to each Operator, and you have 'schedule_interval' : '#daily' on it. Also, schedule_interval is not an operator argument but a DAG object argument. So, besides removing it from the default_args dict, you have to add it to the DAG definition like:
with DAG(
'name',
schedule_interval="#daily"
default_args=default_args
) as dag:

Apache-Airflow - Task is in the none state when running DAG

Just started with airflow and wanted to run simple dag with BashOperator that outputs 'Hello' to console
I noticed that my status is indefinitely stuck in 'Running'
When I go on task details, I get this:
Task is in the 'None' state which is not a valid state for execution. The task must be cleared in order to be run.
Any suggestions or hints are much appreciated.
Dag:
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'dude_whose_doors_open_like_this_-W-',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['yessure#gmail.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'Test',
default_args=default_args,
description='Test',
schedule_interval=timedelta(days=1)
)
t1 = BashOperator(
task_id='ECHO',
bash_command='echo "Hello"',
dag=dag
)
t1
I've managed to solve it by adding 'start_date': dt(1970, 1, 1),
to default args object
and adding schedule_interval=None to my dag object
Could you remove the last line of t1- this isn't necessary. Also start_dateshouldn't be set dynamically - this can lead to problems with the scheduling.

DAG is not visible on Airflow UI

This is my dag file in dags folder.
Code that goes along with the Airflow located at:
http://airflow.readthedocs.org/en/latest/tutorial.html
"""
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from work_file import Test
class Main(Test):
def __init__(self):
super(Test, self).__init__()
def create_dag(self):
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2015, 6, 1),
"email": ["airflow#airflow.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG("python_dag", default_args=default_args, schedule_interval='0 * * * *')
dummy_task = DummyOperator(task_id='dummy_task', retries=3)
python_task = PythonOperator(task_id='python_task', python_callable=self.my_func)
dummy_task >> python_task
if __name__ == "__main__":
a = Main()
a.create_dag()
This is my other file work_file.py which is in the same dags folder.
class Test:
def __init__(self):
pass
def my_func(self):
return "Hello"
Aim:-The Aim is to call the my_func from my dag file.
Problem:- There seems to be no error on the UI,but my dag python_dag is not visible.
My server, scheduler is also running, I've tried restarting the same but nothing happened.
I have imported the file as well (from work_file import Test)
Thanks in advance!
There are multiple problems with the DAG:
The operators are not assigned to any DAG. Add dag=dag to the constructors. E.g., DummyOperator(..., dag=dag).
create_dag() does not return the DAG. Add return dag.
The DAG script is not executed as the top level code. That is, the modules __name__ is never '__main__'. Remove if __name__ == "__main__":.
The DAG objects must be in the global namespace of the module. Assign the return value of create_dag() to a variable: dag = a.create_dag().

Airflow does not run task

I've written a python task for Airflow. When I load my DAG into Airflow, it shows everything just fine. Once I trigger a run for my DAG, it will create a run and switch to succeeded without ever running my task. The webserver and scheduler are both running and there's nothing in the log.
The task doesn't even get a status while running (not even skipped).
If I run my task directly using airflow test update_dags work 2019-01-01 it runs just fine.
This is my DAG:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
import os
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime.today(),
'email': ['***redacted***'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'params': {
'git_repository': '***redacted***',
'git_ref': 'origin/master',
'git_folder': '/opt/dag-repository'
}
}
def command(cmd: str, *args):
templated_command = cmd.format(*args)
print('Running command: {}'.format(templated_command))
os.system(templated_command)
def do_work(**kwargs):
params = kwargs['params']
dag_directory = kwargs['conf'].get('core', 'dags_folder')
git_repository = params['git_repository']
git_ref = params['git_ref']
git_folder = params['git_folder']
command('if [ ! -d {0} ]; then git clone {1} {0}; fi', git_folder, git_repository)
command('cd {0}; git fetch -apt', git_folder)
command('cd {0}; git reset --hard {1}', git_folder, git_ref)
command('ln -sf {0} {1}', '{}/src'.format(git_folder), dag_directory)
with DAG('update_dags', default_args=default_args, schedule_interval=timedelta(minutes=10), max_active_runs=1) as dag:
work_stage = PythonOperator(
task_id="work",
python_callable=do_work,
provide_context=True
)
I also had the same problem. I solved the problem by turning os.system(cmd) into subprocess.run(cmd, shell=True, check=True), but I'm not quite sure why. hope this helps

PythonOperator with python_callable set gets executed constantly

import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from workflow.task import some_task
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['jimin.park1#aig.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=1),
'start_date': airflow.utils.dates.days_ago(0)
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG('JiminTest', default_args=default_args, schedule_interval='*/1 * * * *', catchup=False)
t1 = PythonOperator(
task_id='Task1',
provide_context=True,
python_callable=some_task,
dag=dag
)
The actual some_task itself simply appends timestamp to some file. As you can see in the dag config file, the task itself is configured to run every 1 min.
def some_task(ds, **kwargs):
current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
with open("test.txt", "a") as myfile:
myfile.write(current_time + '\n')
I simply tail -f the output file and started up the webserver without the scheduler running. This function was being called and things were being appended to the file when webserver starts up. When I start up the scheduler, on each execution loop, the file gets appended.
What I want is for the function to be executed on every minute as intended, not every execution loop.
The scheduler will run each DAG file every scheduler loop, including all import statements.
Is there anything running code in the file from where you are importing the function?
Try to check the scheduler_heartbeat_sec config parameter in your config file. For your case it should be smaller than 60 seconds.
If you want the scheduler not to cahtchup previous runs set catchup_by_defaultto False (I am not sure if this relevant to your question though).
Please indicate which Apache Airflow version are you using

Resources