How to find the number of upstream tasks failed in Airflow? - airflow

I am having a tough time in figuring out how to find the failed task for the same dag run running twice on same day(same execution day).
Consider an example when a dag with dag_id=1 has failed on the first run (due to any reason lets say connection timeout maybe) and task got failed. TaskInstance table will contain the entry of the failed task when we try to query it. GREAT!!
But, If I re-run the same dag(note that dag_id is still 1) then in the last task(it has the rule of ALL_DONE so irrespective of the whether upstream task was failed or was successful it will be executed) I want to calculate the number of tasks failed in the current dag_run ignoring the previous dag_runs. I came across dag_run id which could be useful if we can relate it to TaskInstance but I could not. Any suggestions/help is appreciated.

In Airflow 1.10.x the same result can be achieved by much simpler code that avoids touching ORM directly.
from airflow.utils.state import State
def your_python_operator_callable(**context):
tis_dagrun = context['ti'].get_dagrun().get_task_instances()
failed_count = sum([True if ti.state == State.FAILED else False for ti in tis_dagrun])
print(f"There are {failed_count} failed tasks in this execution"
The one unfortunate problem is that context['ti'].get_dagrun() does not return instance of DAGRun when running test of a single task from CLI. In the effect, manual testing of that single task will fail but the standard run will work as expected.

You could create a PythonOperator task which queries the Airflow database to find the information you're looking for. This has the added benefit of passing along the information you need to query for the data you want:
from contextlib import closing
from airflow import models, settings
from airflow.utils.state import State
def your_python_operator_callable(**context):
with closing(settings.Session()) as session:
print("There are {} failed tasks in this execution".format(
session.query(
models.TaskInstance
).filter(
models.TaskInstance.dag_id == context["dag"].dag_id,
models.TaskInstance.execution_date == context["execution_date"],
models.TaskInstance.state == State.FAILED).count()
)
Then add the task to your DAG with a PythonOperator.
(I have not tested the above, but hopefully will send you on the right path)

Related

Apache Airflow problem - "a task with task_id create_tag_template_field_result is already in the DAG"

So, I have a problem with even the blank Airflow installation.
As soon as I try to run
airflow test tutorial print_date 2015-06-01
I get a raised exception which says
PendingDeprecationWarning: The requested task could not be added to the DAG because a task with task_id create_tag_template_field_result is already in the DAG. Starting in Airflow 2.0, trying to overwrite a task will raise an exception.
What is the reason for this (as I made literally no changes to the installation whatsoever)?
I also got that when, in a previous installation, I tried to run my own dag... but the "create_tag_template_field_result" was nowhere to be found in my code.
you can set the config arg load_examples = False to solve it.
This is the test command will call get_dag function which will construct a DagBag object, in the construction function will call collect_dags function.
The collect_dags function when the conf arg LOAD_EXAMPLES=True(default True), will collect all the dags in the example path, that's where the task create_tag_template_field_result comes from.
And in the collect_dags function will call add_task function of every example task, that's where you add the create_tag_template_field_result task again.
And maybe it's quickstart when you added this task before for the first time while you didn't realize.
you can set the config arg load_examples = False to solve it
This warning is occuring in
/usr/local/lib/python3.7/dist-packages/airflow/example_dags/example_complex.py
so i remove or rename (for example, to not working name *.py.back ) this.
I had the same error with a fresh install.
Then I don't know if this helps, but I downgraded Airflow to version 1.10.10 (with python3.7) and the error was gone.

Dag Seems to be missing

I have a dag which checks for new workflows to be generated (Dynamic DAG) at a regular interval and if found, creates them. (Ref: Dynamic dags not getting added by scheduler )
The above DAG is working and the dynamic DAGs are getting created and listed in the web-server. Two issues here:
When clicking on the DAG in web url, it says "DAG seems to be missing"
The listed DAGs are not listed using "airflow list_dags" command
Error:
DAG "app01_user" seems to be missing.
The same is for all other dynamically generated DAGs. I have compiled the Python script and found no errors.
Edit1:
I tried clearing all data and running "airflow run". It ran successfully but no Dynamic generated DAGs were added to "airflow list_dags". But when running the command "airflow list_dags", it loaded and executed the DAG, (which generated Dynamic DAGs). The dynamic DAGs are also listed as below:
[root#cmnode dags]# airflow list_dags
sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8\nLANG=en_US.UTF-8)
sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8\nLANG=en_US.UTF-8)
[2019-08-13 00:34:31,692] {settings.py:182} INFO - settings.configure_orm(): Using pool settings. pool_size=15, pool_recycle=1800, pid=25386
[2019-08-13 00:34:31,877] {__init__.py:51} INFO - Using executor LocalExecutor
[2019-08-13 00:34:32,113] {__init__.py:305} INFO - Filling up the DagBag from /root/airflow/dags
/usr/lib/python2.7/site-packages/airflow/operators/bash_operator.py:70: PendingDeprecationWarning: Invalid arguments were passed to BashOperator (task_id: tst_dyn_dag). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
*args: ()
**kwargs: {'provide_context': True}
super(BashOperator, self).__init__(*args, **kwargs)
-------------------------------------------------------------------
DAGS
-------------------------------------------------------------------
app01_user
app02_user
app03_user
app04_user
testDynDags
Upon running again, all the above generated 4 dags disappeared and only the base DAG, "testDynDags" is displayed.
When I was getting this error, there was an exception showing up in the webserver logs. Once I resolved that error and I restarted the webserver it went through normally.
From what I can see this is the error that is thrown when the webserver tried to parse the dag file and there is an error. In my case it was an error importing a new operator I added to a plugin.
Usually, I check in Airflow UI, sometimes the reason of broken DAG appear in there. But if it is not there, I usually run the .py file of my DAG, and error (reason of DAG cant be parsed) will appear.
I never got to work on dynamic DAG generation but I did face this issue when DAG was not present on all nodes ( scheduler, worker and webserver ). In case you have airflow cluster, please make sure that DAG is present on all airflow nodes.
Same error, the reason was I renamed my dag_id in uppercase. Something like "import_myclientname" into "import_MYCLIENTNAME".
I am little late to the party but I faced the error today:
In short: try executing airflow dags report and/or airflow dags reserialize
Check out my comment here:
https://stackoverflow.com/a/73880927/4437153
I found that airflow fails to recognize a dag defined in a file that does not have from airflow import DAG in it, even if DAG is not explicitly used in that file.
For example, suppose you have two files, a.py and b.py:
# a.py
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
def makedag(dag_id="a"):
with DAG(dag_id=dag_id) as dag:
DummyOperator(task_id="nada")
dag = makedag()
and
# b.py
from a import makedag
dag = makedag(dag_id="b")
Then airflow will only look at a.py. It won't even look at b.py at all, even to notice if there's a syntax error in it! But if you add from airflow import DAG to it and don't change anything else, it will show up.

Airflow: Simple DAG with one task never finishes

I have made a very simple DAG that looks like this:
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
cleanup_command = "/home/ubuntu/airflow/dags/scripts/log_cleanup/log_cleanup.sh "
dag = DAG(
'log_cleanup',
description='DAG for deleting old logs',
schedule_interval='10 13 * * *',
start_date=datetime(2018, 3, 30),
catchup=False,
)
t1 = BashOperator(task_id='cleanup_task', bash_command=cleanup_command, dag=dag)
The task finishes successfully but despite of this the DAG remains in "running" status. Any idea what could cause this. The screenshot below show the issue with the DAG remaining running. The earlier runs are only finished because I manually mark status as success. [Edit: I had originally written: "The earlier runs are only finished because I manually set status to running."]
The earlier runs are only finished because I manually set status to running.
Are you sure your scheduler is running? You can start it with $ airflow scheduler, and check the scheduler CLI command docs You shouldn't have to manually set tasks to running.
Your code here seems fine. One thing you might try is restarting your scheduler.
In the Airflow metadata database, DAG run end state is disconnected from task run end state. I've seen this happen before, but usually it resolves itself on the scheduler's next loop when it realizes all of the tasks in the DAG run have reached a final state (success, failed, or skipped).
Are you running the LocalExecutor, SequentialExecutor, or something else here?

Airflow returns "Backfill done" without running tasks

I'm running Airflow and attempting to iterate on some task we're building from the command line.
When running a airflow webserver, everything works as expected. But when I run airflow backfill dag task '2017-08-12', airflow returns:
[2017-08-15 02:52:55,639] {__init__.py:57} INFO - Using executor LocalExecutor
[2017-08-15 02:52:56,144] {models.py:168} INFO - Filling up the DagBag from /usr/local/airflow/dags
2017-08-15 02:52:59,055 - airflow.jobs.BackfillJob - INFO - Backfill done. Exiting
...and doesn't actually run the dag.
When using airflow test or airflow run (i.e. commands involving running a task rather than a dag), it works as expected
Am I making a basic mistake? What can I do to debug from here?
Thanks
Have you run those DAG on that date range already? You will need to clear the DAG first then backfill. Base on what Maxime mentioned here: https://groups.google.com/forum/#!topic/airbnb_airflow/gMY-sc0QVh0
If a task has a #monthly schedule, then if you try and run it with a start_date mid-month, it will merely state Backfill done. Exiting.. If a task has a schedule of '30 5 * * *', this also prevents backfill from the command line
(Updated to reflect better information, and this discussion)
Two possible reasons:
Execution date specified via -e option is outside of the DAG's [start_date, end_date) range.
Even if execution date is between the dates, please keep in mind that if you DAG has schedule_interval=None then it won't backfill iteratively: it will only run for a single date (specified as --start_date or --end_date if the first is omitted).

What to do when a py.test hangs silently?

While using py.test, I have some tests that run fine with SQLite but hang silently when I switch to Postgresql. How would I go about debugging something like that? Is there a "verbose" mode I can run my tests in, or set a breakpoint ? More generally, what is the standard plan of attack when pytest stalls silently? I've tried using the pytest-timeout, and ran the test with $ py.test --timeout=300, but the tests still hang with no activity on the screen whatsoever
I ran into the same SQLite/Postgres problem with Flask and SQLAlchemy, similar to Gordon Fierce. However, my solution was different. Postgres is strict about table locks and connections, so explicitly closing the session connection on teardown solved the problem for me.
My working code:
#pytest.yield_fixture(scope='function')
def db(app):
# app is an instance of a flask app, _db a SQLAlchemy DB
_db.app = app
with app.app_context():
_db.create_all()
yield _db
# Explicitly close DB connection
_db.session.close()
_db.drop_all()
Reference: SQLAlchemy
To answer the question "How would I go about debugging something like that?"
Run with py.test -m trace --trace to get trace of python calls.
One option (useful for any stuck unix binary) is to attach to process using strace -p <PID>. See what system call it might be stuck on or loop of system calls. e.g. stuck calling gettimeofday
For more verbose py.test output install pytest-sugar. pip install pytest-sugar And run test with pytest.py --verbose . . .
https://pypi.python.org/pypi/pytest-sugar
I had a similar problem with pytest and Postgresql while testing a Flask app that used SQLAlchemy. It seems pytest has a hard time running a teardown using its request.addfinalizer method with Postgresql.
Previously I had:
#pytest.fixture
def db(app, request):
def teardown():
_db.drop_all()
_db.app = app
_db.create_all()
request.addfinalizer(teardown)
return _db
( _db is an instance of SQLAlchemy I import from extensions.py )
But if I drop the database every time the database fixture is called:
#pytest.fixture
def db(app, request):
_db.app = app
_db.drop_all()
_db.create_all()
return _db
Then pytest won't hang after your first test.
Not knowing what is breaking in the code, the best way is to isolate the test that is failing and set a breakpoint in it to have a look. Note: I use pudb instead of pdb, because it's really the best way to debug python if you are not using an IDE.
For example, you can the following in your test file:
import pudb
...
def test_create_product(session):
pudb.set_trace()
# Create the Product instance
# Create a Price instance
# Add the Product instance to the session.
...
Then run it with
py.test -s --capture=no test_my_stuff.py
Now you'll be able to see exactly where the script locks up, and examine the stack and the database at this particular moment of execution. Otherwise it's like looking for a needle in a haystack.
I just ran into this problem for quite some time (though I wasn't using SQLite). The test suite ran fine locally, but failed in CircleCI (Docker).
My problem was ultimately that:
An object's underlying implementation used threading
The object's __del__ normally would end the threads
My test suite wasn't calling __del__ as it should have
I figured I'd add how I figured this out. Other answers suggest these:
Found usage of pytest-timeout didn't help, the test hung after completion
Invoked via pytest --timeout 5
Versions: pytest==6.2.2, pytest-timeout==1.4.2
Running pytest -m trace --trace or pytest --verbose yielded no useful information either
I ended up having to comment literally everything out, including:
All conftest.py code and test code
Slowly uncommented/re-commented regions and identified the root cause
Ultimate solution: using a factory fixture to add a finalizer to call __del__
In my case the Flask application did not check if __name__ == '__main__': so it executed app.start() when that was not my intention.
You can read many more details here.
In my case diff worked very slow on comparing 4 MB data when assert failed.
with open(path, 'rb') as f:
assert f.read() == data
Fixed by:
with open(path, 'rb') as f:
eq = f.read() == data
assert eq

Resources