End to end pipeline tests for apache airflow

End to end pipeline tests for apache airflow - airflow

I want to test a dag in airflow end to end. I have some fixed input data and some expected results.
I tried googling around, unfortunately i can only find information related to unit tests.
Are there some good practices to follow?
I know that i can trigger a DAG run via terminal, but what would be the recommended way to check if the result is satisfying if the duration of the DAG is not fixed?

Related

What is the difference between max_retries and status_retries when using Airflow BatchOperator?

What is the difference between max_retries and status_retries when using Airflow BatchOperator? I need to ensure that if a batch job fails, airflow will mark the task (that triggered the batch job) as a failure. Currently, my batch job fails, but the airflow BatchOperator task is marked as a success. I believe that using one or both of these parameters will solve my problem, but I'm not sure what the difference really is between them. Thanks!!

Finding out whether a DAG execution is a catchup or a regularly scheduled one

I have an Airflow pipeline that starts with a FileSensor that may perform a number of retries (which makes sense because the producing process sometimes takes longer, and sometimes simply fails).
However when I restart the pipeline, as it runs in catchup mode, the retries in the file_sensor become spurious: if the file isn't there for a previous day, it wont materialize anymore.
Therefore my question: is it possible to make the behavior of a DAG-run contingent on whether that is currently running in a catch up or in a regularly scheduled run?
My apologies if this is a duplicated question: it seems a rather basic problem, but I couldn't find previous questions or documentation.

The solution is rather simple.
Set a LatestOnlyOperator upstream from the FileSensor
Set an operator of any type you may need downstream from the FileSensor with its trigger rule set to TriggerRule.ALL_DONE.
Both skipped and success states count as "done" states, while an error state doesn't. Hence, in a "non-catch-up" run the FileSensor will need to succeed to give way to the downstream task, while in a catch-up run the downstream will right away start after skipping the FileSensor.

Airflow upstream task in "none status" status, but downstream tasks executed

As screenshot:
Any idea why this would occur, and how would I go about troubleshooting it?
--
Update:
It's "none status" not "queued" as I originally interpreted
The DAG run occurred on 3/8 and last relevant commit was on 3/1. But I'm having trouble finding the same DAG run....will keep investigating

It's not Queued status. It's None status.
This can happen in one of the following cases:
The task drop_staging_table_if_exists was added after create_staging_table started to run.
The task drop_staging_table_if_exists used to have a different task_id in the past.
The task drop_staging_table_if_exists was somewhere else in the workflow and you changed the dependencies after the DAG run started.
Note Airflow currently doesn't support DAG versioning (It will be supported in future versions when AIP-36 DAG Versioning is completed) This means that Airflow constantly reload the DAG structure, so changes that you make will also be reflected on past runs - This is by design! and it's very useful for cases where you want to backfill past runs.
Either way, if you will start a new run or clear this specific run the issue you are facing will be resolved.

Apache Airflow dag is not getting succeded after the run

I have created a dag and that dag is available in the Airflow UI and i turned it on to run it. After running the dag the status is showing it is up for retry. After that i went to the server and used the command "Airflow scheduler" and after that the dag went successful.
Before running the dag the scheduler is up and running and i am not sure why this is happened.
Do we need to run the airflow scheduler when ever we create a new dag ?
Want to know how the scheduler works.
Thanks

You can look at the airflow scheduler as an infinite loop that checks tasks' states on each iteration and triggers tasks whose dependencies have been met.
The entire process generates a bunch of data that piles up more and more on each round and, at some point, it might end up rendering the scheduler useless as its performance degrades over time. This depends on your Airflow version, it seems to be solved in the newest version (2.0), but for older ones (< 2.0) the recommendation was to restart the scheduler every run_duration number of seconds, with some people recommending setting it to once an hour or once a day. So, unless you're working on Airflow 2.0, I think this is what you're experiencing.
You can find references to this scheduler-restarting issue in posts made by Astronomer here and here.

Triggering an Airflow DAG from terminal always keep running state

I am trying to use airflow trigger_dag dag_id to trigger my dag, but it just show me running state and doesn't do anymore.
I have searched for many questions, but all people just say dag id paused. the problem is my dag is unpaused, but also keep the running state.
Note: I can use one dag to trigger another one in Web UI. But it doesn't work in command line.
please see the snapshot as below

I had the same issue many times, The state of the task is not running, it is not queued either, it's stuck after we 'clear'. Sometimes I found the task is going to Shutdown state before getting into stuck. And after a large time the instance will be failed, still, the task status will be in white. I have solved it in many ways, I
can't say its reason or exact solution, but try one of this:
Try trigger dag command again with the same Execution date and time instead of the clear option.
Try backfill it will run only unsuccessful instances.
or try with a different time within the same interval it will create another instance which is fresh and not have the issue.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

End to end pipeline tests for apache airflow - airflow

Related

What is the difference between max_retries and status_retries when using Airflow BatchOperator?

Finding out whether a DAG execution is a catchup or a regularly scheduled one

Airflow upstream task in "none status" status, but downstream tasks executed

Apache Airflow dag is not getting succeded after the run

Triggering an Airflow DAG from terminal always keep running state

Categories

Resources