How ro remove unwanted broken DAGs in airflow - airflow

I write something wrong in my sql_test.py,and run python sql_test.py,the error is 'no module named xxx',and in web-ui it shows a red error - Broken DAG.
And then I run airflow list_dags the same error occurs again .This is strange and I don't know what's happening.
I tried to run airflow delete_dags sql_test but there is no such id.
How can I :
remove the waning in web-ui
get sql_test out of list_dags

There's some syntactical mistake in your dag-definition file, resulting in failure in parsing the DAG. When Airflow fails to parse a DAG, several functionalities get broken (like list_dags in your case)
Of course deleting the problematic dag-definition file would fix it, but that's not a solution. So here's how you can understand what's wrong and fix it
From linux shell, go to Airflow's logs folder
cd $AIRFLOW_HOME/logs/scheduler/latest/
Run tree command to see directory structure
tree -I "__init__.py|__pycache__|*.pyc"
View the last few lines of the log file of your corresponding broken dag
tail -n 25 /path/to/my/broken-dag.py.log
This will give you the stack-trace that Airflow threw while trying to parse your broken dag file. That would hopefully help you diagnose the problem and patch it.
Once your dag-definition file is fixed
the broken dag message would disappear from UI
DAG would appear in the UI (refresh it a few times)
list_dags command would also start working

If you don't want to repair your DAG and ignore it, you can remove the unwanted DAG by specifying the DAG's underlying file in an .airflowignore file.

Related

How to avoid DAG generation during task run

We have an Airflow python script which read configuration files and then generate > 100 DAGs dynamically. When running the script in Airflow 2.4.1, from the task run log, we notice that Airflow is trying to parse our python script for every task run.
https://github.com/apache/airflow/blob/2.4.1/airflow/task/task_runner/standard_task_runner.py#L91-L97
Is there any way to make Airflow deserialize DAGs from DBs instead?
just found out that it is an expected behavior
https://medium.com/apache-airflow/airflows-magic-loop-ec424b05b629
https://medium.com/apache-airflow/magic-loop-in-airflow-reloaded-3e1bd8fb6671
but the Python script may use parsing context to load the respective DAG only
https://github.com/apache/airflow/pull/25161

(Dagster) Schedule my_hourly_schedule was started from a location that can no longer be found

I'm getting the following Warning message when trying to start the dagster-daemon:
Schedule my_hourly_schedule was started from a location Scheduler that can no longer be found in the workspace, or has metadata that has changed since the schedule was started. You can turn off this schedule in the Dagit UI from the Status tab.
I'm trying to automate some pipelines with dagster and created a new project using dagster new-project Scheduler where "Scheduler" is my project.
This command, as expected, created a diretory with some hello_world files. Inside of it I put the dagster.yaml file with configuration for a PostgreDB to which I want to right the logs. The whole thing looks like this:
However, whenever I run dagster-daemon run from the directory where the workspace.yaml file is located, I get the message above. I tried runnning running the daemon from other folders, but it then complains that it can't find any workspace.yaml files.
I guess, I'm running into a "beginner mistake", but could anyone help me with this?
I appreciate any counsel.
One thing to note is that the dagster.yaml file will not do anything unless you've set your DAGSTER_HOME environment variable to point at the directory that this file lives.
That being said, I think what's going on here is that you don't have the Scheduler package installed into the python environment that you're running your dagster-daemon in.
To fix this, you can run pip install -e . in the Scheduler directory, although the README.md inside that directory has more specific instructions for working with virtualenvs.

Clear Failed Airflow DAG But Don't Restart

I'm running a custom backfill script that backfills a DAG serially. (If I run the backfill concurrently, I either run into a deadlock problem or a Serializable isolation problem.) As part of the process, I sometimes have failed DAGs mixed in with non-existing dates. To backfill that failed date, I need to clear it first. The problem comes in that a cleared DAG auto restarts and conflicts with the first date running in the backfill.
airflow clear dag_id -f -s 01-01-20 -e 01-12-20 -f
Because of the way it is built, I'll have to rewrite it from scratch and clear each backfill individually. If I can prevent a cleared DAG from rerunning, I would save me some time. Is there a way to do this in the cli?
You could try setting the max_active_runs argument to 1 when creating the DAG object. This will ensure that no more than one execution is active at a time, that way you can clear as many as you'd like and let Airflow handle the rest.

Airflow scheduler does not appear to be running after execute a task

When there is a task running, Airflow will pop a notice saying the scheduler does not appear to be running and it kept showing until the task finished:
The scheduler does not appear to be running. Last heartbeat was received 5 minutes ago.
The DAGs list may not update, and new tasks will not be scheduled.
Actually, the scheduler process is running, as I have checked the process. After the task finished, the notice will disappear and everything back to normal.
My task is kind of heavy, may running for couple hours.
I think it is expected for Sequential Executor. Sequential Executor runs one thing at a time so it cannot run heartbeat and task at the same time.
Why do you need to use Sequential Executor / Sqlite? The advice to switch to other DB/Executor make perfect sense.
You have started airflow webserver and you haven't started your airflow scheduler.
Run airflow scheduler in background
airflow scheduler > /console/scheduler_log.log &
I had the same issue.
I switch to postgresql by updating airflow.cfg file > sql_alchemy_conn =postgresql+psycopg2://airflow#localhost:5432/airflow
and executor = LocalExecutor
This link may help how to set this up locally
https://medium.com/#taufiq_ibrahim/apache-airflow-installation-on-ubuntu-ddc087482c14
A quick fix could be to run the airflow scheduler separately. Perhaps not the best solution but it did work for me. To do so, run this command in the terminal:
airflow scheduler
I had a similar issue and have been trying to troubleshoot this for a while now.
I managed to fix it by setting this value in airflow.cfg:
scheduler_health_check_threshold = 240
PS: Based on a recent conversation in Airflow Slack Community, it could happen due to contention at the Database side. So, another workaround suggested was to scale up the database. In my case, this was not a viable solution.
EDIT:
This was last tested with Airflow Version 2.3.3
I have solved this issue by deleting airflow-scheduler.pid file.
then
airflow scheduler -D
Check the airflow-scheduler.err and airflow-scheduler.log files.
I got an error like this:
Traceback (most recent call last):
File "/home/myVM/venv/py_env/lib/python3.8/site-packages/lockfile/pidlockfile.py", ine 77, in acquire
write_pid_to_pidfile(self.path)
File "/home/myVM/venv/py_env/lib/python3.8/site-packages/lockfile/pidlockfile.py", line 161, in write_pid_to_pidfile
pidfile_fd = os.open(pidfile_path, open_flags, open_mode)
FileExistsError: [Errno 17] File exists: '/home/myVM/venv/py_env/airflow-scheduler.pid'
I removed the existing airflow-scheduler.pid file and started the scheduler again by airflow scheduler -D. It was working fine then.
Our problem is that the file "logs/scheduler.log" is too large, 1TB. After cleaning this file everything is fine.
I had the same issue while using sqlite. There was a special message in Airflow logs: ERROR - Cannot use more than 1 thread when using sqlite. Setting max_threads to 1. If you use only 1 thread, the scheduler will be unavailable while executing a dag.
So if use sqlite, try to switch to another database. If you don't, check max_threads value in your airflow.cfg.
On Composer page, click on your environment name, and it will open the Environment details, go to the PyPIPackages tab.
Click on Edit button, increase the any package version.
For example:
I increased the version of pymsql packages, and this restarted the airflow environment, it took a while for it to update. Once it is done, I'm no longer have this error.
You can also add a Python package, it will restart the airflow environment.
I've had the same issue after changing the airflow timezone. I then restarted the airflow-scheduler and it works. You can also check if the airflow-scheduler and airflow-worker are on different servers.
In simple words, using LocalExecutor and postgresql could fix this error.
Running Airflow locally, following the instruction, https://airflow.apache.org/docs/apache-airflow/stable/start/local.html.
It has the default config
executor = SequentialExecutor
sql_alchemy_conn = sqlite:////Users/yourusername/airflow/airflow.db
It will use SequentialExecutor and sqlite by default, and it will have this "The scheduler does not appear to be running." error.
To fix it, I followed Jarek Potiuk's advice. I changed the following config:
executor = LocalExecutor
sql_alchemy_conn = postgresql://postgres:masterpasswordforyourlocalpostgresql#localhost:5432
And then I rerun the "airflow db init"
airflow db init
airflow users create \
--username admin \
--firstname Peter \
--lastname Parker \
--role Admin \
--email spiderman#superhero.org
After the db inited. Run
airflow webserver --port 8080
airflow scheduler
This fixed the airflow scheduler error.
This happens to me when AIRFLOW_HOME is not set.
By setting AIRFLOW_HOME to the correct path, the indicated executor will be selected.
If it matters: somehow, the -D flag causes a lot of problems for me. The airflow webserver -D immediately crashes after starting, and airflow scheduler -D somehow does next to nothing for me.
Weirdly enough, it works without the detach flag. This means I can just run the program normally, and make it run in the background, with e.g. nohup airflow scheduler &.
After change executor from SequentialExecutor to LocalExecutor, it works!
in airflow.cfg:
executor = LocalExecutor

How to see logs of running robot script?

The robot scripts when ran on RIDE, generate output.xml, report.html etc files, once run is over.
Is there any way available to view logs when script is still running? (When I use pause on failure)
Also some times I had to Stop/Abort the run in middle, and no logs are generated in such cases.
Kindly help,
Thanks in advance
As for first part — RIDE runs tests adding own listener, providing more verboseness of the output and pausing/resuming functionality.The easiest thing is to run tests not from RIDE, but from console using robot/pybot script. In this case much less logs are written to output (though it doesn't provide pause/resume functionality).
For second part — robot (RIDE starts robot script — you can see it in execution log: command: pybot.bat...) generates output.xml file not after but during execution, so generated output.xml is not valid until test is finished. After normal execution rebot tool generating log.html automatically. So generally it's possible to take following steps:
"Fix" your incomplete output.xml file after execution stop with fixml. output.xml location for RIDE execution can be found in the very same execution log of yours (e.g. ...\appdata\local\temp\RIDEv_0yrp.d\ in my case)
Run rebot stand-alone: rebot output.xml --log log.html --report report.html. Rebot options description you can check using rebot --help (as usual)
Please also note that directory where RIDE output files are stored is temporary — exists only when RIDE is started. You will lose your output on exiting RIDE
I'm using RIDE 1.5 so maybe my answer is not valid for other versions
In RIDE, under Run Tab , when you are running the scripts , you have a option, show message log, it will shows the runtime log.
Try this out.

Resources