How to watch tasks logs via CLI in Airflow? - airflow

So I am having a problem: there are no logs displaying in Airflow UI I am currently working with, I don't know the reason, but I've already informed my colleagues about it and they're looking for a solution.
Meanwhile I need to watch logs of certain tasks of my Dag. Is there any way to do it via airflow CLI?
I am using airflow tasks run command but it only seems to run tasks, but doesn't show a thing in command line.

By default Airflow should store your logs at $AIRFLOW_HOME/logs/ maybe you'll find them there, if they are still generated.
In the meantime you can use airflow tasks test <dag_id> <task_id> this tests a specific task and displays the logs in your terminal.

Related

Unable to create dag with apache airflow

I am running airflow 2.0, setting up airflow dag for the first time, and following quick start tutorials.
After creating and running the py file I don't see the dag created it does not list for me.
setting:
airflow.cfg:dags_folder = /Users/vik/src/airflow/dags
my python file is in this folder. There are no errors here.
I am able to see the example dags in example-dags.
I did airflow db init
airflow webserver
airflow scheduler
then try to list the dags
I think I am missing something
I don't know exactly how you installed everything, but I highly recommend Astronomer CLI for simplicity and quick setup. With that you'll be able to setup a first DAG pretty quickly. Here is also the video tutorial that helps you understand how to install / setup everything.
A few things to try:
Make sure the scheduleris running (run airflow scheduler) or try to restart it .
Using the Airflow CLI, run airflow config list and make sure that the loaded config is in fact what you are expecting, check the value of dags_folder.
Try running airflow dags list from the CLI, and check the the filepath if your dag is shown in the results.
If there was an error parsing your DAG, and therefore could not be loaded by the scheduler, you can find the logs in ${AIRFLOW_HOME}/logs/scheduler/${date}/your_dag_id.log

Cloud composer import custom plugin to all existing dags

I am using Cloud Composer to schedule multiple DAGs. These DAGs are built dynamically using this method and they use custom plugins.
I would like to know how to proceed when adding / modifying a plugin which concerns all DAGs (let's say it adds a new task to each DAGs)?
Do we need to pause all the running DAGs when doing so?
What I did so far when adding /modifying a plugin, is :
Upload the plugins into the plugins bucket of the Composer cluster (using gcloud composer command)
Do a dummy update in the Airflow config -> add a dummy value to the airflow.cfg (using gcloud composer commands)
I did that to force the DAGs to pause, and once the update is finished then the DAGs are resumed but with the new plugins and hence the new tasks (or if its not in this dagrun then it's the next one). Is it useless?
Thanks if you can help.
As explained in the architecture diagram, the Airflow webserver where you view your DAG and plugin code runs in a Google-managed tenant project, whereas the Airflow workers which actually run your DAG and plugin code are directly in your project.
When a DAG/Plugin is placed in the Composer bucket, the Airflow webserver(which falls under tenant project) validates the code and updates any new scheduling changes in the Airflow database.
At the same time, the Airflow scheduler (in your project) asks the Airflow database for the next DAG to run and notifies the Airflow workers to perform the scheduled work. The Airflow workers (in your project) then grab the DAG/Plugin code from the Composer bucket and compile them in order to run that specific task.
Thus, any updates made to your DAGs/Plugin code is read separately by the Airflow webserver and Airflow workers, at different times.
If you do not see your new code in the Airflow webserver, it should still be picked up by the Workers when they grab the code fresh on the new task run.
Therefore you shouldn't have to restart Composer for the workers to pick up the changes.
You cannot force a worker to grab and re-compile the new code mid task execution.
There are two ways to refresh the Airflow Webserver to see the Plugin code changes if it is not updating:
Set the reload_on_plugin_change property to True in the [webserver] section via the ‘AIRFLOW CONFIGURATIONS OVERRIDE’ tab in the Console.
OR, you can specifically add/remove/update a PYPI Package via the ‘PYPI PACKAGES’ Console tab. Non PYPI Package changes will not trigger the web server restart. Note this will also initiate an entire Composer environment restart which may take ~20 minutes.

Scheduler not updating package files

I'm developing a DAG on Cloud Composer; my code is separated into a main python file and one package with subfolders, it looks like this:
my_dag1.py
package1/__init__.py
package1/functions.py
package1/package2/__init__.py
package1/package2/more_functions.py
I updated one of the functions on package1/functions.py to take an additional argument (and update the reference in my_dag1.py). The code would run correctly on my local environment and I was not getting any errors when running
gcloud beta composer environments run my-airflow-environment list_dags --location europe-west1
But the Web UI raised a python error
TypeError: my_function() got an unexpected keyword argument
'new_argument'
I have tried to rename the function and the error changed to
NameError: name 'my_function' is not defined
I tried changing the name of the DAG and to upload the files to the dag folder zipped and unzipped, but nothing worked.
The error disappeared only after I renamed the package folder.
I suspect the issue is related to the scheduler picking up my_dag1.py but not package1/functions.py. The error appeared out of nowhere as I have made similar updates on the previous weeks.
Any idea on how to fix this issue without refactoring the whole code structure?
EDIT-1
Here's the link to related discussion on Google Groups
I've run into a similar issue. the "Broken DAG" error won't dismiss in Web UI. I guess this is a cache bug in Web server of AirFlow.
Background.
I created a customized operator with Airflow Plugin features.
After I import the customized operator, the airflow Web UI keep shows the Broken DAG error says that it can't find the customized operator.
Why I think it's a bug in Web server Airflow?
I can manually run the DAG with the command airflow test, so the import should be correct.
Even if I remove the related DAG file from the /dags/ folder of airflow, the error still there.
Here are What I did to resolve this issue.
restart airflow web service. (sometimes you can resolve the issue only by this).
make sure no DAG is running, restart airflow scheduler service.
make sure no DAG is running, restart airflow worker
Hopefully, it can help someone has the same issue.
Try restarting the webserver with:
gcloud beta composer environments restart-web-server ENVIRONMENT_NAME --location=LOCATION

Airflow DAG "seems to be existing only locally. The master scheduler doesn't seem to be aware of its existence"

I used airflow for workflow of Spark jobs. After installation, I copy the DAG files into DAGs folder set in airflow.cfg. I can backfill the DAG to run the BashOperators successfully. But there is always a warning like the one mentioned. I didn't verify if the scheduling is fine, but I doubt scheduling can work as the warning said the master scheduler doesn't know of my DAG's existence. How can I eliminate this warning and get scheduling work? Anybody run into the same issue who can help me out?
This is usually connected to the Scheduler not running or the refresh interval being too wide. There are no log entries present so we cannot analyze from there. Also, unfortunately the very cause might have been ignored, because this is usually the root of the problem:
I didn't verify if the scheduling is fine.
So first you should check if both of the following services are running:
airflow webserver
and
airflow scheduler
If that won't help, see this post for more reference: Airflow 1.9.0 is queuing but not launching tasks

Apache Airflow Best Practice: (Python)Operators or BashOperators

these days I'm working on a new ETL project and I wanted to give a try to Airflow as job manager.
Me and my colleague are both working on Airflow for the first time and we are following two different approaches: I decided to write python functions (operators like the ones included in the apache-airflow project) while my colleague uses airflow to call external python scripts through BashOperator.
I'd like to know if there is something like a "good practice", if the two approaches are equally good or I should consider one over the other.
To me, the main differences are:
- with BashOperator you can call a python script using a specific python environment with specific packages
- with BashOperator the tasks are more independent and can be launched manually if airflow goes mad
- with BashOperator task to task communication is a bit harder to manage
- with BashOperator task errors and failures are harder to manage (how can a bash task know if the task before it failed or succeded?).
What do you think?
My personal preference in these cases would be to use a PythonOperator over BashOperator. Here's what I do and why:
Single repo that contains all my DAGs. This repo also has a setup.py that includes airflow as a dependency, along with anything else my DAGs require. Airflow services are run from a virtualenv that installs these dependencies. This handles the python environment you mentioned regarding BashOperator.
I try to put all Python logic unrelated to Airflow in its own externally packaged python library. That code should have its own unit tests and also has its own main so it can be called on the command line independent of Airflow. This addresses your point about when Airflow goes mad!
If the logic is small enough that it doesn't make sense to separate into its own library, I drop it in a utils folder in my DAG repo, with unit tests still of course.
Then I call this logic in Airflow with the PythonOperator. The python callable can be easily be unit tested, unlike a BashOperator template script. This also means you can access things like starting an Airflow DB session, push multiple values to XCom, etc.
Like you mentioned, error handling is a bit easier with Python. You can catch exceptions and check return values easily. You can choose to mark the task as skipped with raise AirflowSkipException.
FYI for BashOperator, if the script exits with an error code, Airflow will mark the task as failed.
TaskA checks data availability at source. TaskB process it.
Task A>>Task B
Both tasks use BashOperator to call python scripts. I used to return sys.exit(1) (when no data at source) from script1 triggered by TaskA as a way to communicate Task A failed as there is no data and no need to run task B.

Resources