Distributed logging in Apache Airflow

Distributed logging in Apache Airflow - airflow

We are using Cloud Composer (Managed Airflow in GCP) to orchestrate our tasks. We are moving all our logs to sumo logic (a standard process in our org). Our requirement is to track an entire log of a single execution of a DAG, as of now there seems to be no way to track.
Currently, the first task in DAG will generate a unique ID and pass it to other tasks via xcom. The problem here is we were not able to inject the unique ID in Airflow operators log(like BigQueryOperator).
Is there any other way to inject the custom unique ID in Airflow operators log?

Composer integrates with stackdriver logging and you could filter per-DAG logs by "workflow:{your-dag-name}" and "execution-date:{your-dag-run-date}", e.g.,
You could read log entries with the following filters:
resource.type="cloud_composer_environment"
resource.labels.location="your-location"
resource.labels.environment_name="your-environment-name"
logName="projects/cloud-airflow-dev/logs/airflow-worker"
labels."execution-date"="your-dag-run-date"
labels.workflow="your-dag-id"

Related

Airflow AWS ECS operator not fetching Cloudwatch logs when used with firelens

I am using the Airflow EcsOperator to run AWS ECS Tasks. As part of this, I am using a custom fluentbit container that is setup to log the container logs to Cloudwatch and AWS Open Search. The logging to both destinations work fine. However, I noticed that the Cloudwatch log groups are getting generated in the format {awslogs_stream_prefix}-{ecs_task_id}. Braces are added to just show the two parts separately, the actual prefix is of the form "ecsprime-generator-container-firelens-977be157d3be4614a84537cd081152d7" where the string starting with 977 is the Task Id. Unfortunately, Airflow code that reads Cloudwatch logs expects the log group name to be in the format {awslogs_stream_prefix}/{ecs_task_id}. Due to this, I am not able to have the Airflow EcsOperator display the corresponding Cloudwatch logs.
Are there any workarounds to address this?

How do TaskInstances in the same process share variables in airflow

I have a requirement to get information from the current instance process in a running DAGs instance
For example, if I have created an DAGs instance [run_id] via the airflow API, do I have a way to get the global variables of this process group and define a method that is aware of the global variables of each DAGs instance to get the parameters I want

If you need to cross-communication between tasks you can use Xcom
Note that xcom is used to share metadata and are limited in size.
Airflow also offer Variables as key/value store.

Is there a way to externally check the status of each task in a DAG?

I was looking through the different API endpoints that Airflow offers, but I could not find one that would suite my needs. Essentially I want to monitor the state of each task within the DAG, without having to specify each task I am trying to monitor. Ideally, I would be able to ping the DAG and the response would tell me the state of the task/tasks and what task/tasks are running/retrying...etc

You can use the airflow rest api which comes along with it - https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html

Understanding `None` scheduling preset in Airflow

I am getting started with Apache Airflow and trying to setup a event driven DAG in Airflow. My event is a file being landed in Linux directory. This file can be landed multiple number of time throughout the day. I am using File Sensor operator for file monitoring.
My requirement is every time the file lands(with same name) in directory the Dag should kick off.
I was reading the official scheduling documentation and based on my understanding I see with option None I can make my Dag to be triggered externally based on event and it can be triggered multiple times throughout the day based on that external event.
Is my understanding correct? The official documentation doesn't have detailed information on it.
https://airflow.apache.org/scheduler.html?highlight=scheduling

That is correct. Having the schedule_interval as None means Airflow will never automatically schedule a run of the Dag.
You can schedule dag_runs externally a few different ways:
through the Airflow CLI
using a Local client from within a python script
through the Airflow REST API
manually via the trigger button in the Web UI

Apache Mesos Workflows - Event Driven Scheduler

We are currently using Apache Mesos with Marathon and Chronos to schedule long running and batch processes.
It would be great if we could create more complex workflows like with Oozie. Say for example kicking of a job when a file appears in a location or when a certain application completes or calls an API.
While it seems we could do this with Marathon/Chronos or Singularity, there seems no readily available interface for this.

You can use Chronos' /scheduler/dependency endpoint to specify "all jobs which must run at least once before this job will run." Do this on each of your Chronos jobs, and you can build arbitrarily complex workflow DAGs.
https://airbnb.github.io/chronos/#Adding%20a%20Dependent%20Job
Chronos currently only schedules jobs based on time or dependency triggers. Other events like file update, git push, or email/tweet could be modeled as a wait-for-X job that your target job would then depend on.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Distributed logging in Apache Airflow - airflow

Related

Airflow AWS ECS operator not fetching Cloudwatch logs when used with firelens

How do TaskInstances in the same process share variables in airflow

Is there a way to externally check the status of each task in a DAG?

Understanding `None` scheduling preset in Airflow

Apache Mesos Workflows - Event Driven Scheduler

Categories

Resources