I have a DAG in Airflow where the run is not scheduled, but triggered by an event. I would like to send an alert when the DAG did not run in the last 24 hours. My problem is I am not really sure which tool is the best for the task.
I tried to solve it with the Logs Explorer, I was able to write a quite good query filtering by the textPayload, but it seems that tool is designed to send the alert when a specific log is there, not when it is missing. (Maybe I missed something?)
I also checked Monitoring where I could set up an Alert when logs are missing, however in this case I was not able to write any query where I can filter logs by textPayload.
Thank you in advance if you can help me!
You could set up a separate alert DAG that notifies you if other DAGs haven't run in a specified amount of time? To get the last runtime of a DAG, use something like this:
from airflow.models import DagRun
dag_runs = DagRun.find(dag_id=dag_id)
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
Then you can use dag_runs[0] and compare with the current server time. If the date difference is greater than 24h, raise an alert.
I was able to do it in the monitoring. I did not need the filtering query which I used in the Logs Explorer. I needed to create an Alerting Policy, filtered by workflow_name, task_name and location. In the configure trigger section I was able to choose "Metric absence" with a 1 day absence time, so I resolved my old query with this.
Of course, it could be solved with setting up a new DAG, but setting up an Alerting Policy seems more easier.
Related
I've recently joined a team as a DAG developer. I can see that we currently use Python's requests directly instead of using HttpHooks in our code. We create a requests.Session object to share it across different requests but since min_file_process_interval is set to 30 seconds by default this session is recreated every 30 seconds which doesn't make much sense.
Will using HttpHook help in this case? Are hooks somehow left out of this DAG refreshing process? They also create a requests.Session object underneath.
Also, the APIs which we are calling require an access token which expires after some time. Currently we fetch a new access token each time we make an API call but it would be best to fetch the token only if the previous one has expired. But again, DAGs are refreshed every 30 seconds. So how to prevent the token from being cleared when the DAGs are refreshed?
Both the token retrieval and requests.Session object creation is done in a utils.py module used as a plugin in Airflow DAGs.
You can use HttpHook and you can use request directly. Both are fine and it's up to you. In general using HttpHook should makes your life easier (you can also subclass it and enhance it)
In any case you should use the code inside PythonOperator and not as top level code thus the min_file_process_interval is not relevant.
To explain with example
It's OK to do
def func():
HttpHook(...).run(...) # or requests.get(...)
with DAG('my_dag', default_args=default_args, catchup=False, schedule=None):
PythonOperator(
task_id='places',
python_callable=func,
)
In this example the HttpHook (or requests.get) will be invoked only when the operator is running.
Never do:
with DAG('my_dag', default_args=default_args, catchup=False, schedule=None):
HttpHook(...).run(...) # or requests.get(...)
In this example HttpHook (or requests.get) is called every time the DAG is parsed (min_file_process_interval) which means that the end point is called every 30 seconds. Big no for that.
I was looking through the different API endpoints that Airflow offers, but I could not find one that would suite my needs. Essentially I want to monitor the state of each task within the DAG, without having to specify each task I am trying to monitor. Ideally, I would be able to ping the DAG and the response would tell me the state of the task/tasks and what task/tasks are running/retrying...etc
You can use the airflow rest api which comes along with it - https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html
In Apache Airflow (2.x), each Operator Instance has a state as defined here (airflow source repo).
I have two use cases that don't seem to clearly fall into the pre-defined states:
Warn, but don't fail - This seems like it should be a very standard use case and I am surprised to not see it in the out-of-the-box airflow source code. Basically, I'd like to color-code a node with something eye-catching - say orange - corresponding to a non-fatal warning, but continue execution as normal otherwise. Obviously you can print warnings to the log, but finding them takes more work than just looking at the colorful circles on the DAGs page.
"Sensor N/A" or "Data not ready" - This would be a status that gets assigned when a sensor notices that data in the source system is not yet ready, and that downstream operators can be skipped until the next execution of the DAG, but that nothing in the data pipeline is broken. Basically an expected end-of-branch.
Is there a good way of achieving either of these use cases with the out-of-the-box Airflow node states? If not, is there a way to defining custom operator states? Since I am running airflow on a managed service (MWAA), I don't think changing the source code of our deployment is an option.
Thanks,
The task states are tightly integrated with Airflow. There's no way to configure which logging levels lead to which state. I'd say the easiest way is to grep log files for "WARNING" or set up a log aggregation service e.g. Elasticsearch to make log files searchable.
For #2, sensors have no knowledge about why a sensor timed out. After timeout or execution_timeout is reached, they simply raise an Exception. You can deal with exceptions using trigger_rules, but these still don't take the reason for an exception into account.
If you want more control over this, I would implement your own Sensor which takes an argument e.g. data_not_ready_timeout (which is smaller than timeout and execution_timeout). In the poke() method, check if data_not_ready_timeout has been reached, and raise an AirflowSkipException if so. This will skip downstream tasks. Once timeout or execution_timeout are reached, the task is failed. Look at BaseSensorOperator.execute() for some inspiration to get the initial starting date of a sensor.
I am trying to use airflow trigger_dag dag_id to trigger my dag, but it just show me running state and doesn't do anymore.
I have searched for many questions, but all people just say dag id paused. the problem is my dag is unpaused, but also keep the running state.
Note: I can use one dag to trigger another one in Web UI. But it doesn't work in command line.
please see the snapshot as below
I had the same issue many times, The state of the task is not running, it is not queued either, it's stuck after we 'clear'. Sometimes I found the task is going to Shutdown state before getting into stuck. And after a large time the instance will be failed, still, the task status will be in white. I have solved it in many ways, I
can't say its reason or exact solution, but try one of this:
Try trigger dag command again with the same Execution date and time instead of the clear option.
Try backfill it will run only unsuccessful instances.
or try with a different time within the same interval it will create another instance which is fresh and not have the issue.
Is it possible on Firebase or Parse to set up something kinda like a cron job?
Is there a way to set up some sort of timed operation that runs over the stored user data?
For example, I'm writing a program that allows people to RSVP for lunch everyday. If you have RSVPed by noon, then you get paired up with somebody else who has also RSVPed. Using JavaScript, the user can submit their RSVP in the browser.
The question is, can Firebase/Parse execute the code to match everyone at 12:00pm every day?
Yes, this can be done with Parse. You'll need to write your matching function as a background job in cloud code, and then you'll need to schedule the task in the dashboard. In terms of the flexibility in scheduling, it's not as flexible as cron but you can definitely run a task at the same time every day, or every x minutes/hours.
Tasks can take 15 mins max to execute before they're killed, so depending on the size of your database or the complexity of your task, you may need to break it up into different tasks or make it resumable.
Just to confirm about Firebase:
As #rickerbh said, it can be done with Parse, but currently there is no way for you to run your code on Firebase's server. There are 2 options for you 2 solve this:
You could use Firebase Queue and run your code in Node.js
You could use a different library such as Microsoft Azure (I still haven't tried this yet, I'm not sure if it provides Job Scheduling for Android)
However, Firebase is working on something called Firebase Trigger, which will solve our problem, however it is still not released with no confirmed release date.