Before executing the DAG, I want to check whether a particular connection id is present in the connection list or not. I dont have any mechanismn of retaining a connection. Even if I create a connection through GUI, when server reboots all the connections gets removed.
Following is the task I thought I should add but thenI got an ascii error when I ran it, may be because the command return a table that might not be adequately parsed by the logger.
def create_connection(**kwargs):
print(kwargs.get('ds'))
list_conn = BashOperator(
task_id='list_connections',
bash_command='airflow connections --l',
xcom_push=True)
conns = list_conn.execute(context=kwargs)
logging.info(conns)
if not conns:
new_conn = Connection(conn_id='xyz', conn_type='s3',
host='https://api.example.com')
session = settings.Session()
session.add(new_conn)
session.commit()
logging.info('Connection is created')
Question: Is there any way I would get to know in Airflow DAG itself that the connection is added or not. If its already there then I would not create a new connection.
session.query(Connection) should do the trick.
def list_connections(**context):
session = settings.Session()
return session.query(Connection)
list_conn = PythonOperator(
task_id='list_connections',
python_callable=list_connections,
provide_context=True,
)
Please make sure all the code is contained within tasks. Or to phrase it correctly, they should execute during run time instead of load time. Adding the code directly in DAG file cause it to run during load time which is not recommended.
The accepted answers work perfectly. I had a scenario where I needed to get a connection by connection id to create the DAG. So I had to get it outside the task and in the DAG creation itself.
The following code worked for me:
from airflow.hooks.base_hook import BaseHook
conn = BaseHook.get_connection(connection)
Hope this might help someone! :)
Related
I've recently joined a team as a DAG developer. I can see that we currently use Python's requests directly instead of using HttpHooks in our code. We create a requests.Session object to share it across different requests but since min_file_process_interval is set to 30 seconds by default this session is recreated every 30 seconds which doesn't make much sense.
Will using HttpHook help in this case? Are hooks somehow left out of this DAG refreshing process? They also create a requests.Session object underneath.
Also, the APIs which we are calling require an access token which expires after some time. Currently we fetch a new access token each time we make an API call but it would be best to fetch the token only if the previous one has expired. But again, DAGs are refreshed every 30 seconds. So how to prevent the token from being cleared when the DAGs are refreshed?
Both the token retrieval and requests.Session object creation is done in a utils.py module used as a plugin in Airflow DAGs.
You can use HttpHook and you can use request directly. Both are fine and it's up to you. In general using HttpHook should makes your life easier (you can also subclass it and enhance it)
In any case you should use the code inside PythonOperator and not as top level code thus the min_file_process_interval is not relevant.
To explain with example
It's OK to do
def func():
HttpHook(...).run(...) # or requests.get(...)
with DAG('my_dag', default_args=default_args, catchup=False, schedule=None):
PythonOperator(
task_id='places',
python_callable=func,
)
In this example the HttpHook (or requests.get) will be invoked only when the operator is running.
Never do:
with DAG('my_dag', default_args=default_args, catchup=False, schedule=None):
HttpHook(...).run(...) # or requests.get(...)
In this example HttpHook (or requests.get) is called every time the DAG is parsed (min_file_process_interval) which means that the end point is called every 30 seconds. Big no for that.
I've an airflow dag that executes 10 tasks (exporting different data from the same source) in parallel, every 15min. I've also enabled 'email_on_failure' to get notified on failures.
Once every month or so, the tasks start failing for a couple of hours due to the data source not being available. Causing airflow to generate hundreds of emails (10 emails every 15min.), until the raw data source is available again.
Is there a better way to avoid being spammed with emails once consecutive runs fail to succeed?
For example, is it possible to only send an email on failure once it is the first run that start failing (i.e. previous run was successful)?
To customise the logic in callbacks you can use on_failure_callback and define a python function to call on failure/success. in this function you can access the task instance.
A property on this task instance is try_number - which you can check before sending an alert. An example could be:
some_operator = BashOperator(
task_id="some_operator",
bash_command="""
echo "something"
""",
on_failure_callback=task_fail_email_alert,
dag=dag,
def task_fail_email_alert(context):
try_number = context["ti"].try_number
if try_number == 1:
# send alert
else:
# do nothing
You can them implement the code to send an email in this function, rather than use the builtin email_on_failure. The EmailOperator is available by importing from airflow.operators.email import EmailOperator.
Giving consideration that your tasks are running concurrently and one or multiple failures could occur, I would suggest to treat the dispatch of failure messages as one would a shared resource.
You need to implement a lock that is "dagrun-aware" –– one that knows about the DagRun.
You can back this lock using an in-memory database like Redis, an object store like S3, system file, or a database. How you choose to implement this up to you.
In your on_failure_callback implementation, you must acquire said Lock. If acquisition of said Lock is successful, carry on to dispatch the email. Otherwise, pass.
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
class OnlyOnceLock:
def __init__(self, run_id):
self.run_id = run_id
def acquire(self):
# Returns False if run_id already exists in a backing store.
# S3 example
hook = S3Hook()
key = self.run_id
bucket_name = 'coordinated-email-alerts'
try:
hook.head_object(key, bucket_name)
return False
except:
# This is the first time lock is acquired
hook.load_string('fakie', key, bucket_name)
return True
def __enter__(self):
return self.acquire()
def __exit__(self, exc_type, exc_val, exc_tb):
pass
def on_failure_callback(context):
error = context['exception']
task = context['task']
run_id = context['run_id']
ti = context['ti']
with OnlyOnceLock(run_id) as lock:
if lock:
ti.email_alert(error, task)
I have a DAG in Airflow where the run is not scheduled, but triggered by an event. I would like to send an alert when the DAG did not run in the last 24 hours. My problem is I am not really sure which tool is the best for the task.
I tried to solve it with the Logs Explorer, I was able to write a quite good query filtering by the textPayload, but it seems that tool is designed to send the alert when a specific log is there, not when it is missing. (Maybe I missed something?)
I also checked Monitoring where I could set up an Alert when logs are missing, however in this case I was not able to write any query where I can filter logs by textPayload.
Thank you in advance if you can help me!
You could set up a separate alert DAG that notifies you if other DAGs haven't run in a specified amount of time? To get the last runtime of a DAG, use something like this:
from airflow.models import DagRun
dag_runs = DagRun.find(dag_id=dag_id)
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
Then you can use dag_runs[0] and compare with the current server time. If the date difference is greater than 24h, raise an alert.
I was able to do it in the monitoring. I did not need the filtering query which I used in the Logs Explorer. I needed to create an Alerting Policy, filtered by workflow_name, task_name and location. In the configure trigger section I was able to choose "Metric absence" with a 1 day absence time, so I resolved my old query with this.
Of course, it could be solved with setting up a new DAG, but setting up an Alerting Policy seems more easier.
I just started using Airflow to coordinate our ETL pipeline.
I encountered the pipe error when I run a dag.
I've seen a general stackoverflow discussion here.
My case is more on the Airflow side. According to the discussion in that post, the possible root cause is:
The broken pipe error usually occurs if your request is blocked or
takes too long and after request-side timeout, it'll close the
connection and then, when the respond-side (server) tries to write to
the socket, it will throw a pipe broken error.
This might be the real cause in my case, I have a pythonoperator that will start another job outside of Airflow, and that job could be very lengthy (i.e. 10+ hours), I wonder if what is the mechanism in place in Airflow that I can leverage to prevent this error.
Can anyone help?
UPDATE1 20190303-1:
Thanks to #y2k-shubham for the SSHOperator, I am able to use it to set up a SSH connection successfully and am able to run some simple commands on the remote site (indeed the default ssh connection has to be set to localhost because the job is on the localhost) and am able to see the correct result of hostname, pwd.
However, when I attempted to run the actual job, I received same error, again, the error is from the jpipeline ob instead of the Airflow dag/task.
UPDATE2: 20190303-2
I had a successful run (airflow test) with no error, and then followed another failed run (scheduler) with same error from pipeline.
While I'd suggest you keep looking for a more graceful way of trying to achieve what you want, I'm putting up example usage as requested
First you've got to create an SSHHook. This can be done in two ways
The conventional way where you supply all requisite settings like host, user, password (if needed) etc from the client code where you are instantiating the hook. Im hereby citing an example from test_ssh_hook.py, but you must thoroughly go through SSHHook as well as its tests to understand all possible usages
ssh_hook = SSHHook(remote_host="remote_host",
port="port",
username="username",
timeout=10,
key_file="fake.file")
The Airflow way where you put all connection details inside a Connection object that can be managed from UI and only pass it's conn_id to instantiate your hook
ssh_hook = SSHHook(ssh_conn_id="my_ssh_conn_id")
Of course, if your'e relying on SSHOperator, then you can directly pass the ssh_conn_id to operator.
ssh_operator = SSHOperator(ssh_conn_id="my_ssh_conn_id")
Now if your'e planning to have a dedicated task for running a command over SSH, you can use SSHOperator. Again I'm citing an example from test_ssh_operator.py, but go through the sources for a better picture.
task = SSHOperator(task_id="test",
command="echo -n airflow",
dag=self.dag,
timeout=10,
ssh_conn_id="ssh_default")
But then you might want to run a command over SSH as a part of your bigger task. In that case, you don't want an SSHOperator, you can still use just the SSHHook. The get_conn() method of SSHHook provides you an instance of paramiko SSHClient. With this you can run a command using exec_command() call
my_command = "echo airflow"
stdin, stdout, stderr = ssh_client.exec_command(
command=my_command,
get_pty=my_command.startswith("sudo"),
timeout=10)
If you look at SSHOperator's execute() method, it is a rather complicated (but robust) piece of code trying to achieve a very simple thing. For my own usage, I had created some snippets that you might want to look at
For using SSHHook independently of SSHOperator, have a look at ssh_utils.py
For an operator that runs multiple commands over SSH (you can achieve the same thing by using bash's && operator), see MultiCmdSSHOperator
This is my situation:
I am trying to access the instance of my custom operator running in another DAG. I am able to get the correct DagRun and TaskInstance objects by doing the following.
dag_bag = DagBag(settings.DAGS_FOLDER)
target_dag = dag_bag.get_dag('target_dag_id')
dr = target_dag.get_dagrun(target_dag.latest_execution_date)
ti = dr.get_task_instance('target_task_id')
I have printed the TaskInstance object aquired by the above lines and it is correct (it is running/has the correct task_id etc.), however when I am unable to access the task object, which would allow me to interface with the running operator. I should be able to do the following:
running_custom_operator = ti.task #AttributeError: TaskInstance has not attribute task
Any help would be much appreciated, either following my approach or if someone knows how to access the task object of a running task instance.
Thank you
You can simply grab the task object from the DAG object: target_dag.task_dict["target_task_id"]