I'm trying to send an email to certain people defined in a variable in the Airflow UI. I see a similar question here but mine is slightly different. I suspect I can't do this when I'm using Jinja templating?
Apache airflow sends sla miss emails only to first person on the list
My variable:
{ "config1" : "abc", "config2":500000, "email_on_failure_list": ["jorge#gmail.com","maria#gmail.com"]}
My dag. it's bad practice to use Variable.Get() in top level code when referencing a variable created in the Airflow UI. so I'm using Jinja templating but it doesn't seem to be working
email = "{{ var.json.my_dag.email_on_failure_list}}"
default_args = {
'depends_on_past': False,
'start_date': datetime(2020, 9, 1),
'email_on_failure': True,
'email': email
}
this is the error I receive
ERROR - Failed to send email to: {{ var.json.my_dag.email_on_failure_list}}
[2021-02-09 09:13:16,339] {{taskinstance.py:1201}} ERROR - {'{{ var.json.my_dag.email_on_failure_list}}': (501, b'5.1.3 Invalid address')}
Short answer: it's not possible with Jinja templates.
The alternative that was suggested on the Airflow Slack channel (Thanks Kaxil!) is to create an Environment Variable. I can then reference this using Variable.Get without any of the performance issues.
Environment variables aren't ideal for something non-sensitive since it requires more access to modify but it's a decent alternative to have different configuration settings across different Airflow environments (dev, qa, prod).
Follow the instructions here:
https://airflow.apache.org/docs/apache-airflow/2.0.1/howto/variable.html#storing-variables-in-environment-variables
Steps:
Create an environment variable on the airflow server (ex. AIRFLOW_VAR_MY_EMAIL_LIST). It's important to not create the variable in the Airflow UI
use it in your dag
from airflow.models import Variable
my_email_list = Variable.get("AIRFLOW_VAR_MY_EMAIL_LIST")
default_args = {
'depends_on_past': False,
'start_date': datetime(2020, 9, 1),
'email_on_failure': True,
'email': my_email_list
}
Related
Issue Summary
Few of my airflow tasks are still queued even after the next run is executed fine.
args
default_args = {
'owner': 'my-owner',
'depends_on_past': False,
'email': ['email#org.com']
'email_on_failure': True,
'email_on_retry': True,
'retries': 2,
'retry_exponential_backoff': True,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': datetime.datetime(2020, 4, 1)
}
Before implementing retry_exponential_backoff the dag was failed due to Exceeded rate limits: too many table update operations for this table.
After implementing retry_exponential_backoff to True there is no failure due to Exceeded rate limits but so far for one run I could see few tasks are still not yet triggered even after the next run executed.
What could be the issue? Appreciate your support.
Usually, you can click on the queued (grey) task and select the first option: "Instance Details" Then on the first upper part you can read the section Task Instance Details which has details about the state of the task.
Now about your use case, we still need more details: airflow version, your schedule_interval, more logs, what Executors you are using, num_runs, concurrency, etc.
Please check the official Airflow Documentation here
Problem:
several tasks/jobs that need to be executed for a client
a lot of clients (hundreds)
tasks/jobs are nearly identical, only config changes
Are there any best practices in Airflow to keep things simple? I'm thinking about (in no specific order):
specific client configs as possible (override defaults when needed)
ui overview: job per client would make it very difficult to get an overview
modularity: re-use of code as much as possible
performance: clients should not hinder one another (too much)
scaling: it should be easy to increase performance (preferably horizontally)
fault-tolerance: when things fail for one client they don't hinder others + clear ui indication
re-execution: easy to manually re-execute for 1 client
setup: setting up a new client should be easy (and using code/config, no ui)
governance: easy to evaluate code changes and enforce rules
clean-up: easy to remove a client
etc.
I cannot find a lot of material on this particular use case.
Ideally we have one "template" that is re-used per client. It is unclear whether one job or multiple jobs are the best solution. Or maybe there is another way that better suits this usage?
Airflow has extensive support for the Google Cloud Platform. But note that most Hooks and Operators are in the contrib section, which means that they have a beta status, meaning that they can have breaking changes between minor releases.
Number of client aspects:
There can be as many DAGs as is needed and each one of them can mention multiple tasks. It is recommended to keep one logical workflow in one DAG file and try keep it very light (e.g. configuration file). It allows taking less time and resources for the Airflow scheduler to process them at each heartbeat.
It is possible to create DAGs (with the same base code) dynamically based on any number of configuration parameters, which is really helpful and time-saving option when having a lot of clients.
To create new DAGs, please create a DAG template within the create_dag function. Code can be wrapped in a method that allows for custom parameters to be passed in. Moreover, the input parameters don't have to exist in the dag file itself. Another common form of generating DAGs is by setting values in a Variable object. Plese, refer here for further information.
Specific client configs:
You can use Macros are used to pass dynamic information into task instances at runtime. A list of default variables accessible in all templates can be found here.
Airflow’s built-in support for Jinja templating enables users to pass arguments that can be used in templated fields.
UI overview
If your dag takes long time to load, you could reduce the value of default_dag_run_display_number configuration in airflow.cfg to a smaller value. This configurable controls the number of dag run to show in UI with default value 25.
Modularity
If a dictionary of default_args is passed to a DAG, it will apply them to any of its operators. This makes it easy to apply a common parameter to many operators without having to type it many times.
Take a look for example:
from datetime import datetime, timedelta
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG('my_dag', default_args=default_args)
op = DummyOperator(task_id='dummy', dag=dag)
print(op.owner) # Airflow
For more information about the BaseOperator’s parameters and what they do, refer to the airflow.models.BaseOperator documentation.
Performance
It is possible to use variables, which you could control to improve airflow DAG performance (possible to set in the airflow.cfg.):
parallelism: controls the number of task instances that runs simultaneously across the whole Airflow cluster.
concurrency: The Airflow scheduler will run no more than concurrency task instances for your DAG at any given time. Concurrency is defined in your Airflow DAG. If you do not set the concurrency on your DAG, the scheduler will use the default value from the dag_concurrency entry in your airflow.cfg.
task_concurrency: This variable controls the number of concurrent running task instances across dag_runs per task.
max_active_runs: the Airflow scheduler will run no more than max_active_runs DagRuns of your DAG at a given time.
pool: This variable controls the number of concurrent running task instances assigned to the pool.
You can see the airflow config in the composer instance bucket gs://composer_instance_bucket/airflow.cfg. You can tune this configuration as you wish, but keep in mind that cloud composer has some configurations blocked.
Scaling
Please, keep in mind that is recommended that the number of nodes must be greater than 3, keeping this number below to 3 could cause some issues, if you want to up upgrade the number of nodes you can use the gcloud command to specify this value. Also please note that, there are some airflow configurations related to autoscalling blocked and can't be overridden.
Some Airflow configurations are preconfigured for Cloud Composer, and you cannot change them.
Fault-tolerance
Please, refer to following documentation.
Re-execution
Just like object is an instance of a class, an Airflow task is an instance of an Operator (BaseOperator). So write a "re-usable" operator and use it hundreds of times across your pipelines simply by passing different params.
Latency
It is possible to reduce airflow DAG scheduling latency in production by using:
max_threads: Scheduler will spawn multiple threads in parallel to schedule dags. This is controlled by max_threads with default value of 2.
scheduler_heartbeat_sec: User should consider to increase scheduler_heartbeat_sec config to a higher value(e.g 60 secs) which controls how frequent the airflow scheduler gets the heartbeat and updates the job’s entry in database.
Please refer to following articles about best practices:
Airflow documentation
Medium
I hope it will helps you in some way.
I run backfill command and all preceeding tasks are done except for the last 4 tasks.Their status are scheduled and nothing happens at all. I have run airflow test before and succeeded,but now I don't know what to do.
I specified a none-existing pool.But I think this won't matter because
preceeding tasks run smoothly.
Uhm no;
AFAIK, non-existent pool results in tasks getting stuck into a limbo state
But I've found that even while DAG is running, adding / removing pools or modifying the count of slots in pool works; so try creating the pool with adequate number of slots
I'm glad to join the conversation. Currently, I already created the pool with adequate number of slots, but the problem is still here. As this picture shows us: enter image description here
And besides, I set the priority_weight of the dag. It works! These tasks begin to running. The change like this:
default_args = {
'owner': 'Airflow',
'retries': 2,
'queue': 'queue_name',
'pool': 'pool_name',
'priority_weight': 10, # the larger the number, the higher the weight
'retry_delay': timedelta(seconds=10)}
I've been setting up an airflow cluster on our system and previously it has been working. I'm not sure what I may have done to change this.
I have a DAG which I want to run on a schedule. To make sure it's working I'd also like to trigger it manually. Neither of these seem to be working at the moment and no logs seem to be being written for the task instances. The only logs available are the airflow scheduler logs which generally look healthy.
I am just constantly met with this message:
Task is not ready for retry yet but will be retried automatically. Current date is 2018-12-12T11:34:46.978355+00:00 and task will be retried at 2018-12-12T11:35:08.093313+00:00.
However, if I wait a little the exact same message is presented again except the times have moved forward a little. Therefore, it seems the task is never actually being retried.
I am using a LocalExecutor and the task is an SSHOperator. Simplified code below. All it does is ssh onto another machine and start a bunch of application with a pre-determined directory structure.:
DB_INFO_FILE = 'info.json'
START_SCRIPT = '/bin/start.sh'
TIME_IN_PAST = timezone.convert_to_utc(datetime.today() -
timedelta(days=1))
DEFAULT_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': TIME_IN_PAST,
'email': ['some_email#blah.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
def _extract_instance_id(instance_string):
return re.findall(r'\d+', instance_string)[0]
def _read_file_as_json(file_name):
with open(file_name) as open_file:
return json.load(open_file)
DB_INFO = _read_file_as_json(DB_INFO_FILE)
CONFIG_CLIENT = ConfigDbClient(**DB_INFO)
APP_DIRS = CONFIG_CLIENT.get_values('%my-app-info%')
INSTANCE_START_SCRIPT_PATHS = {
_extract_instance_id(instance_string): directory+START_SCRIPT
for instance_string, directory in APP_DIRS.items()
}
# Create an ssh hook which refers to pre-existing connection information
# setup and stored by airflow
SSH_HOOK = SSHHook(ssh_conn_id='my-conn-id')
# Create a DAG object to add tasks to
DAG = DAG('my-dag-id',
default_args=DEFAULT_ARGS)
# Create a task for each app instance.
for instance_id, start_script in INSTANCE_START_SCRIPT_PATHS.items():
task = SSHOperator(
task_id='run-script-{0}'.format(instance_id),
command='bash {0}'.format(start_script),
ssh_hook=SSH_HOOK,
dag=DAG)
It works when I run the tasks individually, via the command line but not via the UI. It seems I can run tasks but I simply cannot trigger a DAG to run. I've tried many combinations of start_date s and interval schedules just to sanity check also.
Any ideas?
And yes, I am aware this question has been asked before and I have looked at all of them but not of the solutions have helped me.
Oh. Your start_date is changing at the same rate or faster than the schedule interval period.
Here's what the scheduler is seeing every couple of seconds:
start_date: 2018-12-11T12:12:12.000Z # E.G. IFF now is 2018-12-12T12:12:12.000Z, a day ago
schedule_interval: timedelta(days=1) # The default
Here's what the scheduler needs for a DAG to run: The last time a run occurred was more than one schedule interval ago. If no scheduled run has occurred, the first scheduled run could start now if one full schedule interval has passed since the start_date as that is the earliest allowable date for execution_date. In which case the dag_run with the execution_date set to the start of that interval period should be created. Then task_instances can be created for any tasks in the DAG whose dependencies are met as long as the task_instance execution_date is after the start_date of the DAG (this is not stored on the dag_run object but recomputed by loading the DAG file just while inspecting the dag's state).
So it's not getting scheduled automatically for the reason that the start date keeps changing just as the interval is satisfied. However if it were -2d at least one run would get scheduled and then any further runs would have to wait until it's 1d after that to be scheduled. It's easier though if you just set a fixed datetime on your start_date.
But what about those odd retries on your manual runs…
You did start a manual run or two. These runs took the current time as the execution_date unless you specified something else. This should be after the start_date, at least until tomorrow, which should clear them to run. But then it seems in your logs you're seeing that they're failing and being marked for retry and also not decrementing your retries. I'm not sure why that would be but could it be that something isn't right with the SSHOperator.
Did you install airflow with the [ssh] extra so that SSHOperator's dependencies are met (specifically paramiko and sshtunnel) on both the webserver and scheduler? One of them is working because I assume it's parsing and showing up in the UI based on being added to the DB.
What do you get if you execute:
airflow test my-dag-id run-script-an_instance_id 2018-12-12T12:12:12
You know that the scheduler and webserver are looping over refilling their DAG bag and so rerunning this DAG file a few 1000 times a day, reloading that json (it's local access, so similar to importing a module), and recreating that SSHHook with a DB lookup. I don't see anything fancy done setting up this hook, why not just remove ssh_hook from the SSHOperator and replace it with ssh_conn_id='my-conn-id' so it can be created once at execution time?
I doubt that's the issue that's causing the retries that just roll forward though.
I had a task stuck in up_for_retry for almost 24 hours before I noticed it, and it had nothing to do with the start_date, end_date, or any other classic beginner's problem.
I resorted to reading the source code, and found that Airflow treats up_for_retry tasks differently if they are part of a backfill DAG run.
So I connected to the metadata DB and changed backfill_ to scheduled__ in the dag_run row corresponding to the stuck task. Airflow started running the stuck task immediately.
I have been trying to look at how to use the User role. It says here, that it is for users with DAG ownership. So I created a couple of users with usernames ABC and XYZ and assigned them with User role.
Here's my DAG:
DEFAULT_ARGS = {
'owner': 'ABC',
...,
...
}
dag = DAG(
'test_dag',
default_args=DEFAULT_ARGS,
...,
...
)
When I logged in as XYZ, I expected the DAG test_dag to be hidden. If not hidden then at least to be in inactive state, since test_dag belongs to ABC. But as a XYZ, I'm able to operate test_dag.
Am I missing anything out here?
Make sure you are using the new RBAC UI. Verify that you have the following in your airflow.cfg file
[webserver]
rbac = True
authenticate = True
filter_by_owner = True
Are you using password authentication? If so, this is probably a bug, that is still not fixed: JIRA. It was also discussed here: How to allow airflow dags for concrete user(s) only
You can try to use LDAP or OAuth as you authentication method. This might resolve your problem.