How to use multiple cron expressions within single #scheduled annotation spring boot
I have to run same job every Monday and last day of every month , I want to achieve it using #scheduled annotation with 2 cron expressions. Is is possible and if yes how I can implement??
Related
I would like to manually trigger Airflow dag and pass parameters in this call. How to create DAG tasks based on the passed parameters?
Unfortunately, you can't dynamically create tasks in the sense you are asking now. Meaning, you can not add or delete tasks for a single run based on parameters that you give it. You always add the entire DAG, meaning also any consecutive DagRuns.
You have two options that might still get you to a satisfactory solution though.
Mapped tasks as explained here; https://airflow.apache.org/docs/apache-airflow/2.3.0/concepts/dynamic-task-mapping.html#simple-mapping
Generate a DAG based on some external resource. Example; we have a library on which we call my_config.get_dag() which will then create the DAG based on a json file.
Let's say you have 2 scripts: Daily_summary.py and Weekly_summary.py.
You could create 2 separate DAGs with daily and weekly schedules, but is it possible to solve this with 1 DAG?
I've tried a daily schedule, and simply putting this at the bottom (simplified):
if datetime.today().strftime('%A') == 'Sunday':
SSHOperator(run weekly_summary.py)
But problem is that if it is still running on Sunday at midnight, airflow will terminate this task since the Operator no longer exists on Monday.
If I could somehow get the execution day's day-of-the-week, that would solve it, but with Jinja templating '{{ds}}' it is not actually a text of 'yyyy-mm-dd', so cannot change it to date with datetime package. It only becomes date format somehow AFTER the airflow script gets executed
You shoudl dynamically generate two DAGs. But you can reuse the same code for that. This is the power of airflow - this is Python code, so you can easily use the same code to generate same DAG "structure" but with two diferent dag ids, and two different schedules.
See this nice article from Astronomer with some examples: https://www.astronomer.io/guides/dynamically-generating-dags
We are using Airflow 2.1.4 and running on Kubernetes.
We have separated pods for web-server, scheduler and we are using Kubernetes executors.
We are using variety of operator such as PythonOperator, KubernetesPodOperator etc.
Our setup handles ~2K customers (businesses) and each one of them has it's own DAG.
Our code looks something like:
def get_customers():
logger.info("querying database to get all customers")
return sql_connection.query(SELECT id, name, offset FROM customers_table)
customers = get_customers()
for id, name, offset in customers:
dag = DAG(
dag_id=f"{id}-{name}",
schedule_interval=offset,
)
with dag:
first = PythonOperator(..)
second = KubernetesPodOperator(..)
third = SimpleHttpOperator(..)
first >> second >> third
globals()[id] = dag
In the snippet above is a simplified version of what we've got, but we have a few dozens of operators in the DAG (and not just three).
The problem is that for each one of the operators in each one of the DAGs we see the querying database to get all customers log - which means that we query the database a way more than we want to.
The database doesn't updated frequently and we can update the DAGs only once-twice a day.
I know that the DAGs are being saved in the metadata database or something..
Is there a way to build those DAGs only one time / via scheduler and not to do that per operator?
Should we change the design to support our multi-tenancy requirement? Is there a better option than that?
In our case, ~60 operators X ~2,000 customers = ~120,000 queries to the database.
Yes this is entirely expected. The DAGs are parsed by Airflow regularly (evey 30 second by default) so any top-level code (the one that is executed during parsing the file rather than "execute" methods of operators) is executed then.
Simple answer (and best practice) is "do not use any heavy operations in the top-level code of your DAGs". Specifically do not use DB queries. But if you want some more specific answers and possible ways how you can handle it, there are dedicated chapters about it in Airflow documentation about best practices:
This is explanation why Top-Level code should be "light" https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code
This one is about strategies you might use to avoid "heavy" operations in Top-level code when you do dynamic DAG generation as you do in your case: https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#dynamic-dag-generation
In short there are three proposed ways:
using env variables
generating a configuration file (for example .json) from your DB automatically (periodically) by an external script and putting it next to your DAG and reading the json file by your DAG from there rather than using sql query.
generating many DAG python files dynamically (for exmple using JINJA) also automatically and periodically using an external script.
You could use either 2) or 3) to achive your goal I believe.
Someone please tell me whether a DAG in airflow is just a graph (like a placeholder) without any actual data (like arguments) associated with it OR a DAG is like an instance (for a fixed argument)?
I want a system where the set of operations to perform (given a set of arguments) are fixed. But this input will be different everytime the set of operations are run. In simple terms, the pipeline is the same but the arguments to the pipeline will be different everytime it is run.
I want to know how to configure this in airflow? Should I create a new DAG for every new set of arguments? or any other method?
In my case, the graph is the same but want to run it on different data (from different users) as they come. So, should I create a new DAG everytime for new data?
Yes you are correct; A DAG is basically kind off a one-way graph. You can create a DAG once by chaining together multiple operators together to form your "structure".
Each operator, can then take in multiple arguments that you can pass from the DAG definition file itself (if needed).
Or you can pass in a configuration object to the DAG, and access custom data from there using the context.
I would recommend reading the Airflow Docs for more examples: https://airflow.apache.org/concepts.html#tasks
You can think of Airflow DAG as a program made of other programs, with the exception that it can't contain loops(acyclic). Will you change your program every time input data changes? Of course, it all depends on how you write your program, but usually you'd like you program to generalise, right? You don't want two different programs to do 2+2 and 3+3. But you'll have different programs to show Facebook pages and to play Pokemon Go. If you want to do the same thing to a similar data then you want to write your DAG once, and maybe only change environment arguments(DB connection, date, etc) - Airflow is perfectly suitable for that.
You do not need to create a new DAG every time, if the structure of the graph is the same.
Airflow DAGs are created via code, so you are free to create a code structure that allows you to pass in arguments each time. How you do that will require some creative thinking.
You could, for example, create a web form that accepts the arguments, stores them in a DB and then schedules the DAG with the Airflow restAPI. The DAG code would then need to be written to retrieve params from the database.
There are several other ways to accomplish what you are asking, they all just depend on your use case.
One caveat, the Airflow scheduler does not perform well if you change the start date of the DAG. For your idea above you will need to set the start date earlier than your first DAG run and then set the schedule interval to off. This way you have a start date that doesn’t change and dynamically triggered DAG runs.
My goal is to schedule jobs with EmrCreateJobFlowOperator and EmrAddStepsOperator. Namely, I want to create cluster and add add steps for each scheduled day (or hour) starting from specified date. Basically, I want EmrAddStepsOperator to be back-filled, but not EmrCreateJobFlowOperator. To achieve this I though that I can use sub-dag concept, where parent dag will have disabled catch-up and child dag will have it enabled. I don't want to create EMR cluster for each step.
Is this possible? Are there any other options?