Can an Airflow DAG schedule be altered in The Airflow UI? - airflow

I'm new to Apache Airflow. I have been modifying the schedule_interval and replacing the python script, each time I want to change the execution time.
Can I change the DAG schedule without uploading a new python script?
Thanks

There is an Airflow Plugin which allows for (visual) Dag-Generation and Modification, see here. It seems to be outdated and not very actively developed, though.
The general idea for Airflow is, roughly speaking, etl-as-code, including benefits like code versioning etc., i.e. you need to be aware of problems arising from redefining such central aspects as the schedule time from the UI. F.e., when you could edit the schedule-time in the UI (but that would not alter the code itself), what would be the state of your DAG? However, it`s for sure not impossible and Airflow's design allows for such modifications.
tldr; One could of course customize the UI (see above, f.e. using Airflow Plugins) and in fact your requirement is very understandable, especially to account for non-technical users which can't upload/modify code.
Another, probably easier option might be to use Variables in Airflow, i.e. pull the schedule-time/cron-linke schedule-string (1 * * * *,daily etc.) from the an Airflow Variable; such Variables can be altered in the GUI, so this might work out for you.

The Variable option doesn't work, at least in Airflow 2.0.1
The message :
Broken DAG: [/opt/airflow/dags/ADB-365/big-kahuna-365-adb-incremental.py] Invalid Cron expression: Exactly 5 or 6 columns has to be specified for iteratorexpression.
appears in the main view.

Related

Importing a hardcoded code logic into several airflow dag scripts

There is a hardcoded sql code logic(combination of joins from multiple tables and filter conditions) which i want to import into several airflow dag scripts. So if any changes needs to be done in that sql code logic, I can do the change from a single location. And this hardcoded sql logic will be used as a paramater value in my operator which will be used in several dag scripts. What is the best approach I can go with a viable solution?
Thanks in Advance!!

Limitation of tags in an Airflow DAG

I'm looking for the best practice of tags in Airflow.
any limitation on the size of a tag name?
how many tags are good for an Airflow DAG?
what's a good tag example vs using the naming convention of the DAG's name? For example, which one is better: all Ads teams' dags are tagged with "Ads" or named as ads_XXX_XXX?
Thanks
The max length of a tag is 100 chars.
As far as a tagging strategy, it's entirely up to you. Not using tags is also a valid option. There is no right or wrong when it comes to naming conventions & tags. Use whatever makes the most sense to you.
They are used to help filter dags so you do not have to rely on the name. You could filter them by team, by task (a "downloader" dag vs a "loading data" dag etc), or any other arbitrary group you think would be useful.

Version control of big data tables (iceberg)

I'm building a Iceberg tables on the top of a data lake. These tables are used for reporting tools. I'm trying to figure out what is the best way to control a version/deploy changes to these tables in CI/CD process. E.g. I could like to add a column to the Iceberg table. To do that I have to write a ALTER TABLE statement, save it to the git repository and deploy via CI/CD pipeline. Tables are accessible via AWS Glue Catalog.
I couldn't find to much info about this in google so if anyone could share some knowledge, it would be much appreciated.
Cheers.
Version control of Iceberg tables.
Agree with #Fokko Driesprong. This is a supplement only.
Sometimes, table changes are considered as part of task version changes. That is, table change statements, ALTER TABLE, are bound to task upgrades.
Tasks are sometimes automatically deployed. So it often executes a table change statement first, and then deploys a new task. If the change is disruptive, then we need to stop the old task first and then deploy the new one.
Corresponding to the upgrade, we also have a rollback script, of course, the corresponding table change statement.
thanks for asking this question. I don't think there is a definitive way of doing this. In practice I see most people bundling this as part of the job that writing to the Iceberg table. This way you can make sure that new columns are populated right away with the new version of the job. If you don't do any breaking changes (such as deletion of column), then the downstream jobs won't break. Hope this helps!

Is there any way to execute repeatable flyway scripts first?

We use flyway since years to maintain our DB scripts, and it does a wonderful job.
However there is one single situation where I am not really happy - possibly someone out there has a solution:
In order to reduce the number of scripts required (and also in order to keep overview about "where" our procedures are defined) I'd like to implement our functions/procedures in one script. Every time a procedure changes (or a new one is developed) this script shall be updated - repeatable scripts sound perfect for this purpose, but unfortunately they are not.
The drawback is, that a new procedure cannot be accessed by non-repeatable scripts, as repeatable scripts are executed last, so the procedure does not exist when the non-repeatable script executes.
I hoped I can control this by specifying different locations (e.g. loc_first containing the repeatables I want to be executed first, loc_normal for the standard scripts and the repeatables to be executed last).
Unfortunately the order of locations has no impact on execution order ;-(
What's the proper way to deal with this situation? Right now I need to specify the corresponding procedures in non-repeatable scripts, but that's exactly what I'd like to avoid ....
I found a workaround on my own: I'm using flyway directly with maven (the same would work in case you use the API of course). Each stage of my maven script has its own profile (specifiying URL etc.)
Now I create two profiles for every stage - so I have e.g. dev and devProcs.
The difference between these two maven profiles is, that the "[stage]Procs" profile operates on a different location (where only the repeatable scripts maintaining procedures are kept). Then I need to execute flyway twice - first with [stage]Procs then with [stage].
To me this looks a bit messy, but at least I can maintain my procedures in a repeatable script this way.
According to flyway docs, Repeatable migrations ALWAYS execute after versioned migration.
But, I guess, you can use Flyway callbacks. Looks like, beforeMigrate.sql callback is exactly what you need.

In airflow, is there a good way to call another dag's task?

I've got dag_prime and dag_tertiary.
dag_prime: Scans through a directory and intends to call dag_tertiary
on each one. Currently a PythonOperator.
dag_tertiary: Scans through the directory passed to it and does (possibly time-intensive) calculations on the contents thereof.
I can call the secondary one from a system call from the python operator, but i feel like there's got to be a better way. I'd also like to consider queuing the dag_tertiary calls, if there's a simple way to do that. Is there a better way than using system calls?
Thanks!
Use airflow.operators.trigger_dagrun for calling one DAG from another.
The details can be found in operator trigger_dagrun Airflow documentation.
Following post gives a good example of using this operator:
https://www.linkedin.com/pulse/airflow-lesson-1-triggerdagrunoperator-siddharth-anand
Use TriggerDagRunOperator from airflow.operators.dagrun_operator and pass the other DAG name to triger_dag_id parameter.
Follow Airflow updated documentation dag_run_operator Airflow Documentation

Resources