I am struggling with something that looks very simple, but I don't know how to do that. we have 2 different schemas in our PRE env. How can I switch between 2 schemas in airflow? Imagine that I have only one dag .The same scenario in bash is to execute .sh with the env param (either mstr_new, mstr_pre) manually.
./script.sh env
Now to implement the same thing in Airflow, how we are going to specify what schema has to be executed?
There is a good answer to that in Airflow Best Practices:
https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#staging-environment
Related
At the moment we schedule our Databricks notebooks using Airflow. Due to dependencies between projects, there are dependencies between DAGs. Some DAGs wait until a task in a previous DAG is finished before starting (by using sensors).
We are now looking to use Databricks DBX. It is still new for us, but it seems that DBX' main added value is when you use Databricks workflows. It would be possible to run a Python wheel in a job that was created by DBX. My question is now, is it possible to add dependencies between Databricks jobs? Can we create 2 different jobs using DBX, and make the second job wait until the first one is completed.
I am aware that I can have dependencies between tasks in one job, but in our case it is not possible to have only one job with all the tasks.
I was thinking about adding a notebook/python script before the wheel with ETL logic. This notebook would check then if the previous job is finished. Once this is the case, the task with the wheel will be executed. Does this make sense, or are there better ways? Is something like the ExternalTaskSensor in Airflow available within Databricks workflows?
Or is there a good way to use DBX without DB workflows?
author of dbx here.
TL;DR - dbx is not opinionated in terms of the orchestrator choice.
It is still new for us, but it seems that DBX' main added value is when you use Databricks workflows. It would be possible to run a Python wheel in a job that was created by DBX.
The short answer is yes, but it's done on the tasks level (read more here on the difference between workflow and task).
Another approach would be the following - if you still need (or want) to use Airflow, you can do it in the following way:
Deploy and update your jobs from your CI/CD pipeline with dbx deploy commands.
In Airflow, use the Databricks Operator to launch the job (either by name or by id).
I'm learning airflow and I'm trying to figure out how to run the simplest possible example where a dag is defined and run end-to-end from Python.
I was originally following the doc tutorial
https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html
But I'm noticing that all of these tutorial are defining DAGs in a global scope and then using an external bash command to run it. This seems far more complicated than it needs to be (in the simplest case where everything is local).
What I want is something that is self-contained. I want to import airflow, define the dag, and execute it immediately. I'm OK if I have to start some worker daemon / scheduler on my local machine before I run my code, but what seems not OK is needing to write a custom to define that DAG and then have to execute it somewhere outside of that file.
I'm not sure if this is possible, perhaps it runs contrary to the design philosophy of airflow, but it feels like it should be possible and that I'm just making a simple mistake, or perhaps I didn't configure airflow correctly before running this.
The example I would like to get working is as follows:
from airflow import DAG
from datetime import timezone
from datetime import datetime as datetime_cls
from airflow.operators.bash import BashOperator
now = datetime_cls.utcnow().replace(tzinfo=timezone.utc)
dag = DAG(
'mycustomdag',
start_date=now,
catchup=False,
tags=['example'],
)
t1 = BashOperator(task_id='task1', bash_command='date', dag=dag)
dag.run(verbose=True, local=True)
I'm running this directly in IPython, but the code might exist in some library function and get executed on the fly. The important feature is that no file with the sole purpose of defining this dag will exist on disk a-priori. I want to dynamically define the dag and run it.
This seems like it should be fine. But this results in an exception:
BackfillUnfinished: Some task instances failed:
DAG ID Task ID Run ID Try number
----------- --------- ------------------------------------------ ------------
mycustomdag task1 backfill__2022-06-20T19:51:38.686880+00:00
1
I'm ok if I need to pre-configure the runner. But I don't want to interact with this at all. I don't want to look at any UI. I just want to define and run the dag in Python. Is that possible or is that just now how airflow works?
EDIT
I'm thinking this might not be possible (which is absolutely crazy to me from a design perspective). In these docs (https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#scope) it seems to state that dags must be defined in a global scope.
Is there any way around this? Can you locally define a dag and then export it to a file?
I want to modify the schedule of a task I created in a dags/ folder through the airflow UI. I can't find a way to modify the schedule through the UI. Can it be done or we can get it done only by modifying the python script ?
The only way to change it is through the code. As it's part of the DAG definition (like tasks and dependencies), it appears to be difficult to be able to change it through the web interface.
We are evaluating Airflow for scheduling and data pipeline design. However we are not able to find out how to achieve the following two task:
(1) How to change the DAG schedule through the GUI?
(2) How to achieve the incremental update when the data source is Oracle or MySQl.
This is what we have tried:
(1) We tried changing the schedule of the DAG in the GUI, but looks like that only changes the schedule of that particular instance.
(2) We tried to handle the incremental update programatically by storing the last column value. Is there any other better way of doing incremental update?
1) You can't change the DAG schedule in the GUI, you have to do this in python code when you write the DAG
2) How you do incremental updates is entirely up to you, however I would use a combination of Airflow macros https://airflow.apache.org/code.html#macros and SQL files with JINJA templates https://airflow.apache.org/concepts.html#jinja-templating
Might be worth having a look through the Airflow documentation as it sounds like you're not entirely familiar with its concepts.
these days I'm working on a new ETL project and I wanted to give a try to Airflow as job manager.
Me and my colleague are both working on Airflow for the first time and we are following two different approaches: I decided to write python functions (operators like the ones included in the apache-airflow project) while my colleague uses airflow to call external python scripts through BashOperator.
I'd like to know if there is something like a "good practice", if the two approaches are equally good or I should consider one over the other.
To me, the main differences are:
- with BashOperator you can call a python script using a specific python environment with specific packages
- with BashOperator the tasks are more independent and can be launched manually if airflow goes mad
- with BashOperator task to task communication is a bit harder to manage
- with BashOperator task errors and failures are harder to manage (how can a bash task know if the task before it failed or succeded?).
What do you think?
My personal preference in these cases would be to use a PythonOperator over BashOperator. Here's what I do and why:
Single repo that contains all my DAGs. This repo also has a setup.py that includes airflow as a dependency, along with anything else my DAGs require. Airflow services are run from a virtualenv that installs these dependencies. This handles the python environment you mentioned regarding BashOperator.
I try to put all Python logic unrelated to Airflow in its own externally packaged python library. That code should have its own unit tests and also has its own main so it can be called on the command line independent of Airflow. This addresses your point about when Airflow goes mad!
If the logic is small enough that it doesn't make sense to separate into its own library, I drop it in a utils folder in my DAG repo, with unit tests still of course.
Then I call this logic in Airflow with the PythonOperator. The python callable can be easily be unit tested, unlike a BashOperator template script. This also means you can access things like starting an Airflow DB session, push multiple values to XCom, etc.
Like you mentioned, error handling is a bit easier with Python. You can catch exceptions and check return values easily. You can choose to mark the task as skipped with raise AirflowSkipException.
FYI for BashOperator, if the script exits with an error code, Airflow will mark the task as failed.
TaskA checks data availability at source. TaskB process it.
Task A>>Task B
Both tasks use BashOperator to call python scripts. I used to return sys.exit(1) (when no data at source) from script1 triggered by TaskA as a way to communicate Task A failed as there is no data and no need to run task B.