At the moment we schedule our Databricks notebooks using Airflow. Due to dependencies between projects, there are dependencies between DAGs. Some DAGs wait until a task in a previous DAG is finished before starting (by using sensors).
We are now looking to use Databricks DBX. It is still new for us, but it seems that DBX' main added value is when you use Databricks workflows. It would be possible to run a Python wheel in a job that was created by DBX. My question is now, is it possible to add dependencies between Databricks jobs? Can we create 2 different jobs using DBX, and make the second job wait until the first one is completed.
I am aware that I can have dependencies between tasks in one job, but in our case it is not possible to have only one job with all the tasks.
I was thinking about adding a notebook/python script before the wheel with ETL logic. This notebook would check then if the previous job is finished. Once this is the case, the task with the wheel will be executed. Does this make sense, or are there better ways? Is something like the ExternalTaskSensor in Airflow available within Databricks workflows?
Or is there a good way to use DBX without DB workflows?
author of dbx here.
TL;DR - dbx is not opinionated in terms of the orchestrator choice.
It is still new for us, but it seems that DBX' main added value is when you use Databricks workflows. It would be possible to run a Python wheel in a job that was created by DBX.
The short answer is yes, but it's done on the tasks level (read more here on the difference between workflow and task).
Another approach would be the following - if you still need (or want) to use Airflow, you can do it in the following way:
Deploy and update your jobs from your CI/CD pipeline with dbx deploy commands.
In Airflow, use the Databricks Operator to launch the job (either by name or by id).
Related
I'm very new to Airflow and while I have read the docs and some answers about Airflow's configuration regarding parallelism, it seems I have not yet found the answer to specifying threads used in a task.
My current case is I have 5 tasks (in the form of a Python script) that only do API calls (but to different API service) and transform the data. For each task I can make up to 1000+ calls, so I try to utilize multithreading in the script. Unfortunately, when I try to run the multithreaded script in Airflow, it doesn't use the multithreading mechanism in the script. I feel like this is because of Airflow configuration that overrides the child script or am I wrong? Any help or answer is appreciated, thank you.
Run your script with a KubernetesPodOperator.
You can use a python base image and run your script as is. This should closely mimic how you are executing the script locally except now it's done in a kubernetes pod.
I want to modify the schedule of a task I created in a dags/ folder through the airflow UI. I can't find a way to modify the schedule through the UI. Can it be done or we can get it done only by modifying the python script ?
The only way to change it is through the code. As it's part of the DAG definition (like tasks and dependencies), it appears to be difficult to be able to change it through the web interface.
We are evaluating Airflow for scheduling and data pipeline design. However we are not able to find out how to achieve the following two task:
(1) How to change the DAG schedule through the GUI?
(2) How to achieve the incremental update when the data source is Oracle or MySQl.
This is what we have tried:
(1) We tried changing the schedule of the DAG in the GUI, but looks like that only changes the schedule of that particular instance.
(2) We tried to handle the incremental update programatically by storing the last column value. Is there any other better way of doing incremental update?
1) You can't change the DAG schedule in the GUI, you have to do this in python code when you write the DAG
2) How you do incremental updates is entirely up to you, however I would use a combination of Airflow macros https://airflow.apache.org/code.html#macros and SQL files with JINJA templates https://airflow.apache.org/concepts.html#jinja-templating
Might be worth having a look through the Airflow documentation as it sounds like you're not entirely familiar with its concepts.
these days I'm working on a new ETL project and I wanted to give a try to Airflow as job manager.
Me and my colleague are both working on Airflow for the first time and we are following two different approaches: I decided to write python functions (operators like the ones included in the apache-airflow project) while my colleague uses airflow to call external python scripts through BashOperator.
I'd like to know if there is something like a "good practice", if the two approaches are equally good or I should consider one over the other.
To me, the main differences are:
- with BashOperator you can call a python script using a specific python environment with specific packages
- with BashOperator the tasks are more independent and can be launched manually if airflow goes mad
- with BashOperator task to task communication is a bit harder to manage
- with BashOperator task errors and failures are harder to manage (how can a bash task know if the task before it failed or succeded?).
What do you think?
My personal preference in these cases would be to use a PythonOperator over BashOperator. Here's what I do and why:
Single repo that contains all my DAGs. This repo also has a setup.py that includes airflow as a dependency, along with anything else my DAGs require. Airflow services are run from a virtualenv that installs these dependencies. This handles the python environment you mentioned regarding BashOperator.
I try to put all Python logic unrelated to Airflow in its own externally packaged python library. That code should have its own unit tests and also has its own main so it can be called on the command line independent of Airflow. This addresses your point about when Airflow goes mad!
If the logic is small enough that it doesn't make sense to separate into its own library, I drop it in a utils folder in my DAG repo, with unit tests still of course.
Then I call this logic in Airflow with the PythonOperator. The python callable can be easily be unit tested, unlike a BashOperator template script. This also means you can access things like starting an Airflow DB session, push multiple values to XCom, etc.
Like you mentioned, error handling is a bit easier with Python. You can catch exceptions and check return values easily. You can choose to mark the task as skipped with raise AirflowSkipException.
FYI for BashOperator, if the script exits with an error code, Airflow will mark the task as failed.
TaskA checks data availability at source. TaskB process it.
Task A>>Task B
Both tasks use BashOperator to call python scripts. I used to return sys.exit(1) (when no data at source) from script1 triggered by TaskA as a way to communicate Task A failed as there is no data and no need to run task B.
I have an R script that I run every day that scrapes data from a couple of different websites, and then writes the data scraped to a couple of different CSV files. Each day, at a specific time (that changes daily) I open RStudio, open the file, and run the script. I check that it runs correctly each time, and then I save the output to a CSV file. It is often a pain to have to do this everyday (takes ~10-15 minutes a day). I would love it if someway I could have this script run automatically at a pre-defined specific time, and a buddy of mine said AWS is capable of doing this?
Is this true? If so, what is the specific feature / aspect of AWS that is able to do this, this way I can look more into it?
Thanks!
Two options come to mind thinking about this:
Host a EC2 Instance with R on it and configure a CRON-Job to execute your R-Script regularly.
One easy way to get started: Use this AMI.
To execute the script R offers a CLI rscript. See e.g. here on how to set this up
Go Serverless: AWS Lambda is a hosted microservice. Currently R is not natively supported but on the official AWS Blog here they offer a step by step guid on how to run R. Basically you execute R from Python using the rpy2-Package.
Once you have this setup schedule the function via CloudWatch Events (~hosted cron-job). Here you can find a step by step guide on how to do that.
One more thing: You say that your function outputs CSV files: To save them properly you will need to put them to a file-storage like AWS-S3. You can do this i R via the aws.s3-package. Another option would be to use the AWS SDK for python which is preinstalled in the lambda-function. You could e.g. write a csv file to the /tmp/-dir and after the R script is done move the file to S3 via boto3's S3 upload_file function.
IMHO the first option is easier to setup but the second-one is more robust.
It's a bit counterintuitive but you'd use Cloudwatch with an event rule to run periodically. It can run a Lambda or send a message to an SNS topic or SQS queue. The challenge you'll have is that a Lambda doesn't support R so you'd either have to have a Lambda kick off something else or have something waiting on the SNS topic or SQS queue to run the script for you. It isn't a perfect solution as there are, potentially, quite a few moving parts.
#stdunbar is right about using CloudWatch Events to trigger a lambda function. You can set a frequency of the trigger or use a Cron. But as he mentioned, Lambda does not natively support R.
This may help you to use R with Lambda: R Statistics ready to run in AWS Lambda and x86_64 Linux VMs
If you are running windows, one of the easier solution is to write a .BAT script to run your R-script and then use Window's task scheduler to run as desired.
To call your R-script from your batch file use the following syntax:
C:\Program Files\R\R-3.2.4\bin\Rscript.exe" C:\rscripts\hello.R
Just verify the path to the "RScript" application and your R code is correct.
Dockerize your script (write a Dockerfile, build an image)
Push the image to AWS ECR
Create an AWS ECS cluster and AWS ECS task definition within the cluster that will run the image from AWS ECR every time it's spun-up
Use EventBridge to create a time-based trigger that will run the AWS ECS task definition
I recently gave a seminar walking through this at the Why R? 2022 conference.
You can check out the video here: https://www.youtube.com/watch?v=dgkm0QkWXag
And the GitHub repo here: https://github.com/mrismailt/why-r-2022-serverless-r-in-the-cloud