I want to modify the schedule of a task I created in a dags/ folder through the airflow UI. I can't find a way to modify the schedule through the UI. Can it be done or we can get it done only by modifying the python script ?
The only way to change it is through the code. As it's part of the DAG definition (like tasks and dependencies), it appears to be difficult to be able to change it through the web interface.
Related
At the moment we schedule our Databricks notebooks using Airflow. Due to dependencies between projects, there are dependencies between DAGs. Some DAGs wait until a task in a previous DAG is finished before starting (by using sensors).
We are now looking to use Databricks DBX. It is still new for us, but it seems that DBX' main added value is when you use Databricks workflows. It would be possible to run a Python wheel in a job that was created by DBX. My question is now, is it possible to add dependencies between Databricks jobs? Can we create 2 different jobs using DBX, and make the second job wait until the first one is completed.
I am aware that I can have dependencies between tasks in one job, but in our case it is not possible to have only one job with all the tasks.
I was thinking about adding a notebook/python script before the wheel with ETL logic. This notebook would check then if the previous job is finished. Once this is the case, the task with the wheel will be executed. Does this make sense, or are there better ways? Is something like the ExternalTaskSensor in Airflow available within Databricks workflows?
Or is there a good way to use DBX without DB workflows?
author of dbx here.
TL;DR - dbx is not opinionated in terms of the orchestrator choice.
It is still new for us, but it seems that DBX' main added value is when you use Databricks workflows. It would be possible to run a Python wheel in a job that was created by DBX.
The short answer is yes, but it's done on the tasks level (read more here on the difference between workflow and task).
Another approach would be the following - if you still need (or want) to use Airflow, you can do it in the following way:
Deploy and update your jobs from your CI/CD pipeline with dbx deploy commands.
In Airflow, use the Databricks Operator to launch the job (either by name or by id).
I placed a dag file in the dags folder based on a tutorial with slight modifications, but it doesn't show up in the GUI or when run airflow dags list.
Answering my own question: Check the python file for Exceptions by running it directly. It turns out one exception in the dag's python script due to a missing import made the dag not show up in the list. I note this just in case another new user comes across this. To me the moral of the story is that dag files should often be checked by running with python directly when they are modified because there won't be an obvious error showing up otherwise; they may just disappear from the list.
I have 2 .net core web projects.
One of them is called ScheduledJobs and it uses Hangfire with the dashboard to both schedule and process jobs.
The other is called ClientWebsite and it schedules the jobs only - but I dont want them executing here!
ScheduledJobs works fine, if I schedule anything from there it picks them up and processes them.
But since I need to be able to schedule jobs from clientWebsite too, I have to have the following settings in startup:
services.AddHangfire(x => x.UseSqlServerStorage(Configuration.GetConnectionString("DefaultConnection"));
services.AddHangfireServer();
If I dont call services.AddHangfireServer it wont even let me schedule them.
But if I add it, then it processes them too which I dont want !
Please help! Thanks
You shouldn't need to register the hangfire service at all in the second project in this way.
If you want to purely queue jobs from it you can use the GlobalConfiguration to set up which database it should point at similar to
GlobalConfiguration.Configuration.UseSqlServerStorage(Configuration.GetConnectionString("DefaultConnection"));
Once you have done this you can register a BackgroundJobClient similar to this (this is taken from an autofac example so depending on your DI it wont be exactly the same as the line below)
builder.RegisterType<BackgroundJobClient>().As<IBackgroundJobClient>();
What this then allows you to do is resolve and enqueue jobs using the IBackgroundJobClient in your application without setting up a hangfire server at all.
In your classes where you want to enqueue jobs from you can then simple resolve an instance of IBackgroundJobClient and make use of the Enqueue method such as
_myClient.Enqueue<MyJobClass>(x => x.APublicMethodOnMyJobClass());
Details on the backgroundjobclient can be found here - BackgroundJobClient
We are evaluating Airflow for scheduling and data pipeline design. However we are not able to find out how to achieve the following two task:
(1) How to change the DAG schedule through the GUI?
(2) How to achieve the incremental update when the data source is Oracle or MySQl.
This is what we have tried:
(1) We tried changing the schedule of the DAG in the GUI, but looks like that only changes the schedule of that particular instance.
(2) We tried to handle the incremental update programatically by storing the last column value. Is there any other better way of doing incremental update?
1) You can't change the DAG schedule in the GUI, you have to do this in python code when you write the DAG
2) How you do incremental updates is entirely up to you, however I would use a combination of Airflow macros https://airflow.apache.org/code.html#macros and SQL files with JINJA templates https://airflow.apache.org/concepts.html#jinja-templating
Might be worth having a look through the Airflow documentation as it sounds like you're not entirely familiar with its concepts.
these days I'm working on a new ETL project and I wanted to give a try to Airflow as job manager.
Me and my colleague are both working on Airflow for the first time and we are following two different approaches: I decided to write python functions (operators like the ones included in the apache-airflow project) while my colleague uses airflow to call external python scripts through BashOperator.
I'd like to know if there is something like a "good practice", if the two approaches are equally good or I should consider one over the other.
To me, the main differences are:
- with BashOperator you can call a python script using a specific python environment with specific packages
- with BashOperator the tasks are more independent and can be launched manually if airflow goes mad
- with BashOperator task to task communication is a bit harder to manage
- with BashOperator task errors and failures are harder to manage (how can a bash task know if the task before it failed or succeded?).
What do you think?
My personal preference in these cases would be to use a PythonOperator over BashOperator. Here's what I do and why:
Single repo that contains all my DAGs. This repo also has a setup.py that includes airflow as a dependency, along with anything else my DAGs require. Airflow services are run from a virtualenv that installs these dependencies. This handles the python environment you mentioned regarding BashOperator.
I try to put all Python logic unrelated to Airflow in its own externally packaged python library. That code should have its own unit tests and also has its own main so it can be called on the command line independent of Airflow. This addresses your point about when Airflow goes mad!
If the logic is small enough that it doesn't make sense to separate into its own library, I drop it in a utils folder in my DAG repo, with unit tests still of course.
Then I call this logic in Airflow with the PythonOperator. The python callable can be easily be unit tested, unlike a BashOperator template script. This also means you can access things like starting an Airflow DB session, push multiple values to XCom, etc.
Like you mentioned, error handling is a bit easier with Python. You can catch exceptions and check return values easily. You can choose to mark the task as skipped with raise AirflowSkipException.
FYI for BashOperator, if the script exits with an error code, Airflow will mark the task as failed.
TaskA checks data availability at source. TaskB process it.
Task A>>Task B
Both tasks use BashOperator to call python scripts. I used to return sys.exit(1) (when no data at source) from script1 triggered by TaskA as a way to communicate Task A failed as there is no data and no need to run task B.