Schedule tasks at fixed time in multiple timezones - airflow

I'm getting started with Airflow and I'm not sure how to approach this problem:
I'm building a data export system that should run at a fixed time daily for different locations. My issue is that the locations have several timezones and the definition of day start/end changes depending on the timezone.
I saw in the documentation that I can make a dag timezone aware but I'm not sure creating 100s of dags with different timezone is the right approach. I also have some common tasks so multiple dags creates more complexity or duplication in the tasks performed.
Is there an airflow idiomatic way to achieve this? I think it would be a fairly common use case to build reports that are timezone dependant but I didn't find any informations about it.

Dynamic DAGs is a hot topic in Airflow, but from my point of view, Airflow itself doesn't provide a straightforward way to implement that. You'll need to balance the pros and cons depending on your use case.
As a good starting point, you can check Astronomer guide for dynamically generating DAGs. There you have all the options available and some ideas of the pros and cons. Make sure you check out the scalability considerations to see the implications in terms of performance.
From your use case, I think the best approach will be either the Create_DAG Method (under Single-File Methods), or the DAG Factory. I personally prefer the first one because it's like a Factory (in terms of programming patterns), but you have the flexibility to create all the files you need for each DAG. In the second you won't have much control of what you create and there's an external dependency needed.
Another interesting article about dynamically generating DAGs is "How to build a DAG Factory on Airflow".

Related

Cyclic Workflow in Data Comparison Process

I am searching for a solution to automatize an iterative data comparison process until all data packages are consistent. My general guess is to use something like Apache Airflow, but the iterative nature seem to be a cyclic graph. Apache airflow only allows DAGs (directed acyclic graph). Since I have not even a lot of knowledge in Airflow, I am a bit lost and would appreciate some expert knowledge here.
Current status: I am in a position were I regularly need to compare data packages for consistency and communicate errors to and between the two different parties manually.
On the one hand there is a design data set and on the other hand there are measured data sets. Both datasets involve many manual steps from different parties. So if an inconsistency occurs, I contact one or the other party and the error is removed manually. There are also regular changes to both data sets that can introduce new errors to already checked datasets.
I guess this process was not automatized yet, because the datasets are not directly comparable, but some transformations need to be done in between. I automatized this transformation process the last weeks so all that need to be done now from my side is to run the script and to communicate the errors.
What I would need now is a tool that orchestrates my script against the correct datasets and contacts the according persons as long as errors exists. In case something changes or was added the script needs to be run again.
My first guess was that I would need to create a workflow in apache airflow, but this iterative process seems to me as a cyclic graph, which is not allowed in Airflow. Do you have any suggestions or is this a common occurrence, were also solutions with Airflow exists?
I think one way to solve your problem could be to have a DAG workflow for the main task of comparing the datasets and sending notifications. Then run a periodical task in Cron, Quartz, etc, that triggers that DAG workflow. You are correct in Airflow not liking cyclic workflows.
I worked on Cylc, a cyclic graph workflow tool. Cyclic workflows (or workflows with loops) are very common in areas such as Numerical Weather Prediction NWP (reason why Cylc was created), and also in other fields such as optimization.
In NWP workflows, some steps may be waiting for datasets, and the workflow may stall and send notifications if the data is not as it was expected (e.g. some satellite imaging data is missing, and also the tides model output file is missing).
Also, in production, NWP models run multiple times a day. Either because you have new observation data, or new input data, or maybe because you want to run ensemble models, etc. So you end up with multiple runs of the workflow in parallel, where the workflow manager is responsible to manage dependencies, optimize the use of resources, send notifications, and more.
Cyclic workflows are complicated, that's probably why most implementations opt to support only DAGs.
If you'd like to try Cylc, the team has been trying to make it more generic so that it's not specific to NWP only. It has a new GUI, and the input format and documentation were improved with ease of use in mind.
There are other tools that support cyclic workflows too, such as StackStorm, Prefect, and I am currently checking if Autosubmit supports it too. Take a look at these tools if you'd like.
If you are in life sciences, or are interested in reproducible workflows, the CWL standard also has some ongoing discussion about adding support to loops, which could allow you to achieve something akin to what you described I reckon.

Why use CustomOperator over PythonOperator in Apache Airflow?

As I'm using Apache Airflow, I can't seem to find why someone would create a CustomOperator over a PythonOperator. Wouldn't it lead to the same results if I'm using a python function inside a PythonOperator instead of a CustomOperator?
If someone would know what are the different use cases and best practices, that would be great! !
Thanks a lot for your help
Both operators while similar are really at different abstraction levels, and depending on your use-case, one may be a better fit than another.
Code defined in a CustomOperator can be easily used by multiple DAGs. If you have a lot of DAGs that need to perform the same task it may make more sense to expose this code to the DAGs via a CustomOperator.
PythonOperator is very general and is a better fit for one-off DAG specific tasks.
At the end of the day the default set of operators provided in Airflow are just tools. Which tool you end up using (default operators) or whether it makes sense to create your own custom tool (custom operators) is a choice determined by a bunch of factors:
The type of task you are trying to accomplish.
Code organization requirements necessitated by policy or the number of people
maintaining the pipeline.
Simple personal taste.

Data share between two task in airflow dag

I want to do hive query using HiveOperator and the output of that query should transfer to python script using PythonOperator. Is it possible and how?
One approach to this kind of problem in general is to use Airflow's xcoms - see documentation
However, I would use this sparingly and only where strictly necessary. Otherwise, you risk ending up with your operators being quite tangled and interdependent.

How to set up/monitor huge amounts of equivalent DAGs

I am new to Airflow and am still learning the concepts.
I am trying to monitor a huge amount of webpages (>1000) once a day.
At the moment I dynamically create a single DAG for each webpage (data acquisition and processing). This works from a functional point of view. However, looking at the User-Interface I find the amount of DAGs overwhelming and my question is:
Is this the right way to do it? (a single DAG for each webpage)
Is there any way to get a better overview of how the monitoring of all webpages is doing?
Since all DAGs are equivalent and only deal with a different url, it made me think that grouping these DAGs together or having a common overview might be possible or at least a good idea.
E.g. if the acquisition or processing of a certain webpage is failing I would like to see this easily in the UI without having to scroll many pages to find a certain DAG.
You should just have one DAG and have multiple tasks. Based on information you provided, the only thing that seem to change is the URL, so better have one DAG and have many tasks.

How to force tasks in Microsoft Project to be "scheduled" based on priority and resource assignment?

I needed to do some higher-level project planning that doesn't really fit into the workflow of our day-to-day task management tools (FogBugz and whiteboards), so I figured I'd give MS Project a whirl (it being free through MSDN).
I've hit a pretty solid wall, though. What I have is about 120 tasks, a set of people (referring to them as "resources" is amazingly harsh to me, but I digress), and a rough prioritization of those tasks. Some tasks have a person assigned to them, some don't (simply because we don't know who's going to do what yet).
Fine so far. The problem is that, except in those relatively rare instances where tasks are linked (most of the work involved can be done in any order), all of the tasks are scheduled to run concurrently. What I'd like to do is have Project figure out some scheduling scenario based upon:
the defined tasks
their relative priority
any links/dependencies, if defined
the availability of the people that I've defined, while respecting the explicit "resource" assignments I've already made
Is this possible? I've fiddled with the resource leveling dialog and read more MS Project documentation than I'd care to admit, so any suggestions are welcome.
FWIW, I noticed in my searches this question on Yahoo Answers; the person there seems to be after roughly the same thing, but I figured asking here might be more fruitful.
After some further experimentation, I've found a partial solution for my own question. If you:
assign a person to each task
specify in the Advanced tab of the Task Information panel that all tasks should (select all your tasks and click the Task Info button to update these properties for all tasks):
use a calendar (called "standard" in my project file)
not ignore resource calendars when scheduling
have a constraint of As Soon As Possible (which is the default, I believe)
Choose Level Resources from the Tools menu, and specify:
Look for overallocations on a Hour by Hour basis
a leveling order of "Priority, Standard" (which rolls in the relative Priority values for each task you've defined when setting the schedule)
Click "Level Now" in that leveling resources dialog, and all of the tasks should be rescheduled so that they're not running concurrently, and that no one is "overscheduled".
You can ostensibly have Project automatically reschedule things as tasks are added, edited, etc., but I suspect that would result in chaos, as there's nothing about the resource leveling process that makes me think it's "stable" (e.g. that two levelings performed back-to-back wouldn't yield the same schedule).
It would be nice if Project would "fully allocate" whatever people you have configured, so that you don't have to assign people to tasks just to have those tasks scheduled in a way that is consistent, if not correct. Any thoughts on that front would be most welcome.
That seems (and feels!) like a lot of work, but I think the result is relatively decent -- a super-high-level view of a project that allows for a high degree of day to day flexibility, but still affords one a way to reasonably make plans around "interdisciplinary" activities (e.g. once this is done, we need to buy those four servers, make sure our legal stuff is taken care of, and pull the trigger on that marketing push one week later, etc).

Resources