How to set up/monitor huge amounts of equivalent DAGs - airflow

I am new to Airflow and am still learning the concepts.
I am trying to monitor a huge amount of webpages (>1000) once a day.
At the moment I dynamically create a single DAG for each webpage (data acquisition and processing). This works from a functional point of view. However, looking at the User-Interface I find the amount of DAGs overwhelming and my question is:
Is this the right way to do it? (a single DAG for each webpage)
Is there any way to get a better overview of how the monitoring of all webpages is doing?
Since all DAGs are equivalent and only deal with a different url, it made me think that grouping these DAGs together or having a common overview might be possible or at least a good idea.
E.g. if the acquisition or processing of a certain webpage is failing I would like to see this easily in the UI without having to scroll many pages to find a certain DAG.

You should just have one DAG and have multiple tasks. Based on information you provided, the only thing that seem to change is the URL, so better have one DAG and have many tasks.

Related

Cyclic Workflow in Data Comparison Process

I am searching for a solution to automatize an iterative data comparison process until all data packages are consistent. My general guess is to use something like Apache Airflow, but the iterative nature seem to be a cyclic graph. Apache airflow only allows DAGs (directed acyclic graph). Since I have not even a lot of knowledge in Airflow, I am a bit lost and would appreciate some expert knowledge here.
Current status: I am in a position were I regularly need to compare data packages for consistency and communicate errors to and between the two different parties manually.
On the one hand there is a design data set and on the other hand there are measured data sets. Both datasets involve many manual steps from different parties. So if an inconsistency occurs, I contact one or the other party and the error is removed manually. There are also regular changes to both data sets that can introduce new errors to already checked datasets.
I guess this process was not automatized yet, because the datasets are not directly comparable, but some transformations need to be done in between. I automatized this transformation process the last weeks so all that need to be done now from my side is to run the script and to communicate the errors.
What I would need now is a tool that orchestrates my script against the correct datasets and contacts the according persons as long as errors exists. In case something changes or was added the script needs to be run again.
My first guess was that I would need to create a workflow in apache airflow, but this iterative process seems to me as a cyclic graph, which is not allowed in Airflow. Do you have any suggestions or is this a common occurrence, were also solutions with Airflow exists?
I think one way to solve your problem could be to have a DAG workflow for the main task of comparing the datasets and sending notifications. Then run a periodical task in Cron, Quartz, etc, that triggers that DAG workflow. You are correct in Airflow not liking cyclic workflows.
I worked on Cylc, a cyclic graph workflow tool. Cyclic workflows (or workflows with loops) are very common in areas such as Numerical Weather Prediction NWP (reason why Cylc was created), and also in other fields such as optimization.
In NWP workflows, some steps may be waiting for datasets, and the workflow may stall and send notifications if the data is not as it was expected (e.g. some satellite imaging data is missing, and also the tides model output file is missing).
Also, in production, NWP models run multiple times a day. Either because you have new observation data, or new input data, or maybe because you want to run ensemble models, etc. So you end up with multiple runs of the workflow in parallel, where the workflow manager is responsible to manage dependencies, optimize the use of resources, send notifications, and more.
Cyclic workflows are complicated, that's probably why most implementations opt to support only DAGs.
If you'd like to try Cylc, the team has been trying to make it more generic so that it's not specific to NWP only. It has a new GUI, and the input format and documentation were improved with ease of use in mind.
There are other tools that support cyclic workflows too, such as StackStorm, Prefect, and I am currently checking if Autosubmit supports it too. Take a look at these tools if you'd like.
If you are in life sciences, or are interested in reproducible workflows, the CWL standard also has some ongoing discussion about adding support to loops, which could allow you to achieve something akin to what you described I reckon.

Schedule tasks at fixed time in multiple timezones

I'm getting started with Airflow and I'm not sure how to approach this problem:
I'm building a data export system that should run at a fixed time daily for different locations. My issue is that the locations have several timezones and the definition of day start/end changes depending on the timezone.
I saw in the documentation that I can make a dag timezone aware but I'm not sure creating 100s of dags with different timezone is the right approach. I also have some common tasks so multiple dags creates more complexity or duplication in the tasks performed.
Is there an airflow idiomatic way to achieve this? I think it would be a fairly common use case to build reports that are timezone dependant but I didn't find any informations about it.
Dynamic DAGs is a hot topic in Airflow, but from my point of view, Airflow itself doesn't provide a straightforward way to implement that. You'll need to balance the pros and cons depending on your use case.
As a good starting point, you can check Astronomer guide for dynamically generating DAGs. There you have all the options available and some ideas of the pros and cons. Make sure you check out the scalability considerations to see the implications in terms of performance.
From your use case, I think the best approach will be either the Create_DAG Method (under Single-File Methods), or the DAG Factory. I personally prefer the first one because it's like a Factory (in terms of programming patterns), but you have the flexibility to create all the files you need for each DAG. In the second you won't have much control of what you create and there's an external dependency needed.
Another interesting article about dynamically generating DAGs is "How to build a DAG Factory on Airflow".

How should multi-part Codecept.js scenarios be organized?

What is the preferred (or just a good) pattern for a multi-part Codecept.js scenario, such as this:
Select file to upload.
Clear selection.
Select file to upload after having cleared selection.
I can do this in a single scenario and use I.say to delineate the parts, but it feels like I should be able to write these as independent scenarios and use .only on part 2, for example, and have part 1 run prior to part 2, because it depends on it.
I would also want to skip parts 2 and 3 if part 1 fails when running the whole suite.
I like thinking about behaviour in terms of capabilities. I can see that you have a couple here:
Uploading files
Correcting mistakes while uploading files.
So I would expect these to be in two scenarios:
One where you actually upload the file
One where you correct mistakes you make.
A lot of people say there should only be one "When" in scenarios, but that doesn't take into account interactions with people (including your past mistaken self) or time. In this situation, it's the whole process of correcting the file upload that provides the value. I can't see any value in the intermediate steps, so I'd leave them all in one scenario.
If there's any different behaviour associated with different contexts (eg: you've already got too many files uploaded) or outcomes (eg: your file system doesn't have room) or rules (eg: your status means you qualify for super-fast upload) then I would expect those to be new scenarios. If you start getting to the point where there are a lot of scenarios associated with file uploads and different things that happen to them, that might be a good time to separate this scenario out. Right now I can't see any reason to do that.
Re failing the first part: if you're doing BDD right, you'll be talking through not just the behaviour of your system, but the behaviour of individual bits of code too. That should help produce a good design which minimizes the chances of having bugs. Really good BDD teams produce scenarios that hardly ever catch bugs!
The scenarios act as living documentation, rather than regression tests; helping future devs understand the value of the code and get it right, rather than nailing it down to stop them getting it wrong.
So I wouldn't worry about it failing. If it's doing that a lot, you've got a different problem. Code it as if it's going to be passing most of the time, and make sure it's readable and comprehensible. As long as you can see when it fails and work it out, even if that takes a little bit of time, it'll be fine.
Having said that, I'd be surprised if Codecept doesn't have at least an option to stop on failure. Most BDD tools don't continue a scenario after a failed step; it would be an odd thing to do.
As far as I know, you're not be able to set priority for execution in codeceptjs. Better make one test. it is also will be more flexible if you will need to add or delete some part.

Getting large number of entities from datastore

By this question, I am able to store large number (>50k) of entities in datastore. Now I want to access all of it in my application. I have to perform mathematical operations on it. It always time out. One way is to use TaskQueue again but it will be asynchronous job. I need a way to access these 50k+ entities in my application and process them without getting time out.
Part of the accepted answer to your original question may still apply, for example a manually scaled instance with 24h deadline. Or a VM instance. For a price, of course.
Some speedup may be achieved by using memcache.
Side note: depending on the size of your entities you may need to keep an eye on the instance memory usage as well.
Another possibility would be to switch to a faster instance class (and with more memory as well, but also with extra costs).
But all such improvements might still not be enough. The best approach would still be to give your entity data processing algorithm a deeper thought - to make it scalable.
I'm having a hard time imagining a computation so monolithic that can't be broken into smaller pieces which wouldn't need all the data at once. I'm almost certain there has to be some way of using some partial computations, maybe with storing some partial results so that you can split the problem and allow it to be handled in smaller pieces in multiple requests.
As an extreme (academic) example think about CPUs doing pretty much any super-complex computation fundamentally with just sequences of simple, short operations on a small set of registers - it's all about how to orchestrate them.
Here's a nice article describing a drastic reduction of the overall duration of a computation (no clue if it's anything like yours) by using a nice approach (also interesting because it's using the GAE Pipeline API).
If you post your code you might get some more specific advice.

How to force tasks in Microsoft Project to be "scheduled" based on priority and resource assignment?

I needed to do some higher-level project planning that doesn't really fit into the workflow of our day-to-day task management tools (FogBugz and whiteboards), so I figured I'd give MS Project a whirl (it being free through MSDN).
I've hit a pretty solid wall, though. What I have is about 120 tasks, a set of people (referring to them as "resources" is amazingly harsh to me, but I digress), and a rough prioritization of those tasks. Some tasks have a person assigned to them, some don't (simply because we don't know who's going to do what yet).
Fine so far. The problem is that, except in those relatively rare instances where tasks are linked (most of the work involved can be done in any order), all of the tasks are scheduled to run concurrently. What I'd like to do is have Project figure out some scheduling scenario based upon:
the defined tasks
their relative priority
any links/dependencies, if defined
the availability of the people that I've defined, while respecting the explicit "resource" assignments I've already made
Is this possible? I've fiddled with the resource leveling dialog and read more MS Project documentation than I'd care to admit, so any suggestions are welcome.
FWIW, I noticed in my searches this question on Yahoo Answers; the person there seems to be after roughly the same thing, but I figured asking here might be more fruitful.
After some further experimentation, I've found a partial solution for my own question. If you:
assign a person to each task
specify in the Advanced tab of the Task Information panel that all tasks should (select all your tasks and click the Task Info button to update these properties for all tasks):
use a calendar (called "standard" in my project file)
not ignore resource calendars when scheduling
have a constraint of As Soon As Possible (which is the default, I believe)
Choose Level Resources from the Tools menu, and specify:
Look for overallocations on a Hour by Hour basis
a leveling order of "Priority, Standard" (which rolls in the relative Priority values for each task you've defined when setting the schedule)
Click "Level Now" in that leveling resources dialog, and all of the tasks should be rescheduled so that they're not running concurrently, and that no one is "overscheduled".
You can ostensibly have Project automatically reschedule things as tasks are added, edited, etc., but I suspect that would result in chaos, as there's nothing about the resource leveling process that makes me think it's "stable" (e.g. that two levelings performed back-to-back wouldn't yield the same schedule).
It would be nice if Project would "fully allocate" whatever people you have configured, so that you don't have to assign people to tasks just to have those tasks scheduled in a way that is consistent, if not correct. Any thoughts on that front would be most welcome.
That seems (and feels!) like a lot of work, but I think the result is relatively decent -- a super-high-level view of a project that allows for a high degree of day to day flexibility, but still affords one a way to reasonably make plans around "interdisciplinary" activities (e.g. once this is done, we need to buy those four servers, make sure our legal stuff is taken care of, and pull the trigger on that marketing push one week later, etc).

Resources