Cyclic Workflow in Data Comparison Process - airflow

I am searching for a solution to automatize an iterative data comparison process until all data packages are consistent. My general guess is to use something like Apache Airflow, but the iterative nature seem to be a cyclic graph. Apache airflow only allows DAGs (directed acyclic graph). Since I have not even a lot of knowledge in Airflow, I am a bit lost and would appreciate some expert knowledge here.
Current status: I am in a position were I regularly need to compare data packages for consistency and communicate errors to and between the two different parties manually.
On the one hand there is a design data set and on the other hand there are measured data sets. Both datasets involve many manual steps from different parties. So if an inconsistency occurs, I contact one or the other party and the error is removed manually. There are also regular changes to both data sets that can introduce new errors to already checked datasets.
I guess this process was not automatized yet, because the datasets are not directly comparable, but some transformations need to be done in between. I automatized this transformation process the last weeks so all that need to be done now from my side is to run the script and to communicate the errors.
What I would need now is a tool that orchestrates my script against the correct datasets and contacts the according persons as long as errors exists. In case something changes or was added the script needs to be run again.
My first guess was that I would need to create a workflow in apache airflow, but this iterative process seems to me as a cyclic graph, which is not allowed in Airflow. Do you have any suggestions or is this a common occurrence, were also solutions with Airflow exists?

I think one way to solve your problem could be to have a DAG workflow for the main task of comparing the datasets and sending notifications. Then run a periodical task in Cron, Quartz, etc, that triggers that DAG workflow. You are correct in Airflow not liking cyclic workflows.
I worked on Cylc, a cyclic graph workflow tool. Cyclic workflows (or workflows with loops) are very common in areas such as Numerical Weather Prediction NWP (reason why Cylc was created), and also in other fields such as optimization.
In NWP workflows, some steps may be waiting for datasets, and the workflow may stall and send notifications if the data is not as it was expected (e.g. some satellite imaging data is missing, and also the tides model output file is missing).
Also, in production, NWP models run multiple times a day. Either because you have new observation data, or new input data, or maybe because you want to run ensemble models, etc. So you end up with multiple runs of the workflow in parallel, where the workflow manager is responsible to manage dependencies, optimize the use of resources, send notifications, and more.
Cyclic workflows are complicated, that's probably why most implementations opt to support only DAGs.
If you'd like to try Cylc, the team has been trying to make it more generic so that it's not specific to NWP only. It has a new GUI, and the input format and documentation were improved with ease of use in mind.
There are other tools that support cyclic workflows too, such as StackStorm, Prefect, and I am currently checking if Autosubmit supports it too. Take a look at these tools if you'd like.
If you are in life sciences, or are interested in reproducible workflows, the CWL standard also has some ongoing discussion about adding support to loops, which could allow you to achieve something akin to what you described I reckon.

Related

Schedule tasks at fixed time in multiple timezones

I'm getting started with Airflow and I'm not sure how to approach this problem:
I'm building a data export system that should run at a fixed time daily for different locations. My issue is that the locations have several timezones and the definition of day start/end changes depending on the timezone.
I saw in the documentation that I can make a dag timezone aware but I'm not sure creating 100s of dags with different timezone is the right approach. I also have some common tasks so multiple dags creates more complexity or duplication in the tasks performed.
Is there an airflow idiomatic way to achieve this? I think it would be a fairly common use case to build reports that are timezone dependant but I didn't find any informations about it.
Dynamic DAGs is a hot topic in Airflow, but from my point of view, Airflow itself doesn't provide a straightforward way to implement that. You'll need to balance the pros and cons depending on your use case.
As a good starting point, you can check Astronomer guide for dynamically generating DAGs. There you have all the options available and some ideas of the pros and cons. Make sure you check out the scalability considerations to see the implications in terms of performance.
From your use case, I think the best approach will be either the Create_DAG Method (under Single-File Methods), or the DAG Factory. I personally prefer the first one because it's like a Factory (in terms of programming patterns), but you have the flexibility to create all the files you need for each DAG. In the second you won't have much control of what you create and there's an external dependency needed.
Another interesting article about dynamically generating DAGs is "How to build a DAG Factory on Airflow".

Is it possible to run two websocket connections asynhronously or use same variables in two separate sessions in R?

I'm setting up an algorithm in R which involves performing tasks on data that is streamed from a websocket. I have sucessfuly implemented a connection to a websocket (using package https://github.com/rstudio/websocket), but my algorithm does not perform optimal due to linear processing of data received from websocket (some important tasks get delayed because less important ones get triggered before them). As the tasks could easily be divided, I am wondering:
1) whether it would be possible to run two websocket connections simultaneously, providing that there is a single data frame (as a global variable) that gets updated in one instance and is accessible in another?
2) is it possible to check the queue from websocket and prioritize certain tables?
3) I am considering also a solution that includes two separate R sessions, but I am not sure if there is a way to access data that gets updated in real time in another R session? Is there a workaround that does not include saving a table in one and loading it in another?
I have already tried this with async (https://github.com/r-lib/async) without much success and I have also tried 'jobs' pannels in the newer versions of RStudio.
I would be happy to provide some code, but at this point I think that this might be irrelevant, as the question that I have is more or less trying to expand the code rather than fix it. I am also aware that probably Python offers an easier solution, but I would still like to exhaust every option that R offers.
Any help would be greatly appreciated!
In the end I was able to accomplish this by running two separate R instances and connecting them using r sockets.

How to set up/monitor huge amounts of equivalent DAGs

I am new to Airflow and am still learning the concepts.
I am trying to monitor a huge amount of webpages (>1000) once a day.
At the moment I dynamically create a single DAG for each webpage (data acquisition and processing). This works from a functional point of view. However, looking at the User-Interface I find the amount of DAGs overwhelming and my question is:
Is this the right way to do it? (a single DAG for each webpage)
Is there any way to get a better overview of how the monitoring of all webpages is doing?
Since all DAGs are equivalent and only deal with a different url, it made me think that grouping these DAGs together or having a common overview might be possible or at least a good idea.
E.g. if the acquisition or processing of a certain webpage is failing I would like to see this easily in the UI without having to scroll many pages to find a certain DAG.
You should just have one DAG and have multiple tasks. Based on information you provided, the only thing that seem to change is the URL, so better have one DAG and have many tasks.

How to approach writing developer tests (unit tests, integration tests, etc) for a system?

I have a WCF service which runs and interacts with database, file system and few external web services, then creates the result and Xml Serialize it and returns it finally.
I'd like to write tests for this solution and I'm thinking how (it's all using dependency injection and design by contract).
There are 3 main approaches I can take.
1) I can pick smallest units of codes/methods and write tests for it. Pick one class and isolate it from its dependencies (other classes, etc). Although it guarantees quality but it takes lots of time writing them and that's slow.
2) Only make the interaction with external systems mockable and write some tests that cover the main scenarios from when the request is made until the response is serialized and returned. This will test all the interactions between my classes but mocks all external resource accesses.
3) I can setup a test environment where the interaction with external web services do happen, file access happens, database access happens, etc. Then writing the tests from end to end. this requires environmental setup and dependency on all other systems to be up and running.
About #1, I see no point in investing the time/money/energy on writing the tests for every single method or codes that I have. I mean it's a waste of time.
About #3, since it has dependency on external resources/systems, it's hard to set it up and running.
#2, sounds to be the best option to me. Since it will test what it should be testing. Only my system and all its classes and mocking all other external systems.
So basically, my conclusion after some years experience with unit tests is that writing unit tests is a waste to be avoided and instead isolated system tests are best return on investment.
Even if I was going to write the tests first (TDD) then the production code, still #2 I think would be best.
What's your view on this? would you write small unit tests for your application? would you consider it a good practice and best use of time/budget/energy?
If you want to talk about quality, you should have all 3:
Unit tests to ensure your code does what you think it does, expose any edge cases and help with regression. You (developer) should write such tests.
Integration tests to verify correctness of entire process, whether components talk to each other correctly and so on. And again, you as a developer write such tests.
System-wide tests in production-like environment (with some limitations naturally - you might not have access to client database, but you should have its exact copy on your local machines). Those tests are usually written by dedicated testers (often in programming languages different from application code), but of course can be written by you.
Second and third type of tests (integration and system) will be way too much effort to test edge cases of smaller components. This is what you usually want unit tests for. You need integration because something might fail on hooking-up of tested, verified and correct modules. And of course system tests is what you do daily, during development, or have assigned people (manual testers) do it.
Going for selected type of tests from the list might work to some point, but is far from complete solution or quality software.
All 3 are important and targeted at different test types that is a matrix of unit/integration/system categories with positive and negative testing in each category.
For code coverage Unit testing will yield the highest percentage, followed by Integration then System.
You also need to consider whether or not the purpose of the test is Validation (will meet the final user\customer requirements, i.e Value) or Verification (written to specification, i.e. Correct).
In summary the answer is 'it depends', and I would recommend following the SEI CMMi model for Verification and Validation (i.e. testing) which begins with the goals (value) of each activity then subjecting that activity to measures that will ultimately allow the whole process to be subjected to continuous improvement. In this way you have isolated the What and Why from the How and you will be able to answer time and value type questions for your given environment (which could be a Life support System or a Tweet of the day, to your favorite Aunt, App).
Summary: #2 (integration testing) seems most logical, but you shouldn't hesitate to use a variety of tests to achieve the best coverage for pieces of your codebase that need it most. Shooting for having tests for "everything" is not a worthy goal.
Long version
There is a school of thought out there where devs are convinced that adopting unit\integration\system tests means striving for every single chuck of code being tested. It's either no test coverage at all, or committing to testing "everything". This binary thinking always makes adopting any kind of testing strategy seem very expensive.
The truth is, forcing every single line of code\function\module to be tested is about as sound as writing all your code to be as fast as possible. It takes too much time and effort, and most of it nets very little return. Another truth is that you can never achieve true 100% coverage in a non-trivial project.
Testing is not a goal unto itself. It's a means to achieve other things: final product quality, maintainability, interoperability, and so on, all while expending the least amount of effort possible.
With that in mind, step back and evaluate your particular circumstances. Why do you want to "write tests for this solution"? Are you unhappy with the overall quality of the project today? Have you experienced high regression rates? Are you perhaps unsure about how some module works (and more importantly, what bugs it might have)? Regardless of what your exact goal is, you should be able to select pieces that pose particular challenges and focus your attention on them. Depending on what those pieces are, an appropriate testing approach can be selected.
If you have a particularly tricky function or a class, consider unit testing them. If you're faced with a complicated architecture with multiple, hard to understand interactions, consider writing integration tests to establish a clean baseline for your trickiest scenarios and to better understand where the problems are coming from (you'll probably flush out some bugs along the way). System testing can help if your concerns are not addressed in more localized tests.
Based on the information you provided for your particular scenario, external-facing unit testing\integration testing (#2) looks most promising. It seems like you have a lot of external dependencies, so I'd guess this is where most of the complexity hides. Comprehensive unit testing (#1) is a superset of #2, with all the extra internal stuff carrying questionable value. #3 (full system testing) will probably not allow you to test external edge cases\error conditions as well as you would like.

'make'-like dependency-tracking library?

There are many nice things to like about Makefiles, and many pains in the butt.
In the course of doing various project (I'm a research scientist, "data scientist", or whatever) I often find myself starting out with a few data objects on disk, generating various artifacts from those, generating artifacts from those artifacts, and so on.
It would be nice if I could just say "this object depends on these other objects", and "this object is created in the following manner from these objects", and then ask a Make-like framework to handle the details of actually building them, figuring out which objects need to be updated, farming out work to multiple processors (like Make's -j option), and so on. Makefiles can do all this - but the huge problem is that all the actions have to be written as shell commands. This is not convenient if I'm working in R or Perl or another similar environment. Furthermore, a strong assumption in Make is that all targets are files - there are some exceptions and workarounds, but if my targets are e.g. rows in a database, that would be pretty painful.
To be clear, I'm not after a software-build system. I'm interested in something that (more generally?) deals with dependency webs of artifacts.
Anyone know of a framework for these kinds of dependency webs? Seems like it could be a nice tool for doing data science, & visually showing how results were generated, etc.
One extremely interesting example I saw recently was IncPy, but it looks like it hasn't been touched in quite a while, and it's very closely coupled with Python. It's probably also much more ambitious than I'm hoping for, which is why it has to be so closely coupled with Python.
Sorry for the vague question, let me know if some clarification would be helpful.
A new system called "Drake" was announced today that targets this exact situation: http://blog.factual.com/introducing-drake-a-kind-of-make-for-data . Looks very promising, though I haven't actually tried it yet.
This question is several years old, but I thought adding a link to remake here would be relevant.
From the GitHub repository:
The idea here is to re-imagine a set of ideas from make but built for R. Rather than having a series of calls to different instances of R (as happens if you run make on R scripts), the idea is to define pieces of a pipeline within an R session. Rather than being language agnostic (like make must be), remake is unapologetically R focussed.
It is not on CRAN yet, and I haven't tried it, but it looks very interesting.
I would give Bazel a try for this. It is primarily a software build system, but with its genrule type of artifacts it can perform pretty arbitrary file generation, too.
Bazel is very extendable, using its Python-like Starlark language which should be far easier to use for complicated tasks than make. You can start by writing simple genrule steps by hand, then refactor common patterns into macros, and if things become more complicated even write your own rules. So you should be able to express your individual transformations at a high level that models how you think about them, then turn that representation into lower level constructs using something that feels like a proper programming language.
Where make depends on timestamps, Bazel checks fingerprints. So if at any one step produces the same output even though one of its inputs changed, then subsequent steps won't need to get re-computed again. If some of your data processing steps project or filter data, there might be a high probability of this kind of thing happening.
I see your question is tagged for R, even though it doesn't mention it much. Under the hood, R computations would in Bazel still boil down to R CMD invocations on the shell. But you could have complicated muliti-line commands assembled in complicated ways, to read your inputs, process them and store the outputs. If the cost of initialization of the R binary is a concern, Rserve might help although using it would make the setup depend on a locally accessible Rserve instance I believe. Even with that I see nothing that would avoid the cost of storing the data to file, and loading it back from file. If you want something that avoids that cost by keeping things in memory between steps, then you'd be looking into a very R-specific tool, not a generic tool like you requested.
In terms of “visually showing how results were generated”, bazel query --output graph can be used to generate a graphviz dot file of the dependency graph.
Disclaimer: I'm currently working at Google, which internally uses a variant of Bazel called Blaze. Actually Bazel is the open-source released version of Blaze. I'm very familiar with using Blaze, but not with setting up Bazel from scratch.
Red-R has a concept of data flow programming. I have not tried it yet.

Resources