Kubeflow VS generic workflow orchestrator? - airflow

i am struggling understanding the functional role of Kubeflow (KF) compared with other (generic) workflow orchestrator.
I know KF is oriented to ML tasks, and is built on top of Argo.
Two questions:
can KF be used at a higher level as a workflow orchestrator to perform more generic tasks (i.e. ETL) whose outcome might be useful in the following ML tasks?
can use all funcionalities of Argo within KF.
what can a generic workflow orchestrator (as Airflow, argo, etc.) do that KF cannot?

Yes, you can create a python function/ general containers with code baked in which executes whatever task you like.
pre-defined component -
https://www.kubeflow.org/docs/components/pipelines/sdk-v2/component-development/
python component - https://www.kubeflow.org/docs/components/pipelines/sdk-v2/python-function-components/
KFP is an abstraction op top of Argo workflows.
it gives you the ability to create Workflows using python instead of writing YAML files. Check out this article : https://towardsdatascience.com/build-your-data-pipeline-on-kubernetes-using-kubeflow-pipelines-sdk-and-argo-eef69a80237c
since Argo Workflows development is advancing independently from KFP it's safe to assume there will be missing features in KFP (Which are the community will add according to demands).
that's a big question.
in general, airflow has sensors, SLA feature/ huge store of operators/sensors/reports/plugins and a bigger community since it's not ML oriented.

Related

Cyclic Workflow in Data Comparison Process

I am searching for a solution to automatize an iterative data comparison process until all data packages are consistent. My general guess is to use something like Apache Airflow, but the iterative nature seem to be a cyclic graph. Apache airflow only allows DAGs (directed acyclic graph). Since I have not even a lot of knowledge in Airflow, I am a bit lost and would appreciate some expert knowledge here.
Current status: I am in a position were I regularly need to compare data packages for consistency and communicate errors to and between the two different parties manually.
On the one hand there is a design data set and on the other hand there are measured data sets. Both datasets involve many manual steps from different parties. So if an inconsistency occurs, I contact one or the other party and the error is removed manually. There are also regular changes to both data sets that can introduce new errors to already checked datasets.
I guess this process was not automatized yet, because the datasets are not directly comparable, but some transformations need to be done in between. I automatized this transformation process the last weeks so all that need to be done now from my side is to run the script and to communicate the errors.
What I would need now is a tool that orchestrates my script against the correct datasets and contacts the according persons as long as errors exists. In case something changes or was added the script needs to be run again.
My first guess was that I would need to create a workflow in apache airflow, but this iterative process seems to me as a cyclic graph, which is not allowed in Airflow. Do you have any suggestions or is this a common occurrence, were also solutions with Airflow exists?
I think one way to solve your problem could be to have a DAG workflow for the main task of comparing the datasets and sending notifications. Then run a periodical task in Cron, Quartz, etc, that triggers that DAG workflow. You are correct in Airflow not liking cyclic workflows.
I worked on Cylc, a cyclic graph workflow tool. Cyclic workflows (or workflows with loops) are very common in areas such as Numerical Weather Prediction NWP (reason why Cylc was created), and also in other fields such as optimization.
In NWP workflows, some steps may be waiting for datasets, and the workflow may stall and send notifications if the data is not as it was expected (e.g. some satellite imaging data is missing, and also the tides model output file is missing).
Also, in production, NWP models run multiple times a day. Either because you have new observation data, or new input data, or maybe because you want to run ensemble models, etc. So you end up with multiple runs of the workflow in parallel, where the workflow manager is responsible to manage dependencies, optimize the use of resources, send notifications, and more.
Cyclic workflows are complicated, that's probably why most implementations opt to support only DAGs.
If you'd like to try Cylc, the team has been trying to make it more generic so that it's not specific to NWP only. It has a new GUI, and the input format and documentation were improved with ease of use in mind.
There are other tools that support cyclic workflows too, such as StackStorm, Prefect, and I am currently checking if Autosubmit supports it too. Take a look at these tools if you'd like.
If you are in life sciences, or are interested in reproducible workflows, the CWL standard also has some ongoing discussion about adding support to loops, which could allow you to achieve something akin to what you described I reckon.

Schedule tasks at fixed time in multiple timezones

I'm getting started with Airflow and I'm not sure how to approach this problem:
I'm building a data export system that should run at a fixed time daily for different locations. My issue is that the locations have several timezones and the definition of day start/end changes depending on the timezone.
I saw in the documentation that I can make a dag timezone aware but I'm not sure creating 100s of dags with different timezone is the right approach. I also have some common tasks so multiple dags creates more complexity or duplication in the tasks performed.
Is there an airflow idiomatic way to achieve this? I think it would be a fairly common use case to build reports that are timezone dependant but I didn't find any informations about it.
Dynamic DAGs is a hot topic in Airflow, but from my point of view, Airflow itself doesn't provide a straightforward way to implement that. You'll need to balance the pros and cons depending on your use case.
As a good starting point, you can check Astronomer guide for dynamically generating DAGs. There you have all the options available and some ideas of the pros and cons. Make sure you check out the scalability considerations to see the implications in terms of performance.
From your use case, I think the best approach will be either the Create_DAG Method (under Single-File Methods), or the DAG Factory. I personally prefer the first one because it's like a Factory (in terms of programming patterns), but you have the flexibility to create all the files you need for each DAG. In the second you won't have much control of what you create and there's an external dependency needed.
Another interesting article about dynamically generating DAGs is "How to build a DAG Factory on Airflow".

Create instance specification code in Rhapsody

I am working on a Rhapsody SysML project for work and we need to be able to model different configurations of our system. To give a concrete example, if our system is a vehicle, we want to be able to simulate that vehicle with different configurations of engines, wheels, etc.
This is my first time using SysML but in the book A Practical Guide to SysML it discusses, in chapter 7, the concept of Instance Specifications. These sound like exactly what we need, and Rhapsody appears to have support for them. So we created an Instance Specification in Rhapsody, giving it specific values for the engine and wheels. But once we create the instance specification we cannot find any way to actually create an instance from that specification. We noticed that Rhapsody doesn't even generate any code for the instance specification.
So my questions are the following, can Instance Specifications be used to create different configurations of a system and if so how? If not, what is the best method for modeling different configurations of a system?
Thanks for any help you can provide.

Using Rule Flow in InRule for Workflow

I see Rule Flow which supports action so it may be possible to build some types of workflow on top of this. In my situation I have an case management application with tasks for different roles, all working on a "document" that flows through different states and depending on state, different role will see in their queue to work on.
I'm not sure what your question is, but InRule comes with direct support for Windows Workflow Foundation, so executing any InRule RuleApplication, including those with RuleFlow definitions, is certainly possible.
If you'd like assistance setting up this integration, I would suggest utilizing the support knowledge base and forums at http://support.inrule.com
Full disclosure: I am an InRule Technology employee.
For case management scenarios, you can use decisions specifically to model a process. Create a custom table or flags in your cases that depict the transition points in your process (steps). As you transition steps, call a decision which will determine if the data state is good enough to make the transition. If it is, then set the flag for the new state. Some folks allow for multiple states at the same time. InRule is a stateless platform; however, when used with CRM it provides 95% of the process logic and relies on CRM to do the persistence. I have written about this pattern in a white paper:
https://info.inrule.com/rs/inruletechnology/images/White_Paper_InRule_Salesforce_Integration.pdf

Multi Branching in Microsoft Workflow Foundation 4

I’m working on a BPMS project with WF4. For implementing human activities I used custom native activity that executes several functionality. A book mark is created with it. WF instance is persisted and workflow will be unload for next call.
Exactly my problem is fork-joint in workflow foundation 4. I don’t know how I can do it.
I found that parallel activity execute each child activity of itself and when all of them are finished workflow can be continued also know about pick branch and it’s functionality, But in my project there are another kind of activity that it’s like parallel and branching activity.
I want to have multiple branch of sequence that can work with each other and go to en without any dependency to other sequence. I think it’s like multiple instance workflows also I need to join branches in some situation and it’s fork-join. May be one of the branches go to the end of workflow but another is on the middle of sequence.
Dose wf4 support multi-branching ? Can I do it?
WF4 doesn't support fork-joins. You need to model this using a Parallel activity and/or custom activities with bookmarks.

Resources