How is Flyte tailored to "Data and Machine Learning"? - airflow

https://flyte.org/ says that it is
The Workflow Automation Platform for Complex, Mission-Critical Data and Machine Learning Processes at Scale
I went through quite a bit of documentation and I fail to see why it is "Data and Machine Learning". It seem to me that it is a workflow manager on top of a container orchastration (here Kubernetes), where workflow manager means, that I can define a Directed Acyclic Graphs (DAG) and then the DAG nodes are deployed as containers and the DAG is run.
Of course this is usefull and important for "Data and Machine Learning", but I might as well use it for any other microservice DAG with this. Except for features/details, how is this different than https://airflow.apache.org or other workflow managers (of which there are many). There are even more specialized workflow managers for "Data and Machine Learning", e.g., https://spark.apache.org.
What should I keep in mind as a Software Achitect?

That is a great question. You are right in one thing, at the core it is a Serverless Workflow Orchestrator (serverless, because it does bring up the infrastructure to run the code). And yes it can be used for multiple other situations. It may not be the best tool for some other systems like Micro-service orchestration.
But, what really makes it good for ML & Data Orchestration is a combination of
Features (list below) &
Integrations (list below)
Community of folks using it
Roadmap
Features
Long running tasks: It is designed for extremely long running tasks. Tasks that can run for days and weeks, even if the control plane goes down, you will not lose the work. You can keep deploying without impacting existing work.
Versioning - allow multiple users to work independently on the same
workflow, use different libraries, models, inputs etc
Memoization. Lets take an example of a pipeline with 10 steps, you can memoize all 9 steps and if 10th fails or you can modify 10th and then it will reuse results from previous 9. This leads to drastically faster iteration
Strong typing and ML specific type supports
Flyte understands dataframes and is able to translate dataframes from spark.dataFrame -> pandas.DataFrame -> Modin -> polars etc without the user having to think about how to do it efficiently. Also supports things like tensors (correctly serialized), numpy arrays, etc. Also models can be saved and retrieved from past executions so is infact the model truth store
Native support for Intra task checkpointing. This can help is recovering model training between node failures and across executions even. With new support being added for Checkpointing callbacks.
Flyte decks: A way to visualize metrics like ROC curve, etc or auto visualization of the distribution of data input to a task.
Extendable Programming interface, that can orchestrate distributed jobs or run locally -
e.g spark, MPI, sagemaker
Reference task for library isolation
Scheduler independent of user code
Understanding of resources like GPU's etc - Automatically schedule on gpus and or spot machines. With Smart handling of spot machines - n-1 retries last one automatically is moved to an on-demand machine to better guarantees
Map tasks and dynamic tasks. (map over a list of regions), dynamic -> create new static graphs based on inputs dyanmically
Multiple launchplans. Schedule 2 runs for a workflow with slightly different hyper parameters or model values etc
For Admins
For really long running tasks, admin can deploy the management layer without killing the tasks
Support for spot/arm/gpu (with different versions etc)
Quotas and throttles for per project/domain
Upgrade infra without upgrading user libraries
Integrations
pandas dataframe native support
Spark
mpi jobs (gang scheduled)
pandera / Great expectations for data quality
Sagemaker
Easy deployment of model for serving
Polars / Modin / Spark dataframe
tensors / checkpointing etc
etc and many others in the roadmap
Community
Focused on ML specific features
Roadmap
CD4ML, with human in the loop and external signal based workflows. This will allow for users to automate deployment of models or perform human in the loop labeling etc
Support for Ray/Spark/Dask cluster re-use across tasks
Integration with WhyLogs and other tools for monitoring
Integration with MLFlow etc
More native Flytedecks renderers
Hopefully this answers your questions. Also please join the slack community and help spread this information. Also ask more questions

Related

airflow create sub process based on number of files

A newbie question in airflow,
I am having a list of 100+ servers in a text file. Currently, a python script is used to login to each server, read a file, and write the output. It's taking a long time to get the output. If this job is converted to Airflow DAG, is it possible to split the servers into multiple groups and a new task can be initiated by using any operators? Or this can be achieved by only modifying the Python script(like using async) and execute using the Python operator. Seeking advice/best practice. I tried searching for examples but was not able to find one. Thanks!
Airflow is not really a "map-reduce" type of framework (which you seem to be trying to implement). The tasks of Airflow are not (at least currently) designed to split the work between them. This is very atypical for Airflow to have N tasks that do the same thing on a subset of data each. Airflow is more for orchestrating the logic, so each task in Airflow conceptually does a different thing and there are rarely cases where N parallel task do the same thing (but on a different subset of data). More often than not Airflow "tasks" do not "do" the job themselves, they are rather telling others what to do and wait until this gets done.
Typically Airflow can be used to orchestrate such services which excel in doing this kind of processing - you could have a Hadoop job which processes such "parallel" map-reduce kind of jobs using other tools. You could also - as you mentioned - perform an async, multi-threading or even multi-processing python operator, but at some scale, I think typically other, dedicated tools should be much easier to use and better to get the most value of (with efficient utilization of parallelism for example).

What are some popular frameworks available for implementing CQRS, Event Sourcing and Saga for data consistency and distributed transactions?

I would like to know some popular frameworks that are available for implementing CQRS, ES, Saga in the application.
As a part of my research, I have to compare these frameworks and evaluate them based on various -ilities.
I have to compare these [event-sourcing] frameworks and evaluate them based on various -ilities.
The premise of the question is that you need a framework to implement event sourcing but, in fact, you do not.
Greg Young, one of the most influential proponents of event sourcing, frequently expresses his misgivings about frameworks. See, for instance, his QCon London 2013 keynote, esp. mark 9'.
Event sourcing is conceptually simple and doesn't need the kind of magic that frameworks typically bring with them. For instance, rebuilding the state from a stream of events simply consists in a left fold over the stream in question. Moreover, you don't necessarily need a specialised database; I know people who have successfully implemented event sourcing by simply appending events to a file.
If your research aims at comparing event-sourcing frameworks, I would argue that you should consider the case where no framework is used at all.
Axon is a popular framework/server for building CQRS/ES applications.
EventStoreDB is a popular EventStore database for the EventSourcing part.
A simple starting point if you want to write your own framework/library is to check out some of the code I co-authored at https://www.cqrs.nu/
If you are looking for a managed solution, you can also check out what we at Serialized provide.
In addition to Axon, on the JVM there's also the Akka ecosystem (the cluster sharding, persistence, sharded daemon process, and projection modules are the most relevant to CQRS/ES/DDD). One benefit of Akka Persistence is the ability to choose from a variety of datastores to use as an event store (JDBC SQL databases and Cassandra are the most common, but there are many more supported). My experience with it has been that it is capable of exceptionally high availability and since it allows a stateful event-sourced application to be deployed as if it's stateless (e.g. in Kubernetes without needing an operator) there's a lot of deployment flexibility. Note that because it's built on the actor model, a lot of JVM observability tooling doesn't work particularly well with it (often assuming a stronger mapping of threads to tasks), so certain commercially-licensed observability tooling is recommended.
Additionally, Kalix also provides a polyglot (all you need is to express domain logic in a language which supports grpc) event-sourcing implementation.
Disclaimer: since answering this question (almost a year after answering this question), I became employed by Lightbend, the maintainers of Akka and provider of Kalix.

Best .Net Actor/Process framework for coordinating Local and Networked clusters

We have a process that involves loading a large block of data, applying some transformations to it, and then outputting what has changed. We currently run a web app where multiple instances of these large blocks of data are processed in the same CLR instance, and this leads to garbage collection thrashing and OOM errors.
We have proven that hosting some tracked state in a longer running process works perfectly to solve our main problem. The issue we now face is, as a stateful system, we need to host it and manage coordination with other parts of the system (also change tracking instances).
I'm evaluating Actors in Service Fabric and Akka at the moment, there are a number of other options, but before I proceed, I would like peoples thoughts on this approach with the following considerations:
We have a natural partition point in our system (Authority) which means we can divide our top level data set easily. Each partition will be represented by a top level instance that needs to organise a few sub-actors in its own local cluster, but we would expect a single host machine to be able to run multiple clusters.
Each Authority Cluster of actors would ideally be hosted together on a single machine to benefit from local communication and some use of shared local resources to get around limits on message size.
The actors themselves should be separate processes on the same box (Akka seems to run local Actors in the same CLR instance, which would crash everything on OOM - is this true?), this will enable me to spin up a process, run the transformation through it, emit the results and tear it down without impacting the other instances memory / GC. I appreciate hardware resource contention would still be a problem, but I expect this to be more memory than CPU intensive, so expect a RAM heavy box.
Because the data model is quite large, and the messages can contain either model fragments or changes to model fragments, it's difficult to work with immutability. We do not want to clone every message payload into internal state and apply it to the model, so ideally any actor solution used would enable us to work with the original message payload. This may cause problems with restoring an actor state as it wants to save and replay these on wakeup, but as we have state tracking internally, we can just store the resulting output of this on sleep.
We need a coordinator that can spin up instances of an Authority Cluster. There needs to be some elasticity in terms of the number of VM's/Machines and the number of Authority Clusters hosted on them, and something needs to handle creation and destruction of these.
We have a lot of .NET code, all our models, transformations and validation is defined in it, and will need to be heavily re-used. Whatever solution will need to support .Net
My questions then are:
While this feels like a good fit for Actors, I have reservations and wonder if there is something more appropriate? Everything I have tried has come back to a hosted processes of some kind.
If actors are the right way to go, which tech stack would put me closest to what I am trying to achieve with the above concerns taken into account?
IMO (coming at this from a JVM Akka perspective, thus why I changed the akka tag to akka.net; I don't have a great knowledge about the CLR side of things), there seems to be a mismatch between
We do not want to clone every message payload into internal state and apply it to the model, so ideally any actor solution used would enable us to work with the original message payload.
and
The actors themselves should be separate processes on the same box (Akka seems to run local Actors in the same CLR instance, which would crash everything on OOM - is this true?)
Assuming that you're talking about the same OS process, those are almost certainly mutually incompatible: exchanging messages strongly suggests serialization and is thus isomorphic to a copy operation. It's possible that something using shared memory between OS processes could work, but you may well have to make a choice about which is more important.
Likewise, the parent/child relationship in the "traditional" (Erlang/Akka) style actor model trivially gives you the local cluster of actors (which, since they're running in the same OS process allows the Akka optimization of not copying messages until you cross an OS process boundary), while "virtual actor" implementations as found in Service Fabric or Orleans (or, I'd argue Cloudstate or Lagom) basically assume distribution.
Semantically, the virtual actor models implicitly assume that actors are eternal (though their eternal essence may not always be incarnate). For your use-case, this doesn't necessarily seem to be the case.
I think a cluster of Akka.Net instances with sharded Authority actors spawning shorter-lived child actors best fits, assuming that you're getting OOM issues from trying to process multiple large blocks of data simultaneously. You would have to implement the instance scale-up/down logic yourself.
I have not worked with Akka.net so I can't speak to that at all, but I'd be happy to speak to what you're talking about in a Service Fabric context.
Service Fabric has no issue with the concept of running multiple clusters. In its terminology, the whole of your system would be called an Application and would have a version when deployed to the SF cluster. If you wanted to create multiple instances of it, all you'd need to do is select what you wanted to call the deployed app instance and it'll stand up provisioning for you.
SF has a notion of placement constraints, metric balancing and custom rules that you can utilize if you think you can better balance the various resources than its automatic balancing (or you need to for network DMZ purposes). While I've never personally grouped things down to a single machine, I frequently limit access of services to single VM scale sets (we host in Azure).
To the last point though, you'll still have message size limits, but you can also override them to some degree. In your project containing service interfaces, just set the following attribute above your namespace:
[assembly:FabricTransportRemotingSettings(MaxMessageSize=<(long)new size in bytes>)] and you're good to go.
Services can be configured to run using a Shared or Exclusive process model.
Regarding your state requirement, it's not necessarily clear to me what you're trying to do, but I think you're saying that that it's not critical that your actors store any state since they can work from some centrally-provided model.
You might look then at volatile state persistence then as it'll mean that state is saved for the actors in memory, but should you lose the replicas, nothing is written to disk so it's all lost. Or if you don't care and are ok just sending the model to the actors for any work, you can configure them to be stateless.
On the other hand, if you're still looking to retain state in the actors and simply are concerned about immutability, rest assured that actor state isn't immutable and can be updated trivially. There are simply order of operation concerns you need to keep in mind (e.g. if you retrieve the state, make a change, save it, 1) you must commit the transaction for it to take and 2) if you modify the state but don't save it, it'll obviously not persist - pull a fresh copy in a new transaction for any modifications). There's a whole pile of guidelines here.
Assuming your coordinator is intended to save some sort of state, might I recommend a singleton stateful service. Presumably it's not receiving an inordinate amount of use so a single instance is sufficient and it can easily save state (without the annoyance of identifying which state is on which partition). As for spinning up services, I covered this in the first bullet, but use the ApplicationManager on the built-in FabricClient to set up new applications and the ServiceManager to create instances of necessary services within each.
Service Fabric supports .NET Core 3.1 through .NET 5 as of the latest 8.0 release though note a minor serialization issues with an easy workaround with .NET 5.
If you have an Azure support subscription, I'd encourage you to write to the team under Development questions and share your concerns. Alternatively, on the third Thursday of each month at 10 AM PST, they also have a community call on Teams that you're welcome to join and you can find past calls here.
Again, I can't speak to whether this is a better fit than Akka.NET, but our stack is built atop Service Fabric. While it has some shortcomings (what framework doesn't?) it's an excellent platform for distributed software development.

How to build a predictive dialer?

I need to build a reliable predictive dialer based on Asterisk. Currently the system we use includes Wombat and Asterisk, and we do not find this solution usable as Wombat provides a poor API and it's impossible to use it without regular manual operations.
The system we want:
Can be used solely via API or direct database queries (adding lists to campaigns, updating lists, starting campaigns, stopping campaigns etc.) so that it can be completely integrated into an existing product
Is free, or paid for annually independent to the usage rate
Is considered stable
Should be able to handle tens of thousands of calls per day, if it matters
Use vicidial.org or hire freelancer to build new core with your needed api.
You can also check OSdial for this, it also developed using asterisk.
We have been working with a preview of the next version of Wombat, through the Early Access program, and Wombat has a complete configuration and reporting JSON API and you can deploy it "headless" in order to scale up to thousands of parallel lines. If you ask Loway they can likely get you access to the Early Access program.
BTW, Vicidial is great for agent-based outbound, but imposes quite a large penalty on the number of agents per server - you cannot reasonably use it to do telecasting at the scale we are looking for as it would require too many servers. Wombat is leaner and can drive over one thousands channel per server. YMMV.
This question would be better placed on a "hire-a-freelancer" site like oDesk ... if you need custom programing done, those are the sorts of places to go to get manpower.
Your specifications are well within what is possible with Asterisk. I'd strongly recommend looking at Vici Dial and OS Dial as others have suggested; out of the can, they are pretty good.
The hard part of any auto-dialer is not the dialer, oddly enough. It's the prediction algorithms, the answering machine detection algorithms and the agent UI. Those are what makes or breaks an auto-dialer application for a company.

Using AWS for parallel processing with R

I want to take a shot at the Kaggle Dunnhumby challenge by building a model for each customer. I want to split the data into ten groups and use Amazon web-services (AWS) to build models using R on the ten groups in parallel. Some relevant links I have come across are:
The segue package;
A presentation on parallel web-services using Amazon.
What I don't understand is:
How do I get the data into the ten nodes?
How do I send and execute the R functions on the nodes?
I would be very grateful if you could share suggestions and hints to point me in the right direction.
PS I am using the free usage account on AWS but it was very difficult to install R from source on the Amazon Linux AMIs (lots of errors due to missing headers, libraries and other dependencies).
You can build up everything manually at AWS. You have to build your own amazon computer cluster with several instances. There is a nice tutorial video available at the Amazon website: http://www.youtube.com/watch?v=YfCgK1bmCjw
But it will take you several hours to get everything running:
starting 11 EC2 instances (for every group one instance + one head instance)
R and MPI on all machines (check for preinstalled images)
configuring MPI correctly (probably add a security layer)
in best case a file server which will be mounted to all nodes (share data)
with this infrastructure the best solution is the use of the snow or foreach package (with Rmpi)
The segue package is nice but you will definitely get data communication problems!
The simples solution is cloudnumbers.com (http://www.cloudnumbers.com). This platform provides you with easy access to computer clusters in the cloud. You can test 5 hours for free with a small computer cluster in the cloud! Check the slides from the useR conference: http://cloudnumbers.com/hpc-news-from-the-user2011-conference
I'm not sure I can answer the question about which method to use, but I can explain how I would think about the question. I'm the author of Segue so keep that bias in mind :)
A few questions I would answer BEFORE I started trying to figure out how to get AWS (or any other system) running:
How many customers are in the training data?
How big is the training data (what you will send to AWS)?
What's the expected average run time to fit a model to one customer... all runs?
When you fit your model to one customer, how much data is generated (what you will return from AWS)?
Just glancing at the training data, it doesn't look that big (~280 MB). So this isn't really a "big data" problem. If your models take a long time to create, it might be a "big CPU" problem, which Segue may, or may not, be a good tool to help you solve.
In answer to your specific question about how to get the data onto AWS, Segue does this by serializing the list object you provide to the emrlapply() command, uploading the serialized object to S3, then using the Elastic Map Reduce service to stream the object through Hadoop. But as a user of Segue you don't need to know that. You just need to call emrlapply() and pass it your list data (probably a list where each element is a matrix or data frame of a single shopper's data) and a function (one you write to fit the model you choose) and Segue takes care of the rest. But keep in mind that the very first thing Segue does when you call emrlapply() is to serialize (sometimes slowly) and upload your data to S3. So depending on the size of the data and the speed of your internet connection upload speeds, this can be slow. I take issues with Markus' assertion that you will "definitely get data communication problems". That's clearly FUD. I use Segue on stochastic simulations that send/receive 300MB/1GB with some regularity. But I tend to run these simulations from an AWS instance so I am sending and receiving from one AWS rack to another, which makes everything much faster.
If you're wanting to do some analysis on AWS and get your feet wet with R in the cloud, I recommend Drew Conway's AMI for Scientific Computing. Using his AMI will save you from having to install/build much. To upload data to your running machine, once you set up your ssh certificates, you can use scp to upload files to your instance.
I like running RStudio on my Amazon instances. This will require setting up password access to your instance. There are a lot of resources around for helping with this.

Resources