How to configure Google Cloud Composer cost-effectively - airflow

After some research and testing, we have decided to start using Google Cloud Composer. Since our current DAGs and tasks are relatively small, and don't require the server to run continuously, I am looking how to manage costs.
Two questions:
The option to use preemptible VMs seems logical. This saves costs considerably, and I'm thinking to go for 3x n1-standard-4. I expect each task to be quite short, so don't think this will have significant impact for our workloads. Is it possible to use preemptible VMs with Composer?
Schedule to turn the Composer environment on/off, as asked in this post. I can't find how to do this in the documentation, either by turning the whole enviroment down, or to shutdown the workers as proposed in the answer.
Help, anyone?

This is an interesting question.
One roadblock you may encounter is the nature of Airflow itself. Generally, Airflow is not intended for use ephemerally. Instead, I'd suspect that the vast majority of Airflow use, Cloud Composer or otherwise, is persistent. Ephemerality brings cost benefits but also risks with Airflow architecture. For example, what happens if the scheduler to restart your Airflow resources fails?
To answer your questions:
Preemptibles are not supported in Composer. While PVMs have a ton of awesome benefits, they could leave tasks in a very weird state, especially if you got preempted several times.
There is not formal documentation for this process because it's generally informal and not recommended if you must depend on your environment. The basic approach, though, would be to:
Create a very small GCE VM
Setup the Cloud SDK (gcloud) to connect to your project
Create a crontab that either does a fresh create/delete of an environment when you need it /or/ pauses the VMs in the Composer worker pool
In the long-term, I think Composer will better support ephemeral use of worker resources. In the short term, another option is to run a lightweight Airflow environment on a small(ish) GCE VM and then suspend/resume that VM when you need to use Airflow. You don't get Composer that way, but you do benefit from the team's work improving and expanding GCP support in core Airflow.

Related

Metaflow from Netflix vs Apache Airflow

I have a questions regarding differences between Apache Airflow and Metaflow(https://docs.metaflow.org/). As far as I understood Apache airflow is just a job scheduler, that runs tasks. Metaflow from Netflix is as a dataflow library, which creates machine learning pipeline(dataflow is available) in forms of DAGs. Basically it means, that Metaflow can be executed on the Apache Airflow?
is my understanding correct?
If yes, is it possible to convert Metaflow DAG into Apache Airflow DAG?
Honestly, I haven't worked with Metaflow and thank you for introducing it to me! There is a nice introduction video you can find on Youtube.
Airflow is a framework for creating scheduled pipelines. A pipeline is a set of tasks, linked between each other that represent an Directed Acyclic Graph. Pipeline can be scheduled, you can tell how often or when it should run, you can tell when it should've ran in the past and what time period it should backfill. You can run the whole Airflow as one single docker container or you can have multi-node cluster, it has bunch of already existing operators to integrate with 3rd party services. I recommend to look into Airflow Architecture and concepts.
Metaflow looks like something similar, but created specifically for data-scientists. I can be wrong here, but looking at the Metaflow Basics it looks like I can the same way create a scheduled pipeline similar to Airflow.
I would look in specific tools you want to integrate with and which one of both integrates better. As mentioned, Airflow has lots of already made connectors and operators, as well as, powerful scheduler with backfill and Jinja template language to design your DB queries for enter link description here.
Hope that is somewhat helpful.
Here is also some nice article with feature comparison.

Microservices orchestration choices

I am exploring the possible solutions for orchestrating my flows across multiple services via some infrastructure. Searching shows me a few options such as Conductor, Camunda, Airflow etc.
I am wondering what would fit my use case better
One of my service is in Java, the other is in Python
I need to pass info to the Java service, then take the output and pass it to the Python service
Final output is then published to another queue
It feels like Conductor is a good choice, but would love to hear your inputs!
All options can fulfill the requirement stated. Think about further / future requirements. Is it only a data pipe? Is it about orchestrating a larger end-to end business process? Do you need support for long-running processes? Is end-to-end transparency in a graphical form a benefit? Is graphical process modelling in BPMN2 standard going to be a benefit? Are there going to be audit or reporting requirements? Or is it going to be a simple, isolated, technical solution?
This article gives a great overview of tools in the market and what their primary use cases are: https://blog.bernd-ruecker.com/understanding-the-process-automation-landscape-9406fe019d93
All listed tools might technically be able to execute your workflow (I have no experience working with Conductor & Camunda). A few characteristics on which a decision is usually made are:
open vs closed source
how do you define workflows? (e.g. Python code in Airflow. Others use e.g. JSON/XML/something custom)
does it come with a UI?
can it scale out in case my workloads start growing?
is it agnostic to any technology or limited to running certain technologies? (e.g. Oozie is built for scheduling jobs on Hadoop)
other requirements could be e.g. security, logging, monitoring, etc.
There are many orchestration-tool-comparisons on the internet, e.g. 1 or 2.
Introduction to Container Orchestration
The practice of automating the administration of container-based microservice applications across different clusters is known as container orchestration. Within corporations, this notion is gaining popularity. In addition, a variety of Container Orchestration technologies have become indispensable in the deployment of microservice-based applications.
Software development in the modern era is no longer monolithic. Instead, it generates component-based apps that run across many containers. These adaptable and scalable containers work together to accomplish a specified purpose or microservice.
Depending on the complexity of the application and other requirements like load balancing, they may span many clusters.
Containers encapsulate application code as well as its dependencies. To function efficiently, they receive the resources they require from physical or virtual hosts. When complicated systems are built as containers, clustering them for deployment requires adequate management and priority.
How to Choose a Container Orchestration Tool?
We've looked at a number of Orchestration Tools that you may examine when selecting which is ideal for your business. To do so, make sure to understand your company's requirements and operations. Then you'll be able to more readily weigh the benefits and drawbacks of each option.
Kubernetes
Kubernetes has a lot of features and is ideally suited for container and cluster management at the corporate level. Kubernetes is managed by a number of platforms, including Google, AWS, Azure, Pivotal, and Docker. As the containerized workload grows, you have a lot of options.
The biggest disadvantage is that it does not work with Docker Swarm and Compose CLI manifests. It might also be difficult to understand and set up. Despite these flaws, it is one of the most used systems for cluster deployment and management.
Docker Swarm
For individuals who are already familiar with Docker Compose, Docker Swarm is a better option. It's easy to use and doesn't require any additional software. Unlike Kubernetes and Amazon ECS, however, Docker Swarm lacks sophisticated features such as built-in logging and monitoring. As a result, it is better suited to small-scale businesses that are just starting started with containers.
Amazon ECS
If you're already familiar with Amazon Web Services, Amazon ECS is a great way to install and configure clusters. It's a quick and easy method to get started, and it scales to match demand. It also connects with a number of other AWS services. It's also excellent for small teams with limited resources for container maintenance.
One of its disadvantages is that it is incompatible with nonstandard deployments. It also contains ECS-specific configuration files, which complicates debugging.

In what way is Ruby on Rails NOT multithreaded?

Disclaimer: I'm a c# ASP.NET developer learning "RoR". Sorry if this question doesn't "get" RoR, any corrections greatly appreciated!
What is multithreading
My understanding of "multithread" ability in web apps is twofold:
Every time a web/app server receives a request it can assign a thread to the new request, thus multiple requests can run concurrently.
The app runtime + language allows for multiple threads to be used WITHIN a single request (in ASP.NET via "Async" methods and keywords for example).
In this way, IIS7 + ASP.NET can do points 1 AND 2.
I'm confused about RoR
I've read these two articles and they have left me confused:
Clearing up some things about LinkedIn mobile’s move from Rails to
node.js
How to deploy a multi-threaded Rails app
question one.
I think I understand that RoR doesn't lend itself very well to point number 2 above, that is, having multiple threads within the same request, have I got that right?
question two.
Just to be crystal clear, RoR app/web servers can also do point number 1 above right (that is, multiple requests can run concurrently)? is that not always the case with RoR?
Question 1:
You can spawn more Ruby threads in one request if you want, although that seems to be outside the typical use case for Rails. There are uses for it for certain long-running IO or external operations.
Question 2:
The limiting factor for Ruby concurrency in general, not just with Rails, is the Global Interpreter Lock. This feature of Ruby prevents more than 1 thread of Ruby from executing at any given time per process. The lock is released whenever there is non-Ruby code executing, such as waiting for disk IO or SQL responses. You can get around this by using a different implementation of Ruby than the default, such as JRuby, but not all.
Phusion Passenger uses process based concurrency to handle a few requests concurrently, so, strictly speaking, is not "multithreaded," but is still concurrent.
This talk from Ruby MidWest 2011 has some good thoughts on getting multithreaded Ruby on Rails going.
Since this is about "from ASP.NET to RoR" there is another small but important detail to remember: In *nix environments it's common to achieve concurrency of a service application through multi-processing rather than multi-threading. This is an architecture that goes way back and is related to the relatively cheap cost of multi-processing on *nix systems using fork and Copy-on-Write. Each process serves one request at a time in a single thread and the main process controls spawning and killing worker child processes. Multiple requests are served concurrently by different child processes.
Modern service applications, for example Apache, have multi-process, multi-threaded, and even combined modes (where the service forks several processes, each running several threads).
In cases where the application was built with portability at mind (examples again: Apache, MySQL, etc) it is customary to run it in multi-process or combined mode on *nix systems, and in multi-threaded mode on Windows servers.
However, admittedly Rails is somewhat lacking on the Windows front. It's not that you can't run it on Windows, it's just that not a lot of effort went into making sure it runs well and smoothly for production use on Windows servers. It's not a common production platform among the RoR community.
As a result, Eventhough Rails itself is thread-safe since version 2.2, there isn't yet a good multi-threaded server for it on Windows servers. And you get the best results by running it on *nix servers using multi-process/single-threaded concurrency model.
Rails as a framework is thread-safe. So, the answer is yes!
The second link that you posted here names many of the Rails servers that don't do well with multi-threading. He later mentions that nginx is the way to go (it is definitely the most popular one and highly recommended). But he doesn't mention what made him come up to the conclusions.
Ruby 1.9.3 came out recently and has some new threading goodness built in which didn't exist before.
Use of multi-threading generally depends on the use case.
Personally I have tried it once an year ago and it had worked but I haven't used it in any production code because I haven't come across a use case where using multi-threading made more sense over pushing the long running task to a background job.
I would love to explore this more. So, if you can describe what you are trying to achieve then maybe we can do a POC.

distributed scheduling system for R scripts

I would like to schedule and distribute on several machines - Windows or Ubuntu - (one task is only on one machine) the execution of R scripts (using RServe for instance).
I don't want to reinvent the wheel and would like to use a system that already exists to distribute these tasks in an optimal manner and ideally have a GUI to control the proper execution of the scripts.
1/ Is there a R package or a library that can be used for that?
2/ One library that seems to be quite widely used is mapReduce with Apache Hadoop.
I have no experience with this framework. What installation/plugin/setup would you advise for my purpose?
Edit: Here are more details about my setup:
I have indeed an office full of machines (small servers or workstations) that are sometimes also used for other purpose. I want to use the computing power of all these machines and distribute my R scripts on them.
I also need a scheduler eg. a tool to schedule the scripts at a fix time or regularly.
I am using both Windows and Ubuntu but a good solution on one of the system would be sufficient for now.
Finally, I don't need the server to get back the result of scripts. Scripts do stuff like accessing a database, saving files, etc, but do not return anything. I just would like to get back the errors/warnings if there are some.
If what you are wanting to do is distribute jobs for parallel execution on machines you have physical access to, I HIGHLY recommend the doRedis backend for foreach. You can read the vignette PDF to get more details. The gist is as follows:
Why write a doRedis package? After all, the foreach package already
has available many parallel back end packages, including doMC, doSNOW
and doMPI. The doRedis package allows for dynamic pools of workers.
New workers may be added at any time, even in the middle of running
computations. This feature is relevant, for example, to modern cloud
computing environments. Users can make an economic decision to \turn
on" more computing resources at any time in order to accelerate
running computations. Similarly, modernThe doRedis Package cluster
resource allocation systems can dynamically schedule R workers as
cluster resources become available
Hadoop works best if the machines running Hadoop are dedicated to the cluster, and not borrowed. There's also considerable overhead to setting up Hadoop which can be worth the effort if you need the map/reduce algo and distributed storage provided by Hadoop.
So what, exactly is your configuration? Do you have an office full of machines you're wanting to distribute R jobs on? Do you have a dedicated cluster? Is this going to be EC2 or other "cloud" based?
The devil is in the details, so you can get better answers if the details are explicit.
If you want the workers to do jobs and have the results of the jobs reconfigured back in one master node, you'll be much better off using a dedicated R solution and not a system like TakTuk or dsh which are more general parallelization tools.
Look into TakTuk and dsh as starting points. You could perhaps roll your own mechanism with pssh or clusterssh, though these may be more effort.

What's the best way to simulate a complex production web development environment?

I want to create a modestly scalable development environment for an in-development web service.
Ideally, there would be an nginx web server with haproxy and a few database servers, websockets, the works.
I'd be going with Amazon cloud services for all of this hosting... but I'd rather not pay for CPU cycles when I'm just developing... much less develop on a remote, cloud environment.
What's the best way to go about modeling a somewhat complex development environment locally that could - hopefully, at the press of a button - sync with a similarly architected Amazon cloud environment?
All I have is my Macbook Pro. I also have a fully built 1Ghz tower computer in the closet I could leverage, if needed, and wouldn't be opposed to buying more. But, ultimately, I'd like to have the ability to sync to production with minimal steps and reconfiguration.
Thanks!
Check out vagrant and virtualbox. That will get you local environments running nicely on your macbook. Syncing to EC2 is going to be tougher. At the system level you'll want to use something like puppet or chef (which are both nicely supported by vagrant). Add to that a solid automated application deployment mechanism and you should be close. Be prepared to put some time into this, it's not likely to be a trivial undertaking.

Resources