We started to implement Airflow for task scheduling about a year ago, and we are slowly migrating more and more tasks to Airflow. At some point we noticed that the server was filling up with logs, even after we implemented remote logging to S3. I'm trying to understand what the best way to handle logs is, and I've found a lot of conflicting advice, such as in this stackoverflow question from 4 years ago.
Implementing maintenance dags to clean out logs (airflow-maintenance-dags)
Implementing our own FileTaskHandler
Using the logrotate linux utility
When we implemented remote logging, we expected the local logs to be removed after they were shipped to S3, but this is not the case. Local logs remain on the server. I thought this might be a problem with our configuration but I haven't found any way to fix that. Also, remote logging only applies to task logs, but process logs (specifically scheduler logs) are always local, and they took up the most space.
We tried to implement maintenance dags, but our workers are running from a different location to the rest of airflow, particularly the scheduler, so only task logs were getting cleaned. We could get around this by creating a new worker that shares logs with the scheduler, but we prefer not to create extra workers.
We haven't tried to implement either of the other two suggestions yet. But that is why I want to understand, how are other people solving this, and what is the recommended way?
Related
I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.
I already have software patterns from other projects for Airflow + Batch, but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Other option would be to have one task that kicks off the 10k containers and monitors it from there.
I have no experience with Step Functions, but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Does Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?
I have worked on both Apache Airflow and AWS Step Functions and here are some insights:
Step Functions provide out of the box maintenance. It has high availability and scalability that is required for your use-case, for Airflow we'll have to do to it with auto-scaling/load balancing on servers or containers (kubernetes).*
Both Airflow and Step Functions have user friendly UI's. While Airflow supports multiple representations of the state machine, Step Functions only display state machine as DAG's.
As of version 2.0, Airflow's Rest API is now stable. AWS Step Functions are also supported by a range of production graded cli and SDK's.
Airflow has server costs while Step Functions have 4000/month free step executions (free tier) and $0.000025/step after that. e.g. if you use 10K steps for AWS Batch that run once daily, you will be priced $0.25 per day ($7.5 per month). The price for Airflow server (t2.large ec2 1 year reserved instance) is $41.98 per month. We will have to use AWS Batch for either case.**
AWS Batch can integrate to both Airflow and Step Functions.
You can clear and rerun a failed task in Apache Airflow, but in Step Functions you will have to create a custom implementation to handle that. You may handle automated retries with back-offs in Step Functions definition as well.
For failed task in Step Functions you will get a visual representation of failed state and the detailed message when you click it. You may also use aws cli or sdk to get the details.
Step Functions use easy to use JSON as state machine definition, while Airflow uses Python script.
Step Functions support async callbacks, i.e. state machine pauses until an external source notifies it to resume. While Airflow has yet to add this feature.
Overall, I see more advantages of using AWS Step Functions. You will have to consider maintenance cost and development cost for both services as per your use case.
UPDATES (AWS Managed Workflows for Apache Airflow Service):
*With AWS Managed Workflows for Apache Airflow service, you can offload deployment, maintenance, autoscaling/load balancing and security of your Airflow Service to AWS. But please consider the version number you're willing to settle for, as AWS managed services are mostly behind the latest version. (e.g. As of March 08, 2021, the latest version of open source airflow is 2.01, while MWAA allows version 1.10.12)
**MWAA costs on environment, instance and storage. More details here.
I have used both Airflow and Step Functions in my personal and work projects.
In general I liked step functions but the fact that you need to schedule the execution with Event Bridge is super annoying. Actually I think here Airflow could just act as a triggered for the step functions.
If Airflow would be cheaper to manage, I would always opt for it because I find managing Json based pipelines a hustle whenever you need to detour from the main use case. This always happen for me somehow.This becomes even a more complex issue when you need to have source control.
This one is a bit more subjective assessment but I find the monitoring capability of Airflow far greater than for step functions.
Also some information about the usage of Airflow vs Step functions
Aws currently has managed airflow which is priced per hour and you don’t need to have dedicated ec2. On the other hand step functions are aws lambdas that have an execution time limit of 15min which makes them not the best candidate for a long running pipelines
I am maintaining API server for my company which runs a python flask app in uwsgi on top of nginx.
...
#app.route('/getquick', methods=["GET"])
def GET_GET_IP_DATA():
sp_final = "CALL sp_quick()"
cursor.execute(sp_final)
#app.route('/get_massive_log', methods=["POST"])
def get_massive_log():
sp_final = "CALL sp_slow()"
cursor.execute(sp_final)
...
While the first request /getquick gets processed very quickly, /get_massive_log can take up to five seconds due to a rather long and complex mySQL query. The server can handle few of these queries but starts creating broken pipe errors when called to much.
The problem is, the other /getquick requests get blocked by these long I/O requests.
My manager suggested that I use gevent to somehow free up the server to process the other requests while waiting for the mySQL queries, but I am not sure if I am looking in the correct direction.
I am using pymysql to run queries, which google seems to suggest to work with gevent on top of uwsgi, but I have not been able to produce better results with it.
I have googled for days now, and while I am trying to understand threads, concurrency, asynchronous requests, I don't know where to start digging to find a solution. Is it even possible? Any suggestions or even pointers to where to research would be greatly appreciated.
EDIT : Perhaps my questions wasn't too clear, so I'll try to restate it:
What's the best way to free up workers for processing other requests while waiting for long database queries with uwsgi?
You need to learn about Uwsgi offloading
Offloading is a way to optimize tiny tasks, delegating them to one or
more threads.
These threads run such tasks in a non-blocking/evented way allowing
for a huge amount of concurrency.
You can read about offloading subsystem in the docs
I have a web app that will run forever (at least for a few days) on my local machine using the technique (hack?) described in Jeff Atwood's post: https://blog.stackoverflow.com/2008/07/easy-background-tasks-in-aspnet/
However when I run it on App Harbor my app doesn't run for more than an hour or so (I'm not sure when it dies) as long as I hit the site it stays up so I'm assuming it is being killed after an idle period, but I'm not sure why.
My app doesn't save any state or persist anything. It makes web service calls and survives errors in any calls.
I turned on a ping service to keep my app alive but I'm curious why this works on my local machine but not on App Harbor?
The guys behind App Harbor pays for EC2 instances for all running apps, so they naturally want to limit the cpu usage as much as possible. One way to achieve this is to shut down unused applications very fast and only restart them when someone actually try to access them. Paid hosting should not be limited in this way.
(As far as I have been informed they are able to host around 100k sites on less than twenty medium instances which is certainly quite impressive and calls for a very economic use of resources.)
To overcome the limitation you would need a cron job to ping your app harbour site. But this of course a quite recursive problem since you need app harbour to act as a cron job ;)
AppHarbor recycles the Application Pool frequently to keep sleeping websites from using idle CPU time. This is simply the price you pay of using a shared website hosting plan.
If you really want to run a background job then you should be using AppHarbor's background workers, since this is exactly the type of task they were built to run.
http://support.appharbor.com/kb/getting-started/background-workers
Simply build a new console application that runs your logic and include it in your solution. When you push the code the workers will be started automatically. If you happen to already have other exe's in your solution make sure to edit the app.config and set the 'deploy background worker' value to false.
Does anyone know how to run a number of processes in the background either through a job queue or parallel processing.
I have a number of maintenance updates that take time to run and want to do this in the background.
I would recomment Gearman server, it prooved quite stable, it's totally outside of Symfony2, and you have to have server up and running (don't know what your hosting options are), but it distribues jobs perfectly. In skiniest version, it just keeps all jobs in-memory, but you can configure it to use sqlite database as backup, so for any reason server reboots, or gearman deamon breaks, you can just start it again, and your jobs will be perserved. I konw it has been tested with very large loads (adding up 1k jobs per second), and it stood it's ground. It's probably more stable nowdays, I'm speaking from experience 2 yrs ago, where we offloaded some long-running tasks in ZF application to background processing via Gearman. It should be quite self-explanitory how it works from image below:
Checkout RabbitMq. It's the most popular option according to knpbundles.com
Take a look at http://github.com/mmoreram/rsqueue-bundle
Uses Redis as queue core and will be mantained.
Take a look at enqueue libraty. There are a lot of transports (AMQP, STOMP, AmazonSQS, Redis, Filesystem, Doctrine DBAL and more) to choose from. Easy to use and feature rich. That would be enough for simple job queue, though if you need something more sophisticated look at enqueue/job-queue. It can run an exclusive job (only one job running at a given time) or a job with sub-jobs, or a job with something to do after it has been done.
Of course, there is a bundle for it
Scenario:
A biztalk application is deployed with a receive port, orchestration and send port. Messages flow correctly.
At some point, a bug is found in the orchestration, causing messages to suspend. The orchestration must be fixed and redeployed.
Question:
Because you can't redploy an orchestration with suspended instances, how would you go about retaining those messages, terminating the instances, redeploying and then resending those messages through the fixed orchestration? Is there a process or tool for this?
If the bug doesn`t require major modification - i.e no new orchestrations, no new schemas, no new promoted fields etc, then a short term 'hack' is possible, viz by simply reinstalling the fixed MSIs (and GAC) on your servers, and restarting the host instances (using NLB if applicable) (i.e. without importing the MSI's into BizTalk).
You should then be able to resume any suspended (resumable) orchs. Then schedule some downtime at a less busy time, put your app into partially stopped to prevent new orchs starting, wait for all running orchs to complete, and then import the fixed MSI (consider bumping up the buggy orch assembly version with the hotfix)
Building a custom tool with the ability to audit all messages going in and out of Biztalk is useful, so you can replay them. This will allow you to terminate orchs, reinstall, and then replay.
You can as well fix the orchstration and while building it, increment the version of the assembly. This way you can have parallel deployment of the orchestrations. You can unenlist the existing one after deploying the newer versioned orchstration.