How to know when to scale your FastAPI app? - fastapi

I'm currently following the process of https://fastapi.tiangolo.com/deployment/docker/ to get a Docker FastAPI container up and running, and it works like a charm.
We're currently on the nginx+uwsgi+flask stack and we want to get onboard with FastAPI.
The big question we have is:
What metric do we use to tell us we need to scale up or out?
In our current stack, uwsgi exposes the "Max Workers" and "Idle Workers" metrics, so it's super easy to answer that question. If the number of Idle Workers metrics starts creeping down to zero, we either scale UP (increase the number of workers), or scale OUT (spin up more uwsgi containers behind our load balancer). Easy peasy!
With FastAPI, how do we know ahead of time when we need to scale?
Does FastAPI have any metrics we can query? Or is the answer, "scale up when your cpu/memory is maxing out?"
thanks,
A FastAPI Newbie

Related

How Uvicorn workers works, and how many do I need for a slim machine?

The application I deploy is FastAPI with Uvicorn under K8s.
While trying to understand how I want to Dockerize the application I understood I want to implement Uvicorn without Gunicorn and to add a system of scale up/down by the load of the requests the application is getting.
I did a lot of load testing and discovered that with the default of 1 Uvicorn worker I'm getting 3.5 RPS, while changing the workers to 8 I can get easly 22 RPS (didn't check for more since its great results for me).
Now what I was expecting regarding the resources is that the CPU that I will have to provide will be with a limit of 8 (I assume every worker works on one process and thread), but I saw only increase in the memory usage, but barley in the CPU. maybe its because the app don't use that much CPU but indeed its possible for it to use more than 1 CPU? so far it didn't used more than one CPU.
How are the Uvicorn workers works? how should I calculate the workers number I need for the app? I didn't find useful information.
Again, my goal is to keep a slim machine of 1 cpu, with Autoscaling system.

How to send 50.000 HTTP requests in a few seconds?

I want to create a load test for a feature of my app. It’s using a Google App Engine and a VM. The user sends HTTP requests to the App Engine. It’s realistic that this Engine gets thousands of requests in a few seconds. So I want to create a load test, where I send 20.000 - 50.000 in a timeframe of 1-10 seconds.
How would you solve this problem?
I started to try using Google Cloud Task, because it seems perfect for this. You schedule HTTP requests for a specific timepoint. The docs say that there is a limit of 500 tasks per second per queue. If you need more tasks per second, you can split this tasks into multiple queues. I did this, but Google Cloud Tasks does not execute all the scheduled task at the given timepoint. One queue needs 2-5 minutes to execute 500 requests, which are all scheduled for the same second :thinking_face:
I also tried a TypeScript script running asynchronous node-fetch requests, but I need for 5.000 requests 77 seconds on my macbook.
I don't think you can get 50.000 HTTP requests "in a few seconds" from "your macbook", it's better to consider going for a special load testing tool (which can be deployed onto GCP virtual machine in order to minimize network latency and traffic costs)
The tool choice is up to you, either you need to have powerful enough machine type so it would be able to conduct 50k requests "in a few seconds" from a single virtual machine or the tool needs to have the feature of running in clustered mode so you could kick off several machines and they would send the requests together at the same moment of time.
Given you mention TypeScript you might want to try out k6 tool (it doesn't scale though) or check out Open Source Load Testing Tools: Which One Should You Use? to see what are other options, none of them provides JavaScript API however several don't require programming languages knowledge at all
A tool you could consider using is siege.
This is Linux based and to prevent any additional cost by testing from an outside system out of GCP.
You could deploy siege on a relatively large machine or a few machines inside GCP.
It is fairly simple to set up, but since you mention that you need 20-50k in a span of a few seconds, siege by default only allows 255 requests per second. You can make this larger, though, so it can fit your needs.
You would need to play around on how many connections a machine can establish, since each machine will have a certain limit based on CPU, Memory and number of network sockets. You could just increase the -c number, until the machine gives an "Error: system resources exhausted" error or something similar. Experiment with what your virtual machine on GCP can handle.

How to loadbalance multiple application servers against multiple Vtgates

We are doing a POC to prove that Vitess can scale massively and meet our requirements. We are using around 40 application servers, 15 VTGates and 30 shards (each shard contains master, replica and rdonly). However we were able to scale up to a point and above that point getting a flat line.
Main dark point for us is connecting application server and multiple VTGates. We have tried loadbalancer (AWS nlb) in between them and seen increased QPS but much lower TPS (~15000 QPS, ~ 1500-2000 TPS). Then we tired each application use JDBC connection pooling to connect VTGate without loadbalancer. We got similar results. Then we tried without connection pooling. Then we were able to increase TPS, however saw massive dips in QPS which affect the TPS.
As you can see we have hit a certain roadblock and need few brilliant ideas to overcome this. Really appreciate valuable inputs 

Is it "okay" to host a small wordpress blog on one AWS EC2 Instance without load balancers/beanstalk?

This is a very simple question for those with the knowledge, but I'm a newbie.
In essence, I just need to know if it would be considered okay to run a small, approx. 700 visitors/day bitnami wordpress blog on just one t2.medium EC2 instance (without any auto-scaling, beanstalk).
Am at risk of it crashing? What stats should I monitor or be aware of to be aware of potential dangers? Sorry for the basic nature of these questions, but this is new.
tl;dr: It might be "okay", but it's not ideal.
If your question is because of:
Initial setup time - Load-balancing and auto-scaling will be less expensive (more time-efficient) over time.
Cost - Auto-scaling spins down instances that aren't being used to reduce cost.
Minimal setup for a great user experience - The goal of a great AWS setup is to ensure that capacity matches demand
Am at risk of it crashing?
Possibly, yes. If you average 700 visitors, then the risk is traffic spikes if all visitors hit at the same. It also depends on what your maximum visitors are, which could vary widely from the average (or not)
What stats should I monitor or be aware of to be aware of potential dangers?
Monitor the usage on high traffic days (ie. public holiday sales)
Setup billing alerts
Setup the right metrics:
See John Rotenstein's SO answer:
CPU Utilization is not always the right measure to use -- your
application might only be able to handle a limited number of
connections, it might be squeezed on RAM and the types of requests
might vary too.
You can use normal monitoring tools, or you can write something that
pushes metrics to Amazon CloudWatch, so that you go beyond the basic
CPU and Network metrics that CloudWatch normally provides. You could
even use the Load Balancer's Latency metric to trigger scaling when
the application slows down (custom code required).
I'd start with:
Two or more instances - to deal with instance redundancy (an instance going down)
Several t2.small rather than one t2.medium can work out to be more cost-efficient, and more cost efficient than EC in some use cases.
Add auto-scaling - automatically spin up or down instances based on minimum and maximum counts
Load balancing - to re-route users from unhealthy to healthy instances. And also to keep all of the spun up instances all working as evenly as possible (rather than a single instance handling 80% of the workload while the others bludge).
You can always reduce your instances after time with monitoring.
In my opinion, with 700 visitors a day, the safer option would be to run a load balanced/auto-scaling environment on Elastic Beanstalk with at least 2 instances. The problem with running just one instance is that yes you are at a great risk of crashing in case you get an increase in traffic or when the instance goes down and with just one running you will not have a fallback. You can easily set up CloudWatch monitoring on NetworkIn, NetworkOut to get a sense of the number of requests your site is receiving and serving, and setup CPU Usage monitoring as well. The trade-off with running a load balanced environment over a single instance environment is that the cost might significantly increase as you introduce other things into your environment such as a load balancer. Also if you introduce a load balancer consider reducing the instance size to maybe a t2.small, could aid in reducing the cost.
It actually depends. This question range is wide. You have multiple options here.
You can use only ec2 instance for that much amount of visitors or even more if your application allows. You can also consider caching if your app need it.
You may add instance in an autoscaling group. So that if by any chance you need more resources you can increase them horizontally.
You can add load balancers lateron also. You just need to add user data in your launch configuration attached to autoscaling group. So when your instance get up it should automatically register itself in your load balancer.
For monitoring, you can check for the request metrics in cloudwarch for ELB. You have to keep an eye on your CPU and trigger the scale out policy once it reaches a particular threshold.

How to handle scaling when request per minute go from 500 to 5000 instanly

I have an application that spikes from 500 rpm to 5000 and stays there for 20-30min. I know that's not a ton of requests but its the magnitude of the jump that is killing me. AWS-EC2 takes 5 min to scale up so that's not helpful when things move so fast. Maybe multiple DB's that handle different pieces of the application.
How would you go about analyzing this and thinking about infrastructure if you will always go from 500 to 5000RPM or higher in one minute?
This is the graph from my AWS logs:
If you can predict that demand will increase at some point you can automate provisioning of new instances. If you can't determine this then you need to do proper capacity planning. For instance, how many servers/containers do you need running to sustain the load with an acceptable user experience? This will be key to determine.
You also should look at implement asynchronous messaging patterns that offload the spike although this may come with some performance degradation.
One additional consideration would be moving to a serverless architecture like AWS Lambda. This likely wouldn't fully solve the problem but would provide you more ability to quickly provision on demand infrastructure.

Resources