High memory usage in OpenCPU - r

R requires CPU more than anything else so it is recommended to pick one of the newer generation compute optimized instance types, preferably with a SSD disk.
I've recently run into a problem with high memory usage (quickly raising to 100%) during load testing. To reproduce: there is an R package for which processing time is UP TO 0.2 in no-stress conditions. If I'm trying to query one of the endpoints using curl for 1000 jsons on 3 machines in parallel all of the memory is suddenly used which results in 'cannot fork' or:
cannot popen '/usr/bin/which 'uname' 2>/dev/null', probable reason 'Cannot allocate memory' In call: system(paste(which, shQuote(names[i])), intern = TRUE, ignore.stderr = TRUE)
The setup is 2x AWS 8GB CPU-optimized servers + load balancer all in private network. HTTPS is enabled and my main usage is online processing of requests so I'm mostly querying /json endpoints.
Do you happen to have any suggestions on how to approach this issue? The plan is to have more packages installed (more online processes requesting result from various functions) and don't want to end up having 32GB RAM per box.
All of the packages are deployed with such options:
LazyData: false
LazyLoad: false
They are also added into serverconf.yml.j2 - preload section.
RData files are loaded within an onLoad function by calling utils::data.
Also, keeping in mind that I'm using OpenCPU without github and only one-way communication (from backend to ocpu box) which options do you suggest to turn on/optimize? It's not clearly stated in the docs yet.

It mostly depends on which packages you are using and what you are doing. Can you run the same functionality that you are invoking through opencpu locally (in the command line) without running out of memory?
Apache2 prefork creates worker processes to handle concurrent requests. Each of these workers contains an R process with all preloaded
packages. So if one request would take 500mb, the total memory
consumption on the server is n * 500 where n is the number of workers
that are loaded.
Depending on how many concurrent requests you expect, you could try
lowering StartServers or MaxRequestWorkers in your apache2 config.
Also try raising (or lowering) the option rlimit.as in the file /etc/opencpu/server.conf which limits the amount of memory (address space) a single process is allowed to consume.

Related

Request to Tensorflow Serving Server got timeout error for models using grpc python API

Synopsis
10 models for computer vision was deployed to tensorflow serving server (TSS) running on Ubuntu 22.04.
TSS installed as system service on the dedicated server having MSI RTX3060 12 Gb on board. System configuration and TSS service file below.
Problem
Request send via the tensorflow-serving grpc api randomly got status code DEADLINE_EXCEEDED or UNAVAILABLE sometimes on the first request but more often after some number (1...4) of successful requests or after some period of inactivity ( 1 hour or more ).
No OMM or service dump happened. GPU memory occupation is near 6 Gb. Service logs seems to have no problem indication, no warnings also (debug level 3).
Some experiments and results in detail see below.
System
[OS] Ubuntu 22.04.1 LTS
[CPU] 11th Gen Intel(R) Core(TM) i5-11400F # 2.60GHz
[GPU] MSI GeForce RTX 3060 12G
[RAM] 16Gb
[SSD] NVME 1Tb
[Tensorflow] Version 2.9.1
[CUDA] Version 11.7
[CUDNN] Version 8.6.0.163-1+cuda11.8
[TensorRT] Not used while building tensorflow-serving
TSS service file
[Unit]
Description=Tensorflow Serving Service
After=network-online.target
[Service]
User=root
Environment="PATH=/usr/local/bin:/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/cuda/bin:/usr/bin/model_servers"
Environment="LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="TF_FORCE_GPU_ALLOW_GROWTH=true"
ExecStart=/usr/bin/model_servers/tensorflow_model_server --port=8504 --model_config_file=/mnt/data/models/export/frman.conf
[Install]
WantedBy=multi-user.target
Hints
Tensorflow serving service intializes after network on host is available․
Service configured to allocate gpu memory when it needed (environment variable - TF_FORCE_GPU_ALLOW_GROWTH=true).
Hypotesis - actions
The problem is lost packets on the network
Requests to TSS was monitored with wireshark from client side and with GNOME system monitor in server side. NO problem detected.
Timeout value for single request on client side of tensorflow_serving.apis.prediction_service_pb2_grpc.PredictionServiceStub object increased .
stub.Predict(request, timeout * len(images))
The grpс channel is being checked for readiness before data transmission to begin
Interceptors for grpc requests added. Procedure repeat request with exponential backoff but nevertheless it returned status code DEADLINE_EXCEEDED or UNAVAILABLE randomly.
options_grpc = [
('grpc.max_send_message_length', 100 * 1024 * 1024),
('grpc.max_receive_message_length', 100 * 1024 * 1024),
('grpc.default_compression_algorithm', grpc.Compression.Gzip),
('grpc.grpc.default_compression_level', CompressionLevel.high)
]
interceptors = (
RetryOnRpcErrorClientInterceptor(
max_attempts=5,
sleeping_policy=ExponentialBackoff(init_backoff_ms=1000, max_backoff_ms=32000, multiplier=2),
status_for_retry=(grpc.StatusCode.DEADLINE_EXCEEDED, grpc.StatusCode.UNKNOWN, grpc.StatusCode.UNAVAILABLE),
),
)
channel = grpc.insecure_channel(f"{host}:{port}", options=options_grpc)
grpc.channel_ready_future(channel).result(timeout=5)
channel = grpc.intercept_channel(channel, *interceptors)
The problem is in the models themselves
Was noticed that problems mostly arise with models with U2net architecture. U2net uses some custom operations and was assumed that the problem arises from the fact that at the first request time outed because these operations for loading ops are too long.
That was found on the TSS service log file. To resolve this we try:
Add a warm-up that kind of models at the service startup, so all custom network operations loaded into memory before inference.
Custom operations in U2Net was eliminated by converting the models to onnx format and then to tensorflow saved model format. So, after that there is no need for the TSS to load network custom ops at startup. Model warm up added also.
The problem with the lack of memory - There was noticed another alarming message in TSS service log:
tensorflow_model_server[62029]: 2022-12-20 14:10:13.900599: W external/org_tensorflow/tensorflow/tsl/framework/bfc_allocator.cc:360] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
It looks line not enough memory for inference on GPU. To solve this we try to limit image batch size to 1(one) image for TSS and to set environment variable - TF_FORCE_GPU_ALLOW_GROWTH=true.
Memory consumption not increased after that. But random timeout (DEADLINE_EXCEEDED) error not dissapied.
Conclusion - Issue NOT Resolved
Thus, the problem is still in place especially when TSS inference segmentaion models (like U2Net).
Not Root of the problem was found.
Difficult to reproduce error as of its random type.
What else should be worth to check or configure to resolve the issue?

How to send 50.000 HTTP requests in a few seconds?

I want to create a load test for a feature of my app. It’s using a Google App Engine and a VM. The user sends HTTP requests to the App Engine. It’s realistic that this Engine gets thousands of requests in a few seconds. So I want to create a load test, where I send 20.000 - 50.000 in a timeframe of 1-10 seconds.
How would you solve this problem?
I started to try using Google Cloud Task, because it seems perfect for this. You schedule HTTP requests for a specific timepoint. The docs say that there is a limit of 500 tasks per second per queue. If you need more tasks per second, you can split this tasks into multiple queues. I did this, but Google Cloud Tasks does not execute all the scheduled task at the given timepoint. One queue needs 2-5 minutes to execute 500 requests, which are all scheduled for the same second :thinking_face:
I also tried a TypeScript script running asynchronous node-fetch requests, but I need for 5.000 requests 77 seconds on my macbook.
I don't think you can get 50.000 HTTP requests "in a few seconds" from "your macbook", it's better to consider going for a special load testing tool (which can be deployed onto GCP virtual machine in order to minimize network latency and traffic costs)
The tool choice is up to you, either you need to have powerful enough machine type so it would be able to conduct 50k requests "in a few seconds" from a single virtual machine or the tool needs to have the feature of running in clustered mode so you could kick off several machines and they would send the requests together at the same moment of time.
Given you mention TypeScript you might want to try out k6 tool (it doesn't scale though) or check out Open Source Load Testing Tools: Which One Should You Use? to see what are other options, none of them provides JavaScript API however several don't require programming languages knowledge at all
A tool you could consider using is siege.
This is Linux based and to prevent any additional cost by testing from an outside system out of GCP.
You could deploy siege on a relatively large machine or a few machines inside GCP.
It is fairly simple to set up, but since you mention that you need 20-50k in a span of a few seconds, siege by default only allows 255 requests per second. You can make this larger, though, so it can fit your needs.
You would need to play around on how many connections a machine can establish, since each machine will have a certain limit based on CPU, Memory and number of network sockets. You could just increase the -c number, until the machine gives an "Error: system resources exhausted" error or something similar. Experiment with what your virtual machine on GCP can handle.

Rserve sub-processes memory issue

I have a Java server from which I am calling R functions using c.eval.
I use Rserve to do this and preload all libraries using the following call
R CMD Rserver --RS-conf Rserve.conf
The file Rserve.conf has all my loaded libraries. The total memory requirement for this is around 127MB.
The challenge I have is that every time I call a function from within my Java server, a new process is spawned but it looks like the process requires the full 127MB of memory. So with 32GB RAM, roughly, 240 concurrent calls are sufficient to max out the memory usage and brings the server crashing down.
I found this link: Rserve share library code but it talks about exactly what I have been doing. Any help in understanding how to make Rserve work without it loading all libraries for every call would be much appreciated.

IIS holding up requests in queue instead of processing those

I'm executing a load test against an application hosted in Azure. It's a cloud service with 3 instances behind an internal load balancer (Hash based load balancing mode).
When I execute the load test, it queues request even though the req/sec and total current request to IIS is quite low. I'm not sure what could be the problem.
Any suggestions?
Adding few screenshot of performance counters which might help you take decision.
Click on image to view original image.
Edit-1: Per request from Rohit Rajan,
Cloud Service is having 2 instances (meaning 2 VMs), each of them having 14 GBs of RAM and 8 cores.
I'm executing a Step load pattern start with 100 and add 100,150 user every 5 minutes, till 4-5 hours until the load reaches to 10,000 VUs.
Any call to external system are written async. Database calls are synchronous.
There is no straight forward answer to your question. One possible way would be to explore additional investigation options.
Based on your explanation, there seems to be a bottleneck within the application which is causing the requests to queue-up.
In order to investigate this, collect a memory dump when you see the requests queuing up and then use DebugDiag to run a hang analysis on it.
There are several ways to gather the memory dump.
Task Manager
Procdump.exe
Debug Diagnostics
Process Explorer
Once you have the memory dump you can install debug diag and then run analysis on it. It will generate a report which can help you get started.
Debug Diagnostics download: https://www.microsoft.com/en-us/download/details.aspx?id=49924

How to set up cluster slave nodes (on Windows)

I need to run thousands* of models on 15 machines (each of 4 cores), all Windows. I started to learn parallel, snow and snowfall packages and read a bunch of intro's, but they mainly focus on the setup of the master. There is only a little information on how to set up the worker (slave) nodes on Windows. The information is often contradictory: some say that SOCK cluster is practically the easiest way to go, others claim that SOCK cluster setup is complicated on Windows (sshd setup) and the best way to go is MPI.
So, what is an easiest way to install slave nodes on Windows? MPI, PVM, SOCK or NWS? My, possibly naive ideas were (listed by priority):
To use all 4 cores on the slave nodes (required).
Ideally, I need only R with some packages and a slave R script or R function that would listen on some port and wait for tasks from master.
Ideally, nodes can be added/removed dynamically from the cluster.
Ideally, the slaves would connect to the master - so I wouldn't have to list all the slaves IP's in configuration of the master.
Only 1 is 100% required, 2-4 are "would be good". Is it too naive to request?
I am sorry but I have not been able to figure this out from the available docs and tutorials. I would be grateful if you point me out to the right source.
* Note that each of those thousands of models will take at least 7 minutes, so there won't be a big communication overhead.
It's a shame how all these APIs (like parallel/snow/snowfall) are complex to work with, a lots of docs but not what you need... I have found an API which is very simple and goes straight to the ideas I sketched!! It is redis and doRedis R package (as recommended here). Finally a very simple tutorial is present! Just modified a bit and got this:
The workers need only R, doRedis package and this script:
require(doRedis)
redisWorker('jobs', '10.0.0.7') # IP of the server
The master needs redis server running (installed the experimental windows binaries for Windows), and this R code:
require(doRedis)
registerDoRedis('jobs')
foreach(j=1:10,.combine=sum,.multicombine=TRUE) %dopar%
... # whatever you need to run
removeQueue('jobs')
Adding/removing workers is fully dynamic, no need to specify IPs at master, automatic "load balanancing", simple and no need for tons of docs! This solution fulfills all the requirements and even more - as stated in ?registerDoRedis:
The doRedis parallel back end tolerates faults among the worker processes and automatically resubmits failed tasks.
I don't know how complex this would be using the parallel/snow/snowfall with SOCKS/MPI/PVM/NWS, if it would be possible at all, but I guess very complex...
The only disadvantages of using redis I found:
It is a database server. I wonder if this API exist somewhere without the need to install the database server which I don't need at all. I guess it must exist!
There is a bug in the current doRedis package ("object '.doRedisGlobals' not found") with no solution yet and I am not able to install the old working doRedis 1.0.5 package into R 3.0.1.

Resources