Request to Tensorflow Serving Server got timeout error for models using grpc python API - grpc

Synopsis
10 models for computer vision was deployed to tensorflow serving server (TSS) running on Ubuntu 22.04.
TSS installed as system service on the dedicated server having MSI RTX3060 12 Gb on board. System configuration and TSS service file below.
Problem
Request send via the tensorflow-serving grpc api randomly got status code DEADLINE_EXCEEDED or UNAVAILABLE sometimes on the first request but more often after some number (1...4) of successful requests or after some period of inactivity ( 1 hour or more ).
No OMM or service dump happened. GPU memory occupation is near 6 Gb. Service logs seems to have no problem indication, no warnings also (debug level 3).
Some experiments and results in detail see below.
System
[OS] Ubuntu 22.04.1 LTS
[CPU] 11th Gen Intel(R) Core(TM) i5-11400F # 2.60GHz
[GPU] MSI GeForce RTX 3060 12G
[RAM] 16Gb
[SSD] NVME 1Tb
[Tensorflow] Version 2.9.1
[CUDA] Version 11.7
[CUDNN] Version 8.6.0.163-1+cuda11.8
[TensorRT] Not used while building tensorflow-serving
TSS service file
[Unit]
Description=Tensorflow Serving Service
After=network-online.target
[Service]
User=root
Environment="PATH=/usr/local/bin:/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/cuda/bin:/usr/bin/model_servers"
Environment="LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="TF_FORCE_GPU_ALLOW_GROWTH=true"
ExecStart=/usr/bin/model_servers/tensorflow_model_server --port=8504 --model_config_file=/mnt/data/models/export/frman.conf
[Install]
WantedBy=multi-user.target
Hints
Tensorflow serving service intializes after network on host is available․
Service configured to allocate gpu memory when it needed (environment variable - TF_FORCE_GPU_ALLOW_GROWTH=true).
Hypotesis - actions
The problem is lost packets on the network
Requests to TSS was monitored with wireshark from client side and with GNOME system monitor in server side. NO problem detected.
Timeout value for single request on client side of tensorflow_serving.apis.prediction_service_pb2_grpc.PredictionServiceStub object increased .
stub.Predict(request, timeout * len(images))
The grpс channel is being checked for readiness before data transmission to begin
Interceptors for grpc requests added. Procedure repeat request with exponential backoff but nevertheless it returned status code DEADLINE_EXCEEDED or UNAVAILABLE randomly.
options_grpc = [
('grpc.max_send_message_length', 100 * 1024 * 1024),
('grpc.max_receive_message_length', 100 * 1024 * 1024),
('grpc.default_compression_algorithm', grpc.Compression.Gzip),
('grpc.grpc.default_compression_level', CompressionLevel.high)
]
interceptors = (
RetryOnRpcErrorClientInterceptor(
max_attempts=5,
sleeping_policy=ExponentialBackoff(init_backoff_ms=1000, max_backoff_ms=32000, multiplier=2),
status_for_retry=(grpc.StatusCode.DEADLINE_EXCEEDED, grpc.StatusCode.UNKNOWN, grpc.StatusCode.UNAVAILABLE),
),
)
channel = grpc.insecure_channel(f"{host}:{port}", options=options_grpc)
grpc.channel_ready_future(channel).result(timeout=5)
channel = grpc.intercept_channel(channel, *interceptors)
The problem is in the models themselves
Was noticed that problems mostly arise with models with U2net architecture. U2net uses some custom operations and was assumed that the problem arises from the fact that at the first request time outed because these operations for loading ops are too long.
That was found on the TSS service log file. To resolve this we try:
Add a warm-up that kind of models at the service startup, so all custom network operations loaded into memory before inference.
Custom operations in U2Net was eliminated by converting the models to onnx format and then to tensorflow saved model format. So, after that there is no need for the TSS to load network custom ops at startup. Model warm up added also.
The problem with the lack of memory - There was noticed another alarming message in TSS service log:
tensorflow_model_server[62029]: 2022-12-20 14:10:13.900599: W external/org_tensorflow/tensorflow/tsl/framework/bfc_allocator.cc:360] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
It looks line not enough memory for inference on GPU. To solve this we try to limit image batch size to 1(one) image for TSS and to set environment variable - TF_FORCE_GPU_ALLOW_GROWTH=true.
Memory consumption not increased after that. But random timeout (DEADLINE_EXCEEDED) error not dissapied.
Conclusion - Issue NOT Resolved
Thus, the problem is still in place especially when TSS inference segmentaion models (like U2Net).
Not Root of the problem was found.
Difficult to reproduce error as of its random type.
What else should be worth to check or configure to resolve the issue?

Related

Corda Node getting JVM OutOfMemory Exception when trying to load data

Background:
We are trying to load data in our Custom CorDapp(Corda-3.1) using jmeter.
Our CorDapp is distributed across Six nodes(Three Parties, Two Notaries and One Oracle).
The Flow being executed in order to load data is having very minimal business logic, has three participants and requires two parties to sign the transaction.
Below is the environment, configuration and test details:
Server: Ubuntu 16.04
Ram: 8GB
Memory allocation to Corda.jar: 4GB
Memory allocation to Corda-webserver.jar : 1GB
JMeter Configuration- Threads(Users): 20 (1 transaction per second per thread)
Result:
Node B crashed after approx 21000 successful transactions(in approximately 3 hours and 30 mins) with "java.lang.OutOfMemoryError: Java heap space". After some time other nodes crashed due to continuous "handshake error" with Node B.
We analyzed heap dump using Eclipse MAT and found out that more than 21000 instances of hibernate SessionFactoryImpl were created which occupied more than 85% of the memory on Node B.
We need to understand why Corda network is creating so many objects and persisting them in memory.
We are continuing our investigation as we are not 100% sure if this is entirely Corda bug.
Solution to the problem is critical in our pursuit to continue further tests.
Note - We have more details about our investigation but we are unable to attach them here but can send over email.
If you're developing in Java, it is likely that the issue you're encountering has already been fixed by https://r3-cev.atlassian.net/browse/CORDA-1411
The fix is not available in Corda 3.1 yet, but the JIRA ticket provides a workaround. You need to override equals and hashCode on any subclasses of MappedSchema that you've defined. This should fix the issue you're observing.

High memory usage in OpenCPU

R requires CPU more than anything else so it is recommended to pick one of the newer generation compute optimized instance types, preferably with a SSD disk.
I've recently run into a problem with high memory usage (quickly raising to 100%) during load testing. To reproduce: there is an R package for which processing time is UP TO 0.2 in no-stress conditions. If I'm trying to query one of the endpoints using curl for 1000 jsons on 3 machines in parallel all of the memory is suddenly used which results in 'cannot fork' or:
cannot popen '/usr/bin/which 'uname' 2>/dev/null', probable reason 'Cannot allocate memory' In call: system(paste(which, shQuote(names[i])), intern = TRUE, ignore.stderr = TRUE)
The setup is 2x AWS 8GB CPU-optimized servers + load balancer all in private network. HTTPS is enabled and my main usage is online processing of requests so I'm mostly querying /json endpoints.
Do you happen to have any suggestions on how to approach this issue? The plan is to have more packages installed (more online processes requesting result from various functions) and don't want to end up having 32GB RAM per box.
All of the packages are deployed with such options:
LazyData: false
LazyLoad: false
They are also added into serverconf.yml.j2 - preload section.
RData files are loaded within an onLoad function by calling utils::data.
Also, keeping in mind that I'm using OpenCPU without github and only one-way communication (from backend to ocpu box) which options do you suggest to turn on/optimize? It's not clearly stated in the docs yet.
It mostly depends on which packages you are using and what you are doing. Can you run the same functionality that you are invoking through opencpu locally (in the command line) without running out of memory?
Apache2 prefork creates worker processes to handle concurrent requests. Each of these workers contains an R process with all preloaded
packages. So if one request would take 500mb, the total memory
consumption on the server is n * 500 where n is the number of workers
that are loaded.
Depending on how many concurrent requests you expect, you could try
lowering StartServers or MaxRequestWorkers in your apache2 config.
Also try raising (or lowering) the option rlimit.as in the file /etc/opencpu/server.conf which limits the amount of memory (address space) a single process is allowed to consume.

Increase RAM usage for IIS server

I am running a large scale ERP system on the following server configuration. The application is developed using AngularJS and ASP.NET 4.5
Dell PowerEdge R730 (Quad Core 2.7 Ghz, 32 GB RAM, 5 x 500 GB Hard disk, RAID5 configured) Software: Host OS is VMWare ESXi 6.0 Two VMs run on VMWare ESXi .. one is Windows Server 2012 R2 with 16 GB memory allocated ... this contains IIS 8 server with my application code Another VM is also Windows Server 2012 R2 with SQL Server 2012 and 16 GB memory allocated .... this just contains my application database.
You see, I separated the application server and database server for load balancing purposes.
My application contains a registration module where the load is expected to be very very high (around 10,000 visitors over 10 minutes)
To support this volume of requests, I have done the following in my IIS server -> increase request queue in application pool length to 5000 -> enable output caching for aspx files -> enable static and dynamic compression in IIS server -> set virtual memory limit and private memory limit of each application pool to 0 -> Increase maximum worker process of each application pool to 6
I then used gatling to run load testing on my application. I injected 500 users at once into my registration module.
However, I see that only 40% / 45% of my RAM is being used. Each worker process is using only a maximum amount of 130 MB or so.
And gatling is reporting that around 20% of my requests are getting 403 error, and more than 60% of all HTTP requests have a response time greater than 20 seconds.
A single user makes 380 HTTP requests over a span of around 3 minutes. The total data transfer of a single user is 1.5 MB. I have simulated 500 users like this.
Is there anything missing in my server tuning? I have already tuned my application code to minimize memory leaks, increase timeouts, and so on.
There is a known issue with the newest generation of PowerEdge servers that use the Broadcom Network Chip set. Apparently, the "VM" feature for the network is broken which results in horrible network latency on VMs.
Head to Dell and get the most recent firmware and Windows drivers for the Broadcom.
Head to VMWare Downloads and get the latest Broadcom Driver
As for the worker process settings, for maximum performance, you should consider running the same number of worker processes as there are NUMA nodes, so that there is 1:1 affinity between the worker processes and NUMA nodes. This can be done by setting "Maximum Worker Processes" AppPool setting to 0. In this setting, IIS determines how many NUMA nodes are available on the hardware and starts the same number of worker processes.
I guess the 1 caveat to the answer you received would be if your server isn't NUMA aware/uses symmetric processing, you won't see those IIS options under CPU, but the above poster seems to know a good bit more than I do about the machine. Sorry I don't have enough street cred to add this as a comment. As far as IIS you may also want to make sure your app pool doesn't use default recycle conditions and pick a time like midnight for recycle. If you have root level settings applied the default app pool recycling at 29 hours may also trigger garbage collection against your child pool/causing delays even in concurrent gc where it sounds like you may benefit a bit from Gcserver=true. Pretty tough to assess that though.
Has your sql server been optimized for that type of workload? If your data isn't paramount you could squeeze faster execution times with delayed durability, then assess queries that are returning too much info for async io wait types. In general there's not enough here to really assess for sql optimizations, but if not configured right (size/growth options) you could be hitting a lot of timeouts due to growth, vlf fragmentation, etc.

Machine <IP_address> has been started with not enough memory

I am using Cloudify 2.7 with OpenStack Icehouse.
I developed a tomcat recipe and deployed it. In the orchestrator log of the cloudify console, I read the following WARNING:
2015-06-04 11:05:01,706 ESM INFO [org.openspaces.grid.gsm.strategy.ScaleStrategyProgressEventState] - [tommy.tomcat] machines SLA enforcement is in progress.; Caused by: org.openspaces.grid.gsm.machines.exceptions.ExpectedMachineWithMoreMemoryException: Machines SLA Enforcement is in progress: Expected machine with more memory. Machine <Public_IP>/<Public_IP> has been started with not enough memory. Actual total memory is 995MB. Which is less than (reserved + container) = (0MB+3800MB) = 3800MB
The Flavor of the VM is: 4GB RAM, 2vCPU, 20GB Disk
Into the cloud driver I commented the following line:
//reservedMemoryCapacityPerMachineInMB 1024
and configured the compute section related to the flavor as following:
computeTemplate
{
imageId <imageID>
machineMemoryMB 3900
hardwareId <hardwareId>
...
}
Can someone help me to pointing out the error?
Thanks.
The error message states that the actual available memory is only 995MB, which is considerably less than the expected 4GB. To clarify that:
do you run multiple services on the same machine?
maybe the VM really has less memory than expected. please run 'cat /proc/meminfo' on the started VM to verify the exact memory it has
In principle, you should not comment out any setting of reserved memory because Cloudify must take that into account - this setting is supposed to represent the memory used by the OS and other processes. additionally, the orchestrator (ESM) takes into account ~100 MB for cloudify to run freely.
So, please update machineMemoryMB to the value calculated this way:
(the number returned by 'cat /proc/meminfo') - 1024 - 100

LoadRunner - Monitoring linux counters gives RPC error

Linux distribution is Red Hat. I'm monitoring linux counters with the LoadRunner Controller's System Resources Graphs - Unix Resources. Monitoring is working properly and graphs are plotted in real time. But after a few minutes, errors are appearing:
Monitor name :UNIX Resources. Internal rpc error (error code:2).
Machine: 31.2.2.63. Hint: Check that RPC on this machine is up and running.
Check that rstat daemon on this machine is up and running
(use rpcinfo utility for this verification).
Details: RPC: RPC call failed.
RPC-TCP: recv()/recvfrom() failed.
RPC-TCP: Timeout reached. (entry point: Factory::CollectData).
[MsgId: MMSG-47197]
I logged on the Linux server and found rstatd is still running. Clearing the measurements in Controller's Unix Resources and adding them again, monitoring again started to work but after a few minutes, the same error occurred.
What might cause this error ? Is it due to network traffic ?
Consider using SiteScope, which has been the preferred monitoring foundation for the collection of UNIX|Linux status since version 8.0 of LoadRunner. Every Loadrunner license since version 8 has come with aa 500 Point SiteScope license in the box for this purpose. More points are available upon request for test exclusive use of the instance.

Resources