Machine <IP_address> has been started with not enough memory - openstack

I am using Cloudify 2.7 with OpenStack Icehouse.
I developed a tomcat recipe and deployed it. In the orchestrator log of the cloudify console, I read the following WARNING:
2015-06-04 11:05:01,706 ESM INFO [org.openspaces.grid.gsm.strategy.ScaleStrategyProgressEventState] - [tommy.tomcat] machines SLA enforcement is in progress.; Caused by: org.openspaces.grid.gsm.machines.exceptions.ExpectedMachineWithMoreMemoryException: Machines SLA Enforcement is in progress: Expected machine with more memory. Machine <Public_IP>/<Public_IP> has been started with not enough memory. Actual total memory is 995MB. Which is less than (reserved + container) = (0MB+3800MB) = 3800MB
The Flavor of the VM is: 4GB RAM, 2vCPU, 20GB Disk
Into the cloud driver I commented the following line:
//reservedMemoryCapacityPerMachineInMB 1024
and configured the compute section related to the flavor as following:
computeTemplate
{
imageId <imageID>
machineMemoryMB 3900
hardwareId <hardwareId>
...
}
Can someone help me to pointing out the error?
Thanks.

The error message states that the actual available memory is only 995MB, which is considerably less than the expected 4GB. To clarify that:
do you run multiple services on the same machine?
maybe the VM really has less memory than expected. please run 'cat /proc/meminfo' on the started VM to verify the exact memory it has
In principle, you should not comment out any setting of reserved memory because Cloudify must take that into account - this setting is supposed to represent the memory used by the OS and other processes. additionally, the orchestrator (ESM) takes into account ~100 MB for cloudify to run freely.
So, please update machineMemoryMB to the value calculated this way:
(the number returned by 'cat /proc/meminfo') - 1024 - 100

Related

Request to Tensorflow Serving Server got timeout error for models using grpc python API

Synopsis
10 models for computer vision was deployed to tensorflow serving server (TSS) running on Ubuntu 22.04.
TSS installed as system service on the dedicated server having MSI RTX3060 12 Gb on board. System configuration and TSS service file below.
Problem
Request send via the tensorflow-serving grpc api randomly got status code DEADLINE_EXCEEDED or UNAVAILABLE sometimes on the first request but more often after some number (1...4) of successful requests or after some period of inactivity ( 1 hour or more ).
No OMM or service dump happened. GPU memory occupation is near 6 Gb. Service logs seems to have no problem indication, no warnings also (debug level 3).
Some experiments and results in detail see below.
System
[OS] Ubuntu 22.04.1 LTS
[CPU] 11th Gen Intel(R) Core(TM) i5-11400F # 2.60GHz
[GPU] MSI GeForce RTX 3060 12G
[RAM] 16Gb
[SSD] NVME 1Tb
[Tensorflow] Version 2.9.1
[CUDA] Version 11.7
[CUDNN] Version 8.6.0.163-1+cuda11.8
[TensorRT] Not used while building tensorflow-serving
TSS service file
[Unit]
Description=Tensorflow Serving Service
After=network-online.target
[Service]
User=root
Environment="PATH=/usr/local/bin:/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/cuda/bin:/usr/bin/model_servers"
Environment="LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="TF_FORCE_GPU_ALLOW_GROWTH=true"
ExecStart=/usr/bin/model_servers/tensorflow_model_server --port=8504 --model_config_file=/mnt/data/models/export/frman.conf
[Install]
WantedBy=multi-user.target
Hints
Tensorflow serving service intializes after network on host is available․
Service configured to allocate gpu memory when it needed (environment variable - TF_FORCE_GPU_ALLOW_GROWTH=true).
Hypotesis - actions
The problem is lost packets on the network
Requests to TSS was monitored with wireshark from client side and with GNOME system monitor in server side. NO problem detected.
Timeout value for single request on client side of tensorflow_serving.apis.prediction_service_pb2_grpc.PredictionServiceStub object increased .
stub.Predict(request, timeout * len(images))
The grpс channel is being checked for readiness before data transmission to begin
Interceptors for grpc requests added. Procedure repeat request with exponential backoff but nevertheless it returned status code DEADLINE_EXCEEDED or UNAVAILABLE randomly.
options_grpc = [
('grpc.max_send_message_length', 100 * 1024 * 1024),
('grpc.max_receive_message_length', 100 * 1024 * 1024),
('grpc.default_compression_algorithm', grpc.Compression.Gzip),
('grpc.grpc.default_compression_level', CompressionLevel.high)
]
interceptors = (
RetryOnRpcErrorClientInterceptor(
max_attempts=5,
sleeping_policy=ExponentialBackoff(init_backoff_ms=1000, max_backoff_ms=32000, multiplier=2),
status_for_retry=(grpc.StatusCode.DEADLINE_EXCEEDED, grpc.StatusCode.UNKNOWN, grpc.StatusCode.UNAVAILABLE),
),
)
channel = grpc.insecure_channel(f"{host}:{port}", options=options_grpc)
grpc.channel_ready_future(channel).result(timeout=5)
channel = grpc.intercept_channel(channel, *interceptors)
The problem is in the models themselves
Was noticed that problems mostly arise with models with U2net architecture. U2net uses some custom operations and was assumed that the problem arises from the fact that at the first request time outed because these operations for loading ops are too long.
That was found on the TSS service log file. To resolve this we try:
Add a warm-up that kind of models at the service startup, so all custom network operations loaded into memory before inference.
Custom operations in U2Net was eliminated by converting the models to onnx format and then to tensorflow saved model format. So, after that there is no need for the TSS to load network custom ops at startup. Model warm up added also.
The problem with the lack of memory - There was noticed another alarming message in TSS service log:
tensorflow_model_server[62029]: 2022-12-20 14:10:13.900599: W external/org_tensorflow/tensorflow/tsl/framework/bfc_allocator.cc:360] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
It looks line not enough memory for inference on GPU. To solve this we try to limit image batch size to 1(one) image for TSS and to set environment variable - TF_FORCE_GPU_ALLOW_GROWTH=true.
Memory consumption not increased after that. But random timeout (DEADLINE_EXCEEDED) error not dissapied.
Conclusion - Issue NOT Resolved
Thus, the problem is still in place especially when TSS inference segmentaion models (like U2Net).
Not Root of the problem was found.
Difficult to reproduce error as of its random type.
What else should be worth to check or configure to resolve the issue?

Debugging poor I/O performance on OpenStack block device (OpenStack kolla:queen)

I have an OpenStack VM that is getting really poor performance on its root disk - less than 50MB/s writes. My setup is 10 GbE, OpenStack deployed using kolla, the Queen release, with storage on Ceph. I'm trying to follow the path through the infrastructure to identify where the performance bottleneck is, but getting lost along the way:
nova show lets me see which hypervisor (an Ubuntu 16.04 machine) the VM is running on but once I'm on the hypervisor I don't know what to look at. Where else can I look?
Thank you!
My advice is to check the performance first between host (hypervisor) and ceph , if you are able to create a ceph block device, then you will able to map it with rbd command , create filesystem, and mount it - then you can measure the device io perf with : sysstat , iostas, iotop, dstat, vmastat or even with sar

"Cannot allocate memory" when starting new Flink job

We are running Flink on a 3 VM cluster. Each VM has about 40 Go of RAM. Each day we stop some jobs and start new ones. After some days, starting a new job is rejected with a "Cannot allocate memory" error :
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000340000000, 12884901888, 0) failed; error='Cannot allocate memory' (errno=12)
Investigations show that the task manager RAM is ever growing, to the point it exceeds the allowed 40 Go, although the jobs are canceled.
I don't have access (yet) to the cluster so I tried some tests on a standalone cluster on my laptop and monitored the task manager RAM:
With jvisualvm I can see everything working as intended. I load the job memory, then clean it and wait (a few minutes) for the GB to fire up. The heap is released.
Whereas with top, memory is - and stay - high.
At the moment we are restarting the cluster every morning to account for this memory issue, but we can't afford it anymore as we'll need jobs running 24/7.
I'm pretty sure it's not a Flink issue but can someone point me in the right direction about what we're doing wrong here?
On standalone mode, Flink may not release resources as you wished.
For example, resources holden by static member in an instance.
It is highly recommended using YARN or K8s as runtime env.

Nuodb Memory and CPU usage reached high

While accessing NuoDB database from Java application, In Task manager tool getting CPU and Memory usage reached 99% almost and I tired with NUODB 2.4 ,2.5 and 2.6 versions but finally i am getting same issue.
Present my windows server hardware configurations are below.
RAM : 12 GB (3 processors ) and
Hard disk : 100 GB
Please give any suggest to come this issue.
Thanks in advance
I see from the task manager MANY "NuoDB Server" processes running
(the picture shows 7 NuoDB processes running on that server),
It might be having too many TEs or too much memory configured for NuoDB on
that single server as a potential for the problem or setting NuoDB incorrectly.
The following link can help you understand how to check your system settings.
http://doc.nuodb.com/Latest/Default.htm#Mgr-Show-Domain.htm?Highlight=--memory

PHP exhausted with limit at 4GB but not 2GB?

I'm a developer in a large company that has some legacy code that requires a very large ammount of memory on export functions. To address this, ini_set('memory_limit', '4G'); is used.
The problem is that the script crashes with memory exaustion. If I set the limit to 2G, the script runs to the end. It doesn't even reaches 1GB peak memory usage.
Since the code is versioned and shared with the rest of the company I can't change the limit and changing it on my local install is cumbersome.
My question is: what can make a script crashes with 4GB limit but not 2GB?
PS: my setup is a virtualbox machine running Debian with nginx and php-fpm. The vm has 4GB RAM (although changing this doesn't seem to do any difference).
[update]
Created a new virtual machine with an 64 bits operation system and if I set the vm memory to 2GB it works. (If i use 4GB it doesn't).
Since i'm ok with 2GB, i'll close this issue.
It is a natural limitation: 2 or even 4 Gbs of address space are used for file mapping also which takes some memory pages.
The ultimate solution would be to use the 64-bit PHP interpreter (i.e., switch to 64-bit system, if possible).
Maybe you are on a 32bit system?
Well if your VM only has 4GB, then you probably should give it more memory.
On the 32 bit system 4GB is the limit of memory space. I guess that there can be some memory violations when PHP tries to get 4GB memory.

Resources