Corda Node getting JVM OutOfMemory Exception when trying to load data - corda

Background:
We are trying to load data in our Custom CorDapp(Corda-3.1) using jmeter.
Our CorDapp is distributed across Six nodes(Three Parties, Two Notaries and One Oracle).
The Flow being executed in order to load data is having very minimal business logic, has three participants and requires two parties to sign the transaction.
Below is the environment, configuration and test details:
Server: Ubuntu 16.04
Ram: 8GB
Memory allocation to Corda.jar: 4GB
Memory allocation to Corda-webserver.jar : 1GB
JMeter Configuration- Threads(Users): 20 (1 transaction per second per thread)
Result:
Node B crashed after approx 21000 successful transactions(in approximately 3 hours and 30 mins) with "java.lang.OutOfMemoryError: Java heap space". After some time other nodes crashed due to continuous "handshake error" with Node B.
We analyzed heap dump using Eclipse MAT and found out that more than 21000 instances of hibernate SessionFactoryImpl were created which occupied more than 85% of the memory on Node B.
We need to understand why Corda network is creating so many objects and persisting them in memory.
We are continuing our investigation as we are not 100% sure if this is entirely Corda bug.
Solution to the problem is critical in our pursuit to continue further tests.
Note - We have more details about our investigation but we are unable to attach them here but can send over email.

If you're developing in Java, it is likely that the issue you're encountering has already been fixed by https://r3-cev.atlassian.net/browse/CORDA-1411
The fix is not available in Corda 3.1 yet, but the JIRA ticket provides a workaround. You need to override equals and hashCode on any subclasses of MappedSchema that you've defined. This should fix the issue you're observing.

Related

Request to Tensorflow Serving Server got timeout error for models using grpc python API

Synopsis
10 models for computer vision was deployed to tensorflow serving server (TSS) running on Ubuntu 22.04.
TSS installed as system service on the dedicated server having MSI RTX3060 12 Gb on board. System configuration and TSS service file below.
Problem
Request send via the tensorflow-serving grpc api randomly got status code DEADLINE_EXCEEDED or UNAVAILABLE sometimes on the first request but more often after some number (1...4) of successful requests or after some period of inactivity ( 1 hour or more ).
No OMM or service dump happened. GPU memory occupation is near 6 Gb. Service logs seems to have no problem indication, no warnings also (debug level 3).
Some experiments and results in detail see below.
System
[OS] Ubuntu 22.04.1 LTS
[CPU] 11th Gen Intel(R) Core(TM) i5-11400F # 2.60GHz
[GPU] MSI GeForce RTX 3060 12G
[RAM] 16Gb
[SSD] NVME 1Tb
[Tensorflow] Version 2.9.1
[CUDA] Version 11.7
[CUDNN] Version 8.6.0.163-1+cuda11.8
[TensorRT] Not used while building tensorflow-serving
TSS service file
[Unit]
Description=Tensorflow Serving Service
After=network-online.target
[Service]
User=root
Environment="PATH=/usr/local/bin:/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/cuda/bin:/usr/bin/model_servers"
Environment="LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="TF_FORCE_GPU_ALLOW_GROWTH=true"
ExecStart=/usr/bin/model_servers/tensorflow_model_server --port=8504 --model_config_file=/mnt/data/models/export/frman.conf
[Install]
WantedBy=multi-user.target
Hints
Tensorflow serving service intializes after network on host is available․
Service configured to allocate gpu memory when it needed (environment variable - TF_FORCE_GPU_ALLOW_GROWTH=true).
Hypotesis - actions
The problem is lost packets on the network
Requests to TSS was monitored with wireshark from client side and with GNOME system monitor in server side. NO problem detected.
Timeout value for single request on client side of tensorflow_serving.apis.prediction_service_pb2_grpc.PredictionServiceStub object increased .
stub.Predict(request, timeout * len(images))
The grpс channel is being checked for readiness before data transmission to begin
Interceptors for grpc requests added. Procedure repeat request with exponential backoff but nevertheless it returned status code DEADLINE_EXCEEDED or UNAVAILABLE randomly.
options_grpc = [
('grpc.max_send_message_length', 100 * 1024 * 1024),
('grpc.max_receive_message_length', 100 * 1024 * 1024),
('grpc.default_compression_algorithm', grpc.Compression.Gzip),
('grpc.grpc.default_compression_level', CompressionLevel.high)
]
interceptors = (
RetryOnRpcErrorClientInterceptor(
max_attempts=5,
sleeping_policy=ExponentialBackoff(init_backoff_ms=1000, max_backoff_ms=32000, multiplier=2),
status_for_retry=(grpc.StatusCode.DEADLINE_EXCEEDED, grpc.StatusCode.UNKNOWN, grpc.StatusCode.UNAVAILABLE),
),
)
channel = grpc.insecure_channel(f"{host}:{port}", options=options_grpc)
grpc.channel_ready_future(channel).result(timeout=5)
channel = grpc.intercept_channel(channel, *interceptors)
The problem is in the models themselves
Was noticed that problems mostly arise with models with U2net architecture. U2net uses some custom operations and was assumed that the problem arises from the fact that at the first request time outed because these operations for loading ops are too long.
That was found on the TSS service log file. To resolve this we try:
Add a warm-up that kind of models at the service startup, so all custom network operations loaded into memory before inference.
Custom operations in U2Net was eliminated by converting the models to onnx format and then to tensorflow saved model format. So, after that there is no need for the TSS to load network custom ops at startup. Model warm up added also.
The problem with the lack of memory - There was noticed another alarming message in TSS service log:
tensorflow_model_server[62029]: 2022-12-20 14:10:13.900599: W external/org_tensorflow/tensorflow/tsl/framework/bfc_allocator.cc:360] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
It looks line not enough memory for inference on GPU. To solve this we try to limit image batch size to 1(one) image for TSS and to set environment variable - TF_FORCE_GPU_ALLOW_GROWTH=true.
Memory consumption not increased after that. But random timeout (DEADLINE_EXCEEDED) error not dissapied.
Conclusion - Issue NOT Resolved
Thus, the problem is still in place especially when TSS inference segmentaion models (like U2Net).
Not Root of the problem was found.
Difficult to reproduce error as of its random type.
What else should be worth to check or configure to resolve the issue?

Unable to schedule training job to Azure Machine Learning AKS cluster

I am using the preview feature of configuring an AKS cluster via Arc in Azure Machine Learning Studio and attempting to submit a job for training however it gets stuck in Queued state with the following message:
Queue Information : Job is waiting for available resources, that required for 1 instance with 1.00 vCPU(s), 4.00 GB memory and 0 GPU(s). The best-fit compute can only provide 1.90 vCPU(s), 4.46 GB memory and 0 GPU(s). Please continue to wait or change to a smaller instance type
I am not too sure exactly what this is telling me because (aside from the grammar) the job requirements are LESS than what is available so why is it blocking? Also its telling me to change to a smaller instance type, which I did and it still gave me the same.
Anyone come across this or know how to get past it?
Create and attach an Azure Kubernetes Service cluster limitations: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-kubernetes?tabs=python#limitations

NiFi memory management

I Just want to understand how we should plan for the capacity of a NiFi instance.
We have a NiFi instance which is having around 500 flows. So, the total number of processors enabled on NiFi canvas is around 4000. We do run 2-5 flows simultaneously which does not take more than half an hour i.e. we do process data in MBs.
It was working fine till now but we are seeing outofMemory error very often. So we increased xms and xmx parameters from 4g to 8g which has resolved the problem for now. But going forward we will have more flows and we may face outofmemory issue again.
So, can anyone help with matrix of capacity planning or any suggestion to avoid such issues before happening? eg:- If we have 3000 processors enabled with/without any processing then Xg amount memory required.
Any input on NiFi capacity planning would be appreciated.
Thanks in Advance.
OOM errors can occur due to specific memory consuming processors. For example: SplitXML is loading your whole record to memory, so it could load a 1GiB file for instance.
Each processors can document what resource considerations should be taken. All of the Apache processors(as far as I can tell) are documented in that matter so you can rely on them.
In our example, by the way, SplitXML can be replaced with SplitRecord which doesn't load all of the record to memory.
So even if you use 1000 processors simultaneously, they might not consume as much memory as one processor that loads your whole FlowFile's content to memory.
Check which processors you are using and make sure you don't use one like that(there are more like this one that load the whole document to memory).

"Cannot allocate memory" when starting new Flink job

We are running Flink on a 3 VM cluster. Each VM has about 40 Go of RAM. Each day we stop some jobs and start new ones. After some days, starting a new job is rejected with a "Cannot allocate memory" error :
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000340000000, 12884901888, 0) failed; error='Cannot allocate memory' (errno=12)
Investigations show that the task manager RAM is ever growing, to the point it exceeds the allowed 40 Go, although the jobs are canceled.
I don't have access (yet) to the cluster so I tried some tests on a standalone cluster on my laptop and monitored the task manager RAM:
With jvisualvm I can see everything working as intended. I load the job memory, then clean it and wait (a few minutes) for the GB to fire up. The heap is released.
Whereas with top, memory is - and stay - high.
At the moment we are restarting the cluster every morning to account for this memory issue, but we can't afford it anymore as we'll need jobs running 24/7.
I'm pretty sure it's not a Flink issue but can someone point me in the right direction about what we're doing wrong here?
On standalone mode, Flink may not release resources as you wished.
For example, resources holden by static member in an instance.
It is highly recommended using YARN or K8s as runtime env.

Very slow Riak writes and this error: {shutdown,max_concurrency}

On a 5-node Riak cluster, we have observed very slow writes - about 2 docs per second. Upon investigation, I noticed that some of the nodes were low on disk space. After making more space available and restarting the nodes, we are see this error (or something similar) on all of the nodes inside console.log:
2015-02-20 16:16:29.694 [info] <0.161.0>#riak_core_handoff_manager:handle_info:282 An outbound handoff of partition riak_kv_vnode 182687704666362864775460604089535377456991567872 was terminated for reason: {shutdown,max_concurrency}
Currently, the cluster is not being written to or read from.
I would appreciate any help in getting the cluster back to good health.
I will add that we are writing documents to an index that is tied to a Solr index.
This is not critical production data, and I could technically wipe everything and start fresh, but I would like to properly diagnose and fix the issue so that I am prepared to handle it if it should happen in a production environment in the future.
Thanks!

Resources