"Cannot allocate memory" when starting new Flink job - out-of-memory

We are running Flink on a 3 VM cluster. Each VM has about 40 Go of RAM. Each day we stop some jobs and start new ones. After some days, starting a new job is rejected with a "Cannot allocate memory" error :
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000340000000, 12884901888, 0) failed; error='Cannot allocate memory' (errno=12)
Investigations show that the task manager RAM is ever growing, to the point it exceeds the allowed 40 Go, although the jobs are canceled.
I don't have access (yet) to the cluster so I tried some tests on a standalone cluster on my laptop and monitored the task manager RAM:
With jvisualvm I can see everything working as intended. I load the job memory, then clean it and wait (a few minutes) for the GB to fire up. The heap is released.
Whereas with top, memory is - and stay - high.
At the moment we are restarting the cluster every morning to account for this memory issue, but we can't afford it anymore as we'll need jobs running 24/7.
I'm pretty sure it's not a Flink issue but can someone point me in the right direction about what we're doing wrong here?

On standalone mode, Flink may not release resources as you wished.
For example, resources holden by static member in an instance.
It is highly recommended using YARN or K8s as runtime env.

Related

MariaDB has stopped responding - [ERROR] mysqld got signal 6

MariaDB service was stopped responding all of a sudden. It was running for more than 5 months continuously without any issues. When we check the MariaDB service status at the time of the incident, it showed as active (running) ( service mariadb status ). But we could not log into the MariaDB server, each logging attempt was just hanged without any response. All our web applications were also failed to communicate with the MariaDB service. Also, we checked the max_used_connections, and it was below the maximum value.
When we going through the logs, we saw the below error (this had been triggered at the time of the incident).
210623 2:00:19 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.2.34-MariaDB-log
key_buffer_size=67108864
read_buffer_size=1048576
max_used_connections=139
max_threads=752
thread_count=72
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1621655 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x7f4c008501e8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f4c458a7d30 thread_stack 0x49000
2021-06-23 2:04:20 139966788486912 [Warning] InnoDB: A long semaphore wait:
--Thread 139966780094208 has waited at btr0sea.cc line 1145 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x55e1838d5ab0 created in file btr0sea.cc line 191
a writer (thread id 139966610978560) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file btr0sea.cc line 1145
Last time write locked in file btr0sea.cc line 1218
We could not even stop the MariaDB service using general stopping commands ( service MariaDB stop). But we were able to forcefully kill the MariaDB process and then we could get the MariaDB service back online.
What could be the reason for this failure. If you have already faced similar issues please share your experience, what actions you got to prevent such failures (in the future). Your feedback is much much appreciated.
Our Environment Details are as follows
Operating system: Red Hat Enterprise Linux 7
Mariadb version: 10.2.34-MariaDB-log MariaDB Server
I also face this issue on an aws instance (c5a.4xlarge) hosting my database.
Server version: 10.5.11-MariaDB-1:10.5.11+maria~focal
It happened already 3 times occasionnaly. Like you, no possibility to stop the service but reboot the machine to get it working again.
Logs at restart suggest some tables crashed and should be repaired.

Corda Node getting JVM OutOfMemory Exception when trying to load data

Background:
We are trying to load data in our Custom CorDapp(Corda-3.1) using jmeter.
Our CorDapp is distributed across Six nodes(Three Parties, Two Notaries and One Oracle).
The Flow being executed in order to load data is having very minimal business logic, has three participants and requires two parties to sign the transaction.
Below is the environment, configuration and test details:
Server: Ubuntu 16.04
Ram: 8GB
Memory allocation to Corda.jar: 4GB
Memory allocation to Corda-webserver.jar : 1GB
JMeter Configuration- Threads(Users): 20 (1 transaction per second per thread)
Result:
Node B crashed after approx 21000 successful transactions(in approximately 3 hours and 30 mins) with "java.lang.OutOfMemoryError: Java heap space". After some time other nodes crashed due to continuous "handshake error" with Node B.
We analyzed heap dump using Eclipse MAT and found out that more than 21000 instances of hibernate SessionFactoryImpl were created which occupied more than 85% of the memory on Node B.
We need to understand why Corda network is creating so many objects and persisting them in memory.
We are continuing our investigation as we are not 100% sure if this is entirely Corda bug.
Solution to the problem is critical in our pursuit to continue further tests.
Note - We have more details about our investigation but we are unable to attach them here but can send over email.
If you're developing in Java, it is likely that the issue you're encountering has already been fixed by https://r3-cev.atlassian.net/browse/CORDA-1411
The fix is not available in Corda 3.1 yet, but the JIRA ticket provides a workaround. You need to override equals and hashCode on any subclasses of MappedSchema that you've defined. This should fix the issue you're observing.

ECS will not launch instance, "unable to place a task because the resources could not be found."

I cannot figure out why my ecs service will not launch, and keep being given the error "service unable to place a task because the resources could not be found".
In my task definition, I have 500 cpu units dedicated and 250 memory, for just a very small sample node app that's just serving up my static assets.
I am launching my service with 1 task and no ELB.
my guess is that your cpu units is too high. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#container_definition_environment
it's a harder metric to guess if you haven't really measured it much on your app.
anyway, i'm hitting a similar issue, so i'm right there in the same boat, but i'd try erasing the cpu since it is optional and see if that resolves it.

Need to control AOS memory consumption

Here is my problem. I am working on a batch process that spawns multiple tasks. Each task is basically doing some journal postings. The tasks are run in parallel. Now the journal is a counting journal with close to 10k lines. This process runs for hours as there are around hundred journals to be posted. The process runs fine on physical dev boxes, AOS and SQL on same box. But on a virtual server, its behavior is different. AOS starts consuming all the memory while the lines are getting added and at one point, memory hits 100% and AOS throws out of memory exception and dies, other times process just hangs and waits for memory to be released, which takes a long time. The journal posting is standard AX process and is not customised. The AX environment is 2012 R1 and latest kernel hotfixes are applied (KB2962510). I explored this property called MaxMemLoad that allows you to restrict the memory an AOS can consume on a server, but did not help at all.
The AX environment is composed of three AOSs in a cluster.
How can i restrict this crazy memory consumption?
EDIT:
Thanks to Matej i made some progress. SQL Server version was 2008 R2 SP1 and I applied the latest SP3. Interestingly out of three AOSs in cluster two now have much better memory graph, less than 45%. But the third one is still having weird memory usage. All three AOSs are same versions of AX, similar system configs (windows 2008 r2, 24 GB RAM, 4 cores). I had also applied the latest kernel hotfix on all AOSs. At the moment I am doing a full CIL on this particular server and run the batch again if that helps. I am attaching three graphs, generated using performance monitor, for CPU and memory, as you can see the memory on Server 01 is very erratic, not releasing memory on time, the other two are more stable. Any ideas?

Machine <IP_address> has been started with not enough memory

I am using Cloudify 2.7 with OpenStack Icehouse.
I developed a tomcat recipe and deployed it. In the orchestrator log of the cloudify console, I read the following WARNING:
2015-06-04 11:05:01,706 ESM INFO [org.openspaces.grid.gsm.strategy.ScaleStrategyProgressEventState] - [tommy.tomcat] machines SLA enforcement is in progress.; Caused by: org.openspaces.grid.gsm.machines.exceptions.ExpectedMachineWithMoreMemoryException: Machines SLA Enforcement is in progress: Expected machine with more memory. Machine <Public_IP>/<Public_IP> has been started with not enough memory. Actual total memory is 995MB. Which is less than (reserved + container) = (0MB+3800MB) = 3800MB
The Flavor of the VM is: 4GB RAM, 2vCPU, 20GB Disk
Into the cloud driver I commented the following line:
//reservedMemoryCapacityPerMachineInMB 1024
and configured the compute section related to the flavor as following:
computeTemplate
{
imageId <imageID>
machineMemoryMB 3900
hardwareId <hardwareId>
...
}
Can someone help me to pointing out the error?
Thanks.
The error message states that the actual available memory is only 995MB, which is considerably less than the expected 4GB. To clarify that:
do you run multiple services on the same machine?
maybe the VM really has less memory than expected. please run 'cat /proc/meminfo' on the started VM to verify the exact memory it has
In principle, you should not comment out any setting of reserved memory because Cloudify must take that into account - this setting is supposed to represent the memory used by the OS and other processes. additionally, the orchestrator (ESM) takes into account ~100 MB for cloudify to run freely.
So, please update machineMemoryMB to the value calculated this way:
(the number returned by 'cat /proc/meminfo') - 1024 - 100

Resources