Unable to schedule training job to Azure Machine Learning AKS cluster - azure-machine-learning-studio

I am using the preview feature of configuring an AKS cluster via Arc in Azure Machine Learning Studio and attempting to submit a job for training however it gets stuck in Queued state with the following message:
Queue Information : Job is waiting for available resources, that required for 1 instance with 1.00 vCPU(s), 4.00 GB memory and 0 GPU(s). The best-fit compute can only provide 1.90 vCPU(s), 4.46 GB memory and 0 GPU(s). Please continue to wait or change to a smaller instance type
I am not too sure exactly what this is telling me because (aside from the grammar) the job requirements are LESS than what is available so why is it blocking? Also its telling me to change to a smaller instance type, which I did and it still gave me the same.
Anyone come across this or know how to get past it?

Create and attach an Azure Kubernetes Service cluster limitations: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-kubernetes?tabs=python#limitations

Related

"Cannot allocate memory" when starting new Flink job

We are running Flink on a 3 VM cluster. Each VM has about 40 Go of RAM. Each day we stop some jobs and start new ones. After some days, starting a new job is rejected with a "Cannot allocate memory" error :
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000340000000, 12884901888, 0) failed; error='Cannot allocate memory' (errno=12)
Investigations show that the task manager RAM is ever growing, to the point it exceeds the allowed 40 Go, although the jobs are canceled.
I don't have access (yet) to the cluster so I tried some tests on a standalone cluster on my laptop and monitored the task manager RAM:
With jvisualvm I can see everything working as intended. I load the job memory, then clean it and wait (a few minutes) for the GB to fire up. The heap is released.
Whereas with top, memory is - and stay - high.
At the moment we are restarting the cluster every morning to account for this memory issue, but we can't afford it anymore as we'll need jobs running 24/7.
I'm pretty sure it's not a Flink issue but can someone point me in the right direction about what we're doing wrong here?
On standalone mode, Flink may not release resources as you wished.
For example, resources holden by static member in an instance.
It is highly recommended using YARN or K8s as runtime env.

Corda Node getting JVM OutOfMemory Exception when trying to load data

Background:
We are trying to load data in our Custom CorDapp(Corda-3.1) using jmeter.
Our CorDapp is distributed across Six nodes(Three Parties, Two Notaries and One Oracle).
The Flow being executed in order to load data is having very minimal business logic, has three participants and requires two parties to sign the transaction.
Below is the environment, configuration and test details:
Server: Ubuntu 16.04
Ram: 8GB
Memory allocation to Corda.jar: 4GB
Memory allocation to Corda-webserver.jar : 1GB
JMeter Configuration- Threads(Users): 20 (1 transaction per second per thread)
Result:
Node B crashed after approx 21000 successful transactions(in approximately 3 hours and 30 mins) with "java.lang.OutOfMemoryError: Java heap space". After some time other nodes crashed due to continuous "handshake error" with Node B.
We analyzed heap dump using Eclipse MAT and found out that more than 21000 instances of hibernate SessionFactoryImpl were created which occupied more than 85% of the memory on Node B.
We need to understand why Corda network is creating so many objects and persisting them in memory.
We are continuing our investigation as we are not 100% sure if this is entirely Corda bug.
Solution to the problem is critical in our pursuit to continue further tests.
Note - We have more details about our investigation but we are unable to attach them here but can send over email.
If you're developing in Java, it is likely that the issue you're encountering has already been fixed by https://r3-cev.atlassian.net/browse/CORDA-1411
The fix is not available in Corda 3.1 yet, but the JIRA ticket provides a workaround. You need to override equals and hashCode on any subclasses of MappedSchema that you've defined. This should fix the issue you're observing.

IIS holding up requests in queue instead of processing those

I'm executing a load test against an application hosted in Azure. It's a cloud service with 3 instances behind an internal load balancer (Hash based load balancing mode).
When I execute the load test, it queues request even though the req/sec and total current request to IIS is quite low. I'm not sure what could be the problem.
Any suggestions?
Adding few screenshot of performance counters which might help you take decision.
Click on image to view original image.
Edit-1: Per request from Rohit Rajan,
Cloud Service is having 2 instances (meaning 2 VMs), each of them having 14 GBs of RAM and 8 cores.
I'm executing a Step load pattern start with 100 and add 100,150 user every 5 minutes, till 4-5 hours until the load reaches to 10,000 VUs.
Any call to external system are written async. Database calls are synchronous.
There is no straight forward answer to your question. One possible way would be to explore additional investigation options.
Based on your explanation, there seems to be a bottleneck within the application which is causing the requests to queue-up.
In order to investigate this, collect a memory dump when you see the requests queuing up and then use DebugDiag to run a hang analysis on it.
There are several ways to gather the memory dump.
Task Manager
Procdump.exe
Debug Diagnostics
Process Explorer
Once you have the memory dump you can install debug diag and then run analysis on it. It will generate a report which can help you get started.
Debug Diagnostics download: https://www.microsoft.com/en-us/download/details.aspx?id=49924

warning google cloud compute instance over utilized

i recently installed a Bitnami Wordpress Network stack on google cloud compute.
I keep getting a warning saying that it is over utilised however, when i view cpu and disk usage statistics, i cannot see how this is possible? Both statistics are usually very low only spiking when I am administering websites (ie importing large files, backups, etc).
For exmaple as i post this message right now usage for the
Is this just a marketing ploy to get me to upgrade my instance?
What happens when we overutilise anyway? (what are the symptoms...as my wordpress network appears to me to be functioning flawlessly)
Please see images of my disk and cpu usage over the last 7 days
[CPU utilisation statistcs 7 days][1]
[disk operations 7 days][2]
[Network Packets statistics 7 days][3]
[1]: https://i.stack.imgur.com/iZa0L.png
[2]: https://i.stack.imgur.com/lUOno.png
[3]: https://i.stack.imgur.com/SnbHq.jpg
You need to install the Monitoring Agent in order to get accurate recommendations.
If the monitoring agent is installed and running on a VM instance, the
CPU and memory metrics collected by the agent are automatically used
to compute sizing recommendations. The agent metrics provided by the
monitoring agent give better insights into resource utilization of the
instance than the default Compute Engine metrics. This allows the
recommendation engine to estimate resource requirements better and
make more precise recommendations.
Read: https://cloud.google.com/compute/docs/instances/apply-sizing-recommendations-for-instances?hl=en_GB&_ga=2.217293398.-1509163014.1517671366#using_the_monitoring_agent_for_more_precise_recommendations
How to install the Monitoring Agent to get accurate sizing recommendations:
https://cloud.google.com/monitoring/agent/install-agent

Very slow Riak writes and this error: {shutdown,max_concurrency}

On a 5-node Riak cluster, we have observed very slow writes - about 2 docs per second. Upon investigation, I noticed that some of the nodes were low on disk space. After making more space available and restarting the nodes, we are see this error (or something similar) on all of the nodes inside console.log:
2015-02-20 16:16:29.694 [info] <0.161.0>#riak_core_handoff_manager:handle_info:282 An outbound handoff of partition riak_kv_vnode 182687704666362864775460604089535377456991567872 was terminated for reason: {shutdown,max_concurrency}
Currently, the cluster is not being written to or read from.
I would appreciate any help in getting the cluster back to good health.
I will add that we are writing documents to an index that is tied to a Solr index.
This is not critical production data, and I could technically wipe everything and start fresh, but I would like to properly diagnose and fix the issue so that I am prepared to handle it if it should happen in a production environment in the future.
Thanks!

Resources