Linux distribution is Red Hat. I'm monitoring linux counters with the LoadRunner Controller's System Resources Graphs - Unix Resources. Monitoring is working properly and graphs are plotted in real time. But after a few minutes, errors are appearing:
Monitor name :UNIX Resources. Internal rpc error (error code:2).
Machine: 31.2.2.63. Hint: Check that RPC on this machine is up and running.
Check that rstat daemon on this machine is up and running
(use rpcinfo utility for this verification).
Details: RPC: RPC call failed.
RPC-TCP: recv()/recvfrom() failed.
RPC-TCP: Timeout reached. (entry point: Factory::CollectData).
[MsgId: MMSG-47197]
I logged on the Linux server and found rstatd is still running. Clearing the measurements in Controller's Unix Resources and adding them again, monitoring again started to work but after a few minutes, the same error occurred.
What might cause this error ? Is it due to network traffic ?
Consider using SiteScope, which has been the preferred monitoring foundation for the collection of UNIX|Linux status since version 8.0 of LoadRunner. Every Loadrunner license since version 8 has come with aa 500 Point SiteScope license in the box for this purpose. More points are available upon request for test exclusive use of the instance.
Related
Synopsis
10 models for computer vision was deployed to tensorflow serving server (TSS) running on Ubuntu 22.04.
TSS installed as system service on the dedicated server having MSI RTX3060 12 Gb on board. System configuration and TSS service file below.
Problem
Request send via the tensorflow-serving grpc api randomly got status code DEADLINE_EXCEEDED or UNAVAILABLE sometimes on the first request but more often after some number (1...4) of successful requests or after some period of inactivity ( 1 hour or more ).
No OMM or service dump happened. GPU memory occupation is near 6 Gb. Service logs seems to have no problem indication, no warnings also (debug level 3).
Some experiments and results in detail see below.
System
[OS] Ubuntu 22.04.1 LTS
[CPU] 11th Gen Intel(R) Core(TM) i5-11400F # 2.60GHz
[GPU] MSI GeForce RTX 3060 12G
[RAM] 16Gb
[SSD] NVME 1Tb
[Tensorflow] Version 2.9.1
[CUDA] Version 11.7
[CUDNN] Version 8.6.0.163-1+cuda11.8
[TensorRT] Not used while building tensorflow-serving
TSS service file
[Unit]
Description=Tensorflow Serving Service
After=network-online.target
[Service]
User=root
Environment="PATH=/usr/local/bin:/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/cuda/bin:/usr/bin/model_servers"
Environment="LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="TF_FORCE_GPU_ALLOW_GROWTH=true"
ExecStart=/usr/bin/model_servers/tensorflow_model_server --port=8504 --model_config_file=/mnt/data/models/export/frman.conf
[Install]
WantedBy=multi-user.target
Hints
Tensorflow serving service intializes after network on host is available․
Service configured to allocate gpu memory when it needed (environment variable - TF_FORCE_GPU_ALLOW_GROWTH=true).
Hypotesis - actions
The problem is lost packets on the network
Requests to TSS was monitored with wireshark from client side and with GNOME system monitor in server side. NO problem detected.
Timeout value for single request on client side of tensorflow_serving.apis.prediction_service_pb2_grpc.PredictionServiceStub object increased .
stub.Predict(request, timeout * len(images))
The grpс channel is being checked for readiness before data transmission to begin
Interceptors for grpc requests added. Procedure repeat request with exponential backoff but nevertheless it returned status code DEADLINE_EXCEEDED or UNAVAILABLE randomly.
options_grpc = [
('grpc.max_send_message_length', 100 * 1024 * 1024),
('grpc.max_receive_message_length', 100 * 1024 * 1024),
('grpc.default_compression_algorithm', grpc.Compression.Gzip),
('grpc.grpc.default_compression_level', CompressionLevel.high)
]
interceptors = (
RetryOnRpcErrorClientInterceptor(
max_attempts=5,
sleeping_policy=ExponentialBackoff(init_backoff_ms=1000, max_backoff_ms=32000, multiplier=2),
status_for_retry=(grpc.StatusCode.DEADLINE_EXCEEDED, grpc.StatusCode.UNKNOWN, grpc.StatusCode.UNAVAILABLE),
),
)
channel = grpc.insecure_channel(f"{host}:{port}", options=options_grpc)
grpc.channel_ready_future(channel).result(timeout=5)
channel = grpc.intercept_channel(channel, *interceptors)
The problem is in the models themselves
Was noticed that problems mostly arise with models with U2net architecture. U2net uses some custom operations and was assumed that the problem arises from the fact that at the first request time outed because these operations for loading ops are too long.
That was found on the TSS service log file. To resolve this we try:
Add a warm-up that kind of models at the service startup, so all custom network operations loaded into memory before inference.
Custom operations in U2Net was eliminated by converting the models to onnx format and then to tensorflow saved model format. So, after that there is no need for the TSS to load network custom ops at startup. Model warm up added also.
The problem with the lack of memory - There was noticed another alarming message in TSS service log:
tensorflow_model_server[62029]: 2022-12-20 14:10:13.900599: W external/org_tensorflow/tensorflow/tsl/framework/bfc_allocator.cc:360] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
It looks line not enough memory for inference on GPU. To solve this we try to limit image batch size to 1(one) image for TSS and to set environment variable - TF_FORCE_GPU_ALLOW_GROWTH=true.
Memory consumption not increased after that. But random timeout (DEADLINE_EXCEEDED) error not dissapied.
Conclusion - Issue NOT Resolved
Thus, the problem is still in place especially when TSS inference segmentaion models (like U2Net).
Not Root of the problem was found.
Difficult to reproduce error as of its random type.
What else should be worth to check or configure to resolve the issue?
Last night one of the websites (.NET 4.0 forms) hosted on my Win 2008 R2 (IIS 7.5) Server started to time out throwing the following error for all connected users.
TYPE System.Web.HttpException
MESSAGE Request timed out.
DETAIL System.Web.HttpException (0x80004005): Request timed out.
The outage was confined to just one website within IIS, the others continued to work fine.
Unfortunately I was unable to identify why the website was timing out. Here are the steps I took:
First thing I did was look at the task manager which revealed normal CPU and memory usage. Network activity was also moderate.
I then opened IIS to look at the live connections under 'Worker Processes'. There were about 60 live connections, so it didn't look like anything DDoS related.
Checked database connectivity (hosted on a separate server), all fine!
I then reset the website on IIS. That didn't work
I tried to then do a complete iisreset...still no luck :(
In the end (and under some duress) the only thing I could think to do to resolve this was to restart the server.
Restarting the server worked but I am nervous not knowing why this happened in the first place. Can anyone recommend any checks that I failed to carryout? Is there an official checklist for working through these sorts of IIS problems? I have reviewed the IIS logs but don't see anything unusual on the run up to the outage.
Any pointers or links to useful resources to help me understand and mitigate against this in future will be much appreciated.
EDIT
The only time I logged into the server that day was to add an additional web handler component (for remote deploy) to IIS Web Deploy. I'm doubtful this caused the outage as the server worked for for 6 hours after.
Because iisreset didn't helped and you had to restart whole machine, I would suspect it was a global resources shortage and mostly used website (or most resource consuming) was impacted. It could be because of not available RAM, network connections congestion due to some malfunctioning calls (for example a lot of CLOSE_WAIT sockets exhausting connections pool, we've seen that in production because of malfunction of external service). It could be also one specific client problem, which was disconnected after machine restart so eventually the problem disappeared.
I would start from:
Historical analysis
review Event Viewer to see any errors/warnings from that period of time,
although you have already looked into IIS logs, I would do it once again with help of Log Parser Lizard to make some statistics like number of request per client, network bandwith per client, average response time per client and so on.
Monitoring
continuously monitor Performance Counters:
\Processor(_Total_)\% Processor Time,
\.NET CLR Exceptions(_Global_)\# of Exceps Thrown / sec,
\Memory\Available MBytes,
\Web Service(Default Web Site)\Current Connections (per each your site name),
\ASP.NET v4.0.30319\Request Wait Time,
\ASP.NET v4.0.30319\Requests Current,
\ASP.NET v4.0.30319\Request Queued,
\Process(XXX)\Working Set,
\Process(XXX)\% Processor Time (XXX per each w3wp process),
\Network Interface(XXX)\Bytes total / sec
run Performance Analysis of Logs (PAL) Tool during time of failure to make a very detailed analysis of performance counters data,
run netstat -ano to analyze network traffic (or TCPView tool even better)
If all this will not lead you to any conclusion, create a Debug Diagnostic rule to create a memory dump of the process for long running requests and analyze it with WinDbg and PSSCor extension for .NET debugging.
When I try to start an instance through template,I get the following error messages:
2013-11-10 19:44:28,716 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-5:job-19 = [ d070b5ba-f342-4252-9137-4d2c1b19eca6 ]) No suitable hosts found under this Cluster: 2
2013-11-10 19:44:28,718 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-5:job-19 = [ d070b5ba-f342-4252-9137-4d2c1b19eca6 ]) Could not find suitable Deployment Destination for this VM under any clusters, returning.
2013-11-10 19:44:28,718 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-5:job-19 = [ d070b5ba-f342-4252-9137-4d2c1b19eca6 ]) Searching all possible resources under this Zone: 1
2013-11-10 19:44:28,718 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-5:job-19 = [ d070b5ba-f342-4252-9137-4d2c1b19eca6 ]) Listing clusters in order of aggregate capacity, that have (atleast one host with) enough CPU and RAM capacity under this Zone: 1
I feel confused because I already have a host in cluster 2.
Can anyone give me some suggestions?Any reply will be appreciated!
You need to have a closer look at the log file to understand why the CloudStack is unable to place a VM on the host. The information will appear above the entries you have provided. There are a lot of issues that can cause this problem.
E.g. this blog entry walks through a configuration problem with XenServer
Another common issue is when using local storage, you need to create a new compute offering that uses local storage disks. The default compute offerings do not support local storage.
Updated: changed answer to take into account that your sample is from the log file.
Most probably you are running out of capacity or due to some other error cloudstack has added you host in the "Avoid List". It does that when the management server finds an error while deploying the instance and from the next time till the problem gets resolved, the host and the cluster will be part of avoid list and will be avoided in the subsequent deployments.
You need to find out the exact reason by monitoring the management server logs. Login to your management server and go to folder /var/log/cloudstack/management/.
Now run the command tail -f management-server.log
This will give you continuous output of the management server log, so that you know what exactly is happening at the moment.
Now do the operation on UI (e.g. try to add instance) and quickly monitor the running logs.
Abort the command when you find an exception in the log and monitor the log statements just above the exception.
Also as a standard practice, always have a habit of monitoring the management server logs and agent logs (on host - /var/log/cloudstack/agent/agent.log)
I have a w3wp.exe that is restarting on my IIS server (see specs below). Memory gradually climbs to ~3G then it randomly restarts itself about every 1-2min.
Memory Usage:
The odd thing is that once this memory drop (what looks like a restart - btw...the app pool does not get recycled/restarted) happens GET requests are queued but then serviced as soon as the service warms/starts up (causing a delay in responses to our clients - who were initially reporting delayed reponse times on occasion).
I have followed this link to get a stack dump once the .exe restarts (private bytes go to ~0) but nothing gets logged (no .dmp file) with diag debug once the service restarts.
I see tons of warnings in my webserver (IIS) log but that's it:
A process serving application pool 'MyApplication' suffered a fatal
communication error with the Windows Process Activation Service. The
process id was '1732'. The data field contains the error number.
ASK: I'm not sure if this is a memory limitation, if cacheing is not playing well with my threads/tasks, if cacheing is blowing up, if there is watchdog service restarting my application, etc,. Has anybody run across something similar with w3wp.exe restarting? It's hard to tell because diagdebug is not giving me a dump once it restarts.
SPECS:
MVC4 Web API servicing GET requests (code is debug build with debug=true)
Uses MemoryCache with Model and Business Objects with cache eviction set to 2hrs...uses
Task (TPS) for each new request.
Database: SQL Server 2008R2
Web Servers: Windows Server 2008R2 Enterprise SP1 (64bit, 64G RAM)
IIS 7.5
One application pool...no other LOB applications running on this server
Your first step is to reproduce the problem in your test environment. Setup some kind of load generation app (you can write it yourself pretty easily) and get the same problem happening. Then turn off debug in web.config and see if that fixes the issue. Then change it to be a release build and test again.
I've never used memorycache - try reducing the cache eviction time or just turn it off and see if that fixes the issue. Good luck :)
I'm newbie for network monitoring. I'm using pandorafms 4.0.2 free version. I added about 1,167 agents and 5,831 remote monitors. unknown agent and unknown monitor level is high. number of unknown monitors/agent increase and decrease but it didn't reach to "0". i check few unknown monitors randomly and ping their ip addresses from terminal. result shows they are alive but pandorafms show them as in unknown state. i checked them after about 6 hours. but network lag is still in high value.need help.(I use ubuntu server)
The behavior you describe could be because of two reasons:
A lack of resources in the server. Normally a Pandora server can monitor 2000 devices but it depends on the resources of the server which hosts Pandora. You can check the minimum requirements here.
It could be a bug :-), in 4.x version a bug related to network monitoring was detected. This bug causes some random failures monitoring using ping.
I would update your installation to the new 5.0 version and if the unknown modules persists, you can check the problem of lack of resources by disabling some agents. Also you can check some tips to configure Pandora for large environments here.
Hope it helps.