Troubleshooting regional availability failures in Application Insights

Troubleshooting regional availability failures in Application Insights - azure-application-insights

Today we experienced some failures from SG, BR and IE; all what seem like timeouts to our API which is hosted in the North Europe data centre:
There were no application failures on our side, so we can assume it to be a network transient-related issue between these regions and our server (North Europe).
Are there any resources that can help in confirming/troubleshooting such issues?
EDIT:
When drilling down into the failed request (as suggested by yonisha):
So it must have been either server/network (Azure) or application-related?

Yes, you can see the failures of the availability tests:
Click on the failed test instance, above the chart you've provided (make sure to change the 'Time range' to at least 1 hour.
Click on the failed request, which will open another blade with the exception details:
Depending on the exception details, you can then determine whether it was server/network issue (server was unavailable etc.), application level (you may have insufficient logging to log those failures) or client side issue (client has closed the connection etc.).

Related

Some requests on IIS hang for minutes and end in a lost connection

I have an awkward issue with IIS 10.0 on Windows Server 2016 and ASP.Net 4.5.2 and MVC 5.2.7.
At times, certain requests do not receive a response and run for minutes, maybe 10 or so, before ending in a lost connection (PR_CONNECT_RESET_ERROR in Firefox on Windows, NSURLDomainError in Firefox on iOS). These are mostly POST requests. When this issue occurs, other GET requests will receive a swift response and a correct result. Normally, POST-request do no take long to be processed, typically less than 3 seconds.
Recycling the associated worker process will make the issue go away, for hours or days.
When today inspected the web server when the issue was going on, I saw little CPU usage, less than 10%, memory 56%, the worker process a modest 615 MB. I saw neither logging in the W3C log of these requests, nor in my custom application logs.
I added the Web-Request-Monitor conform How do I see currently executing web request on IIS 8, but in doing so, the the worker process probably got recycled, as the issue is not currently occurring.
There are a reverse proxy and an access manager between the internet and my web server. I suppose they can have something to do with this issue, but it certainly is related to IIS, as recycling helps.
All of this is happening on a acceptation web server running a newer version of my application. I am not aware of any big changes to the application's architecture that could be involved. Also, there will be very little traffic from other clients, if none at all.
What could be next steps to investigate this issue further?

Update
This issue was definitely caused by log4net. However, it was not related to the log4net.Internal.Debug setting. It was caused by two application domains accessing the same log file. This occasionally resulted in concurrency issues with accessing the log file. It appeared that log4net could not properly handle this and got stuck while writing to the log file.
This log file was configured with the RollingFileAppender option. Since we also used AdoNetAppender, we decided to remove file logging all together.
Original
I have found a probable cause. I'll report the steps I took to investigate the issue.
I activated the Worker Processes feature in IIS.
When, after a couple of days of waiting, the issue started again, I found long running requests. They all had State ExecuteRequestHandler and Module Name ManagedPipelineHandler. They had Time Elapsed of hundreds of seconds.
I also activated the Failed Requests Tracing with a rule for long running requests with a Time Taken of 1 minute.
After a couple of days, I started to receive failed request reports. The failed request all have a GENERAL_SET_RESPONSE_HEADER event as their last event.
I added additional debug logging events for each requests. When debugging in my development environment, at one point, I started to see the hanging behaviour there, on one of the new logging statements(!). The application uses log4net.
I captured a stack trace:
log4net.dll!log4net.Appender.AppenderSkeleton.DoAppend(log4net.Core.LoggingEvent loggingEvent) log4net.dll!log4net.Util.AppenderAttachedImpl.AppendLoopOnAppenders(log4net.Core.LoggingEvent loggingEvent) log4net.dll!log4net.Repository.Hierarchy.Logger.CallAppenders(log4net.Core.LoggingEvent loggingEvent) log4net.dll!log4net.Repository.Hierarchy.Logger.Log(System.Type callerStackBoundaryDeclaringType, log4net.Core.Level level, object message, System.Exception exception) log4net.dll!log4net.Core.LogImpl.DebugFormat(string format, object arg0)
The DoAppend method uses lock(this), which may very well cause hangs.
I also found out that the config setting log4net.Internal.Debug was set to true, which I do not want under normal circumstances and this may be related. I did not attempt to understand the log4net code, but I remember that logging initially did not work, in the acceptance environment, so the setting may very well have been set to true then, causing the issue to start.
Another indication that this is happening with log4net is that when the issue last occurred, I realized that logging of level standard, only occurs in some POST requests. I found a POST-request that does not log and requests to it where handled normally, while the other POST-requests still hung.
For now, I have set log4net.Internal.Debug to false and will wait to see what happens.

IIS recycle fix this issue doesn't mean that this is an IIS issue because all asp.net application run in .net runtime unless it is proved that the request is hang in IIS module.
So you may need to wait this issue happen again, then create a Failed request tracing rule for time-taken. Then it will tell us this issue is happening on IIS pipeline module or .net runtime.
If all request hang in .net runtime. Then you may have to capture a hang dump and do a deep analysis via WINDGB and mex extension. It will tell us what's happening there.

Why would a Publish transaction change from "Transport service unable to transport" to "Success"?

Sometimes when mass publishing there will temporarily be failed items with the error
Transport service unable to transport
Refreshing the Publish Queue a few times will then result in a success being returned. Why does this happen?

I'd assume that once a publish transaction reaches an error state it will be "beyond repair". But clearly that is not happening in your case. The fact that you reload the transactions in the Publish Queue window has no effect on what the publisher does, so it is most likely that the publisher and transport service are still working on the transaction.
My advice would be to switch the log level of the transport service to DEBUG and see if anything related to these transactions shows up in there.

As addon to Frank answer, check the transaction folder is empty or transport packages are still lying on this folder.
It seems transport packages are unable to unzip the data.

Bogus URL access causing server to hang

When a unavailable url is accessed, we internally raise an exception and email to support team. We do this to identify is there is hidden error in our web application. Couple of days back, suddenly there was a huge number of access to unavailable url which added load to server and casued SMTP to queue large exception emails. This attempt has brought IIS completely down and none of the applications are accessible.
How to prevent this? Is there any other option like firewall etc to disallow continues request from same ip. I have seen this behavior in google. How can we achieve that?

I'd suggest caching already sent notifications. Before your application sends email, it can check if this error is already reported.
And you can set the cache validity to, say 1 minute, so you get maximum 1 same email/minute.
It is quite easy to implement in ASP.NET.

System.Net.WebException: The request was aborted

First off, let me clarify the platforms we are using. We have an ASP.NET 2.0 app calling a web service which was created and is hosted on webMethods (now SoftwareAG) Integration Server 7.1.2.
The issue we are experiencing appears to occur every 10-20 minutes under a moderate volume of attempts. The .NET app tries to call the web service and gets the "System.Net.WebException: The request was aborted: The request was canceled" error message. There are no errors logged on the Integration Server when this problem occurs.
Any help/suggestions would be much appreciated!

This seems like a nasty one... and little information.
I think you will have to analyze with other tools...
Can it be that the request is stopped somewhere along the way?
Maybe you can try and follow the request with wireshark?

Which logs have you checked on the Integration server and with log levels have you applied?
You could e.g. check if a HTTP connection could be established.

Service Unavailable - IIS

My problem is that sometimes the CPU usage on the webserver is going to 100% (caused by the W3wp.exe)
At that moment the website will become "service unavailable"
Question: Where can I check from the IIS/HTTPERR logs where the website became "service unavailable"?
Can I used Log Parser to identify at which time this is happening? If yes is there any query?
Thank You

You could create a user dump file for the process and use the debug diagnostics tool to analyze what happened. The tool is part of the IIS Diagnostics Toolkit (download and description here). It is located in the folder C:\Program Files\IIS Resources\DebugDiag.
This support article explains in detail how to do that:
How to use the Debug Diagnostics Tool to troubleshoot high CPU usage by a process in IIS

Dunno if this is any food for thought, but this is what we do:
When our page rendering goes over a certain acceptable threshold we mark the server as "busy" and all future new sessions are denied "Server busy" - that lets people with open sessions finish up, lighten the load, and free up resources for creation ofnew sessions to resume
We do this by recording average task duration each minute, and checking if the average over the last five minutes exceeds the threshold - then set the Busy flag. The flag will be cleared on the next recalculation (which is a task scheduled for one-minute interval) when the 5-minute moving-average falls below threshold again.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex