From the Pro ASP.NET Core MVC 2 book (page 417):
The ASP.NET Debugging Levels
...
Critical - This level is used for messages that describe catastrophic failures.
Error - This level is used for messages that describe errors that interrupt the application....
What is the difference between catastrophic failures and interrupting?
The official Microsoft documentation explains it a little more clearly when discussing log levels:
Error = 4
For errors and exceptions that cannot be handled. These messages indicate a failure in the current activity or operation (such as the
current HTTP request), not an application-wide failure. Example log
message: Cannot insert record due to duplicate key violation.
Critical = 5
For failures that require immediate attention. Examples: data loss
scenarios, out of disk space.
See https://learn.microsoft.com/en-us/aspnet/core/fundamentals/logging/?tabs=aspnetcore2x (under the "Log Level" section.
In other words, level 4 "error" will be used for something which crashes the application in its current activity, but probably don't prevent it from continuing to serve other requests or perform other operations. Most exceptions will fall into this category.
On the other hand a level 5 "critical" error will be used for something which is likely to have a longer-term impact, potentially making the application entirely un-usable until the problem is resolved.
Related
I have noticed that many times we face transient issue that anything that is running on an ADX cluster abruptly fails due to the following error:-
ADX async command has completed with a 'Abandoned' state. Status: 'Admin node has changed'
This is mostly noticed in case of commands that run for long time , obviously because the probability that admin node will change during the lifespan of the command execution increases.
Is this standard behavior of an ADX cluster and that we have to take into account the fact that this may happen now and then. Is there any guidance on frequency of this happening or any hint about which circumstances cause this admin node to be changed. As for when the admin node actually changes , is there anything we can do to avoid command failures caused by it ?
An admin node may change occasionally (e.g. once in a ~week), but it is not expected to occur frequently.
If it is happening frequently on your cluster, it may indicate something bad in the usage pattern that is overloading the admin node, poor choice of SKU (not enough CPU/RAM to handle the workload), or an issue with the service or underlying platform.
If you're unsure of which it is - you can consider opening a support ticket.
As a side note, commands that are long-running (e.g. ~5-10 mins, or longer) are discouraged. You should follow the notes in the documentation recommending splitting a single command into multiple (shorter/lighter) ones, each handling a subset of the work to be done.
I am using an Axon Event Tracking processor. Sometimes events take longer that 10 seconds to process.
This seems to cause the message to be processed again and this appears in the log "Releasing claim of token X/0 failed. It was owned by another node."
If I up the number of segments it does not log this BUT the event is still processed twice so I think this might be misleading. (I think I was mistaken about this)
I have tried adjusting the fetchDelay, cleanupDelay and tokenClaimInterval. None of which has fixed this. Is there a property or something that I am missing?
Edit
The scenario taking longer than 10 seconds is making a HTTP request to an external service.
I'm using axon 4.1.2 with all default configuration when using with Spring auto configuration. I cannot see the Releasing claim on token and preparing for retry in [timeout]s log.
I was having this issue with a single segment and 2 instances of the application. I realised I hadn't increased the number of segments like I thought I had.
After further investigation I have discovered that adding an additional segment seems to have stopped this. Even if I have for example 2 segments and 6 applications it still doesn't reappear, however I'm not sure how this is different to my original scenario of 1 segment and 2 application?
I didn't realise it would be possible for multiple threads to grab the same tracking token and process the same event. It sounds like the best action would be to put an idem-potency check before the HTTP call?
The Releasing claim of token [event-processor-name]/[segment-id] failed. It was owned by another node. message can only occur in three scenarios:
You are performing a merge operation of two segments which fails because the given thread doesn't own both segments.
The main event processing loop of the TrackingEventProcessor is stopped, but releasing the token claim fails because the token is already claimed by another thread.
The main event processing loop has caught an Exception, making it retry with a exponential back-off, and it tries to release the claim (which might fail with the given message).
I am guessing it's not options 1 and 2, so that would leave us with option 3. This should also mean you are seeing other WARN level messages, like:
Releasing claim on token and preparing for retry in [timeout]s
Would you be able to share whether that's the case? That way we can pinpoint a little better what the exact problem is you are encountering.
By the way, very likely you have several processes (event handling threads of the TrackingEventProcessor) stealing the TrackingToken from one another. As they're stealing an un-updated token, both (or more) will handled the same event. Hence why you see the event handler being invoked twice.
Obviously undesirable behavior and something we should resolve for you. I would like to ask you to provide answers to my comments under the question, as right now I have to little to go on. Let us figure this out #Dan!
Update
Thanks for updating your question #dan, that's very helpful.
From what you've shared, I am fairly confident that both instances are stealing the token from one another. This does depend though on whether both are using the same database for the token_entry table (although I am assuming they are).
If they are using the same table, then they should "nicely" share their work, unless one of them takes to long. If it takes to long, the token will be claimed by another process. This other process in this case is the thread of the TEP of your other application instance. The "claim timeout" is defaulted to 10 seconds, which also corresponds with the long running event handling process.
This claimTimeout is adjustable though, by invoking the Builder of the JpaTokenStore/JdbcTokenStore (depending on which you are using / auto wiring) and calling the JpaTokenStore.Builder#claimTimeout(TemporalAmount) method. And, I think this would be required on your end, giving the fact you have a long running operation.
There are of course different ways of tackling this. Like, making sure the TEP is only ran on a single instance (not really fault tolerant though), or offloading this long running operation to a schedule task which is triggered by the event.
But, I think we've found the issue at least, so I'd suggest to tweak the claimTimeout and see if the problem persists.
Let us know if this resolves the problem on your end #dan!
Application Insights distinguishes multiple event types, some of which can potentially represent an error:
Traces have a verbosity level and those above warning or error represent error conditions
Requests have a status code and those equal to or above 400 represent error conditions
Exceptions are always errors.
Now you'd think I can easily filter the search view to show me only all those events representing an error condition, but I can't figure out how.
Do I really need to use the Analytics query stuff to do this?
To look at errors within the Azure Portal, the best way to get the pre-filtered view you're looking for is to use the Failures blade rather than Search:
This view shows requests with status codes indicating failure, dependencies indicating failures, and exceptions - in addition to additional filtering tools curated for investigating error scenarios.
Traces aren't covered here, but will be available when you drill down from the Failures blade to a representative transaction:
When considering a service in NServiceBus at what point do you start questioning how many messages handled by a service is too much and start to break these into a new service?
Consider the following: I have a sales service which can currently be broken into a few distinct business components, these are sales order validation, sales order processing, purchase order validation and purchase order processing.
There are currently about 20 message handlers and 2 sagas used within this service. My concern is that during high volume traffic from my website this can cause an initial spike in the messages to jump into the hundreds. Considering that the messages need to be processed in the order they are taken off the queue this can cause a delay for the last in the queue ( depending on what processing each message does).
When separating concerns within a service into smaller business components I find this makes things a little easier. Sure, it's a logical separation, but it seems to provide a layer of clarity and understanding. To me it seems it seems an easier option to do this than creating new services where in the end the more services I have the more maintenance I need to do.
Does anyone have any similar concerns to this?
I think you have actually answered you own question :)
As soon as the message volume reaches a point where the lag becomes an issue you could look to instance your endpoint. You do not necessarily need to reduce the number of handlers. You could simply install the service a number of times and have specific message types sent to the relevant endpoint by mapping.
So it becomes a matter of a simple instance installation and some config changes. So you can then either split messages on sending so that messages from a particular source end up on a particular endpoint (maybe priority) or on message type.
I happened to do the same thing on a previous project (not using NServiecBus though) where we needed document conversion messages coming from the UI to be processed ASAP. We simply installed the conversion service again with its own set of queues and changed the UI configuration to send the conversion messages to the new endpoint. The background conversion messages were still going to the previous endpoint. So here the source determined the separation.
Having deployed a new build of an ASP.NET site in a production environment, I am logging dozens of data errors every second, almost always with the error "Cannot find table 0." We use datasets and frequently refer to Table[0], and while I understand the defensive coding practice of checking the dataset for tables before accessing Table[0], it's never been a problem in the past. A certain page will load fine one second, and then be missing one of its data-driven components the next. Just seeing if this rings a bell for anyone.
More detail: I used a different build server this time, and while I imagine the compiler settings are the same on both, I have a hard time thinking that there's a switch that makes 50% of my database calls come back with no tables. I also switched the project to VS 2008, but then reverted all of those changes when I switched back to VS 2005. I notice that the built assembly has a new MyLibrary.XmlSerializers.dll, where it didn't used to, but I also can't imagine that that's causing all the trouble. (It also doesn't fall down on calls to MyLibrary, or at least no more than any other time.)
Updated to add: I've discovered that the troublesome build is a "Release" build, where the working build was compiled as "Debug". Could that explain it?
Rolling back to the build before these changes fixed it. (Rebooting the SQL Server, the step we tried before that, did not.)
The trouble also seems to be load-based - this cruised through our integration and QA environments without a problem, and even our smoke test environment - the one that points to production data - is fine under light load.
Does this have the distinguishing characteristics of anything you might have seen in the past?
Bumping this old question because we have encountered the same issue and perhaps our solution would give more insight in what causes this.
Essentially this problem occurs in a production environment that is under very heavy load in a Windows service that uses multiple threads to process several jobs simultaneously (100 users use the same DB via ASP.NET web app and there are about 60 transactions/second on older hardware with SQL Server 2000).
No variables are shared, that is connections are opened anew, transaction is started, operations executed, transaction committed and connection closes.
Under heavy load sometimes one of the following exceptions occurs:
NullReferenceException: Object reference not set to an instance of an
object.
at System.Data.SqlClient.SqlInternalConnectionTds.get_IsLockedForBulkCopy()
or
System.Data.SqlClient.SqlException:
The server failed to resume the transaction. Desc:3400000178
or
New request is not allowed to start because it should come with valid transaction descriptor
or
This SqlTransaction has completed; it is no longer usable
It seems somehow the connection that is within the pool becomes corrupted and remains associated with previously used transactions. Furthermore, if such connection is retrieved from pool then sqlAdapter.Fill(dataset) results in an empty dataset, causing "Cannot find table 0". Because our service would retry the operation (reading job list) on failure and it would always get the same corrupt connection from the pool it would fail with this error until restarted.
We removed the issue by using SqlConnection.ClearPool(connection) on exception to make sure this connection is discarded from the pool and restructuring the application so less threads access the same resources simultaneously.
I have no clue who exactly caused this issue so I am not sure we have really fixed that, maybe just made it so rare it had not occurred again yet.
I've fought precisely this error message before. The key is that an underlying data method is swallowing a timeout exception.
You're probably doing something like this:
var table = GetEmployeeDataSet().Tables[0];
GetEmployeeDataSet is swallowing an exception, probably a timeout exception, which is why it only happens sporadically - it happens under load. You need to do the following to fix it:
Modify the underlying code to not swallow the exception, but rather let it bubble up to the next level so you can identify it properly.
Identify the query(s) causing the problem, and then rewrite, reindex, denormalize or throw hardware at the problem. See this for more info: System.Data.SqlClient.SqlException: Timeout expired
I've seen something similar. I believe our problem had to do with failed sessions being re-used (once the session object failed it went into a poor state and could not recover.) We fixed it by increasing the memory for the session pool and increasing the frequency of the web application recycling.
It also was "caused" by a new version that at first blush did not seem to have any change to cause such an effect. However, eventually it became clear that the logic of the program was opening and closing a lot more connections (maybe 20% more) than it used to. This small change pushed the limit of our prior configuration.
You might check the SQL Server logs for errors. Or, the Web server event log. It sounds like your connection pool could be out of open connections or your db could be out.
Which database calls changed between versions?
The error is obviously telling you one of your database calls isn't returning any data on occasion; I can't think of any cases where a code/assembly issue would cause it.
I have seen something like this when doing something with nHibernate Sessions in a non-thread-safe manner. That would explain why you only see it under load. Would need to see your code to guess at what isn't thread-safe though.