commands failures caused by Admin node change - azure-data-explorer

I have noticed that many times we face transient issue that anything that is running on an ADX cluster abruptly fails due to the following error:-
ADX async command has completed with a 'Abandoned' state. Status: 'Admin node has changed'
This is mostly noticed in case of commands that run for long time , obviously because the probability that admin node will change during the lifespan of the command execution increases.
Is this standard behavior of an ADX cluster and that we have to take into account the fact that this may happen now and then. Is there any guidance on frequency of this happening or any hint about which circumstances cause this admin node to be changed. As for when the admin node actually changes , is there anything we can do to avoid command failures caused by it ?

An admin node may change occasionally (e.g. once in a ~week), but it is not expected to occur frequently.
If it is happening frequently on your cluster, it may indicate something bad in the usage pattern that is overloading the admin node, poor choice of SKU (not enough CPU/RAM to handle the workload), or an issue with the service or underlying platform.
If you're unsure of which it is - you can consider opening a support ticket.
As a side note, commands that are long-running (e.g. ~5-10 mins, or longer) are discouraged. You should follow the notes in the documentation recommending splitting a single command into multiple (shorter/lighter) ones, each handling a subset of the work to be done.

Related

Axon Event Processing Timeout

I am using an Axon Event Tracking processor. Sometimes events take longer that 10 seconds to process.
This seems to cause the message to be processed again and this appears in the log "Releasing claim of token X/0 failed. It was owned by another node."
If I up the number of segments it does not log this BUT the event is still processed twice so I think this might be misleading. (I think I was mistaken about this)
I have tried adjusting the fetchDelay, cleanupDelay and tokenClaimInterval. None of which has fixed this. Is there a property or something that I am missing?
Edit
The scenario taking longer than 10 seconds is making a HTTP request to an external service.
I'm using axon 4.1.2 with all default configuration when using with Spring auto configuration. I cannot see the Releasing claim on token and preparing for retry in [timeout]s log.
I was having this issue with a single segment and 2 instances of the application. I realised I hadn't increased the number of segments like I thought I had.
After further investigation I have discovered that adding an additional segment seems to have stopped this. Even if I have for example 2 segments and 6 applications it still doesn't reappear, however I'm not sure how this is different to my original scenario of 1 segment and 2 application?
I didn't realise it would be possible for multiple threads to grab the same tracking token and process the same event. It sounds like the best action would be to put an idem-potency check before the HTTP call?
The Releasing claim of token [event-processor-name]/[segment-id] failed. It was owned by another node. message can only occur in three scenarios:
You are performing a merge operation of two segments which fails because the given thread doesn't own both segments.
The main event processing loop of the TrackingEventProcessor is stopped, but releasing the token claim fails because the token is already claimed by another thread.
The main event processing loop has caught an Exception, making it retry with a exponential back-off, and it tries to release the claim (which might fail with the given message).
I am guessing it's not options 1 and 2, so that would leave us with option 3. This should also mean you are seeing other WARN level messages, like:
Releasing claim on token and preparing for retry in [timeout]s
Would you be able to share whether that's the case? That way we can pinpoint a little better what the exact problem is you are encountering.
By the way, very likely you have several processes (event handling threads of the TrackingEventProcessor) stealing the TrackingToken from one another. As they're stealing an un-updated token, both (or more) will handled the same event. Hence why you see the event handler being invoked twice.
Obviously undesirable behavior and something we should resolve for you. I would like to ask you to provide answers to my comments under the question, as right now I have to little to go on. Let us figure this out #Dan!
Update
Thanks for updating your question #dan, that's very helpful.
From what you've shared, I am fairly confident that both instances are stealing the token from one another. This does depend though on whether both are using the same database for the token_entry table (although I am assuming they are).
If they are using the same table, then they should "nicely" share their work, unless one of them takes to long. If it takes to long, the token will be claimed by another process. This other process in this case is the thread of the TEP of your other application instance. The "claim timeout" is defaulted to 10 seconds, which also corresponds with the long running event handling process.
This claimTimeout is adjustable though, by invoking the Builder of the JpaTokenStore/JdbcTokenStore (depending on which you are using / auto wiring) and calling the JpaTokenStore.Builder#claimTimeout(TemporalAmount) method. And, I think this would be required on your end, giving the fact you have a long running operation.
There are of course different ways of tackling this. Like, making sure the TEP is only ran on a single instance (not really fault tolerant though), or offloading this long running operation to a schedule task which is triggered by the event.
But, I think we've found the issue at least, so I'd suggest to tweak the claimTimeout and see if the problem persists.
Let us know if this resolves the problem on your end #dan!

Known reasons why sqlite3_open_v2 can take over 60s on windows?

I will start with TL;DR version as this may be enough for some of you:
We are trying to investigate an issue that we see in diagnostic data of our C++ product.
The issue was pinpointed to be caused by timeout on sqlite3_open_v2 which supposedly takes over 60s to complete (we only give it 60s).
We tried multiple different configurations, but never were able to reproduce even 5s delay on this call.
So the question is if maybe there are some known scenarios in which sqlite3_open_v2 can take that long (on windows)?
Now to the details:
We are using version 3.10.2 of SQLite. We went through changelogs from this version till now and nothing we've found in the bugfixes section seems to suggest that there was some issue that was addressed in consecutive SQLite releases and may have caused our problem.
The issue we see affects around 0.1% unique user across all supported versions of windows (Win 7, Win 8, Win 10). There are no manual user complaints/reports about that - this can suggest that a problem happens in the context where something serious enough is happening with the user machine/system that he doesn't expect anything to work. So something that indicates system-wide failure is a valid possibility as long as it can possibly happen for 0.1% of random windows users.
There are no data indicating that the same issue ever occurred on Mac which is also supported platform with large enough sample of diagnostic data.
We are using Poco (https://github.com/pocoproject/poco, version: 1.7.2) as a tool for accessing our SQLite database, but we've analyzed the Poco code and it seems that failure on this code level can only (possibly) explain ~1% of all collected samples. This is how we've determined that the problem lies in sqlite3_open_v2 taking a long time.
This happens on both DELETE journal mode as well as on WAL.
It seems like after this problem happens the first time for a particular user each consecutive call to sqlite3_open_v2 takes that long until the user restarts whole application (possibly machine, no way to tell from our data).
We are using following flags setup for sqlite3_open_v2 (as in Poco):
sqlite3_open_v2(..., ..., SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_URI, NULL);
This usually doesn't happen on startup of the application so it's not likely to be caused by something happening while our application is not running. This includes power cuts offs causing data destruction (which tends to return SQLITE_CORRUPT anyway, as mentioned in https://www.sqlite.org/howtocorrupt.html).
We were never able to reproduce this issue locally even though we tried different things:
Multiple threads writing and reading from DB with synchronization required by a particular journaling system.
Keeping SQLite connection open for a long time and working on DB normally in a meanwhile.
Trying to hit HDD hard with other data (dumping /dev/rand (WSL) to multiple files from different processes while accessing DB normally).
Trying to force antivirus software to scan DB on every file access (tested with Avast with basically everything enabled including "scan on open" and "scan on write").
Breaking our internal synchronization required by particular journaling systems.
Calling WinAPI CreateFile with all possible combinations of file sharing options on DB file - this caused issues but sqlite3_open_v2 always returned fast - just with an error.
Calling WinAPI LockFile on random parts of DB file which is btw. nice way of reproducing SQLITE_IOERR, but no luck with reproducing the discussed issue.
Some additional attempts to actually stretch the Poco layer and double-check if our static analysis of codes is right.
We've tried to look for similar issues online but anything somewhat relevant we've found was here sqlite3-open-v2-performance-degrades-as-number-of-opens-increase. This doesn't seem to explain our case though, as the numbers of parallel connections are way beyond what we have as well as what would typical windows users have (unless there is some somewhat popular app exploiting SQLite which we don't know about).
It's very unlikely that this issue is caused by db being accessed through network share as we are putting the DB file inside %appdata% unless there is some pretty standard windows configuration which sets %appdata% to be a remote share.
Do you have any ideas what can cause that issue?
Maybe some hints on what else should we check or what additional diagnostic data that we can collect from users would be useful to pinpoint the real reason why that happens?
Thanks in advance

Debugging flows seems really painful

I'm running into serious productivity issues when debugging flows. I can only assume at this point is due to a lack of knowledge on my part; particularly effective debugging techniques of flows. The problems arise when I have one flow which needs to "wait" for a consumption of a specific state. What seems to happen is the waiting flow starts and waits for the consumption of the specified state, but despite implemented as a listening future with an associated call back (at this point I'm simply using getOrThrow on the future returned from 'WhenConsumed'), the flows just hang and I see hundreds of Artemis send/write messages in the console window. If I stop the debug session, delete the node build directory, redeploy the nodes and start again the flows restart and I can return to the point of failure. However if I simply stop and detach the debugger from the node and attempt to run the calling test (calling the flow via RPC), nothing seems to happen. It's almost as if the flow code (probably incorrect at this point) results in the StateMachine/messaging layer becoming stuck in some kind of stateful loop which is only resolved by wiping the node build directories and redeploying. Simply restarting the node results in the flow no longer executing at all. This is a real productivity killer, and so I'm writing this question in the hope and assumption I've missed an obvious trick in how to effectively test/debug flows in such a way which avoids repeatedly re-deploying the nodes.
It would be great if someone could explain how to effectively debug flows; especially flows which are dependent on vault updates and thus wait on a valut update event. I have considered using a subflow, but this would ultimately, (I believe?) not result in quite the functionality required; namely to have a flow triggered when an identified state is consumed by a node. Or maybe it would? Perhaps this issue is due to not using a subFlow??? I look forward to your thoughts anyway!!
Not sure about your specific use case. But in general,
I would do as much unit testing as possible before physically running the nodes and see if the flow works.
Corda provides three levels of unit testing: transaction/ledger DSL, mock network and driver DSL. So if done right, most if not all bugs in the flows should be resolved by the time of runnodes. Actual runnodes mostly just reveal configuration issues.

ORACLE/ASP.NET: ORA-2020 - Too many database links... what's causing this?

Here's the scenario...
We have an internal website that is running the latest version of the ODAC (Oracle Client). It opens database connections, runs a stored procedure or packaged method, then disconnects. Connection pooling is turned on, and we are currently under version 11g in both our development and test environments, but under 10gR2 in our production environment. This happens on Production.
A few days ago, a process began firing off a ORA-2020 error. The process is called from a webpage on our internal website. The user simply sets a date, hits a button, and a job is started on another system that is separate from the website. The call itself, however, uses a database link to run a function.
We've scoured the SQL to find that it only uses that one database link. And since these links are on a per session basis and the user isn't exceeding the default limit of 4, how is it possible that we are getting a ORA-2020 error.
We have ran a number of tests to try to push over the default limit of 4. ODAC, from what I recall, runs a commit after each connection, and I can't seem to run 4 DB links, then run a piece of SQL with 1 DB link directly after with any errors. The only way I can bring up this error is if I run a query with 4 DB links, then a function or piece of dynamic SQL with a database link within it. We don't have that problem as this issue is sporadic. It isn't ALWAYS happening.
Questions
Is it possible that connection pooling is allowing User B to use User A's connection after the initial process was run, thus adding to the open links number if User B runs a SQL statement with more database links?
Is this a scenario where we should up our limit past 4? What are the disadvantages of increasing the number?
Do I need to explicitly close open database links before disconnecting from the database? Oracle documentation seems to suggest it should automatically happen, but "on occasion"... doesn't.
Firstly, the simple solution: I'd double check that in the production database the number of default links is actually 4.
select *
from v$system_parameter
where name = 'OPEN_LINKS'
Assuming you're not going to get off that lightly:
Is it possible that connection pooling is allowing User B to use User
A's connection after the initial process was run, thus adding to the
open links number if User B runs a SQL statement with more database
links?
You say that you explicitly close the session, which, according to the documentation, should mean that all links associated with that session are closed. Other than that I confess complete ignorance on this point.
Is this a scenario where we should up our limit past 4? What are the
disadvantages of increasing the number?
There aren't any disadvantages that I can think of. Tom Kyte suggests, albeit a long time ago, that each open database link uses 500k of PGA memory. If you don't have any then this will obviously cause a problem but it should be more than fine for most situations.
There are, however, unintended consequences: Imagine that you up this number to 100. Somebody codes something that continually opens links and draws a lot of data through all them select * from my_massive_table or similar. Instead of 4 sessions doing this you have 100, which is attempting to transfer hundreds of gigabytes simultaneously. Your network dies under the strain...
There's probably more but you get the picture.
Do I need to explicitly close open database links before disconnecting
from the database? Oracle documentation seems to suggest it should
automatically happen, but "on occasion"... doesn't.
As you've noted the best answer is "probably not", which isn't much help. You don't mention exactly how you're terminating the session but if you're killing it rather than closing gracefully then definitely.
Using a database link spawns a child process on the remote server. Because your server is no longer in absolute charge of this process there's a myriad of things that could cause it to become orphaned or otherwise not close on termination of the parent process. By no means does this happen the whole time but it can and does.
I would do two things.
In your process, if an exception is encountered, e-mail the results of the following query to yourself.
select *
from v$dblink
As a minimum at least you will know what database links are open in the session and give you some way of tracing them.
Follow the documentations advice; specifically the following:
"You may have occasion to close the link manually. For example, close
links when:
The network connection established by a link is used infrequently in an application.
The user session must be terminated."
The first seems to exactly fit your situation. Unless your process is time-sensitive, which doesn't seem to be the case, then what have you got to lose? The syntax is:
alter session close database link <linkname>
We ended up increasing the link amount, but we never did find the root cause.

ASP.NET OLTP App - creating new version of records

My question is same as this question - but a little to add on to it. My problem is that the users of my web app are allowed to create new versions of a record. Every new version of a record results in creating corresponding "new versions" in 50 other related tables to record individual changes to that particular version. This creation of a new version with about 50 tables involved is running within a transaction (to rollback all changes in the event of an error). Many a times this procedure is slow, understandably due to a "lengthy transaction" with too many table "insertions" involved.
I am looking at a better solution/design to implement such a scenario.
Is there a better way to maintain "versions" of the same record, particularly when it creates too many duplicates in multiple tables
I don't feel the design in itself is good with too many records getting inserted for "every row version", but would at least like to address the immediate problem, the "lengthy transaction" - which causes delay at times. There may not be way way out, but wanted to still ask out - If I don't put the "versioning" inside a transaction, is there a better way to rollback in the event of an error (because transaction appears to block other OLTP queries - due to inserting new versions on all primary tables)
The versioning query runs for about 10 seconds right now, but gets worse at times. Any thoughts are appreciated
Do you have to return the "created" or "failed" message in real time? The following might also be overkill for your solution but it does create a scalable solution.
The web-server could post a message (to a queue of some sort) requesting the action. At this point the user could continue using the site to do other stuff. A windows service in the background can process the message (out of context of the website) and then notify the user (by a message inside the website, similar to stack overflow notifications) or by email that the task has either run or failed.
If you can get away with doing processing decoupled in near real time then you can modify your windows service to scale. You can have a thread pool to manage requests - so maybe you only have 5 threads running at any one time to limit load. If you run into more performance problems you can scale out and develop a system that can have 2 or more processors of the queue (that does add it's own problems / complexity).

Resources