Debugging flows seems really painful - corda

I'm running into serious productivity issues when debugging flows. I can only assume at this point is due to a lack of knowledge on my part; particularly effective debugging techniques of flows. The problems arise when I have one flow which needs to "wait" for a consumption of a specific state. What seems to happen is the waiting flow starts and waits for the consumption of the specified state, but despite implemented as a listening future with an associated call back (at this point I'm simply using getOrThrow on the future returned from 'WhenConsumed'), the flows just hang and I see hundreds of Artemis send/write messages in the console window. If I stop the debug session, delete the node build directory, redeploy the nodes and start again the flows restart and I can return to the point of failure. However if I simply stop and detach the debugger from the node and attempt to run the calling test (calling the flow via RPC), nothing seems to happen. It's almost as if the flow code (probably incorrect at this point) results in the StateMachine/messaging layer becoming stuck in some kind of stateful loop which is only resolved by wiping the node build directories and redeploying. Simply restarting the node results in the flow no longer executing at all. This is a real productivity killer, and so I'm writing this question in the hope and assumption I've missed an obvious trick in how to effectively test/debug flows in such a way which avoids repeatedly re-deploying the nodes.
It would be great if someone could explain how to effectively debug flows; especially flows which are dependent on vault updates and thus wait on a valut update event. I have considered using a subflow, but this would ultimately, (I believe?) not result in quite the functionality required; namely to have a flow triggered when an identified state is consumed by a node. Or maybe it would? Perhaps this issue is due to not using a subFlow??? I look forward to your thoughts anyway!!

Not sure about your specific use case. But in general,
I would do as much unit testing as possible before physically running the nodes and see if the flow works.
Corda provides three levels of unit testing: transaction/ledger DSL, mock network and driver DSL. So if done right, most if not all bugs in the flows should be resolved by the time of runnodes. Actual runnodes mostly just reveal configuration issues.

Related

Axon Event Processing Timeout

I am using an Axon Event Tracking processor. Sometimes events take longer that 10 seconds to process.
This seems to cause the message to be processed again and this appears in the log "Releasing claim of token X/0 failed. It was owned by another node."
If I up the number of segments it does not log this BUT the event is still processed twice so I think this might be misleading. (I think I was mistaken about this)
I have tried adjusting the fetchDelay, cleanupDelay and tokenClaimInterval. None of which has fixed this. Is there a property or something that I am missing?
Edit
The scenario taking longer than 10 seconds is making a HTTP request to an external service.
I'm using axon 4.1.2 with all default configuration when using with Spring auto configuration. I cannot see the Releasing claim on token and preparing for retry in [timeout]s log.
I was having this issue with a single segment and 2 instances of the application. I realised I hadn't increased the number of segments like I thought I had.
After further investigation I have discovered that adding an additional segment seems to have stopped this. Even if I have for example 2 segments and 6 applications it still doesn't reappear, however I'm not sure how this is different to my original scenario of 1 segment and 2 application?
I didn't realise it would be possible for multiple threads to grab the same tracking token and process the same event. It sounds like the best action would be to put an idem-potency check before the HTTP call?
The Releasing claim of token [event-processor-name]/[segment-id] failed. It was owned by another node. message can only occur in three scenarios:
You are performing a merge operation of two segments which fails because the given thread doesn't own both segments.
The main event processing loop of the TrackingEventProcessor is stopped, but releasing the token claim fails because the token is already claimed by another thread.
The main event processing loop has caught an Exception, making it retry with a exponential back-off, and it tries to release the claim (which might fail with the given message).
I am guessing it's not options 1 and 2, so that would leave us with option 3. This should also mean you are seeing other WARN level messages, like:
Releasing claim on token and preparing for retry in [timeout]s
Would you be able to share whether that's the case? That way we can pinpoint a little better what the exact problem is you are encountering.
By the way, very likely you have several processes (event handling threads of the TrackingEventProcessor) stealing the TrackingToken from one another. As they're stealing an un-updated token, both (or more) will handled the same event. Hence why you see the event handler being invoked twice.
Obviously undesirable behavior and something we should resolve for you. I would like to ask you to provide answers to my comments under the question, as right now I have to little to go on. Let us figure this out #Dan!
Update
Thanks for updating your question #dan, that's very helpful.
From what you've shared, I am fairly confident that both instances are stealing the token from one another. This does depend though on whether both are using the same database for the token_entry table (although I am assuming they are).
If they are using the same table, then they should "nicely" share their work, unless one of them takes to long. If it takes to long, the token will be claimed by another process. This other process in this case is the thread of the TEP of your other application instance. The "claim timeout" is defaulted to 10 seconds, which also corresponds with the long running event handling process.
This claimTimeout is adjustable though, by invoking the Builder of the JpaTokenStore/JdbcTokenStore (depending on which you are using / auto wiring) and calling the JpaTokenStore.Builder#claimTimeout(TemporalAmount) method. And, I think this would be required on your end, giving the fact you have a long running operation.
There are of course different ways of tackling this. Like, making sure the TEP is only ran on a single instance (not really fault tolerant though), or offloading this long running operation to a schedule task which is triggered by the event.
But, I think we've found the issue at least, so I'd suggest to tweak the claimTimeout and see if the problem persists.
Let us know if this resolves the problem on your end #dan!

How to initiate a correct reset of a Tracking Event Processor

I am trying to use Axon's TrackingEventProcessor to replay our events.
While resetting the token, I am getting an UnableToClaimTokenException, since the service runs in a distributed setup.
Is there any way to solve this issue without using Axon Server?
Axon Server has nothing to do with the occurrence of the UnableToClaimTokenException, Axon Server just greatly simplifies the process of initiating a replay.
As the exception and javadoc of the TrackingEventProcessor state, you will need to stop all instances of a given TrackingEventProcessor prior to initiating a replay.
Thus, in a distributed set up, you will have to stop each and every duplication of the given TrackingEventProcessor prior to being able to actual call resetTokens on one of them.
Without Axon Server, that means you will have to create your own endpoints or CLI within your application to stop the given processors. To simplify this, you would essentially want to have a centralized dashboard which shows all occurrence of a given TrackingEventProcessor.
This is exactly what Axon Server is, hence simplifying the process tremendously.
Regardless, it's definitely doable to create this yourself.
Thus, when it comes to triggering a replay, you will first shut down each instance of the TEP prior to a reset.

How to kill a thread, stop a promise execution in Raku

I am looking for a wait to stop (send an exception) to a running promise on SIGINT. The examples given in the doc exit the whole process and not just one worker.
Does someone know how to "kill", "unschedule", "stop" a running thread ?
This is for a p6-jupyter-kernel issue or this REPL issue.
Current solution is restarting the repl but not killing the blocked thread
await Promise.anyof(
start {
ENTER $running = True;
LEAVE $running = False;
CATCH {
say $_;
reset;
}
$output :=
self.repl-eval($code,:outer_ctx($!save_ctx),|%adverbs);
},
$ctrl-c
);
Short version: don't use threads for this, use processes. Killing the running process probably is the best thing that can be achieved in this situation in general.
Long answer: first, it's helpful to clear up a little confusion in the question.
First of all, there's no such thing as a "running Promise"; a Promise is a data structure for conveying a result of an asynchronous operation. A start block is really doing three things:
Creating a Promise (which it evaluates to)
Scheduling some code to run
Arranging that the outcome of running that code is reflected by keeping or breaking the Promise
That may sound a little academic, but really matters: a Promise has no awareness of what will ultimately end up keeping or breaking it.
Second, a start block is not - at least with the built-in scheduler - backed by a thread, but rather runs on the thread pool. Even if you could figure out a way to "take out" the thread, the thread pool scheduler is not going to be happy with having one of the threads it expects to eat from the work queue on disappear. You could write your own scheduler that really does back work with a fresh thread each time, but that still isn't a complete solution: what if the piece of code the user has requested execution of schedules work of its own, and then awaits that? Then there is no one thread to kill to really bring things to a halt.
Let's assume, however, that we did manage to solve all of this, and we get ourselves a list of one or more threads that we really want to kill without their cooperation (cooperative situations are fairly easy; we use a Promise and have code poll that every so often and die if that cancellation Promise is ever kept/broken).
Any such mechanism that wants to be able to stop a thread blocked on anything (not just compute, but also I/O, locking, etc.) would need deep integration and cooperation from the underlying runtime (such as MoarVM). For example, trying to cancel a thread that is currently performing garbage collection will be a disaster (most likely deadlocking the VM as a whole). Other unfortunate cancellation times could lead to memory corruption if it was half way through an operation that is not safe to interrupt, deadlocks elsewhere if the killed thread was holding locks, and so forth. Thus one would need some kind of safe-pointing mechanism. (We already have things along those lines in MoarVM to know when it's safe to GC, however cancellation implies different demands. It probably cross-cuts numerous parts of the VM codebase.)
And that's not all: the same situation repeats at the Raku language level too. Lock::Async, for example, is not a kind of lock that the underlying runtime is aware of. Probably the best one can do is try to tear down the callstack and run all of the LEAVE phasers; that way there's some hope (if folks used the .protect method; if they just called lock and unlock explicitly, we're done for). But even if we manage not to leak resources (already a big ask), we still don't know - in general - if the code we killed has left the world in any kind of consistent state. In a REPL context this could lead to dubious outcomes in follow-up executions that access the same global state. That's probably annoying, but what really frightens me is folks using such a cancellation mechanism in a production system - which they will if we implement it.
So, effectively, implementing such a feature would entail doing a significant amount of difficult work on the runtime and Rakudo itself, and the result would be a huge footgun (I've not even enumerated all the things that could go wrong, just the first few that came to mind). By contrast, killing a process clears up all resources, and a process has its own memory space, so there's no consistency worries either.
There is currently no way to stop a thread if it doesn't want to be stopped.
A thread can check a flag every so often, and decide to call it quits if that flag is set. It would be very nice if we would have a way to throw an exception inside a thread from another thread. But we do not, at least not as far as I know.

MagicalRecord: Setting up core data stack on a background thread

One of the things Marcus Zarra recommends in his Core Data book when talking about setting up an app's core data stack is to put calls to addPersistentStoreWithType:configuration:URL:options:error: on a background thread, because it can take an indeterminate amount of time (e.g., to run migrations). Is there a simple way to tell MagicalRecord to do that? It looks like all of its setupCoreDataStack... methods perform everything on the calling (presumably main) thread.
I don't think it makes sense to just move the top-level setup calls onto a background thread, because it wouldn't be safe to start using MR from the main thread until at least the contexts had been created, right? Do I need to implement my own setupCoreDataStackWithAsyncMigration or somesuch thing?
There is the wwdc2012 example code for setting up iCloud on a background thread (Shared Core Data sample). You could refractor the CoreDataController to use MagicalRecord (and ignore anything iCloud). IIRC the locking mechanism, to stop other threads from accessing the store while the setup is in progress, is already present.
Before you go down that route measure the time needed to startup on the device. If the startup is fast enough for your needs then you might want to stick with the setup on a main thread.
Migrations can take some time but migration won't occur on every app launch. Migration time depends on data volume and complexity of changes between model versions. So again it is a judgment call to invest time to move the migration to a background thread or to keep the user waiting.

How commonly do deadlock issues occur in programming?

I've programmed in a number of languages, but I am not aware of deadlocks in my code.
I took this to mean it doesn't happen.
Does this happen frequently (in programming, not in the databases) enough that I should be concerned about it?
Deadlocks could arise if two conditions are true: you have mutilple theads, and they contend for more than one resource.
Do you write multi-threaded code? You might do this explicitly by starting your own threads, or you might work in a framework where the threads are created out of your sight, and so you're running in more than one thread without you seeing that in your code.
An example: the Java Servlet API. You write a servlet or JSP. You deploy to the app server. Several users hit your web site, and hence your servlet. The server will likely have a thread per user.
Now consider what happens if in servicing the requests you want to aquire some resources:
if ( user Is Important ){
getResourceA();
}
getResourceB();
if (today is Thursday ) {
getResourceA();
}
// some more code
releaseResourceA();
releaseResoruceB();
In the contrived example above, think about what might happen on a Thursday when an important user's request arrives, and more or less simultaneously an unimportant user's request arrives.
The important user's thread gets Resoruce A and wants B. The less important user gets resource B and wants A. Neither will let go of the resource that they already own ... deadlock.
This can actually happen quite easily if you are writing code that explicitly uses synchronization. Most commonly I see it happen when using databases, and fortunately databases usually have deadlock detection so we can find out what error we made.
Defense against deadlock:
Acquire resources in a well defined order. In the aboce example, if resource A was always obtained before resource B no deadlock would occur.
If possible use timeouts, so that you don't wait indefinately for a resource. This will allow you to detect contention and apply defense 1.
It would be very hard to give an idea of how often it happens in reality (in production code? in development?) and that wouldn't really give a good idea of how much code is vulnerable to it anyway. (Quite often a deadlock will only occur in very specific situations.)
I've seen a few occurrences, although the most recent one I saw was in an Oracle driver (not in the database at all) due to a finalizer running at the same time as another thread trying to grab a connection. Fortunately I found another bug which let me avoid the finalizer running in the first place...
Basically deadlock is almost always due to trying to acquire one lock (B) whilst holding another one (A) while another thread does exactly the same thing the other way round. If one thread is waiting for B to be released, and the thread holding B is waiting for A to be released, neither is willing to let the other proceed.
Make sure you always acquire locks in the same order (and release them in the reverse order) and you should be able to avoid deadlock in most cases.
There are some odd cases where you don't directly have two locks, but it's the same basic principle. For example, in .NET you might use Control.Invoke from a worker thread in order to update the UI on the UI thread. Now Invoke waits until the update has been processed before continuing. Suppose your background thread holds a lock with the update requires... again, the worker thread is waiting for the UI thread, but the UI thread can't proceed because the worker thread holds the lock. Deadlock again.
This is the sort of pattern to watch out for. If you make sure you only lock where you need to, lock for as short a period as you can get away with, and document the thread safety and locking policies of all your code, you should be able to avoid deadlock. Like all threading topics, however, it's easier said than done.
If you get a chance take a look at first few chapters in Java Concurrency in Practice.
Deadlocks can occur in any concurrent programming situation, so it depends how much concurrency you deal with. Several examples of concurrent programming are: multi-process, multi-thread, and libraries introducing multi-thread. UI frameworks, event handling (such as timer event) could be implemented as threads. Web frameworks could spawn threads to handle multiple web requests simultaneously. With multicore CPUs you might see more concurrent situations visibly than before.
If A is waiting for B, and B is waiting for A, the circular wait causes the deadlock. So, it also depends on the type of code you write as well. If you use distributed transactions, you can easily cause that type of scenario. Without distributed transactions, you risk bank accounts from stealing money.
All depends on what you are coding. Traditional single threaded applications that do not use locking. Not really.
Multi-threaded code with multiple locks is what will cause deadlocks.
I just finished refactoring code that used seven different locks without proper exception handling. That had numerous deadlock issues.
A common cause of deadlocks is when you have different threads (or processes) acquire a set of resources in different order.
E.g. if you have some resource A and B, if thread 1 acquires A and then B, and thread 2 acquires B and then A, then this is a deadlock waiting to happen.
There's a simple solution to this problem: have all your threads always acquire resources in the same order. E.g. if all your threads acquire A and B in that order, you will avoid deadlock.
A deadlock is a situation with two processes are dependent on each other - one cannot finish before the other. Therefore, you will likely only have a deadlock in your code if you are running multiple code flows at any one time.
Developing a multi-threaded application means you need to consider deadlocks. A single-threaded application is unlikely to have deadlocks - but not impossible, the obvious example being that you may be using a DB which is subject to deadlocking.

Resources