what is the behaviour of an asynchronous transaction in infinispan cache? - asynchronous

I'm investigating asynchronous transaction in order to improve performances.
Could you please explain me what is the behaviour of a transaction in a replicated asynchronous cache?
If I have a transaction composed by operations, in which every operation depends by previous operation (i.e. order of execution of operations is important).
For instance, consider a transaction T that performs a read of data1 needed to build data2, data2 is then written on cache.
TRANSACTION T {
// 1° operation
data1 = get(key1);
// 2° operation
data2 = elaborate(data1);
// 3° operation
put(data2);
}
In other words, I need that "entire transactions" are asynchronously executed, but I need that operations performed inside a transaction remain synchronous.
Is it possible? If yes, how I have to configure infinispan?
Many Thanks
Francesco Sclano

I guess you've wrapped those operations into tm.begin(); try { ... tm.commit(); } catch (...) { tm.rollback(); }.
There are few other option, therefore, I assume you use the default - optimistic transactions and 2-phase commit. I'd also recommend enabling write skew check - this should be default in Infinispan 7 but in prior versions you had to enable it explicitly, and some operations behave weird without that.
As for the get, this is always synchronous - you have to wait for the response.
Before commit, the put does basically just another get (in order to return the value) and records that the transaction should do the write during commit.
Then, during commit, the PrepareCommand which locks all updated entries is sent asynchronously - therefore, you can't know whether it succeeded or failed (usually due to timeout, but also due to write skew check or changed value in conditional command).
Second phase, the CommitCommand which overwrites the entries and unlocks the locks is sent synchronously after that - so you wait until the response is received. However, if you have <transaction useSynchronization="true" /> (default), even if this command fails somewhere, the commit succeeds.
In order to send the CommitCommand (or RollbackCommand) asynchronously, you need to configure <transaction syncCommitPhase="false" syncRollbackPhase="false" />.
I hope I've interpreted the sources correctly - and don't ask me why is it done this way. Regrettably, I am not sure whether you can configure the semantics "commit or rollback this transaction reliably, but don't report the result and let me continue" out of the box.
EDIT: The commit in asynchronous mode should be 1-phase as you can't check the result anyway. Therefore, it's possible that concurrent writes get reordered on different nodes and you'll get the cluster inconsistent, but you are not waiting for the transaction to be completed.
Anyway, if you want to execute the whole block atomically and asynchronously, there's nothing easier than to wrap the code into Runnable and execute it in your own threadpool. And with synchronous cache.

Related

When/why can kafka partition unpause "by itself"?

I have this kafka consumer:
new ReactiveKafkaConsumerTemplate<>(createReceivingOptions())
it happily processes messages, I set
max-poll-records=1
so that things don't happen to fast for me. I can verify via logging breakpoint in poll method on
final Map<TopicPartition, List<ConsumerRecord<K, V>>> records = pollForFetches(timer);
how many records poll returned, and yes it's one. Then I asked it to pause all assigned partitions. In log is can see that it worked!
o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=Testing, groupId=Testing] Pausing partitions [TestTopic-0]
and from that point on I can see, that poll gets only 0 records and also this log:
Skipping fetching records for assigned partition TestTopic-0 because it is paused
OK great, it works! But wait, why is my whole topic getting processed then?
Then I found out, that at certain point there is also this log:
Consumer clientId=Testing, groupId=Testing] Resuming partitions [TestTopic-0]
what? Who is calling that? And then I also found out, that there are multiple requests for pausing all over place, not just the one I actually invoked.
Pausing is somehow used by reactive and cannot be used manually? Or does someone have explanation why …clients.consumer.KafkaConsumer does pause/resume topic on it's own all the time, and manual pause because of that gets unpaused?
After reviewing the ConsumerEventLoop code, the reactive client is using pause/resume internally, to handle back pressure - when the downstream can't receive any more data he pauses all assigned partitions and unconditionally resumes them when the back pressure is relieved.
It seems to me that it needs to keep track of whether the pause was done because of back-pressure and only resume in that case.
It looks like it used to do that before this commit.
Perhaps you could use back pressure instead to force the pause?

Is there a way to cancel all operations in the queue?

I'm writing an app using RxAndroidBle library (which is great) but due to a requirement of the Device vendor we must be able to cancel all operations after the device has reached a certain state. Is there a way of cancelling any read/write op that has been sent to the RxBleConnection?
Any ideas?
Thanks!
The library uses an internal ConnectionOperationQueue which tries to remove operations that have not yet reached execution state. A concrete implementation would vary case-to-case but using RxJava it should be fairly easy to achieve:
// First create a completable that will fulfil appropriately
Completable deviceReachedACertainState = ...;
// Then use it for every operation that you want to cancel (unsubscribe) when a certain state happens
Observable<byte[]> cancellableRead = rxBleConnection.readCharacteristic(UUID).takeUntil(deviceReachedACertainState);
Keep in mind that operations that were already taken off the queue for execution will not be cancelled.

A MailboxProcessor that operates with a LIFO logic

I am learning about F# agents (MailboxProcessor).
I am dealing with a rather unconventional problem.
I have one agent (dataSource) which is a source of streaming data. The data has to be processed by an array of agents (dataProcessor). We can consider dataProcessor as some sort of tracking device.
Data may flow in faster than the speed with which the dataProcessor may be able to process its input.
It is OK to have some delay. However, I have to ensure that the agent stays on top of its work and does not get piled under obsolete observations
I am exploring ways to deal with this problem.
The first idea is to implement a stack (LIFO) in dataSource. dataSource would send over the latest observation available when dataProcessor becomes available to receive and process the data. This solution may work but it may get complicated as dataProcessor may need to be blocked and re-activated; and communicate its status to dataSource, leading to a two way communication problem. This problem may boil down to a blocking queue in the consumer-producer problem but I am not sure..
The second idea is to have dataProcessor taking care of message sorting. In this architecture, dataSource will simply post updates in dataProcessor's queue. dataProcessor will use Scanto fetch the latest data available in his queue. This may be the way to go. However, I am not sure if in the current design of MailboxProcessorit is possible to clear a queue of messages, deleting the older obsolete ones. Furthermore, here, it is written that:
Unfortunately, the TryScan function in the current version of F# is
broken in two ways. Firstly, the whole point is to specify a timeout
but the implementation does not actually honor it. Specifically,
irrelevant messages reset the timer. Secondly, as with the other Scan
function, the message queue is examined under a lock that prevents any
other threads from posting for the duration of the scan, which can be
an arbitrarily long time. Consequently, the TryScan function itself
tends to lock-up concurrent systems and can even introduce deadlocks
because the caller's code is evaluated inside the lock (e.g. posting
from the function argument to Scan or TryScan can deadlock the agent
when the code under the lock blocks waiting to acquire the lock it is
already under).
Having the latest observation bounced back may be a problem.
The author of this post, #Jon Harrop, suggests that
I managed to architect around it and the resulting architecture was actually better. In essence, I eagerly Receive all messages and filter using my own local queue.
This idea is surely worth exploring but, before starting to play around with code, I would welcome some inputs on how I could structure my solution.
Thank you.
Sounds like you might need a destructive scan version of the mailbox processor, I implemented this with TPL Dataflow in a blog series that you might be interested in.
My blog is currently down for maintenance but I can point you to the posts in markdown format.
Part1
Part2
Part3
You can also check out the code on github
I also wrote about the issues with scan in my lurking horror post
Hope that helps...
tl;dr I would try this: take Mailbox implementation from FSharp.Actor or Zach Bray's blog post, replace ConcurrentQueue by ConcurrentStack (plus add some bounded capacity logic) and use this changed agent as a dispatcher to pass messages from dataSource to an army of dataProcessors implemented as ordinary MBPs or Actors.
tl;dr2 If workers are a scarce and slow resource and we need to process a message that is the latest at the moment when a worker is ready, then it all boils down to an agent with a stack instead of a queue (with some bounded capacity logic) plus a BlockingQueue of workers. Dispatcher dequeues a ready worker, then pops a message from the stack and sends this message to the worker. After the job is done the worker enqueues itself to the queue when becomes ready (e.g. before let! msg = inbox.Receive()). Dispatcher consumer thread then blocks until any worker is ready, while producer thread keeps the bounded stack updated. (bounded stack could be done with an array + offset + size inside a lock, below is too complex one)
Details
MailBoxProcessor is designed to have only one consumer. This is even commented in the source code of MBP here (search for the word 'DRAGONS' :) )
If you post your data to MBP then only one thread could take it from internal queue or stack.
In you particular use case I would use ConcurrentStack directly or better wrapped into BlockingCollection:
It will allow many concurrent consumers
It is very fast and thread safe
BlockingCollection has BoundedCapacity property that allows you to limit the size of a collection. It throws on Add, but you could catch it or use TryAdd. If A is a main stack and B is a standby, then TryAdd to A, on false Add to B and swap the two with Interlocked.Exchange, then process needed messages in A, clear it, make a new standby - or use three stacks if processing A could be longer than B could become full again; in this way you do not block and do not lose any messages, but could discard unneeded ones is a controlled way.
BlockingCollection has methods like AddToAny/TakeFromAny, which work on an arrays of BlockingCollections. This could help, e.g.:
dataSource produces messages to a BlockingCollection with ConcurrentStack implementation (BCCS)
another thread consumes messages from BCCS and sends them to an array of processing BCCSs. You said that there is a lot of data. You may sacrifice one thread to be blocking and dispatching your messages indefinitely
each processing agent has its own BCCS or implemented as an Agent/Actor/MBP to which the dispatcher posts messages. In your case you need to send a message to only one processorAgent, so you may store processing agents in a circular buffer to always dispatch a message to least recently used processor.
Something like this:
(data stream produces 'T)
|
[dispatcher's BCSC]
|
(a dispatcher thread consumes 'T and pushes to processors, manages capacity of BCCS and LRU queue)
| |
[processor1's BCCS/Actor/MBP] ... [processorN's BCCS/Actor/MBP]
| |
(process) (process)
Instead of ConcurrentStack, you may want to read about heap data structure. If you need your latest messages by some property of messages, e.g. timestamp, rather than by the order in which they arrive to the stack (e.g. if there could be delays in transit and arrival order <> creation order), you can get the latest message by using heap.
If you still need Agents semantics/API, you could read several sources in addition to Dave's links, and somehow adopt implementation to multiple concurrent consumers:
An interesting article by Zach Bray on efficient Actors implementation. There you do need to replace (under the comment // Might want to schedule this call on another thread.) the line execute true by a line async { execute true } |> Async.Start or similar, because otherwise producing thread will be consuming thread - not good for a single fast producer. However, for a dispatcher like described above this is exactly what needed.
FSharp.Actor (aka Fakka) development branch and FSharp MPB source code (first link above) here could be very useful for implementation details. FSharp.Actors library has been in a freeze for several months but there is some activity in dev branch.
Should not miss discussion about Fakka in Google Groups in this context.
I have a somewhat similar use case and for the last two days I have researched everything I could find on the F# Agents/Actors. This answer is a kind of TODO for myself to try these ideas, of which half were born during writing it.
The simplest solution is to greedily eat all messages in the inbox when one arrives and discard all but the most recent. Easily done using TryReceive:
let rec readLatestLoop oldMsg =
async { let! newMsg = inbox.TryReceive 0
match newMsg with
| None -> oldMsg
| Some newMsg -> return! readLatestLoop newMsg }
let readLatest() =
async { let! msg = inbox.Receive()
return! readLatestLoop msg }
When faced with the same problem I architected a more sophisticated and efficient solution I called cancellable streaming and described in in an F# Journal article here. The idea is to start processing messages and then cancel that processing if they are superceded. This significantly improves concurrency if significant processing is being done.

Sqlite's Serialized mode

In the docs here:
http://www.sqlite.org/threadsafe.html
For serialized mode it says:
"In serialized mode, SQLite can be safely used by multiple threads with no restriction."
I want to make sure I understand the guarantee presented there. If a single database connection is opened using the "SQLITE_OPEN_FULLMUTEX" flag and two threads simultaneously try to call sqlite3_exec at the exact same instant, does Sqlite automatically serialize the calls?
The answer is yes. sqlite3_exec() will grab a mutex when the function is entered and releases the mutext once the function is left. Only one thread can own the mutex at any given time, so only one thread can execute sqlite3_exec() at any given time. If two threads try to execute sqlite3_exec() at the same time, one will wait for the other.

How to serialize data reliably

Good day, I receive data from a communication channel and display it. Parallel, I serialize it into a SQLite database (using normal SQL INSERT statements). After my application exit I do a .commit on the sqlite object.
What happens if my application is terminated brutally in the middle? Will the latest (reasonably - not say 100 microsec ago, but at least a sec ago) data be safely in the database even without a .commit is made? Or should I have periodic commit? What are best patterns for doing these things?
I tried autocommit on (sqlite's option) and this slows code a lot by a factor ~55 (autocommit vs. just one commit at end). Doing commit every 100 inserts brings performance within 20% of the optimal mode. So autocommit is very slow for me.
My application pumps lots data into DB - what can I do to make it work well?
You should be performing this within a transaction, and consequently performing a commit at appropriate points in the process. A transaction will guarantee that this operation is atomic - that is, it either works or doesn't work.
Atomicity states that database
modifications must follow an “all or
nothing” rule. Each transaction is
said to be “atomic” if when one part
of the transaction fails, the entire
transaction fails. It is critical that
the database management system
maintain the atomic nature of
transactions in spite of any DBMS,
operating system or hardware failure.
If you've not committed, then the inserts won't be visible (and be rolled back) when your process is terminated.
When do you perform these commits ? When your inserts represent something consistent and complete. e.g.. if you have to insert 2 pieces of information for each message, then commit after you've inserted both pieces of info. Don't commit after each one, since your info won't be consistent or complete.
The data is not permanent in the database without a commit. Use an occasional commit to balance the speed of performing many inserts in a transaction (the more frequent the commit, the slower) with the safety of having more frequent commits.
You should do a COMMIT every time you complete a logical change.
One reason for transaction is to prevent uncommitted data from a transaction to be visible from outside. That is important because sometimes a single logical change can translate into multiple INSERT or UPDATE statements. If one of the latter queries of the transaction fails, the transaction can be cancelled with ROLLBACK and no change at all is recorded.
Generally speaking, no change performed in a transaction is recorded in the database until COMMIT succeeds.
does not this slow down considerably my code? – zaharpopov
Frequent commits, might slow down your code, and as an optimization you could try grouping several logical changes in a single transaction. But this is a departure from the correct use of transactions and you should only do this after measuring that this significantly improves performance.

Resources