What happens when two nodes attempt to write at the same time in 2PC? - software-design

Does anyone know what happens when two nodes try to write data at the same time and both initiate the 2PC protocol? Does a request from one node get aborted and another one succeed? The failed node would retry with some exponential backoff?
If not, what happens?

Usually resource managers does not allow both nodes to participate in the same transaction in the same transaction branch at the same time. Probably second node/binary/thread which tries to join to the transaction will get timeout or some other error on xa_start(..., TMJOIN) call.

Related

Retry and Failure queue prioritization in NiFi

I have a queue at NiFi that contains the items that will be processed through an API query (invokeHTTP). These items can be processed and return the answer with the data correctly (status 200), they can not be found (status 404) and also a failure (status 500).
However, in the case of status 404 and 500, false negatives can happen, so if I consult the same data that gave an error again, it returns with status 200. But there are cases that there really is a failure and it is not a false negative.
So I created a queue for retry and failure for them to enter involeHTTP again and consult the API. I put an expiration time of 5 minutes so that the data that is really at fault is not forever consulting the API.
However, I wanted to prioritize this Failure and Retry queue, so that by the time a data reaches it, it will be consulted in the API again, in front of the standard processing queue, so as not to lose the data that gave false negatives.
Is it possible to do this treatment with this self relationship or do you need a new flowfile?
Each queue can have a prioritizer configured on the queue's settings. Currently you have two separate queues for InvokeHttp, the failure/retry queue and the incoming queue from the matched relationship of EvaluateJsonPath. You need to put a funnel in front of InvokeHttp and send both of these queues to the funnel, then the funnel to InvokeHttp. This way you can create a single incoming queue to InvokeHttp and configure the prioritizer there.
In order to prioritize it correctly, you may want to use Flow File Attribute prioritizer. You would use UpdateAttribute to add a "priority" attribute to each flow file, the ones for failure/retry get priority "A" and the others get priority "B" (or anything that sorts after A).

BizTalk Parallel Convoy with seperate TimeOutException handling not able to build with error message "fatal error X1001: unknown system exception"

Consider the following basic structure of a Parallel Convoy pattern in BizTalk 2016. It is a Parallel Action with 2 active Receive shapes. Combined with a single correlation set that is being initialized by both active receives.
Now my issue arose when I want to have separate exception handling, one for the left receive, and one for the right receive. So I've put a scope around the left receive (Scope_1) with a timeout. And I've wrapped that scope in another scope (Scope_3), to catch the timeout exception.
Now for some reason this isn't allowed and I get back "fatal error X1001: unknown system exception" at build time.
However, if I wrap the scope_3 around both active receives, it's building successfully:
What's the significant difference here for BizTalk to not allow separate timeout exception handling in this scenario?
By the way:
It doesn't matter what type of exception I'm trying to catch, or if all my scopes are a Long Running transaction or not, the occurrence of the error is the same.
If I make a separate correlation set for each receive, the error does not occur, but of course that's not what I want because it wouldn't make it a parallel convoy then.
Setting scopes to synchronized does not affect the behavior.
The significant difference is that the Orchestration will start up when it receives the first message, which may not be the scope_1. So the timer would not be started in that scenario. And if it was scope_1, well it won't time out as you have received it, but it won't be timing out for scope_2.
Having the timer around both, does set the timeout in both scenarios.
What you could do is have the timeout scope as per your second example and set a flag to indicate which one was received, and use that in your exception block.
The other options is a first Receive shape that initializes the correlation set, and then a second receive after it that has the following correlation and have the timeout on that.
First, i am able to replicate your issue.
Although visual studio reported this as an unknown system exception but for me it looks unreachable code detected based the receive shape that is inside the scope (scope_3) that is trying to initialize your correlation. So there's a possibility that you wont be able to initialize the correlation same way your left scope (scope_2) does if your main scope (scope_1) is having some exceptions.
The only way I can think is to use using different correlation sets, you can set your send port to follow on these correlation set.
Without using correlation sets, this should not give error during build time. For me this is considered to be an MS bug, VS should be able to point out the unreachable code detected, not fatal error:

Flink + Kafka: Why am I losing messages?

I have written a very simple Flink streaming job which takes data from Kafka using a FlinkKafkaConsumer082.
protected DataStream<String> getKafkaStream(StreamExecutionEnvironment env, String topic) {
Properties result = new Properties();
result.put("bootstrap.servers", getBrokerUrl());
result.put("zookeeper.connect", getZookeeperUrl());
result.put("group.id", getGroup());
return env.addSource(
new FlinkKafkaConsumer082<>(
topic,
new SimpleStringSchema(), result);
}
This works very well and whenever I put something into the topic on Kafka, it is received by my Flink job and processed. Now I tried to see what happens if my Flink Job isn't online for some reason. So I shut down the flink job and kept sending messages to Kafka. Then I started my Flink job again and was expecting that it would process the messages that were sent meanwhile.
However, I got this message:
No prior offsets found for some partitions in topic collector.Customer. Fetched the following start offsets [FetchPartition {partition=0, offset=25}]
So it basically ignored all messages that came since the last shutdown of the Flink job and just started to read at the end of the queue. From the documentation of FlinkKafkaConsumer082 I gathered, that it automatically takes care of synchronizing the processed offsets with the Kafka broker. However that doesn't seem to be the case.
I am using a single-node Kafka installation (the one that comes with the Kafka distribution) with a single-node Zookeper installation (also the one that is bundled with the Kafka distribution).
I suspect it is some kind of misconfiguration or something the like but I really don't know where to start looking. Has anyone else had this issue and maybe solved it?
I found the reason. You need to explicitly enable checkpointing in the StreamExecutionEnvironment to make the Kafka connector write the processed offsets to Zookeeper. If you don't enable it, the Kafka connector will not write the last read offset and it will therefore not be able to resume from there when the collecting Job is restarted. So be sure to write:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(); // <-- this is the important part
Anatoly's suggestion for changing the initial offset is probably still a good idea, in case checkpointing fails for some reason.
https://kafka.apache.org/08/configuration.html
set auto.offset.reset to smallest(by default it's largest)
auto.offset.reset:
What to do when there is no initial offset in Zookeeper or if an
offset is out of range:
smallest : automatically reset the offset to the smallest offset
largest : automatically reset the offset to the largest offset
anything else: throw exception to the consumer.
If this is set to largest, the consumer may lose some messages when the number of partitions, for the topics it subscribes to, changes on the broker. To
prevent data loss during partition addition, set auto.offset.reset to
smallest
Also make sure getGroup() is the same after restart

What could cause a message (from a polling receive location) to be ignored by subscribing orchestration?

I'll try provide as much information as possible:
No error message.
The instance stays in the "ready service instances".
The receive location has the same parameters (except URI, the three polling queries, user account/pw and receive pipeline) as another receive location that points to another database/table which works.
The pipeline is waiting for the correct schema.
The port surface and receive location are both waiting for the correct schema.
In my test example, there are only 10 lines being returned.
The message, which contains those 10 lines, validates against the schema.
I tried to let the instance alone to no avail - 30+ minutes - and no change in its condition.
I had also tried suspending and then resuming it which then places the instance in the "dehydrated orchestrations" list. Again, with no error message.
I'm able to get the message by looking at the body of the message that's in the "ready to run" service. (This is the message that validates versus the schema I use in Visual Studio.)
How might something like this arise?
Stupid question, but I have to ask... Is the corresponding host instance running?

How to lock node for deletion process

Within alfresco, I want to delete a node but I don't want to be used by any other users in a cluster environment.
I know that I will use LockService for lock a node (in a cluster environment) as in the folloing lines:
lockService.lock(deleteNode);
nodeService.deleteNode(deleteNode);
lockService.unlock(deleteNode);
the last line may cause an exception because the node has already been deleted, and indeed it causes the exception is
A system error happened during the operation: Node does not exist: workspace://SpacesStore/cb6473ed-1f0c-4fa3-bfdf-8f0bc86f3a12
So how to ensure concurrency in a cluster environment when delete a node to prevent two users to access the same node at the same time one of them want to update it and the second once want o delete it?
Depending on your cluster environment (e.g. same DB server used by all Alfresco instances), transactions might most likely just be enough to ensure no stale content is used:
serverA(readNode)
serverB(deleteNode)
serverA(updateNode) <--- transaction failure
The JobLockService allows more control in case of more complex operations, which might involve multiple, dynamic nodes (or no nodes at all, e.g. sending emails or similar):
serverA(acquireLock)
serverB(acquireLock) <--- wait for the lock to be released
serverA(readNode1)
serverA(if something then updateNode2)
serverA(updateNode1)
serverA(releaseLock)
serverB(readNode2)
serverB(releaseLock)

Resources