Background: We have been getting ProducerFencedException in our producer-only transactions, and want to introduce uniqueness to our prefix to prevent this issue.
In this discussion, Gary mentions that in the case of read-process-write, the prefix must be the same in all instances and after each restart.
How to choose Kafka transaction id for several applications, hosted in Kubernetes?
While digging into this issue, I came to the realisation that we are sharing the same prefixId for both producer-only and read-process-write.
In our TopicPublisher class wrapping kafkaTemplate, we already have a publish() and publishInTransaction() methods for read-process-write and producer-only use cases respectively.
I am thinking to have 2 sets of kafkaTemplates/TransactionManagers/ProducerFactories, one with a fixed prefixId to be used by the publish() method and one with a unique prefix to be used in publishInTransaction().
My question is:
Does the prefix for producer-only need to be the same after a pod is restarted. Can we just append some uuid or k8s podId? Someone mentioned there may be delays with aborting transactions.
Is there a clean way to detect if the TopicPublisher is being called from a KafkaListener, so we can have just 1 publish method that uses the correct kafkaTemplate as needed?
Actually, there is no issue using the same transactionIdPrefix, at least with recent versions.
The factory gets a txIdPrefix.
For read-process-write, we create (and cache) a producer with transactionalId:
private String zombieFenceTxIdSuffix(String topic, int partition) {
return this.consumerGroupId + "." + topic + "." + partition;
}
which is suffixed onto the prefix.
For producer only-transactions, we create (and cache) a producer with the prefix and simple numeric suffix.
In the upcoming 2.3 release, there is also an option to assign a producer to a thread so the same thread always uses the same transactional.id.
I believe it needs to be the same, unless you don't mind waiting for transaction.timeout.ms (default 1 minute).
The maximum amount of time in ms that the transaction coordinator will wait for a transaction status update from the producer before proactively aborting the ongoing transaction.If this value is larger than the transaction.max.timeout.ms setting in the broker, the request will fail with a InvalidTransactionTimeout error.
This is what we do in spring-integration-kafka
if (this.transactional
&& TransactionSynchronizationManager.getResource(this.kafkaTemplate.getProducerFactory()) == null) {
sendFuture = this.kafkaTemplate.executeInTransaction(t -> {
return t.send(producerRecord);
});
}
else {
sendFuture = this.kafkaTemplate.send(producerRecord);
}
You can also use String suffix = TransactionSupport.getTransactionIdSuffix(); which is what the factory uses when it is asked for producer - if null, you are not running on a transactional consumer thread.
Related
GRPC END Contract=IPubSubPartitionManager Action=GetOrAddContextSelectorPartition ID=71eff709-3920-4139-a5c9-2b0bef6f5ba7 From=ipv4:10.0.0.56:35091 IsFault=True Duration(ms)=4932 Request { "contextSelector": "/debug/fabric--madari-madariservice-01-GrpcPublishSubscribeProber-8", "addIfNotExists": true, "clientCredential": { "credentialValue": "uswestcentral-prod.sdnpubsub.core.windows.net-client", "credentialRegex": { "pattern": "null string" }, "enablePropertyBasedAcls": false } } Response { "errorMsg": "Timed out waiting for Shared lock on key; id=49b61cd7-31ba-4c5d-b579-d068326e8a90#133028293161628388#urn:ContextSelectorMapping/dataStore#132077756302731635, timeout=4000ms, txn=133029757743026569, lockResourceNameHash=6572262935404555983; oldest txn with lock=133029757735370545 (mode Shared)\r\n" }
This problem mostly affects read operations that support Repeatable Read, the user may request an Update lock rather than a Shared lock. Update lock is an asymmetric lock is used to prevent but when a several transactions potentially updates at that time deadlock occurs.
Try to avoid TimeSpan.MaxValue for time-outs. It may detect deadlocks.
Don't create a transaction within another transaction's using statement. For example: the two transactions (T1 and T2) are attempting to read from and update K1, respectively. Due to the fact that they both end up with the Shared lock possible for them to deadlock. In this situation, one or both of the operations will time out.
To avoid a frequent type of deadlock that arises,
The default transaction timeout should be increased. Usually, it takes 4 second try to use in a different value
Make sure the transactions are short-lived; if they last any longer than necessary, you'll be blocking other tasks in the queue for a longer period of time than necessary.
For Reference: Azure Service Fabric
We appear to have a problem with MDriven generating the same ECO_ID for multiple objects. For the most part it seems to happen in conjunction with unexpected process shutdowns and/or server shutdowns, but it does also happen during normal activity.
Our system consists of one ASP.NET application and one WinForms application. The ASP.NET app is setup in IIS to use a single worker process. We have a mixture of WebForms and MVC, including ApiControllers. We're using a rather old version of the ECO packages: 7.0.0.10021. We're on VS 2017, target framework is 4.7.1.
We have it configured to use 64 bit integers for object id:s. Database is Firebird. SQL configuration is set to use ReadCommitted transaction isolation.
As far as I can tell we have configured EcoSpaceStrategyHandler with EcoSpaceStrategyHandler.SessionStateMode.Never, which should mean that EcoSpaces are not reused at all, right? (Why would I even use EcoSpaceStrategyHandler in this case, instead of just creating EcoSpace normally with the new keyword?)
We have created MasterController : Controller and MasterApiController : ApiController classes that we use for all our controllers. These have a EcoSpace property that simply does this:
if (ecoSpace == null)
{
if (ecoSpaceStrategyHandler == null)
ecoSpaceStrategyHandler = new EcoSpaceStrategyHandler(
EcoSpaceStrategyHandler.SessionStateMode.Never,
typeof(DiamondsEcoSpace),
null,
false
);
ecoSpace = (DiamondsEcoSpace)ecoSpaceStrategyHandler.GetEcoSpace();
}
return ecoSpace;
I.e. if no strategy handler has been created, create one specifying no pooling and no session state persisting of eco spaces. Then, if no ecospace has been fetched, fetch one from the strategy handler. Return the ecospace. Is this an acceptable approach? Why would it be better than simply doing this:
if (ecoSpace = null)
ecoSpace = new DiamondsEcoSpace();
return ecoSpace;
In aspx we have a master page that has an EcoSpaceManager. It has been configured to use a pool but SessionStateMode is Never. It has EnableViewState set to true. Is this acceptable? Does it mean that EcoSpaces will be pooled but inactivated between round trips?
It is possible that we receive multiple incoming API calls in tight succession, so that one API call hasn't been completed before the next one comes in. I assume that this means that multiple instances of MasterApiController can execute simultaneously but in separate threads. There may of course also be MasterController instances executing MVC requests and also the WinForms app may be running some batch job or other.
But as far as I understand id reservation is made at the beginning of any UpdateDatabase call, in this way:
update "ECO_ID" set "BOLD_ID" = "BOLD_ID" + :N;
select "BOLD_ID" from "ECO_ID";
If the returned value is K, this will reserve N new id:s ranging from K - N to K - 1. Using ReadCommitted transactions everywhere should ensure that the update locks the id data row, forcing any concurrent save operations to wait, then fetches the update result without interference from other transactions, then commits. At that point any other pending save operation can proceed with its own id reservation. I fail to see how this could result in the same ID being used for multiple objects.
I should note that it does seem like it sometimes produces id duplicates within one single UpdateDatabase, i.e. when saving a set of new related objects, some of them end up with the same id. I haven't really confirmed this though.
Any ideas what might be going on here? What should I look for?
The issue is most likely that you use ReadCommitted isolation.
This allows for 2 systems to simultaneously start a transaction, read the current value, increase the batch, and then save after each other.
You must use Serializable isolation for key generation; ie only read things not currently in a write operation.
MDriven use 2 settings for isolation level UpdateIsolationLevel and FetchIsolationLevel.
Set your UpdateIsolationLevel to Serializable
I’m writing functionality for receiving messages from Azure Service Bus Topic and delete the specified message from Topic. Before deleting that message, I need to send that message to other Topic.
static async Task ProcessMessagesAsync(Message message, CancellationToken token)
{
// Process the message.
Console.WriteLine($"Received message: WorkOrderNumber:{message.MessageId} SequenceNumber:{message.SystemProperties.SequenceNumber} Body:{Encoding.UTF8.GetString(message.Body)}");
Console.WriteLine("Enter the WorkOrder Number you want to delete:");
string WorkOrderNubmer = Console.ReadLine();
if (message.MessageId == WorkOrderNubmer)
{
//TODO:Post message into other topic(Priority) then delete from this current topic.
var status=await SendMessageToBus(message);
if (status == true)
{
await normalSubscriptionClient.CompleteAsync(message.SystemProperties.LockToken);
Console.WriteLine($"Successfully deleted your message from Topic:{NormalTopicName}-WorkOrderNumber:" + message.MessageId);
}
else
{
Console.WriteLine($"Failed to send message to PriorityTopic:{PriorityTopicName}-WorkOrderNumber:" + message.MessageId);
}
}
else
{
Console.WriteLine($"Failed to delete your message from Topic:{NormalTopicName}-WorkOrderNumber:" + WorkOrderNubmer);
// Complete the message so that it is not received again.
// This can be done only if the subscriptionClient is created in ReceiveMode.PeekLock mode (which is the default).
await normalSubscriptionClient.CompleteAsync(message.SystemProperties.LockToken);
// Note: Use the cancellationToken passed as necessary to determine if the subscriptionClient has already been closed.
// If subscriptionClient has already been closed, you can choose to not call CompleteAsync() or AbandonAsync() etc.
// to avoid unnecessary exceptions.
}
}
My issue with this approach is:
It’s not scalable; what if the message is the 50th in the collection? We’d have to iterate through 49 times and mark i.e deleted.
It’s a long-running process.
To avoid these problems, I want to get the specified message from the queue based on Index or sequence number then I can delete that from the topic.
So, can anyone suggest me how to resolve this problem?
So if I understand your questions and comments correctly you are trying to do something like this:
Incoming messages come into either a standard topic or priority
topic.
Some process checks messages in the standard topic and
"moves" them to the priority topic based on some criteria by
deleting them from the standard topic and adding them to the
priority topic.
Messages are processed as normal.
As Sean noted, step 2 simply won't work. Service Bus is a first=in-first-out-ish system where a consumer simply picks up the next available message. You can sort through a queue by pulling out all the messages and abandoning/completing them based on specific criteria, but scaling is a problem. In addition, you can think of each topic subscription as its own separate queue- removing a message form one subscription does not remove it from any of the other subscriptions.
What I would suggest instead of trying to pull out everything from the topics and then putting back the ones you want to keep, add a sorting queue in front of the two topics. If you don't need to sort the high priority messages you could put this sorting process in front of the standard priority topic only.
This is how the process would work:
Incoming messages are added to a sorting queue Note that this is a single queue, not a topic. At this point in the process we want to ensure there is only one copy of each message.
A sorting process moves messages from the sorting queue into either the standard or priority queue as is appropriate. Using something like Azure Functions you can scale this process fairly easily.
Messages are processed from the topics as normal.
In a clustered intershop environment, we see a lot of error messages. I'm suspecting the communication between the application servers is not reliable.
Caused by: com.intershop.beehive.orm.capi.common.ORMException:
Could not UPDATE object: com.intershop.beehive.bts.internal.orderprocess.basket.BasketPO
Is there safe way to for the local application server, to load the latest instance.
BasketPO basket = null;
try{
BasketPOFactory factory = (BasketPOFactory) NamingMgr.getInstance().lookupFactory(BasketPOFactory.FACTORY_NAME);
try(ORMObjectCollection<BasketPO>baskets = factory.getObjectsBySQLWhere("uuid=?", new Object[]{basketID},CacheMode.NO_CACHING);){
if(null != baskets && !baskets.isEmpty()){
basket = baskets.stream().findFirst().get();
}
}
}
catch(Throwable t){
Logger.error(this, t.getMessage(),t);
}
Does the ORMObject#refresh method help ?
try{
if(null != basket)
basket.refresh();
}
catch(Throwable t){
Logger.error(this, t.getMessage(),t);
}
You experience that error because an optimistic lock "fails". To understand the problem better I'll try to explain how the optimistic locking works in particular in the Intershop ORM layer.
There is a column named OCA in the PO tables (OCA == optimistic control attribute?). Imagine that two servers (or two different threads/transactions) try to update the same row in a table. For performance reasons there is no DB locking involved by default (e.g. by issuing select for update). Instead the first thread/server increments the OCA by one when it updates the row successfully within its transaction.
The second thread/server knows the value of the OCA from the time that it created its own state. It then tries to update the row by issuing a similar query:
UPDATE ... OCA = OCA + 1 ... WHERE UUID = <uuid> AND OCA = <old_oca>
Since the OCA is already incremented by the first thread/server this update fails (in reality - updates 0 rows) and the exception that you posted above is thrown when the ORM layer detects that no rows were updated.
Your problem is not the inter-server communication but rather the fact that either:
multiple servers/threads try to update the same object;
there are direct updates in the database that bypass the ORM layer (less likely);
To solve this you may:
Avoid that situation altogether (highly recommended by me :-) );
Use the ISH locking framework (very cumbersome imHo);
Use pesimistic locking supported by the ISH ORM layer and Oracle (beware of potential performance issues, deadlocks, bugs);
Use Java locking - but since the servers run in different JVM-s this is rarely an option;
OFFTOPIC remarks: I'm not sure why you use getObjectsBySQLWhere when you know the primary key (uuid). As far as I remember ORMObjectCollection-s should be closed if not iterated completely.
UPDATE: If the cluster is not configured correctly and the multicasts can't be received from the nodes you won't be able to resolve the problems programatically.
The "ORMObject.refresh()" marks the cached shared state as invalid. Next access to the object reloads the state from the database. This impacts the performance and increase the database server load.
BUT:
The "refresh()" method does not reload the PO instance state if it already assigned to the current transaction.
Would be best to investigate and fix the server communication issues.
Other possibility is that it isn't a communication problem (multicast between node in the cluster i assume), but that there are simply two request trying to update the basket at the same time. Example two ajax request to update something on the basket.
I would avoid trying to "fix" the orm, it would only cause more harm than good. Rather investigate further and post back more information.
I have a replicated cluster composed by several nodes (up to 30) on which there is a single JAVA process accessing to the coherence cache and I use the map.invoke(key, agent) method for both creation and update of agents. The creation and the update are performed setting the value in the process method.
Example (agent is instance of a ConcreteEntryProcessor implementing EntryProcessor interface):
map.invoke(key, agent);
Which invoke the following code of agent object:
public Object process(Entry entry) {
if (entry.isPresent())
{
//UPDATE
... some stuff which compute the new entry value...
entry.setValue(newValue, true);
return newValue
}
else
{
//CREATION
..other stuff to determine the value...
entry.setValue(value, true);
return value;
}
}
I noticed that if the update is made by the node that created the agent I have good performances, otherwise I have a performance decrease if the update is made from a different node. It seems that there is a kind of ownership of data.
Is there a way to force the execution of an agent on the local node or change the ownership of data?
It all depends on cache configuration. If you use distributed (partitioned) cache, then indeed there is some kind of data ownership. In that case entry processor is invoked on a node that owns given key.
And according to your performance issues, I see two possibilities:
Performance of map.invoke(key, agent) decreases, but performance of EntryProcessor.process(entry) is stable. In that case your performance issues are probably caused by serialization and network traffic needed to send back the result of processing to the node that called map.invoke(key, agent). If you don't need this value on that node, then simply return null from your entry processor.
Performance of EntryProcessor.process(entry) decreases. In that case mayby your create/update logic need some data from the node that called map.invoke(key, agent). So it is again serialization/network traffic issue, but without knowing the details of your particular logic it is hard to find a solution to your issue.