I found myself in a situation where I want to perform some operations on the database that should be handled in a single transaction. One of those operations is injecting > 500 documents, so this is throwing an error because it's hitting
maximum 500 writes allowed per request
In order to work around that, you could use batched writes, but I can't figure out how to do batched writes as part of a transaction. It seems like transaction.commit() is not a thing and in the docs transactions and batched writes appear to be two separate concepts.
Generally speaking, we are using transactions to have consistent data. The recommendation that you get:
you could use batched writes
It is for the exact same reason. Unfortunately, you cannot mix them. You have to choose one or the other. Realistic speaking, both the batch and the transaction are used for atomic updates.
A transaction is similar to batch and as the docs states:
All of the operations succeed, or none of them are applied.
The main difference between a batch write and a transaction is that a batch just writes, while a transaction reads and right after then writes.
So the solution in your case is to use Firestore batched-writes to perform 500 operation at a time.
As you have most probably read in the doc:
The Transaction object passed to a transaction's updateFunction
provides the methods to read and write data within the transaction
context.
and this object, in the Client SDKs, has only four methods: get(), set(), update() and delete() which all take a single Firestore Document as parameter.
With the Node.js Server SDK for Google Cloud Firestore, you will note that there is an additional method, getAll(), which "retrieves multiple documents from Firestore. Holds a pessimistic lock on all returned documents".
So, at the time of writing, there is no possibility, to "mix" a Transaction and a Batched Write.
Related
For example if I was building an airline booking system and all of my seats were individual documents in a cosmos container with PartitionKey of the FlightNumber_DepartureDateTime e.g. UAT123_20220605T1100Z and id of SeatNumber eg. 12A.
A request comes in to allocate a single seat (any seat without a preference).
I want to be able to query the cosmos container for seats where allocated: false and allocate the first one to the request by setting allocated: true allocatedTo:ticketReference. But I need to do this in a thread safe way so that no two requests get the same seat.
Does Cosmos DB (SQL API) have a standard pattern to solve this problem?
The solution I thought of was to query a document and then update it by checking its Etag and if another thread got in first then the update would fail. If it fails then query another document and keep trying until I can successfully update it to claim the seat for this thread.
Is there a better way?
You could achieve this by using transactions. Cosmos DB allows you to write stored procedures that are executed in an atomic transaction, basically serializing concurrent seat reservation operations for you within a logical partition.
Quote from "Benefits of using server-side programming" in the link above:
Atomic transactions: Azure Cosmos DB database operations that are
performed within a single stored procedure or a trigger are atomic.
This atomic functionality lets an application combine related
operations into a single batch, so that either all of the operations
succeed or none of them succeed.
Bear in mind though that transactions come with a cost. They limit scalability of those operations. However in your scenario when you partition data per flight and given that those operations are very fast, this might be the preferable and most reliable option.
I have done something similar with Service Bus queues, essentially allowing you to queue bookings to be saved, therefore you can do the availability logic before you save the booking guaranteeing no overbookings.
https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-queues-topics-subscriptions
From following document: https://cloud.google.com/datastore/docs/concepts/transactions
What would happen if transaction fails with no explicit rollback defined? For example, if we're performing put() operation on value arguments.
The document states that transaction should be idempotent, what does this mean with respect to put() operation? It is not clear how idempotency is applied in this context.
How do we detect failure if failure from commit is not reliable according to the documentation?
We are seeing some symptoms where put() against value argument is sometimes partially saving the data. Note we do not have explicit rollback defined.
As you may already know, Datastore transactions are guaranteed to be atomic, which means that it applies the all-or-nothing principle; either all operations succeed or they all fail. This ensures that the data in your database remains consistent over time.
Now, regardless whether you execute put or any other operation in your transaction, your implementation of the code should always ensure that your transaction has either successfully commited or rolled back. This means that if you aren't fully sure whether the commit succeeded, you should explicitly issue a rollback.
However, there may be some exceptions where a commit might fail, and this doesn't necessarily mean that no data was written to your database. The documentation even points out that "you can receive errors in cases where transactions have been committed."
The simple way to detect transaction failures would be to add a try/catch block in your code for when an Exception (failed transactional operation) or DatastoreException (errors related to Datastore - failed commit) are thrown. I believe that you may already have an answer in this Stackoverflow post about this particular question.
A good practice is to make your transactions idempotent whenever possible. In other words, if you're executing a transaction that includes a write operation put() to your database, if this operation were to fail and needed to be retried, the end result should ideally remain the same.
A real world example can be - you're trying to transfer some money to your friend; the transaction consists of withdrawing 20 USD from your bank account and depositing this same amount into your friend's bank account. If the transaction were to fail and had to be retried, the transaction should still operate with the same amount of money (20 USD) as the final result.
Keep in mind that the Datastore API doesn't retry transactions by default, but you can add your own retry logic to your code, as per the documentation.
In summary, if a transaction is interrupted and your logic doesn't handle the failure accordingly, you may eventually see inconsistencies in the data of your database.
In Google Cloud Dataflow (streaming pipeline), your data "bundles" can be re-executed because of failure or speculative execution. Is there any way of knowing that the current bundle/element is a re-execution?
This would be very useful to provide conditional behavior for side-effects (in our case: to help make a datastore update operation (read/write) idempotent).
I don't believe this is something that is offered through the Beam API but you can avoid the need to know this information through following mechanisms.
If writes to the external datastore are idempotent, simply introduce a fusion break by adding a Reshuffle transform before the write step. This will make sure that data to be written is not re-generated when there are failures.
If writes to external datastore are not idempotent (for example, files, BigQuery), usual mechanism is to combine (1) with writing to a temporary location first. When all (parallel) writes to temporary location are finished, results can be finalized in an idempotent and a failure safe way from a single worker.
Many Beam sinks utilize these mechanisms to write to external data stores in an idempotent manner. For streaming, usually, these operations are performed per window.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
TL;DR
Is there any benefit to batched writes over transactions other than offline availability?
That is assuming that transactions are safer than batched writes.
Why transactions
I am thinking about migrating Cloud Functions code from batched writes to transactions. The initial idea was that it would be safer.
I was worried that the Cloud Function would not catch when a batched write failed and it is crucial that all of the writes (changing many documents accross different collections) are actually applied.
I read this answer and was glad to hear the following:
If you don't need or want to be able to write while offline just use a regular transaction: it will handle checking whether or not you're online and will fail if you're offline. You're not obligated to read anything in the transaction.
This basically says that the only point of batched writes is offline availability. This confirmed my initial thoughts, however, I also read documents.
Why batched writes
The "Transaction and batched writes" documentation points out that batched writes "have fewer failure cases than transactions".
Confusion
Now, I am slightly confused. The only valid concern regarding transaction failure for my case was the following:
The transaction read a document that was modified outside of the transaction. In this case, the transaction automatically runs again. The transaction is retried a finite number of times.
As the write operations that are used my Cloud Functions are exclusively update operations, I am wondering if this even applies to my case, i.e. if the batched write would ever encounter such an issue. The way I imagine it, the batched write just updates, say, a thousand documents and all of these updates only modify a single field.
When the transaction does the same, when would this failure ever occur?
And since I am using Cloud Functions, which I know is always online, would batched writes ever encounter any kind of issue?
+ Transactions: Safer, will not fail
+ Batched writes: Fewer failure cases, will not fail
Can you see what my confusion is from this?
Regarding simplicity of code: I think batched writes looks more concise, however, it is actually more code in my case. Hence, it does not matter to me.
Transactions are not really comparable to batch writes. You should pick the tool that suits the job at hand. The only thing they have in common is that when the operation completes, all the documents will write at the exact same moment (they are atomic). They can also both fail due to violation of a security rule, like any other operation coming from a client app.
Here's how you decide:
Batch writes
Work offline, will be synchronized later
Don't pay attention to any current document values, can't safely modify field values
Doesn't require any document reads to commit
Doesn't fail due to contention, each write will simply clobber prior writes on the same document
Transactions
Don't work at all offline
Pays attention to current document values, can be used to safely modify a document based on existing field values
Requires a document read in order to write updates
Can fail due to too much contention on documents in the transaction
I'm curious about the behavior of the locks that are performed when doing server side transactions on Cloud Firestore as mentioned in this video: https://www.youtube.com/watch?time_continue=750&v=dOVSr0OsAoU
My transaction will be reading multiple documents and placing locks on them. My question is do these locks restrict all access to the documents - including concurrent reads from client code that isn't part of a transaction? Or do they only restrict writes?
If they do restrict reads is there any way around this - it could lead to severe slowdown in the app I'm working on.
Also in the case that a transaction tries to lock documents that are already locked - what is the retry pattern - how often does it retry, and is there an exponential backoff?
Thanks!
My transaction will be reading multiple documents and placing locks on them.
A transaction operation is first reading the value of a property within a document in order to perform the write operation. So it requires round trip communications with server in order to ensure that the code inside the transaction completes successfully.
My question is do these locks restrict all access to the documents - including concurrent reads from client code that isn't part of a transaction?
The answer is no, concurrent users can read the content of the document even if you perform a write operation using a transaction.
Also in the case that a transaction tries to lock documents that are already locked - what is the retry pattern - how often does it retry, and is there an exponential backoff?
According to the official documentation regarding Firestore transactions, a transaction can fail only the following cases:
The transaction contains read operations after write operations. Read operations must always come before any write operations.
The transaction read a document that was modified outside of the transaction. In this case, the transaction automatically runs again. The transaction is retried a finite number of times.
The transaction exceeded the maximum request size of 10 MiB.
Transaction size depends on the sizes of documents and index entries modified by the transaction. For a delete operation, this includes the size of the target document and the sizes of the index entries deleted in response to the operation.
A failed transaction returns an error and does not write anything to the database. You do not need to roll back the transaction; Cloud Firestore does this automatically.