Index overhead during updates when the property value is unchanged - google-cloud-datastore

If majority of the indexed properties are not changed during an update to an entity, will there be any difference in performance as compared to the indexed properties having changed? I am trying to understand what kind of hotspotting can happen in an app that has relatively few inserts but a lot of updates where the updates don't change majority of the built-in indexed properties.

You shouldn't have any issues with performance by doing updates that don't affect the index.
Hottspotting may happen if you have high read/write rates to a narrow key range.
On an update intensive application you have to be careful not to update single entities more than once per second because that introduces higher latency

Related

Surge-like inserts in GCP Firestore in Datastore Mode with one property value equal across all entities

I'm building an app that will have to accept a surge of records in short bursts, like 100 000 in 5 minutes once every 24h. I've chosen Firestore in Datastore Mode as DB. I've figured how to make keys lexically different so the key space is wide and I won't bottle neck on that. I can prefeed the appropriate kind with dummy entities to ensure it can accept high insert traffic on the first day. The only remaining problem I have is that all records from a single surge need to have a property, say event_id, and it's gonna have the same value across all of them. I need to filter by that property later on (only equality filter), so it has to be indexed.
My concern is that this will cause hotspots in the index, but I'm not 100% sure. The docs mostly mention monotonically increasing values or narrow ranges to be a problem, not single values. Strictly speaking, however, mine is a case of a severely narrow range.
I was thinking of using a hierarchical structure, like Event/event-id/Records/my_entities_go_here, but I'm not sure if creating a new Event and submitting entities to (an initially empty) Event/event-id/Records isn't the same as writing to an empty Kind, which is slow in the begging.
Does anyone know a way around this?

DDD and uniqueness constraint

How would one validate a unique constraint using DDD? Let's say that an Entity has a property name that must be unique among the system and there is a specific EntityRepository method nameExists(name): bool... This is what I found people suggests to do, because the repository is the abstraction of the collection of all the Entityies and should be able to perform this check.
So before creating/adding the new Entity the command / domain service could check for the existence of a newName against the repository, but I think that this will not always work because of concurrency.
In a concurrent scenario where two transactions are started simultaneously, the EntityRepository's nameExists method might return false for both transactions, and as a result of this two entries with the same name will be incorrectly inserted.
I am sure that I am missing something basic, but the answers I found all point to the repository exists method - TBH others say that a UNIQUE constraint should be put on the DB to catch the concurrency case, but what if one uses Event Sourcing or a persistence layer that does not have unique constraints?
| Follow up question |
What if the uniqueness constraint is to be applied in different levels of a hierarchy?
A Container's name must be unique in the system and then Child names must be unique inside a Container.
Let's say that a transactional DB takes care of the uniqueness at the lowest possible level, what about the domain?
Should I still express the uniqueness logic at the domain level, e.g. with a Domain Service for the system-level uniqueness and embedding Child entities inside the Container entity and having a business rule (and therefore making Container the aggregate root)?
Or should I not bother with "replicating" the uniqueness in the domain and (given there are no other rules to apply between the two) split Container and Child? Will the domain lack expressiveness then?
I am sure that I am missing something basic
Not something basic.
The term we normally use for enforcing a constraint, like uniqueness, across a set of entities is set validation. Greg Young calls your attention to a specific question:
What is the business impact of having a failure
Most set constraints fall into one of two categories
constraints that need to be true when the system reaches steady state, but may not hold while work is in progress. In business processes, these are often handled by detecting conflicts in the stored data, and then invoking various mitigation processes to resolve the conflict.
constraints that need to be true always.
The first category includes things like double booking a seat on an airplane; it's not necessarily a problem unless both people show up, and even then you can handle it by bumping someone to another seat, or another flight.
In these cases, you make a best effort - you look at a recent copy of the set, make sure there are no conflicts there, then hope for the best (accepting that some percentage of the time, you'll have missed a change).
See Memories, Guesses and Apologies (Pat Helland, 2007).
Second category is the hard one; to ensure the invariant holds you have to lock the entire set to ensure that races don't allow two different writers to insert conflicting information.
Relational databases tend to be really good at set validation - putting the entire set into a single database is going to be the right answer (note the assumption that the set is small enough to fit into a single database -- trying to lock two databases at the same time is hard).
Another possibility is to ensure that only one writer can update the set at any given time -- you don't have to worry about a losing a race when you are the only one running in it.
Sometimes you can lock a smaller set -- imagine, for example, having a collection of locks with numbers, and the hash code for the name tells you which lock you have to grab.
This simplest version of this is when you can use the name as the aggregate identifier itself.
if one uses Event Sourcing or a persistence layer that does not have unique constraints?
Sometimes, you introduce a persistent store dedicated to the set, just to ensure that you can maintain the invariant. See "microservices".
But if you can't change the database, and you can't use a database with the locking guarantees that you need, and the business absolutely has to have the set valid at all times... then you single thread that part of the work.
Everybody that wants to change a name puts a request into a queue, and the one thread responsible for managing the invariant certifies each and every change.
There's no magic; just hard work and trade offs.

Monotonically increasing fields in composite indexes

Suppose I have a Datastore kind with two properties listed below and an extremely high insert rate overall (but low insertion rate for individual values of random_key):
random_key - a uniformly distributed large number
time - a monotonically increasing timestamp indicating the insertion time of an entity
I'm primarily concerned with queries on the composite index (random_key ASC, time DESC) and I don't care about queries on just the time field.
Problem: But according to the datastore documentation, creating this composite index requires that I not exclude the random_key and time fields from auto-indexing. According to the best practices, indexing on time will lead to the hotspoting issue as it is monotonically increasing.
Other questions such as Google datastore - index a date created field without having a hotspot recommend prepending a random value to the timestamp to shard the data. But I'd like to try and have a clean approach that uses a more meaningful value in the other separate property random_key
Question:
What are my options for maintaining the composite index on both fields without having any of the issues related to the auto-index on time alone?
Excluding/ignoring the hot-spotting issue on auto-indexing on time alone doesn't really change/improve things for the composite index: you're still having the problem of updating an index (a composite one, but that doesn't really make a difference) with a monothonically increasing property value, which is still subject to the hot-spotting issue.
That's because the underlying fundamental root cause of the hot-spotting issue, graphically illustrated in App Engine datastore tip: monotonically increasing values are bad, is the number of worker threads that the indexing update workload can be distributed to:
with monothonically changing property values consecutive index updates requests tend to keep hitting the same worker thread which can only perform them in a serialized manner - the hotspot
with random/uniformly distributed property values consecutive indexing update requests can be statistically distributed to multiple workers to be executed in parallel. This is really what sharding is doing for monothonically changing properties as well.
The answer to the question you referenced applies in the composite index case equally well: you can use sharding for time if have an update rate above the mentioned tipping point of 500 writes/sec.
But sharding complicates your app: you'd need multiple queries and client-side merging of the results. If your random_key is indeed more meaningful you might find it more attractive instead to:
keep time unindexed (thus avoiding hot-spotting alltogether)
only query by random_key (which doesn't require a composite index) and simply handle the time filtering via client side processing (which might be less processing than combining results from sharded queries).

Better performance to Query the DB or Cache small result sets?

Say I need to populate 4 or 5 dropdowns w/ items from a database. Each drop down will have < 15 items in it. These items almost never change.
Now I could query the DB each time the page is accessed or I could grab the values from a custom class that would check to see if they already exist in ASP.Net's cache and only if they don't query the DB to update the cache.
It's trivial for me to write but I'm unsure if the performace would be better or not. I think it would be (although not likely anything huge).
What do you think?
When dealing with performance issues you should always:
Do things the simplest way first (avoid premature optimisation)
Performance test your code with set performance goals (e.g. 200ms response time under load of N concurent users)
Then, IF your code doesn't perform then profile your code to determine what is slow, and profile your proposed performance fixes to accurately measure what the real-world performance change will be.
Having said that then yes, what you are suggesting seems sensible (you would usually expect an in-memory cache to be quicker than a database), however it also depends on what data is being returned, what the memory load of your application is, how expensive the query is, what the query parameters are etc...
You should performance test your changes before and after to determine the actual effect of your changes (including things like memory load), and you should only really be doing things like this once you have identified that these dropdowns are the cause of an unacceptable performance problem.
That's what System.Web.Helpers.WebCache class exists for.
IO is usually more expensive than memory operations (by orders of magnitude). Especially if your database is in another machine, then you would even be using network resources, and it will definitely be faster to just use the cache.
But indeed, optimize in the end when you have really identified it as a performance bottleneck by measuring.
Quick answer to your question:
Use the built in .Net cache.
Additional points to ponder over..
Preferably, retrieve all master data in a single database retrieval (think stored procedure and dataset): though, I do not advocate the used of stored procs in all scenarios.
As you rightly said, ensure that your data access layer checks the cache before making a round trip to the database
Also, as your drop down values do not change very often; do remember to keep a long expiry duration
Finally, based on your page design you could also look at Fragment Caching (partial page caching: user controls) which could give you bigger benefits since now you neither access the data cache nor the database.
Performance:
Again, the performance depends more on the application's load as compared to your direct round trips for fetching the master data. Put simply, As Thomas suggested use the cache class!

Oracle sequence cache aging too often

my asp.net application uses some sequences to generate tables primary keys. Db administrators have set the cache size to 20. Now the application is under test and a few records are added daily (say 4 for each user test session).
I've found that new test session records always use new cache portions as if the preavious day cached numbers had expired, losing tenth of keys everyday. I'd like to understand if it's due to some mistake i might have made in my application (disposing of tableadapters or whatever) or if it's the usual behaviour. There are programming best practices to take into account when handling oracle sequences ?
Since the application will not have to bear an heavy load of work (say 20-40 new records at day), i was tinking if it might be the case to set a smaller cache size or none at all.
Does sequence cache resizing implies the reset of current index ?
thank you in advance for any hint
The answer from Justin Cave in this thread might be interesting for you:
http://forums.oracle.com/forums/thread.jspa?threadID=640623
In a nutshell: if the sequence is not accessed frequently enough but you have a a lot of "traffic" in the library cache, then the sequence might be aged out and removed from the cache. In that case the pre-allocated values are lost.
If that happens very frequently to you, it seems that your sequence is not used very often.
I guess that reducing the cache size (or completely disabling it) will not have a noticable impact on performance in your case (also when taking your statement of 20-40 new records a day into account)
Oracle Sequences are not gap-free. Reducing the Cache size will reduce the gaps... but you will still have gaps.
The sequence is not associated to the table by the database, but by your code (via the nextval on the insert via trigger/sql/pkg api) -- on that note you may use the same sequence over multiple tables (it is not like sql server's identity where it is associated to the column/ table)
thus changing the sequence will have no impact on the indexes.
You would just need to make sure if you drop the sequence and restart it, you 'reseed' to the +1 of the current value (e.g. create sequence seqer start with 125 nocache;)
, but
If your application requires a
gap-free set of numbers, then you
cannot use Oracle sequences. You must
serialize activities in the database
using your own developed code.
but be forewarned, you may increase disk IO and possible transaction locking if you choose not to use sequences.
The sequence generator is useful in
multiuser environments for generating
unique numbers without the overhead of
disk I/O or transaction locking.
to reiterate a_horse_with_no_name's comments, what is the issue with gaps in the id?
Edit
also have a look at the caching logic you should use located here:
http://download.oracle.com/docs/cd/E11882_01/server.112/e17120/views002.htm#i1007824
If you are using the sequence for PKs and not to enforce some application logic then you shouldn't worry about gaps. However, if there is some application logic tied to sequential sequence values, you will have holes if you use sequence caching and do not have a busy system. Sequence cache values can be aged out of the library cache.
You say that your system is not very busy, in this case alter your sequence to no cache. You are in a position of taking a negligible performance hit to fix a logic issue so you might as well.
As people mentioned: Gaps shouldn't be a problem, so if you are requiring no gaps you are doing something wrong. (But I don't think this is what you want).
Reducing the cache should reduce the number and decrease the performance of the sequence especially with concurrent access to it. (which shouldn't be a problem in your use case).
Changing the sequence using the alter sequence statement (http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_2011.htm) should not reset the current/next val of the sequence.

Resources