Suppose I have a Datastore kind with two properties listed below and an extremely high insert rate overall (but low insertion rate for individual values of random_key):
random_key - a uniformly distributed large number
time - a monotonically increasing timestamp indicating the insertion time of an entity
I'm primarily concerned with queries on the composite index (random_key ASC, time DESC) and I don't care about queries on just the time field.
Problem: But according to the datastore documentation, creating this composite index requires that I not exclude the random_key and time fields from auto-indexing. According to the best practices, indexing on time will lead to the hotspoting issue as it is monotonically increasing.
Other questions such as Google datastore - index a date created field without having a hotspot recommend prepending a random value to the timestamp to shard the data. But I'd like to try and have a clean approach that uses a more meaningful value in the other separate property random_key
Question:
What are my options for maintaining the composite index on both fields without having any of the issues related to the auto-index on time alone?
Excluding/ignoring the hot-spotting issue on auto-indexing on time alone doesn't really change/improve things for the composite index: you're still having the problem of updating an index (a composite one, but that doesn't really make a difference) with a monothonically increasing property value, which is still subject to the hot-spotting issue.
That's because the underlying fundamental root cause of the hot-spotting issue, graphically illustrated in App Engine datastore tip: monotonically increasing values are bad, is the number of worker threads that the indexing update workload can be distributed to:
with monothonically changing property values consecutive index updates requests tend to keep hitting the same worker thread which can only perform them in a serialized manner - the hotspot
with random/uniformly distributed property values consecutive indexing update requests can be statistically distributed to multiple workers to be executed in parallel. This is really what sharding is doing for monothonically changing properties as well.
The answer to the question you referenced applies in the composite index case equally well: you can use sharding for time if have an update rate above the mentioned tipping point of 500 writes/sec.
But sharding complicates your app: you'd need multiple queries and client-side merging of the results. If your random_key is indeed more meaningful you might find it more attractive instead to:
keep time unindexed (thus avoiding hot-spotting alltogether)
only query by random_key (which doesn't require a composite index) and simply handle the time filtering via client side processing (which might be less processing than combining results from sharded queries).
Related
I'm building an app that will have to accept a surge of records in short bursts, like 100 000 in 5 minutes once every 24h. I've chosen Firestore in Datastore Mode as DB. I've figured how to make keys lexically different so the key space is wide and I won't bottle neck on that. I can prefeed the appropriate kind with dummy entities to ensure it can accept high insert traffic on the first day. The only remaining problem I have is that all records from a single surge need to have a property, say event_id, and it's gonna have the same value across all of them. I need to filter by that property later on (only equality filter), so it has to be indexed.
My concern is that this will cause hotspots in the index, but I'm not 100% sure. The docs mostly mention monotonically increasing values or narrow ranges to be a problem, not single values. Strictly speaking, however, mine is a case of a severely narrow range.
I was thinking of using a hierarchical structure, like Event/event-id/Records/my_entities_go_here, but I'm not sure if creating a new Event and submitting entities to (an initially empty) Event/event-id/Records isn't the same as writing to an empty Kind, which is slow in the begging.
Does anyone know a way around this?
I saw on the Firestore documentation that it is a bad idea to index monotonically increasing values, that it will increase latency. In my app I want to query posts based on unix time which is a double and that is a number that will increase as time moves on, but in my case not perfectly monotonically because people will not be posting every second, in addition I don't think my app will exceed 4 million users. does anyone with expertise think this will be a problem for me
It should be no problem. Just make sure to store it as number and not as String. Othervise the sorting would not work as expected.
This is exactly the problem that the Firestore documentation is warning you about. Your database code will incur a cost of "hotspotting" on the index for the timestamp at scale. Specifically, from that linked documentation:
Creates new documents with a monotonically increasing field, like a timestamp, at a very high rate.
The numbers don't have to be purely monotonic. The hotspotting happens on ranges that are used for sharding the index. The documentation just doesn't tell you what to expect for those ranges, as they can change over time as the index gains more documents.
Also from the documentation:
If you index a field that increases or decreases sequentially between documents in a collection, like a timestamp, then the maximum write rate to the collection is 500 writes per second. If you don't query based on the field with sequential values, you can exempt the field from indexing to bypass this limit.
In an IoT use case with a high write rate, for example, a collection containing documents with a timestamp field might approach the 500 writes per second limit.
If you don't have a situation where new documents are being added rapidly, it's not a near-term problem. But you should be aware that it just doesn't not scale up like reads and queries will scale against that index. Note that number of concurrent users is not the issue at all - it's the number of documents being added per second to an index shard, regardless of how many people are causing the behavior.
If each of my database's an overview has only two types (state: pending, appended), is it efficient to designate these two types as partition keys? Or is it effective to index this state value?
It would be more effective to use a sparse index. In your case, you might add an attribute called isPending. You can add this attribute to items that are pending, and remove it once they are appended. If you create a GSI with tid as the hash key and isPending as the sort key, then only items that are pending will be in the GSI.
It will depend on how would you search for these records!
For example, if you will always search by record ID, it never minds. But if you will search every time by the set of records pending, or appended, you should think in use partitions.
You also could research in this Best practice guide from AWS: https://docs.aws.amazon.com/en_us/amazondynamodb/latest/developerguide/best-practices.html
Updating:
In this section of best practice guide, it recommends the following:
Keep related data together. Research on routing-table optimization
20 years ago found that "locality of reference" was the single most
important factor in speeding up response time: keeping related data
together in one place. This is equally true in NoSQL systems today,
where keeping related data in close proximity has a major impact on
cost and performance. Instead of distributing related data items
across multiple tables, you should keep related items in your NoSQL
system as close together as possible.
As a general rule, you should maintain as few tables as possible in a
DynamoDB application. As emphasized earlier, most well designed
applications require only one table, unless there is a specific reason
for using multiple tables.
Exceptions are cases where high-volume time series data are involved,
or datasets that have very different access patterns—but these are
exceptions. A single table with inverted indexes can usually enable
simple queries to create and retrieve the complex hierarchical data
structures required by your application.
Use sort order. Related items can be grouped together and queried
efficiently if their key design causes them to sort together. This is
an important NoSQL design strategy.
Distribute queries. It is also important that a high volume of
queries not be focused on one part of the database, where they can
exceed I/O capacity. Instead, you should design data keys to
distribute traffic evenly across partitions as much as possible,
avoiding "hot spots."
Use global secondary indexes. By creating specific global secondary
indexes, you can enable different queries than your main table can
support, and that are still fast and relatively inexpensive.
I hope I could help you!
If you don't project all the attributes then if you query by that index by definition you can only get the hash of that value from the result and then perform another query using the hash key to get all the other data.
This would be 2 queries just to get 1 item. Naturally that doesn't make a lot of sense so there must be some reasons why it's advantageous to project only KEY attributes.
Does it speed up replication across the GSI since there are less values to copy thereby increasing chance of a fully consistent read taking place?
Does it lower read / write costs to the table overall as a whole?
my asp.net application uses some sequences to generate tables primary keys. Db administrators have set the cache size to 20. Now the application is under test and a few records are added daily (say 4 for each user test session).
I've found that new test session records always use new cache portions as if the preavious day cached numbers had expired, losing tenth of keys everyday. I'd like to understand if it's due to some mistake i might have made in my application (disposing of tableadapters or whatever) or if it's the usual behaviour. There are programming best practices to take into account when handling oracle sequences ?
Since the application will not have to bear an heavy load of work (say 20-40 new records at day), i was tinking if it might be the case to set a smaller cache size or none at all.
Does sequence cache resizing implies the reset of current index ?
thank you in advance for any hint
The answer from Justin Cave in this thread might be interesting for you:
http://forums.oracle.com/forums/thread.jspa?threadID=640623
In a nutshell: if the sequence is not accessed frequently enough but you have a a lot of "traffic" in the library cache, then the sequence might be aged out and removed from the cache. In that case the pre-allocated values are lost.
If that happens very frequently to you, it seems that your sequence is not used very often.
I guess that reducing the cache size (or completely disabling it) will not have a noticable impact on performance in your case (also when taking your statement of 20-40 new records a day into account)
Oracle Sequences are not gap-free. Reducing the Cache size will reduce the gaps... but you will still have gaps.
The sequence is not associated to the table by the database, but by your code (via the nextval on the insert via trigger/sql/pkg api) -- on that note you may use the same sequence over multiple tables (it is not like sql server's identity where it is associated to the column/ table)
thus changing the sequence will have no impact on the indexes.
You would just need to make sure if you drop the sequence and restart it, you 'reseed' to the +1 of the current value (e.g. create sequence seqer start with 125 nocache;)
, but
If your application requires a
gap-free set of numbers, then you
cannot use Oracle sequences. You must
serialize activities in the database
using your own developed code.
but be forewarned, you may increase disk IO and possible transaction locking if you choose not to use sequences.
The sequence generator is useful in
multiuser environments for generating
unique numbers without the overhead of
disk I/O or transaction locking.
to reiterate a_horse_with_no_name's comments, what is the issue with gaps in the id?
Edit
also have a look at the caching logic you should use located here:
http://download.oracle.com/docs/cd/E11882_01/server.112/e17120/views002.htm#i1007824
If you are using the sequence for PKs and not to enforce some application logic then you shouldn't worry about gaps. However, if there is some application logic tied to sequential sequence values, you will have holes if you use sequence caching and do not have a busy system. Sequence cache values can be aged out of the library cache.
You say that your system is not very busy, in this case alter your sequence to no cache. You are in a position of taking a negligible performance hit to fix a logic issue so you might as well.
As people mentioned: Gaps shouldn't be a problem, so if you are requiring no gaps you are doing something wrong. (But I don't think this is what you want).
Reducing the cache should reduce the number and decrease the performance of the sequence especially with concurrent access to it. (which shouldn't be a problem in your use case).
Changing the sequence using the alter sequence statement (http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_2011.htm) should not reset the current/next val of the sequence.