new count() aggregation function performance - firebase

Firestore introduced the new aggregation query count() which is pretty useful for me. I'm trying to understand the speed and cost of it.
The documentation mentions performance:
Performance depends on your index configuration and on the size of the
dataset.
and
Most queries scale based on the on the size of the result set,
not the dataset. However, aggregation queries scale based on the size
of the dataset and the number of index entries scanned.
and also the pricing:
you are charged one document read for each batch of up to 1000 index
entries matched by the query.
Now imagine I have 2 collections, the first one with 100'000 documents, and the second one with 1'000'000 documents. I run a someQuery.count() on both collections and both return 1500. Here are some questions:
Will both queries be charged the same (2 document reads)?
Will the second query take longer than the first query, since the second collection has more documents? If yes - how much longer (linear, log, etc.)?
What does the documentation mean by performance depends on your index configuration? Can I configure the collections and their indexes to make count work faster?
Would be great to get some answers, I'm relying on count heavily (I know, it's still in preview).

The charge for COUNT() queries is based on the number of index entries that are matched by that query. Essentially:
take the count you get back,
divide by 1000, rounding up,
if the result is 0, add 1.
That's the number of document reads you're charged.
The performance of a COUNT() query depends on the number of items you count. Counting more items will take longer as the database has to match a large number of index entries.
This is similar to if you'd actually retrieve the documents rather than just their count: getting more documents would also take more time. Of course: counting the documents will be faster than actually getting them, but counting more documents still takes more time than counting fewer documents.
Some queries may be possible even when you don't have a specific index for the exact field combination you use, due to Firestore's ability to perform a zig-zag-merge-join across the indexes (that feature might have a different name now, but it's pretty much still doing something similar). That reduces the number of indexes you'd need for a database, and thus reduce the associated cost. It could also affect the performance of the count, but not in a way that you can easily calculate. I recommend not trying to optimize here until you actually need it, i.e. when you notice certain queries being slower than other similar queries in your database.

Related

Large arrays in Firestore Database (Best practices)

I am populating a series of dates and temperatures that I was thinking of storing in a Firestore Database to later be consumed by the front-end with the following structure:
{
date: ['1920-01-01', '1920-01-02', '1920-01-03', '1920-01-04', '1920-01-05', ...],
values: [20, 18, 19.5, 20.5, ...]
}
The array may consider a lot of years, so it turns huge, with thousands of entries. Firestore database started complaining about returning the too many index entries for entity error, and even if I get the data uploaded, the user interface Firebase -> Firebase Database -> Panel View collapses. That happens even with less than 3000 entries array.
The fact is that the data is consumed in the front-end with an array structure very similar to the one described above (I want to plot it using Echarts library). This way, I found this structure to be the more natural way, as any other alternative will require reversing the structure to arrays in the front-end.
Nevertheless, I see that Firestore Database very clearly does not like this structure. What should I do? What is the best practice for dealing with this kind of data in Firestore?
The indexes required for the most basic queries in Firestore are automatically created for you. However, there are some limits involved. So you're getting the following error:
too many index entries for entity
Because you hit the maximum number of index entries for a document, which is 40,000. If you add too many elements into an array or you add too many fields to a document, then you can reach the maximum limit.
So most likely the number of elements that exist in the date array + the number of elements that exist in the values array is bigger than 40k, hence the error.
To solve this, you might consider creating two separate documents, one for each array. If you still hit the maximum limit, then you might consider creating a document for each hour, and not for an entire day. In this way, you'll drastically reduce the number of elements that exist in an array.
If you don't find these solutions useful, then you have to set some "Single-field index exemptions" to avoid the above error.
Firestore is not the best tool to deal with time series. The best solution I found in Firestore was creating an independent document for each day in my data. Nevertheless, that raises the number of documents I need to fetch from the front-end side and, therefore, the costs.
By using large arrays in Firestore, you easily reach the index limit, and you are forced to remove the index, which I feel is a big red flag, suggesting checking another tool.
The solution I found, in case is useful for anyone, was building my API in Flask using MongoDB as a database. Although it takes more effort than just using Firestore, it deals better with time series and brings more flexibility.

How can I know if indexing a timestamp field on a collection of documents is going to cause problems?

I saw on the Firestore documentation that it is a bad idea to index monotonically increasing values, that it will increase latency. In my app I want to query posts based on unix time which is a double and that is a number that will increase as time moves on, but in my case not perfectly monotonically because people will not be posting every second, in addition I don't think my app will exceed 4 million users. does anyone with expertise think this will be a problem for me
It should be no problem. Just make sure to store it as number and not as String. Othervise the sorting would not work as expected.
This is exactly the problem that the Firestore documentation is warning you about. Your database code will incur a cost of "hotspotting" on the index for the timestamp at scale. Specifically, from that linked documentation:
Creates new documents with a monotonically increasing field, like a timestamp, at a very high rate.
The numbers don't have to be purely monotonic. The hotspotting happens on ranges that are used for sharding the index. The documentation just doesn't tell you what to expect for those ranges, as they can change over time as the index gains more documents.
Also from the documentation:
If you index a field that increases or decreases sequentially between documents in a collection, like a timestamp, then the maximum write rate to the collection is 500 writes per second. If you don't query based on the field with sequential values, you can exempt the field from indexing to bypass this limit.
In an IoT use case with a high write rate, for example, a collection containing documents with a timestamp field might approach the 500 writes per second limit.
If you don't have a situation where new documents are being added rapidly, it's not a near-term problem. But you should be aware that it just doesn't not scale up like reads and queries will scale against that index. Note that number of concurrent users is not the issue at all - it's the number of documents being added per second to an index shard, regardless of how many people are causing the behavior.

How would I order my collection on timestamp and score

I have a collection with documents that have a createdAt timestamp and a score number. I sort all the documents on score for our leaderboard. But now I want to also have the daily best.
matchResults.orderBy("score").where("createdAt", ">", yesterday).startAt(someValue).limit(10);
But I found that there are limitations when using different fields.
https://firebase.google.com/docs/firestore/query-data/order-limit-data#limitations.
So how could I get the results of today in chuncks of 10 sorted on score?
You can use multiple orderBy(...) clauses to order on multiple fields, but this won't exactly meet your needs since you must first order by timestamp and only second by score.
A brute force option would be to fetch all the scores for the given day and truncate the list locally. But that of course won't work well if there are thousands of scores to load.
One simple answer would be to use a datestamp instead of timestamp:
matchResults.where("dayCreated", "=", "YYYY-MM-DD").orderBy("score").startAt(...).limit(10)
A second simple answer would be to run a Cloud Function on write events and maintain a daily top scores table separate from your scores data. If the results are frequently viewed, this would ultimately prove more economical and scalable as you would only need to record a small subset (say the top 100) by day, and can simply query that table ordering by score.
Scoreboards are extremely difficult at scale, so don't underestimate the complexity of handling every edge case. Start small and practical, focus on aggregating your results during writes and keep reads small and simple. Limit scope by listing only a top percentage for your "top scores" records and skip complex pagination schemas where possible.

How can one get count, with a where condition in DynamoDB

Let us say, We have a situation where instead of getting the total count in a table, get the count of records with a particular status.
We know DynamoDb is schemaless and still has to count each record one by one to get the total count.
And yet, How can we leverage the above need using dynamoDb queries?
While normally "Query" or "Scan" requests return all the matching items, you can pass the Select=COUNT parameter and ask to retrieve only the number of matching items, instead of the actual items. But before you go doing that, there are a few things you should know:
DynamoDB will still be reading - and you will still be paying for - all the data, even if just for being counted. Doing a "Scan" with a filter is in almost all cases out of the question, because it will read the entire data set every time. With a "Query" you can ask to read just one partition, or a contiguous range of sort-keys in one partition, which in some cases may be reasonable enough (but please think if it is, in your use case).
Even if you're not actually reading the data, and just counting, DynamoDB still does Scan and Query with "paging", i.e., your reads request will read just 1MB of data from disk, return you the partial count, and ask you to submit another request to resume the scan. Your DynamoDB library probably has a way to automate this resumption, so for example it can run thousands or whatever number of queries needed until finally finishing the scan and calculating the total sum.
In some cases, it may make sense for to maintain a counter in addition to the data. Writes will be more expensive (e.g., each write adds data and increments the counter), but reads that need this counter will be hugely cheaper - so it all depends on how much of each your workload needs.

How to retrieve a row's position within a DynamoDB global secondary index and the total?

I'm implementing a leaderboard which is backed up by DynamoDB, and their Global Secondary Index, as described in their developer guide, http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
But, two of the things that are very necessary for a leaderboard system is your position within it, and the total in a leaderboard, so you can show #1 of 2000, or similar.
Using the index, the rows are sorted the correct way, and I'd assume these calls would be cheap enough to make, but I haven't been able to find a way, as of yet, how to do it via their docs. I really hope I don't have to get the entire table every single time to know where a person is positioned in it, or the count of the entire table (although if that's not available, that could be delayed, calculated and stored outside of the table at scheduled periods).
I know DescribeTable gives you information about the entire table, but I would be applying filters to the range key, so that wouldn't suit this purpose.
I am not aware of any efficient way to get the ranking of a player. The dumb way is to do a query starting from the player with the highest point, move downward, keep incrementing your counter until you reach the target player. So for the user with lowest point, you might end up scanning the whole range.
That being said, you can still get the top 100 player with no problem (Leaders). Just do a query starting from the player with the highest point, and set the query limit to 100.
Also, for a given player, you can get 100 players around him with similar points. You just need do two queries like:
query with hashkey="" and rangekey <= his point, limit 50
query with hashkey="" and rangekey >= his point, limit 50
This was the exact same problem we were facing when we were developing our app. Following are two solutions we had come with to deal with this problem:
Query your index with scanIndex->false that will give you all top players (assuming your score/points key in range) with limit 1000. Then applying this mathematical formula y = mx+b where you can take 2 iteration, mostly 1 and last value to find out m and b, x-points, and y-rank. Based on this you will get the rank if you have user's points (this will not be exact rank value it would be approximate, google does the same if we search some thing in our mail it show
and not exact value in first call.
Get all the records and store it in cache until the next update. This is by far the best and less expensive thing we are using.
The beauty of DynamoDB is that it is highly optimized for very specific (and common) use cases. The cost of this optimization is that many other use cases cannot be achieved as easily as with other databases. Unfortunately yours is one of them. That being said, there are perfectly valid and good ways to do this with DynamoDB. I happen to have built an application that has the same requirement as yours.
What you can do is enable DynamoDB Streams on your table and process item update events with a Lambda function. Every time the number of points for a user changes you re-compute their rank and update your item. Even if you use the same scan operation to re-compute the rank, this is still much better, because it moves the bulk of the cost from your read operation to your write operation, which is kind of the point of NoSQL in the first place. This approach also keeps your point updates fast and eventually consistent (the rank will not update immediately, but is guaranteed to update properly unless there's an issue with your Lambda function).
I recommend to go with this approach and once you reach scale optimize by caching your users by rank in something like Redis, unless you have prior experience with it and can set this up quickly. Pick whatever is simplest first. If you are concerned about your leaderboard changing too often, you can reduce the cost by only re-computing the ranks of first, say, 100 users and schedule another Lambda function to run every several minutes, scan all users and update their ranks all at the same time.

Resources