How fast is counting documents in Cloud Firestore? - firebase

Last year Firestore introduced count queries, which allows you to retrieve the number of results in a query/collection without actually reading the individual documents.
The documentation for this count feature mentions:
Aggregation queries rely on the existing index configuration that your queries already use, and scale proportionally to the number of index entries scanned. This means that aggregations of small- to medium-sized data sets perform within 20-40 ms, though latency increases with the number of items counted.
And:
If a count() aggregation cannot resolve within 60 seconds, it returns a DEADLINE_EXCEEDED error.
How many documents can Firestore actually count within that 1 minute timeout?

I created some collections with many documents in a test database, and then ran COUNT() queries against that.
The code to generate the minimal documents through the Node.js Admin SDK:
const db = getFirestore();
const col = db.collection("10m");
let count = 0;
const writer = db.bulkWriter();
while (count++ < 10_000_000) {
if (count % 1000 === 0) await writer.flush();
writer.create(col.doc(), {
index: count,
createdAt: FieldValue.serverTimestamp()
})
}
await writer.close();
Then I counted them with:
for (const name of ["1k", "10k", "1m", "10m"]) {
const start = Date.now();
const result = await getCountFromServer(collection(db, name));
console.log(`Collection '${name}' contains ${result.data().count} docs (counting took ${Date.now()-start}ms)`);
}
And the results I got were:
count
ms
1,000
120
10,000
236
100,000
401
1,000,000
1,814
10,000,000
16,565
I ran some additional tests with limits and conditions, and the results were always in line with the above for the number of results that were counted. So for example, counting 10% of the collection with 10m documents took about 1½ to 2 seconds.
So based on this, you can count up to around 40m documents before you reach the 60 second timeout. Honestly, given that you're charged 1 document read for every up to 1,000 documents counted, you'll probably want to switch over to stored counters well before that.

Related

Interrupt ZStream mapMPar processing

I have the following code which, because of Excel max row limitations, is restricted to ~1million rows:
ZStream.unwrap(generateStreamData).mapMPar(32) {m =>
streamDataToCsvExcel
}
All fairly straightforward and it works perfectly. I keep track of the number of rows streamed, and then stop writing data. However I want to interrupt all the child fibers spawned in mapMPar, something like this:
ZStream.unwrap(generateStreamData).interruptWhen(effect.true).mapMPar(32) {m =>
streamDataToCsvExcel
}
Unfortunately the process is interrupted immediately here. I'm probably missing something obvious...
Editing the post as it needs some clarity.
My stream of data is generated by an expensive process in which data is pulled from a remote server, (this data is itself calculated by an expensive process) with n Fibers.
I then process the streams and then stream them out to the client.
Once the processed row count has reached ~1 million, I then need to stop pulling data from the remote server (i.e. interrupt all the Fibers) and end the process.
Here's what I can come up with after your clarification. The ZIO 1.x version is a bit uglier because of the lack of .dropRight
Basically we can use takeUntilM to count the size of elements we've gotten to stop once we get to the maximum size (and then use .dropRight or the additional filter to discard the last element that would take it over the limit)
This ensures that both
You only run streamDataToCsvExcel until the last possible message before hitting the size limit
Because streams are lazy expensiveQuery only gets run for as many messages as you can fit within the limit (or N+1 if the last value is discarded because it would go over the limit)
import zio._
import zio.stream._
object Main extends zio.App {
override def run(args: List[String]): URIO[zio.ZEnv, ExitCode] = {
val expensiveQuery = ZIO.succeed(Chunk(1, 2))
val generateStreamData = ZIO.succeed(ZStream.repeatEffect(expensiveQuery))
def streamDataToCsvExcel = ZIO.unit
def count(ref: Ref[Int], size: Int): UIO[Boolean] =
ref.updateAndGet(_ + size).map(_ > 10)
for {
counter <- Ref.make(0)
_ <- ZStream
.unwrap(generateStreamData)
.takeUntilM(next => count(counter, next.size)) // Count size of messages and stop when it's reached
.filterM(_ => counter.get.map(_ <= 10)) // Filter last message from `takeUntilM`. Ideally should be .dropRight(1) with ZIO 2
.mapMPar(32)(_ => streamDataToCsvExcel)
.runDrain
} yield ExitCode.success
}
}
If relying on the laziness of streams doesn't work for your use case you can trigger an interrupt of some sort from the takeUntilM condition.
For example you could update the count function to
def count(ref: Ref[Int], size: Int): UIO[Boolean] =
ref.updateAndGet(_ + size).map(_ > 10)
.tapSome { case true => someFiber.interrupt }

DynamoDB pagination - knowing when there is no more results

The doc states:
The absence of LastEvaluatedKey is the only way to know that you have
reached the end of the result set
However if you have 10 items and you're Query-ing 10 items, you WILL get a result set with LastEvaluatedKey. However there is no more items after that.
Is there a reliable method to actually know when reaching the end of the result set?
When you specify the limit (10 as per this question), it finds number of items as provided by limit and does not look beyond that.
As items are 10 and limit is 10, it is able to find elements as per limit each time.
On the second attempt to read items, it finds no item in table and hence return null. You will need to have while loop something like below:
List<QueryResult> queryResultList = new ArrayList<>();
// Since query returns only max 1MB of items at a time,
// use of this flag tells if no more such elements are present in db.
Map<String, AttributeValue> lastKeyEvaluated = null;
Map<String, AttributeValue> expressionAttributeValue = new HashMap<>();
expressionAttributeValue.put(":primary_key_value", new AttributeValue().withS(primary_key_value));
do {
QueryRequest queryRequest = new QueryRequest()
.withTableName(this.getDynamoTable().getTableName())
.withIndexName(Constants.Table.INDEX_NAME)
.withKeyConditionExpression("primary_key = :primary_key_value")
.withExpressionAttributeValues(expressionAttributeValue)
.withExclusiveStartKey(lastKeyEvaluated);
QueryResult result = this.getAmazonDynamoDBClient().query(queryRequest);
queryResultList.add(result);
lastKeyEvaluated = result.getLastEvaluatedKey();
} while (lastKeyEvaluated != null);
I know this question is 4 years old, but I also encountered this problem and this is how I "solved" it:
In my case I was using the last evaluated key for doing a pagination on the results, and I was giving via a variable (lets call it pageSize) the query size.
Using the example of the question, we would be giving to the function a pageSize = 10.
How do we mitigate the lastEvaluatedKey problem when the results are exactly as the pageSize?
We do a query of pageSize + 1 items, and if the response.length == pageSize + 1, that means that we have 1 + X extra at least items, meaning that the are more pages. Then we retrieve the lastEvaluatedKey of the pageSize item (remember that we retrieved pageSize + 1 items). If there are response.length < pageSize there is no lastEvaluatedKey to pass
Summary example: we want a page of 10 items. Then we query 11 items. Two possibilities:
We get <= 10 items, which is cool, we retrieve the items without the lastEvaluatedKey, because there are no more items xd
We get 11 items, meaning that there are more pages needed, so we retrieve the lastEvaluatedKey of the 10th item. Small downside of this is that we need to "craft" the lastEvaluatedKey of this item, as it won't correspond as the same lastEvaluatedKey that dynamoDB response gives us (remember that we are doing a 10 + 1 query so in case there is a lastEvaluatedKey it would be of the 11 item)

How dynamoDB + DAX work with timeseries?

I wonder how DAX works with time-series. I want to insert some data every minute, add TTL to remove it after 14 days and get last 3 hours of data after each insert:
insert 1KB each minute
expire after 14 days
after each insert read data for the last 3 hours
3 hours is 180 minutes, so most of the time I need the last 180 items. Sometimes data is not coming for some time, so there may be less than 180 items.
So there are 20,160 items ±19MB of data for 14 days. How much DAX I will use while fetching the last 3 hours of data every minute? Will it be 19MB or 180KB?
let params = {
TableName: 'prod_server_data',
KeyConditionExpression: 's = :server_id and t between :time_from and :time_to',
ExpressionAttributeValues: {
':server_id': serverId, // string
':time_from': from, // timestamp
':time_to': to, // timestamp
},
ReturnConsumedCapacity: 'TOTAL',
ScanIndexForward: false,
Limit: 1440, // 24h*60 = 1440. 1 check every 1 min
};
const queryResult = await dynamo.query(params).promise();
DAX caches items and queries separately, and the query cache stores the entire response, keyed by the parameters. In this case, set the query TTL to 1 minute, and make sure that :time_from and :time_to only have 1 minute resolution.
If you only call query once per minute, than you won't see much benefit from DAX (since it will have to go to DynamoDB every time to refresh).
If you call query multiple times per minute but only expect the data to update every minute (i.e. repeatedly refreshing a dashboard) there will only be 1 call to DynamoDB every minute to refresh and all other requests will be served from the cache.

reductio Max of a sum

I have a dataset I'm working with that is buildings and electrical power use over time.
There are two aggregations on these buildings that are simple sums across the entire timespan and I have those written. They end up looking like:
var reducer = reductio();
// How much energy is used in the whole system
reducer.value("energy").sum(function (d) {
return +d.Energy;
});
These work great.
The third aggregation, however, is giving me some trouble. I need to find the point that the sum of all the buildings is at its greatest. I need the max of the sum and the time it happened.
I wrote:
reducer.value("power").sum(function (d) {
return +d.Power;
}).max(function (d) {
return +d.Power;
}).aliasProp({
time: function (d, v) {
return v.Timestamp;
}
});
But, this is not necessarily the biggest power use. I'm pretty sure this returns the sum and the time when any individual building used the most power.
So if the power values were 1, 1, 1, 15. I would end up with 18, when there might be a different moment when the values were 5, 5, 5, 5 for a total of 20. The 20 is what I need.
I am at a loss for how to get the maximum of a sum. Any advice?
Just to restate: You are grouping on time, so your group keys are time periods of some sort. What you want is to find the time period (group) for which power use is greatest.
If I'm right that this is what you want, then you would not do this in your reducer, but rather by sorting the groups. You can order groups by using the group.order method: https://github.com/crossfilter/crossfilter/wiki/API-Reference#group_order
// During group setup
group.order(function(p) { return p.power.sum; })
// Later, when you want to grab the top power group
group.top(1)
Reductio's max aggregation should just give you the maximum value that occurs within the group. So given a group with values 1,1,1,15, you would get back the value 15. It sounds like that's not what you want.
Hopefully I understood properly. If not, please comment. If you can put together an example with toy data that is public and where you can tell me what you would like to see vs what you are getting, I should be able to help out.
Update based on example:
So, what you want (based on the description in the example) is to find the maximum power usage for any given time within the selected time period. So you would do the following:
var timeDim = buildings.dimension(function(d) { return d.Timestamp })
var timeGrp = timeDim.group().reduceSum(function(d) { return d.Power })
var maxResults = timeGrp.top(1)
Whenever you want to find the max power usage time for your current filter, just call timeGrp.top(1) and the key of that group will be the time with the maximum power.
Note: Don't filter on timeDim as the filters on a dimension are not applied to groups defined on that dimension.
Here's an updated JSFiddle that writes out the maximum group to the console: https://jsfiddle.net/esjewett/1o3robm3/1/

Crossfilter total by group

Im trying to show the total number of people in each geography when they hover over using crossfilter, but my current code is only showing the total of all geographies. So what is the equivalent in crossfilter to the sql query: SELECT COUNT(*) GROUP BY dma
This is my code so far
//geography that is being hovered over, getting dma name and removing everything that is after the comma
sel_geog = layer.feature.properties.dma_1;
sel_geog = sel_geog.split(",")[0];
console.log(sel_geog);
//crossfilter to get total number of people of each geography
var dmaDim = voter_data.dimension(function(d) {return d.dma == sel_geog}),
dma_grp = dmaDim.groupAll().reduceCount().value();
console.log(dma_grp);
Crossfilter isn't meant to be used in a way where you are building new dimensions and groups for each user interaction. It's meant to build dimensions and groups before interactions take place and then update them quickly when filtering based on user interactions.
It's not really clear from this question what your data looks like or what you are trying to do, but you probably want to create dimensions and group for your dma property and then build your map based on that:
var voter_data = crossfilter(my_data);
var dmaDim = voter_data.dimension(function(d) { return d.dma; });
var dmaGroup = dmaDim.group();
At this point dmaGroup.all() will be an array of objects that looks like { key: 'dmaKey', value: 10 } where 10 is the count of all records where d.dma === 'dmaKey'. There are lots of ways you can aggregate differently with Crossfilter, but that may get you started.

Resources