I have a cosmos DB whose size is around 100GB.
I successfully create a nice partition key, i have around 4600 partition on 70M records, but I still need to query on two datetime fields that are stored as a string, not in an epoch format.
Example json:
"someField1": "UNKNOWN",
"someField2": "DATA",
"endDate": 7014541201,
"startDate": 7054864502,
"someField3": "0",
"someField3": "0",
i notice when i do select * from tbl and when i do select * from tbl where startDate > {someDate} AND endDate<{someDate1} latency different is around 1s, so this filtering does not decrease my latency time.
Is it better to store date types as number? Does cosmos have better performance on epoch query ranges?
I am using SQL API.
Also when i try to add hash indexes on a startDate and endDate he basically convert that into two indexes.
Example:
"path": "/startDate/?",
"indexes": [
{
"kind": "Hash",
"dataType": "String",
"precision": 3
}
]
},
this is converted to
"path": "/startDate/?",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Range",
"dataType": "String",
"precision": -1
}
]
Is that a normal behaviour or it is related to my data?
Thanks.
I checked query metrics, and for 4k records query to cosmosDB is executed in 100ms. I would like to ask you is it normal behaviour that
var option = new FeedOptions { PartitionKey = new PartitionKey(partitionKey), MaxItemCount = -1};
var query= client.CreateDocumentQuery<MyModel>(collectionLink, option)
.Where(tl => tl.StartDate >= DateTimeToUnixTimestamp(startDate) && tl.EndDate <= DateTimeToUnixTimestamp(endDate))
.AsEnumerable().ToList();
this query returns 10k results (in Postman its around 9MB size) in 10-12s? This partition contains around 50k records.
Retrieved Document Count : 12,356
Retrieved Document Size : 12,963,709 bytes
Output Document Count : 3,633
Output Document Size : 3,819,608 bytes
Index Utilization : 29.00 %
Total Query Execution Time : 264.31
milliseconds
Query Compilation Time : 0.12 milliseconds
Logical Plan Build Time : 0.07 milliseconds
Physical Plan Build Time : 0.06 milliseconds
Query Optimization Time : 0.01 milliseconds
Index Lookup Time : 51.10 milliseconds
Document Load Time : 140.51 milliseconds
Runtime Execution Times
Query Engine Times : 55.61 milliseconds
System Function Execution Time : 0.00 milliseconds
User-defined Function Execution Time : 0.00 milliseconds
Document Write Time : 10.56 milliseconds
Client Side Metrics
Retry Count : 0
Request Charge : 904.73 RUs
I am from the CosmosDB engineering team.
Since your collection has 70M records, I assume that the observed latency is only on the first roundtrip (or first page) of results. Note that the observed latency can also be improved by tweaking FeedOptions.MaxDegreeOfParallelism to -1 when executing the query.
Regarding the difference between the two queries themselves, please note that SELECT * without a filter is a full scan query, which is probably a bit
faster to first return results, when compared to the other query with two filters, which does a little bit more work on the local indexes across all the partitions, which may explain the observed latency.
Regarding your other question, we no longer support the Hash indexing policy on new collections. Please see here: https://learn.microsoft.com/en-us/azure/cosmos-db/index-types#index-kind . We automatically convert Hash indexes to Range with full precision.
You may also fetch QueryMetrics for your query and analyze the results to figure out why you have latency. Details are here: https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-query-metrics#query-execution-metrics
Related
Last year Firestore introduced count queries, which allows you to retrieve the number of results in a query/collection without actually reading the individual documents.
The documentation for this count feature mentions:
Aggregation queries rely on the existing index configuration that your queries already use, and scale proportionally to the number of index entries scanned. This means that aggregations of small- to medium-sized data sets perform within 20-40 ms, though latency increases with the number of items counted.
And:
If a count() aggregation cannot resolve within 60 seconds, it returns a DEADLINE_EXCEEDED error.
How many documents can Firestore actually count within that 1 minute timeout?
I created some collections with many documents in a test database, and then ran COUNT() queries against that.
The code to generate the minimal documents through the Node.js Admin SDK:
const db = getFirestore();
const col = db.collection("10m");
let count = 0;
const writer = db.bulkWriter();
while (count++ < 10_000_000) {
if (count % 1000 === 0) await writer.flush();
writer.create(col.doc(), {
index: count,
createdAt: FieldValue.serverTimestamp()
})
}
await writer.close();
Then I counted them with:
for (const name of ["1k", "10k", "1m", "10m"]) {
const start = Date.now();
const result = await getCountFromServer(collection(db, name));
console.log(`Collection '${name}' contains ${result.data().count} docs (counting took ${Date.now()-start}ms)`);
}
And the results I got were:
count
ms
1,000
120
10,000
236
100,000
401
1,000,000
1,814
10,000,000
16,565
I ran some additional tests with limits and conditions, and the results were always in line with the above for the number of results that were counted. So for example, counting 10% of the collection with 10m documents took about 1½ to 2 seconds.
So based on this, you can count up to around 40m documents before you reach the 60 second timeout. Honestly, given that you're charged 1 document read for every up to 1,000 documents counted, you'll probably want to switch over to stored counters well before that.
I have a table with PK (String) and SK (Integer) - e.g.
PK_id SK_version Data
-------------------------------------------------------
c3d4cfc8-8985-4e5... 1 First version
c3d4cfc8-8985-4e5... 2 Second version
I can do a conditional insert to ensure we don't overwrite the PK/SK pair using ConditionalExpression (in the GoLang SDK):
putWriteItem := dynamodb.Put{
TableName: "example_table",
Item: itemMap,
ConditionExpression: aws.String("attribute_not_exists(PK_id) AND attribute_not_exists(SK_version)"),
}
However I would also like to ensure that the SK_version is always consecutive but don't know how to write the expression. In pseudo-code this is:
putWriteItem := dynamodb.Put{
TableName: "example_table",
Item: itemMap,
ConditionExpression: aws.String("attribute_not_exists(PK_id) AND attribute_not_exists(SK_version) **AND attribute_exists(SK_version = :SK_prev_version)**"),
}
Can someone advise how I can write this?
in SQL I'd do something like:
INSERT INTO example_table (PK_id, SK_version, Data)
SELECT {pk}, {sk}, {data}
WHERE NOT EXISTS (
SELECT 1
FROM example_table
WHERE PK_id = {pk}
AND SK_version = {sk}
)
AND EXISTS (
SELECT 1
FROM example_table
WHERE PK_id = {pk}
AND SK_version = {sk} - 1
)
Thanks
A conditional check is applied to a single item. It cannot be spanned across multiple items. In other words, you simply need multiple conditional checks. DynamoDb has transactWriteItems API which performs multiple conditional checks, along with writes/deletes. The code below is in nodejs.
const previousVersionCheck = {
TableName: 'example_table',
Key: {
PK_id: 'prev_pk_id',
SK_version: 'prev_sk_version'
},
ConditionExpression: 'attribute_exists(PK_id)'
}
const newVersionPut = {
TableName: 'example_table',
Item: {
// your item data
},
ConditionExpression: 'attribute_not_exists(PK_id)'
}
await documentClient.transactWrite({
TransactItems: [
{ ConditionCheck: previousVersionCheck },
{ Put: newVersionPut }
]
}).promise()
The transaction has 2 operations: one is a validation against the previous version, and the other is an conditional write. Any of their conditional checks fails, the transaction fails.
You are hitting your head on some of the differences between a SQL and a no-SQL database. DynamoDB is, of course, a no-SQL database. It does not, out of the box, support optimistic locking. I see two straight forward options:
Use a software layer to give you locking on your DynamoDB table. This may or may not be feasible depending on how often updates are made to your table. How fast 'versions' are generated and the maximum time your application can be gated on the lock will likely tell you if this can work foryou. I am not familiar with Go, but the Java API supports this. Again, this isn't a built-in feature of DynamoDB. If there is no such Go API equivalent, you could use the technique described in the link to 'lock' the table for updates. Generally speaking, locking a no-SQL DB isn't a typical pattern as it isn't exactly what it was created to do (part of which is achieving large scale on unstructured documents to allow fast access to many consumers at once)
Stop using an incrementor to guarantee uniqueness. Typically, incrementors are frowned upon in DynamoDB, in part due to the lack of intrinsic support for it and in part because of how DynamoDB shards you don't want a lot of similarity between records. Using a UUID will solve the uniqueness problem, but if you are porting an existing application that means more changes to the elements that create that ID and updates to reading the ID (perhaps to include a creation-time field so you can tell which is the newest, or the prepending or appending of an epoch time to the UUID to do the same). Here is a pertinent link to a SO question explaining on why to use UUIDs instead of incrementing integers.
Based on Hung Tran's answer, here is a Go example:
checkItem := dynamodb.TransactWriteItem{
ConditionCheck: &dynamodb.ConditionCheck{
TableName: "example_table",
ConditionExpression: aws.String("attribute_exists(pk_id) AND attribute_exists(version)"),
Key: map[string]*dynamodb.AttributeValue{"pk_id": {S: id}, "version": {N: prevVer}},
},
}
putItem := dynamodb.TransactWriteItem{
Put: &dynamodb.Put{
TableName: "example_table",
ConditionExpression: aws.String("attribute_not_exists(pk_id) AND attribute_not_exists(version)"),
Item: data,
},
}
writeItems := []*dynamodb.TransactWriteItem{&checkItem, &putItem}
_, _ = db.TransactWriteItems(&dynamodb.TransactWriteItemsInput{TransactItems: writeItems})
we are using pinot hll, and got suggested to switch from fasthll to distinctcounthll, but we got the count very different, with the same condition we have 1000x difference.
Example:
SELECT fasthll(my_hll), distinctcounthll(my_hll)
FROM counts_table WHERE timestamp >= 1500768000
I get results:
"aggregationResults": [
{
"function": "fastHLL_my_hll",
"value": "68685244"
}, {
"function": "distinctCountHLL_my_hll",
"value": "50535"
}]
Could anyone suggest what's the big difference between them?
Please refer to pinot-issue-5153.
FastHll will convert one string into a hyperloglog object, which may represent thousand unique values. DistinctCountHLL treats string as a value, not hyperloglog object, so it will return the approximation of how many unique hyperloglog serialized strings, the value should be close to your total number scanned .
fasthll is deprecated because of the low performance of deserialization. You may generate BYTES type for serialized HyperLogLog using org.apache.pinot.core.common.ObjectSerDeUtils.HYPER_LOG_LOG_SER_DE.serialize(hyperLogLog) and query it with distinctcounthll
I wonder how DAX works with time-series. I want to insert some data every minute, add TTL to remove it after 14 days and get last 3 hours of data after each insert:
insert 1KB each minute
expire after 14 days
after each insert read data for the last 3 hours
3 hours is 180 minutes, so most of the time I need the last 180 items. Sometimes data is not coming for some time, so there may be less than 180 items.
So there are 20,160 items ±19MB of data for 14 days. How much DAX I will use while fetching the last 3 hours of data every minute? Will it be 19MB or 180KB?
let params = {
TableName: 'prod_server_data',
KeyConditionExpression: 's = :server_id and t between :time_from and :time_to',
ExpressionAttributeValues: {
':server_id': serverId, // string
':time_from': from, // timestamp
':time_to': to, // timestamp
},
ReturnConsumedCapacity: 'TOTAL',
ScanIndexForward: false,
Limit: 1440, // 24h*60 = 1440. 1 check every 1 min
};
const queryResult = await dynamo.query(params).promise();
DAX caches items and queries separately, and the query cache stores the entire response, keyed by the parameters. In this case, set the query TTL to 1 minute, and make sure that :time_from and :time_to only have 1 minute resolution.
If you only call query once per minute, than you won't see much benefit from DAX (since it will have to go to DynamoDB every time to refresh).
If you call query multiple times per minute but only expect the data to update every minute (i.e. repeatedly refreshing a dashboard) there will only be 1 call to DynamoDB every minute to refresh and all other requests will be served from the cache.
I am trying to save set of events from MySQL database into elastic search using jdbc input plugin with Logstash. The event record in database contains date fields which are in microseconds format. Practically, there are records in database between set of microseconds.
While importing data, Elasticsearch is truncating the microseconds date format into millisecond format. How could I save the data in microsecond format? The elasticsearch documentation says they follow the JODA time API to store date formats, which is not supporting the microseconds and truncating by adding a Z at the end of the timestamp.
Sample timestamp after truncation : 2018-05-02T08:13:29.268Z
Original timestamp in database : 2018-05-02T08:13:29.268482
The Z is not a result of the truncation but the GMT timezone.
ES supports microseconds, too, provided you've specified the correct date format in your mapping.
If the date field in your mapping is specified like this:
"date": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSSSSS"
}
Then you can index your dates with the exact microsecond precision as you have in your database
UPDATE
Here is a full re-creation that shows you that it works:
PUT myindex
{
"mappings": {
"doc": {
"properties": {
"date": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSSSSS"
}
}
}
}
}
PUT myindex/doc/1
{
"date": "2018-05-02T08:13:29.268482"
}
Side note, "date" datatype stores data in milliseconds in elasticsearch so here in case nanoseconds precision level are wanted in date ranges queries; the appropriate datatype is date_nanos