Pinot fasthll and distinctcounthll returns different values - olap

we are using pinot hll, and got suggested to switch from fasthll to distinctcounthll, but we got the count very different, with the same condition we have 1000x difference.
Example:
SELECT fasthll(my_hll), distinctcounthll(my_hll)
FROM counts_table WHERE timestamp >= 1500768000
I get results:
"aggregationResults": [
{
"function": "fastHLL_my_hll",
"value": "68685244"
}, {
"function": "distinctCountHLL_my_hll",
"value": "50535"
}]
Could anyone suggest what's the big difference between them?

Please refer to pinot-issue-5153.
FastHll will convert one string into a hyperloglog object, which may represent thousand unique values. DistinctCountHLL treats string as a value, not hyperloglog object, so it will return the approximation of how many unique hyperloglog serialized strings, the value should be close to your total number scanned .
fasthll is deprecated because of the low performance of deserialization. You may generate BYTES type for serialized HyperLogLog using org.apache.pinot.core.common.ObjectSerDeUtils.HYPER_LOG_LOG_SER_DE.serialize(hyperLogLog) and query it with distinctcounthll

Related

elasticsearch - Date format requires exactly 3 decimals

I'm having trouble with date parsing in elasticsearch 7.10.1.
Here's (a relevant part of) the mapping for the index:
"utcTime": {
"type": "date",
"format": "strict_date_optional_time_nanos"
}
Date format reference.
Some of the documents are accepted, for example documents with:
"utcTime": "2021-02-17T09:50:13.173Z"
"utcTime": "2021-02-17T09:51:44.158Z"
Note that in both cases, there are exactly 3 decimals to the seconds.
This, on the other hand, is rejected:
"utcTime": "2021-02-17T09:51:45.07Z"
illegal_argument_exception: failed to parse date field [2021-02-17T09:51:45.07Z] with format [yyyy-MM-dd''T''HH:mm:ss.SSSXX]
In this case, there are only two decimals. I'm using Newtonsoft's JSON.net to do the serialization, with a format that should always include 3 decimals, but it doesn't seem to do so anyway. It'll include at most 3 decimals, though.
How can I tell elasticsearch to accept date formats with anywhere between 0 and 3 decimals for the seconds?
EDIT
I finally found the issue, which had nothing to do with the mapping, but rather with a pipeline processor date_index_name.
PUT _ingest/pipeline/test_reroute_pipeline
{
"description" : "Route documents to another index",
"processors" : [
{
"date_index_name": {
"field": "utcTime",
"date_rounding": "d",
"index_name_prefix": "rerouted-"
}
}
]
}
Because the date_format parameter wasn't defined, it would remember the format of the first date received. If it was 2 decimals, it would require 2 every time. If it was 3, it would require three.
Specifying the date format solved the issue for good:
PUT _ingest/pipeline/test_reroute_pipeline
{
"description" : "Route documents to another index",
"processors" : [
{
"date_index_name": {
"field": "utcTime",
"date_rounding": "d",
"index_name_prefix": "rerouted-",
"date_formats": ["ISO8601"]
}
}
]
}
I just tried on a fresh new 7.10.1 cluster and it also accepted 1, 2, 3 decimals for the seconds part.
Looking at the error message you get
illegal_argument_exception: failed to parse date field [2021-02-17T09:51:45.07Z] with format [yyyy-MM-dd''T''HH:mm:ss.SSSXX]
The format that seems to be set is yyyy-MM-dd''T''HH:mm:ss.SSSXX and it is different from strict_date_optional_time_nanos which is yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ
If you check the real mapping from your index, I'm pretty sure the utcTime field doesn't have strict_date_optional_time_nanos as the format.

CosmosDB query on date range + index

I have a cosmos DB whose size is around 100GB.
I successfully create a nice partition key, i have around 4600 partition on 70M records, but I still need to query on two datetime fields that are stored as a string, not in an epoch format.
Example json:
"someField1": "UNKNOWN",
"someField2": "DATA",
"endDate": 7014541201,
"startDate": 7054864502,
"someField3": "0",
"someField3": "0",
i notice when i do select * from tbl and when i do select * from tbl where startDate > {someDate} AND endDate<{someDate1} latency different is around 1s, so this filtering does not decrease my latency time.
Is it better to store date types as number? Does cosmos have better performance on epoch query ranges?
I am using SQL API.
Also when i try to add hash indexes on a startDate and endDate he basically convert that into two indexes.
Example:
"path": "/startDate/?",
"indexes": [
{
"kind": "Hash",
"dataType": "String",
"precision": 3
}
]
},
this is converted to
"path": "/startDate/?",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Range",
"dataType": "String",
"precision": -1
}
]
Is that a normal behaviour or it is related to my data?
Thanks.
I checked query metrics, and for 4k records query to cosmosDB is executed in 100ms. I would like to ask you is it normal behaviour that
var option = new FeedOptions { PartitionKey = new PartitionKey(partitionKey), MaxItemCount = -1};
var query= client.CreateDocumentQuery<MyModel>(collectionLink, option)
.Where(tl => tl.StartDate >= DateTimeToUnixTimestamp(startDate) && tl.EndDate <= DateTimeToUnixTimestamp(endDate))
.AsEnumerable().ToList();
this query returns 10k results (in Postman its around 9MB size) in 10-12s? This partition contains around 50k records.
Retrieved Document Count : 12,356
Retrieved Document Size : 12,963,709 bytes
Output Document Count : 3,633
Output Document Size : 3,819,608 bytes
Index Utilization : 29.00 %
Total Query Execution Time : 264.31
milliseconds
Query Compilation Time : 0.12 milliseconds
Logical Plan Build Time : 0.07 milliseconds
Physical Plan Build Time : 0.06 milliseconds
Query Optimization Time : 0.01 milliseconds
Index Lookup Time : 51.10 milliseconds
Document Load Time : 140.51 milliseconds
Runtime Execution Times
Query Engine Times : 55.61 milliseconds
System Function Execution Time : 0.00 milliseconds
User-defined Function Execution Time : 0.00 milliseconds
Document Write Time : 10.56 milliseconds
Client Side Metrics
Retry Count : 0
Request Charge : 904.73 RUs
I am from the CosmosDB engineering team.
Since your collection has 70M records, I assume that the observed latency is only on the first roundtrip (or first page) of results. Note that the observed latency can also be improved by tweaking FeedOptions.MaxDegreeOfParallelism to -1 when executing the query.
Regarding the difference between the two queries themselves, please note that SELECT * without a filter is a full scan query, which is probably a bit
faster to first return results, when compared to the other query with two filters, which does a little bit more work on the local indexes across all the partitions, which may explain the observed latency.
Regarding your other question, we no longer support the Hash indexing policy on new collections. Please see here: https://learn.microsoft.com/en-us/azure/cosmos-db/index-types#index-kind . We automatically convert Hash indexes to Range with full precision.
You may also fetch QueryMetrics for your query and analyze the results to figure out why you have latency. Details are here: https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-query-metrics#query-execution-metrics

Firebase startAt String only takes first character

I have a structure like below under xyz
{
"pushKey000": {
"findKey": "john_1",
"userName": "john",
"topic": 1
},
"pushKey001": {
"findKey": "john_2",
"userName": "john",
"topic": 2
},
"pushKey002": {
"findKey": "joel_1",
"userName": "joel",
"topic": 1
}
}
Now am trying to make a query where I want data of all entries with findKey starting with "john". I tried the following:(Using REST for example)
https://abc.firebaseio.com/xyz.json?orderBy="findKey"&startAt="john"
This gives me all the results including 'joel'. Basically it just uses the first character of startAt, in this case J.
This firebase video fires the same type of query but only searches with just first character.
Is there something wrong that I am doing or is there is any other way to retrieve it using findKey? Thanks a lot for the help in advance
PS: My .indexOn is on findKey and can't change it
There is nothing wrong with your code, there is something wrong with your expectations. (I always wanted to write that as an answer :))
The startAt() function works as a starting point for your query, not a filter. So in your case it will find the first occurance of "john" and return everything from that point forward (Including Joel, Kevin, Tim, etc...).
Unfortunatly there is no direct way to do a query where findKey contains the string "john". But luckely there is a (partial) workaround using endAt().
You query will look like this:
orderBy="findKey"&startAt="john"&endAt="john\uf8ff"
Here \uf8ff is the last unicode character (please correct me if I'm wrong).
With this you can query for values that start with "john" like "johnnie", "johnn", "john". But not "1john" or "johm" or "joel".

How do I create a Slot that accepts a currency amount

I want to receive a dollar amount in my utterance. So, for example, if I ask Alexa:
Send $100.51 to Kroger.
(pronounced, One hundred dollars and fifty one cents) I want to receive the value 100.51 in a proper slot.
I have tried searching and I defined my utterance slots like this:
"slots": [
{
"name": "Amount",
"type": "AMAZON.NUMBER"
}
]
But on JSON input I only get this result:
"slots": {
"Amount": {
"name": "Amount",
"value": "?"
}
}
How can I make Alexa accepts currency values?
I'm a bit confused by what you wrote in your last sentence and the code, but I'll confirm that there is no built-in intent or slot for handling currency.
So, you'll have to do it manually using AMAZON.NUMBER slot type as you seem to be trying.
I would imagine that you will want to create utterences with two AMAZON.NUMBER slots - one for dollars and one for cents.
Easy, make a custom slot and just use $10.11, $.03, and $1003.84 as the sample's. It will work as currency now, accepting users dollars and cents utterances and converting them to a dollar $XX.XX format.

Solr4 max() function seems to lose precision on TrieDateField values

I have documents that have two date fields, "published_date" and "updated_date". The updated_date field is empty until an update occurs. After there has been an update, I would like to use updated_date as the field to sort by. This is not the exact situation but close enough, and index a single correct field is the least desirable solution.
So I'm trying to do "sort=max(published_date, updated_date) desc"
To validate the results I have "fl=max_date:max(published_date, updated_date), published_date, updated_date"
What I'm seeing is this:
"docs": [
{
"max_date": 1409953170000,
"published_date": "2014-09-05T21:39:01.322Z",
"updated_date": "2014-09-05T21:39:01.319Z"
},
{
"max_date": 1409953040000,
"published_date": "2014-09-05T21:36:51.614Z",
"updated_date": "2014-09-05T21:36:51.611Z"
},
{
"max_date": 1409953040000,
"published_date": "2014-09-05T21:38:01.111Z",
"updated_date": "2014-09-05T21:38:01.107Z"
},
{
"max_date": 1409953040000,
"published_date": "2014-09-05T21:38:11.151Z",
"updated_date": "2014-09-05T21:38:11.148Z"
},
{
"max_date": 1409953040000,
"published_date": "2014-09-05T21:37:36.202Z",
"updated_date": "2014-09-05T21:37:36.194Z"
},
{
"max_date": 1409953040000,
"published_date": "2014-09-05T21:37:41.92Z",
"updated_date": "2014-09-05T21:37:41.915Z"
}, ...
So you can see the max_date that is being sorted by Does not have the same precision as the underlying timestamps. The results are out of order, and the result from max() clearly has room for more precision as it always ends with four zeros.
So how do I make this work? Or is there a bug in solr's conversion functions?
UPDATE:
So it seems from: lucene-solr-lucene_solr_4_5_0/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java
That shows max() is implemented by casting it's arguments to floatVal, since dates are stored as Longs in a TrieField clearly precision is being lost.
Trie*-fields have their precision set by using precisionStep on the field. That way you can get the precision you need for the specific usage. Using 64 as precisionStep will in effect make the field a regular long/Date-field, where you'll have only the exact value available in your function queries (as Trie-fields index several tokens otherwise, to make the fast range search work).
Changing this will however not make the field any faster for range queries, so you might want to have one field for sorting and one for range queries (if needed).
I've opened a bug with SOLR to address the issue.
https://issues.apache.org/jira/browse/SOLR-6490

Resources