Solr4 max() function seems to lose precision on TrieDateField values

Solr4 max() function seems to lose precision on TrieDateField values - datetime

I have documents that have two date fields, "published_date" and "updated_date". The updated_date field is empty until an update occurs. After there has been an update, I would like to use updated_date as the field to sort by. This is not the exact situation but close enough, and index a single correct field is the least desirable solution.
So I'm trying to do "sort=max(published_date, updated_date) desc"
To validate the results I have "fl=max_date:max(published_date, updated_date), published_date, updated_date"
What I'm seeing is this:
"docs": [
{
"max_date": 1409953170000,
"published_date": "2014-09-05T21:39:01.322Z",
"updated_date": "2014-09-05T21:39:01.319Z"
},
{
"max_date": 1409953040000,
"published_date": "2014-09-05T21:36:51.614Z",
"updated_date": "2014-09-05T21:36:51.611Z"
},
{
"max_date": 1409953040000,
"published_date": "2014-09-05T21:38:01.111Z",
"updated_date": "2014-09-05T21:38:01.107Z"
},
{
"max_date": 1409953040000,
"published_date": "2014-09-05T21:38:11.151Z",
"updated_date": "2014-09-05T21:38:11.148Z"
},
{
"max_date": 1409953040000,
"published_date": "2014-09-05T21:37:36.202Z",
"updated_date": "2014-09-05T21:37:36.194Z"
},
{
"max_date": 1409953040000,
"published_date": "2014-09-05T21:37:41.92Z",
"updated_date": "2014-09-05T21:37:41.915Z"
}, ...
So you can see the max_date that is being sorted by Does not have the same precision as the underlying timestamps. The results are out of order, and the result from max() clearly has room for more precision as it always ends with four zeros.
So how do I make this work? Or is there a bug in solr's conversion functions?
UPDATE:
So it seems from: lucene-solr-lucene_solr_4_5_0/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java
That shows max() is implemented by casting it's arguments to floatVal, since dates are stored as Longs in a TrieField clearly precision is being lost.

Trie*-fields have their precision set by using precisionStep on the field. That way you can get the precision you need for the specific usage. Using 64 as precisionStep will in effect make the field a regular long/Date-field, where you'll have only the exact value available in your function queries (as Trie-fields index several tokens otherwise, to make the fast range search work).
Changing this will however not make the field any faster for range queries, so you might want to have one field for sorting and one for range queries (if needed).

I've opened a bug with SOLR to address the issue.
https://issues.apache.org/jira/browse/SOLR-6490

Related

DynamoDB transactional insert with multiple conditions (PK/SK attribute_not_exists and SK attribute_exists)

I have a table with PK (String) and SK (Integer) - e.g.
PK_id SK_version Data
-------------------------------------------------------
c3d4cfc8-8985-4e5... 1 First version
c3d4cfc8-8985-4e5... 2 Second version
I can do a conditional insert to ensure we don't overwrite the PK/SK pair using ConditionalExpression (in the GoLang SDK):
putWriteItem := dynamodb.Put{
TableName: "example_table",
Item: itemMap,
ConditionExpression: aws.String("attribute_not_exists(PK_id) AND attribute_not_exists(SK_version)"),
}
However I would also like to ensure that the SK_version is always consecutive but don't know how to write the expression. In pseudo-code this is:
putWriteItem := dynamodb.Put{
TableName: "example_table",
Item: itemMap,
ConditionExpression: aws.String("attribute_not_exists(PK_id) AND attribute_not_exists(SK_version) **AND attribute_exists(SK_version = :SK_prev_version)**"),
}
Can someone advise how I can write this?
in SQL I'd do something like:
INSERT INTO example_table (PK_id, SK_version, Data)
SELECT {pk}, {sk}, {data}
WHERE NOT EXISTS (
SELECT 1
FROM example_table
WHERE PK_id = {pk}
AND SK_version = {sk}
)
AND EXISTS (
SELECT 1
FROM example_table
WHERE PK_id = {pk}
AND SK_version = {sk} - 1
)
Thanks

A conditional check is applied to a single item. It cannot be spanned across multiple items. In other words, you simply need multiple conditional checks. DynamoDb has transactWriteItems API which performs multiple conditional checks, along with writes/deletes. The code below is in nodejs.
const previousVersionCheck = {
TableName: 'example_table',
Key: {
PK_id: 'prev_pk_id',
SK_version: 'prev_sk_version'
},
ConditionExpression: 'attribute_exists(PK_id)'
}
const newVersionPut = {
TableName: 'example_table',
Item: {
// your item data
},
ConditionExpression: 'attribute_not_exists(PK_id)'
}
await documentClient.transactWrite({
TransactItems: [
{ ConditionCheck: previousVersionCheck },
{ Put: newVersionPut }
]
}).promise()
The transaction has 2 operations: one is a validation against the previous version, and the other is an conditional write. Any of their conditional checks fails, the transaction fails.

You are hitting your head on some of the differences between a SQL and a no-SQL database. DynamoDB is, of course, a no-SQL database. It does not, out of the box, support optimistic locking. I see two straight forward options:
Use a software layer to give you locking on your DynamoDB table. This may or may not be feasible depending on how often updates are made to your table. How fast 'versions' are generated and the maximum time your application can be gated on the lock will likely tell you if this can work foryou. I am not familiar with Go, but the Java API supports this. Again, this isn't a built-in feature of DynamoDB. If there is no such Go API equivalent, you could use the technique described in the link to 'lock' the table for updates. Generally speaking, locking a no-SQL DB isn't a typical pattern as it isn't exactly what it was created to do (part of which is achieving large scale on unstructured documents to allow fast access to many consumers at once)
Stop using an incrementor to guarantee uniqueness. Typically, incrementors are frowned upon in DynamoDB, in part due to the lack of intrinsic support for it and in part because of how DynamoDB shards you don't want a lot of similarity between records. Using a UUID will solve the uniqueness problem, but if you are porting an existing application that means more changes to the elements that create that ID and updates to reading the ID (perhaps to include a creation-time field so you can tell which is the newest, or the prepending or appending of an epoch time to the UUID to do the same). Here is a pertinent link to a SO question explaining on why to use UUIDs instead of incrementing integers.

Based on Hung Tran's answer, here is a Go example:
checkItem := dynamodb.TransactWriteItem{
ConditionCheck: &dynamodb.ConditionCheck{
TableName: "example_table",
ConditionExpression: aws.String("attribute_exists(pk_id) AND attribute_exists(version)"),
Key: map[string]*dynamodb.AttributeValue{"pk_id": {S: id}, "version": {N: prevVer}},
},
}
putItem := dynamodb.TransactWriteItem{
Put: &dynamodb.Put{
TableName: "example_table",
ConditionExpression: aws.String("attribute_not_exists(pk_id) AND attribute_not_exists(version)"),
Item: data,
},
}
writeItems := []*dynamodb.TransactWriteItem{&checkItem, &putItem}
_, _ = db.TransactWriteItems(&dynamodb.TransactWriteItemsInput{TransactItems: writeItems})

Finding JSONPath value by a partial key

I have the following JSON:
{
"Dialog_1": {
"en": {
"label_1595938607000": "Label1",
"newLabel": "Label2"
}
}
}
I want to extract "Label1" by using JSONPath. The problem is that each time I get a JSON with a different number after "label_", and I'm looking for a consistent JSONPath expression that will return the value for any key that begins with "label_" (without knowing in advance the number after the underscore).

It is not possible with JSONPath. EL or Expression Language does not have sch capability.
Besides, I think you need to review your design. Why the variable name is going to be changed all the time? If it is changing then it is data and you need to keep it in a variable. You cannot keep data in data.

Pinot fasthll and distinctcounthll returns different values

we are using pinot hll, and got suggested to switch from fasthll to distinctcounthll, but we got the count very different, with the same condition we have 1000x difference.
Example:
SELECT fasthll(my_hll), distinctcounthll(my_hll)
FROM counts_table WHERE timestamp >= 1500768000
I get results:
"aggregationResults": [
{
"function": "fastHLL_my_hll",
"value": "68685244"
}, {
"function": "distinctCountHLL_my_hll",
"value": "50535"
}]
Could anyone suggest what's the big difference between them?

Please refer to pinot-issue-5153.
FastHll will convert one string into a hyperloglog object, which may represent thousand unique values. DistinctCountHLL treats string as a value, not hyperloglog object, so it will return the approximation of how many unique hyperloglog serialized strings, the value should be close to your total number scanned .
fasthll is deprecated because of the low performance of deserialization. You may generate BYTES type for serialized HyperLogLog using org.apache.pinot.core.common.ObjectSerDeUtils.HYPER_LOG_LOG_SER_DE.serialize(hyperLogLog) and query it with distinctcounthll

Firebase startAt String only takes first character

I have a structure like below under xyz
{
"pushKey000": {
"findKey": "john_1",
"userName": "john",
"topic": 1
},
"pushKey001": {
"findKey": "john_2",
"userName": "john",
"topic": 2
},
"pushKey002": {
"findKey": "joel_1",
"userName": "joel",
"topic": 1
}
}
Now am trying to make a query where I want data of all entries with findKey starting with "john". I tried the following:(Using REST for example)
https://abc.firebaseio.com/xyz.json?orderBy="findKey"&startAt="john"
This gives me all the results including 'joel'. Basically it just uses the first character of startAt, in this case J.
This firebase video fires the same type of query but only searches with just first character.
Is there something wrong that I am doing or is there is any other way to retrieve it using findKey? Thanks a lot for the help in advance
PS: My .indexOn is on findKey and can't change it

There is nothing wrong with your code, there is something wrong with your expectations. (I always wanted to write that as an answer :))
The startAt() function works as a starting point for your query, not a filter. So in your case it will find the first occurance of "john" and return everything from that point forward (Including Joel, Kevin, Tim, etc...).
Unfortunatly there is no direct way to do a query where findKey contains the string "john". But luckely there is a (partial) workaround using endAt().
You query will look like this:
orderBy="findKey"&startAt="john"&endAt="john\uf8ff"
Here \uf8ff is the last unicode character (please correct me if I'm wrong).
With this you can query for values that start with "john" like "johnnie", "johnn", "john". But not "1john" or "johm" or "joel".

Couchbase Reduce function

I am trying to learn how to use map reduce functions with Couchbase. until now i created reports engines based on SQL using Where with multi terms (adding and subtracting terms) and to modify the group part.
I am trying to create this report engine using views.
my problem is how to create a report that enable users to dive in and find more and more data, getting all the way to individual ip stats.
For example. how many clicks where today ? which traffic source ? what did they see? which country ? and etc..
My basic doc for this example look like this:
"1"
{
"date": "2014-01-13 10:00:00",
"ip": "111.222.333.444",
"country": "US",
"source":"1",
}
"2"
{
"date": "2014-01-13 10:00:00",
"ip": "555.222.333.444",
"country": "US",
"source":"1",
}
"3"
{
"date": "2014-01-13 11:00:00",
"ip": "111.888.888.888",
"country": "US",
"source":"2",
}
"4"
{
"date": "2014-01-13 11:00:00",
"ip": "111.777.777.777",
"country": "US",
"source":"1",
}
So i want to allow the user to see at the first screen , how many clicks per day there are at this site.
so i need to count the amount of clicks. simple map/reduce:
MAP:
function (doc, meta) {
emit(dateToArray(doc.date),1);
}
Reduce:
_count
group level 4, group true
will create the sum of clicks per hour.
Now if i want to allow a break down of countries, so i need a dynamic param to change.. from what i am understand it can only by the group level..
so assume i have added this to the emit like this:
emit([dateToArray(doc.date),source],1);
and then grouping level 5 will allow this divide, and using the key too focus on a certein date.. but what if i need to add a county break down? adding this to the emit again?
this seem to be a mess, also if i will want to do a country stats before the source.. is there any smarter way to do this?
Second part...
What if i want to get the first count as follow:
[2014,1,28,10] {ip:"555.222.333.444","111.222.333.444","count":"2"}
i want to see all the ips that are counted for this time...
how should i write my reduce function?
this is my current state that doesnt work..
function(key, values, rereduce) {
var result = {id: 0, count: 0};
for(i=0; i < values.length; i++) {
if(rereduce) {
result.id = result.id + (values[i]).ip +',';
result.count = result.count + values[i].count;
} else {
result.id = values.ip;
result.count = values.length;
}
}
return result;
i didnt get the answer format i was looking for..
i hope this is not to messy and that you could help me with this..
thanks!!

For the first part of your question, I think you are on the right track. That is how you break down views to enable coarse drill down. However, it is important to remember that views are not intended to store your entire documents, nor are they necessarily going to be able to give you a clean cut swatch of data. You probably will need to do fine-filtering within the access layer of your code (using Linq perhaps).
For the second part of your question, a reduce is not the appropriate mechanism to accomplish this. Reduce values have a very finite (and limited) size and will crash the map/reduce engine once they get too big. I suspect you have experimented with that and discovered this for yourself.
The way you worded the question, it seems like you wish to search for all IP addresses that have been counted "X" number of times. This cannot be accomplished directly in Couchbase's map/reduce architecture; however, if you simply want the count for a given IP address, that is something the map/reduce framework has built-in (just use Date + IP as a key).

Categories

HOME

out-of-memory

postgresql-9.1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex