What happens when the top-k query does not find enough documents to satisfy k constraint? - information-retrieval

I am evaluating the top-k range query using NDCG. Given a spatial area and a query keyword, my top-k range query must return k documents in the given area that are textual relevant to the query keyword.
In my scenario, the range query usually finds only one document to return. But I have to compare this query to another one who can find more objects in the given area, with the same keyword. This is possible because an approach I am testing to improve objects description.
I am not figuring out how to use NDCG to compare these two queries in this scenario. I would like to compare Query A and B using NDCG#5, NDCG#10, but Query A only finds one object. Query A will have high NDCG value because of its lower ability to find more objects (probably the value will be one - the maximum). Query B finds more objects (in my opinion, a better solution) but has a lower NDCG value than query A.

You can consider looking at a different measure, e.g. Recall#10, if you care less about the ranking for your application.
NDCG is a measure designed for web search, where you really want to penalize a system that doesn't return the best item at the topmost result, which is why it has an exponential decay factor. This makes sense for navigational queries like ``stackoverflow'' you will look quite bad if you don't return this website first.
It sounds like you are building something a little more sophisticated, where the user cares about many results. Therefore, a more recall-oriented measure (that cares about getting multiple things right more than the ranking) may make more sense.
its lower ability to find more objects
I'd also double-check your implementation of NDCG: you always want to divide by the ideal ranking, regardless of what actually gets returned. It sounds like your Query A returns 1 correct object, but Query B returns more correct objects, but not at high ranks? Either way, you expect Query A to be divided by the DCG of a perfect ranking -- that means 10, 20, or thousands of "correct" objects. It may be that you just don't have enough judgments, and therefore your "perfect ranking" is too small, and therefore you aren't penalizing Query A enough.

Related

Gremlin search for vertexes related to 2 or more specific nodes

I'm trying to produce a Gremlin query whereby I need to find vertexes which have edges from specific other vertexes. The less abstract version of this query is I have user vertexes, and those are related to group vertexes (i.e subjects in a school, so students who are in "Year 6 Maths" and "Year 6 English" etc). An extra difficulty is the ability for subgroups to exist in this query.
The query I need to find those users who are in 2 or more groups specified by the user.
Currently I have a brief solution, but in production usage using Amazon Netpune this query performs way too poorly, even with a small amount of data. I'm sure there's a simpler way of achieving this :/
g.V()
.has('id', 'group_1')
.repeat(out("STUDENT", "SUBGROUP"))
.until(hasLabel("USER"))
.aggregate("q-1")
.V()
.has('id', 'group_2')
.repeat(out("STUDENT", "SUBGROUP"))
.until(hasLabel("USER"))
.where(within("q-1"))
.aggregate("q-2")
.V()
.hasLabel(USER)
.where(within("q-2"))
# We add some more filtering here, such as search terms
.dedup()
.range(0, 10)
.values("id")
.toList()
The first major change you can do is to not bother iterating all of V() again for "USER" - that's already that output from the prior steps so collecting "q-2" just to use it for a filter doesn't seem necessary:
g.V().
has('id', 'group_1').
repeat(out("STUDENT", "SUBGROUP")).
until(hasLabel("USER")).
aggregate("q-1").
V().
has('id', 'group_2').
repeat(out("STUDENT", "SUBGROUP")).
until(hasLabel("USER")).
where(within("q-1")).
# We add some more filtering here, such as search terms
dedup().
range(0, 10).
values("id")
That should already be a huge savings to your query because that change avoids iterating the entire graph in memory (i.e. full scan of all vertices) as there was no index lookup there.
I don't know what your additional filters are here:
# We add some more filtering here, such as search terms
but I would definitely look to try to filter the users earlier in your query rather than later. Perhaps consider using emit() on your repeats() to filter better. You should probably also dedup() your "q-1" and reduce the size of the list there.
I'd be curious to know how much just the initial change I suggested works as that was probably the biggest bulk of your query cost (unless you have a really deep/wide tree of student/subgroups I guess). Perhaps there is more that could be tweaked here though, but it would be nice to know that you at least have a traversal with satisfying performance at this point.

API design: naming "I want one more value outside time boundaries"

I'm designing an API to query the history of a value over a time period. Think about a temperature value, and you want to query all the values for today.
I have a from and a to parameter to specify the boundaries of the query.
The values available may not exactly match the boundaries requested. For example, if from is 2016-02-17T00:00:00Z, the first value may be on 2016-02-17T00:04:30Z. To fully represent a graph of the period, it is necessary to retrieve one more value outside the given range. The value on 2016-02-16T23:59:30Z is useful and it would be convenient for the user to not have to make another query to retrieve it.
So as the API designer I'm thinking about a parameter with a pair a of boolean values that would tell for each boundary: give me one more value if there is no value exactly on the boundary.
My question is how to name this parameter as English is not my native language.
Here are a few ideas I have so far but with which I'm not totally satisfied:
overflow=true,true
overstep=true,true
edges=true,true
I would also appreciate any links to existing APIs with that feature, either web API or in programming languages.
Is it possible to make this more of a function/RPC that a traditional rest resource endpoint, so rather than requesting data for a resource between 2 dates like
/myResource?from=x&to=x
something more like
/getGraphData?graphFrom=&graphTo=x
Whilst its only a naming thing, it makes it a bit more acceptable to retrieve results for a task wrapped with outer data, rather than violating parameters potentially giving unexpected or confusing results.

why HashSet is good for search operations?

hashset underlaying data structure is hashtable .how it will identify duplicates and why it is good for if our frequent operation is search operation ?
It uses hash code of the object which is quickly computed integer. This hash code tries to be as even distributed over all potential object values as possible.
As a result it can distribute the inserted values into a array (hashtable) with very low probability of conflict. Then the search operation is quite quick - get the hash code, access the array, compare and get the value - usually constant time. The same actually happens for finding duplicates.
The conflicts of hash code are resolved as well - there can be potentially more values for the same entry within the hash table - there comes the equal into play. But they are rather rare so they don't affect average performance significantly.

How to retrieve a row's position within a DynamoDB global secondary index and the total?

I'm implementing a leaderboard which is backed up by DynamoDB, and their Global Secondary Index, as described in their developer guide, http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
But, two of the things that are very necessary for a leaderboard system is your position within it, and the total in a leaderboard, so you can show #1 of 2000, or similar.
Using the index, the rows are sorted the correct way, and I'd assume these calls would be cheap enough to make, but I haven't been able to find a way, as of yet, how to do it via their docs. I really hope I don't have to get the entire table every single time to know where a person is positioned in it, or the count of the entire table (although if that's not available, that could be delayed, calculated and stored outside of the table at scheduled periods).
I know DescribeTable gives you information about the entire table, but I would be applying filters to the range key, so that wouldn't suit this purpose.
I am not aware of any efficient way to get the ranking of a player. The dumb way is to do a query starting from the player with the highest point, move downward, keep incrementing your counter until you reach the target player. So for the user with lowest point, you might end up scanning the whole range.
That being said, you can still get the top 100 player with no problem (Leaders). Just do a query starting from the player with the highest point, and set the query limit to 100.
Also, for a given player, you can get 100 players around him with similar points. You just need do two queries like:
query with hashkey="" and rangekey <= his point, limit 50
query with hashkey="" and rangekey >= his point, limit 50
This was the exact same problem we were facing when we were developing our app. Following are two solutions we had come with to deal with this problem:
Query your index with scanIndex->false that will give you all top players (assuming your score/points key in range) with limit 1000. Then applying this mathematical formula y = mx+b where you can take 2 iteration, mostly 1 and last value to find out m and b, x-points, and y-rank. Based on this you will get the rank if you have user's points (this will not be exact rank value it would be approximate, google does the same if we search some thing in our mail it show
and not exact value in first call.
Get all the records and store it in cache until the next update. This is by far the best and less expensive thing we are using.
The beauty of DynamoDB is that it is highly optimized for very specific (and common) use cases. The cost of this optimization is that many other use cases cannot be achieved as easily as with other databases. Unfortunately yours is one of them. That being said, there are perfectly valid and good ways to do this with DynamoDB. I happen to have built an application that has the same requirement as yours.
What you can do is enable DynamoDB Streams on your table and process item update events with a Lambda function. Every time the number of points for a user changes you re-compute their rank and update your item. Even if you use the same scan operation to re-compute the rank, this is still much better, because it moves the bulk of the cost from your read operation to your write operation, which is kind of the point of NoSQL in the first place. This approach also keeps your point updates fast and eventually consistent (the rank will not update immediately, but is guaranteed to update properly unless there's an issue with your Lambda function).
I recommend to go with this approach and once you reach scale optimize by caching your users by rank in something like Redis, unless you have prior experience with it and can set this up quickly. Pick whatever is simplest first. If you are concerned about your leaderboard changing too often, you can reduce the cost by only re-computing the ranks of first, say, 100 users and schedule another Lambda function to run every several minutes, scan all users and update their ranks all at the same time.

How tightly can Marklogic search scores be controlled?

Our database contains documents with a lot of metadata, including relationships between those documents. Fictional example:
<document>
<metadata>
<document-number>ID 12345 : 2012</document-number>
<publication-year>2012</publication-year>
<cross-reference>ID 67890 : 1995</cross-reference>
<cross-reference>ID 67890 : 1998</cross-reference>
<cross-reference>ID 67891 : 2000</cross-reference>
<cross-reference>ID 12345 : 2004</cross-reference>
<supersedes>ID 12345 : 2004</supersedes>
...
</metadata>
</document>
<document>
<metadata>
<document-number>ID 12345 : 2004</document-number>
<publication-year>2004</publication-year>
<cross-reference>ID 67890 : 1995</cross-reference>
<cross-reference>ID 67890 : 1998</cross-reference>
<cross-reference>ID 67891 : 2000</cross-reference>
<cross-reference>ID 12345 : 2012</cross-reference>
<cross-reference>ID 12345 : 2001</cross-reference>
<superseded-by>ID 12345 : 2012</superseded-by>
<supersedes>ID 12345 : 2001</supersedes>
...
</metadata>
</document>
We're using a 1-box search, based on the Marklogic search api to allow users to search these documents. The search grammar describes a variety of contraints and search options, but mostly (and by default) they search by a field defined to include most of the metadata elements, with (somewhat) carefully chosen weights (what really matters here is that document-number has the highest weight.)
The problem is that the business wants quite specific ordering of results, and I can't think of a way to achieve it using the search api.
The requirement that's causing trouble is that if the user search matches a document number (say they search for "12345",) then all documents with that document number should be at the top of the result-set, ordered by descending date. It's easy enough to get them at the top of the result-set; document-number has the highest weight, so sorting by score works fine. The problem is that the secondary sort by date doesn't work because even though all the document-number matches have higher scores than other documents, they don't have the same score, so they end up ordered by how often the search term appears in the rest of the metadata; which isn't really meaningful at all.
What I think we really need is a way of having the search api score results simply by the highest weighted element that matches the search-term, without reference to any other matches in the document. I've had a look at the scoring algorithms and can't see one that does that; have I missed something or is this just not possible? Obviously, it doesn't have to be score that we order by; if there's some other way to get at the score of the single best match in a document and use it for sorting, that would be fine.
Is there some other solution I haven't even thought of?
I thought of doing two searches (one on document-number, and one on the whole metadata tree) and then combining the results, but that seems like it's going to cause a lot of pain with pagination and performance. Which sort-of defeats the purpose of using the search api in the first place.
I should add that it is correct to have those other matches in the result-set, so we can't just search only on document-number.
I think you've reached the limits of what the high-level search API can do for you. I have a few tricks to suggest, though. These won't be 100% robust, but they might be good enough for the business. Then you can get on with the application. Sorry if I sound cynical or dismissive, but I don't believe in micromanaging search results.
Simplest possible: re-sort the first page in memory. That first page could be a bit larger than the page you show to the user. Because it is still limited in size, you can make the rules for this fairly complex without suffering much. That would fix your 'descending date' problem. The results from page 1 wouldn't quite match up with page 2, but that might be good enough.
Taking the next step in complexity, consider using document-quality to handle the descending-date issue. This approach is used by http://markmail.org among others. As each document is inserted or updated, set document quality using a number derived from the date. This could be days or weeks or months since 1970, or using some other fixed date. Newer results will tend to float to the top. If any other boosts tend to swamp the date-based boost, you might get close to what you want.
There might also be some use in analyzing the query to extract the potentially boosting terms. If necessary you could then begin a recursive run of xdmp:exists(cts:search(doc(), $query)) on each boosting term as if it were a standalone query. Bail out as soon as you find a true() result: that means you are going to boost that query term with an absurdly high weight to make it float to the top.
Once you know what the boosting term is, rewrite the entire query to set all other term weights to much lower values, perhaps even 0. The lower the weight, the less those non-boosting terms will interfere with the date-based quality and the boosting weight. If there is no boosting term, you might want make other adjustments. All this is less expensive than it sounds, by the way. Aside from the xdmp:exists calls, it's just in-memory expression evaluation.
Again, though, these are all just tricks to nudge the scores. They won't give you the absolute control over ranking that you're looking for. In my experience, attempts to micromanage scores are doomed to failure. My bet is that your users would be happier with raw TF/IDF, whatever your business managers say.
Another way to do it is to use two searches, as you suggest. Put a range index on document-number (and ideally the document date), extract any potential document-number values from the query (search:parse, extract, then search:resolve is a good strategy), then execute a cts:element-range-query for docs matching those document-number values with date descending. If there aren't enough results to fill up your N-result page, then get the next N-x results from search api. You can keep track of the documents that were returned in the first result set and exclude those URIs from the second one. Keeping track of the pagination won't be too bad.
This might not perform as well as the first solution, but the time difference for the additional range index query combined with a shorter search api query should be negligible enough for most.

Resources