Our database contains documents with a lot of metadata, including relationships between those documents. Fictional example:
<document>
<metadata>
<document-number>ID 12345 : 2012</document-number>
<publication-year>2012</publication-year>
<cross-reference>ID 67890 : 1995</cross-reference>
<cross-reference>ID 67890 : 1998</cross-reference>
<cross-reference>ID 67891 : 2000</cross-reference>
<cross-reference>ID 12345 : 2004</cross-reference>
<supersedes>ID 12345 : 2004</supersedes>
...
</metadata>
</document>
<document>
<metadata>
<document-number>ID 12345 : 2004</document-number>
<publication-year>2004</publication-year>
<cross-reference>ID 67890 : 1995</cross-reference>
<cross-reference>ID 67890 : 1998</cross-reference>
<cross-reference>ID 67891 : 2000</cross-reference>
<cross-reference>ID 12345 : 2012</cross-reference>
<cross-reference>ID 12345 : 2001</cross-reference>
<superseded-by>ID 12345 : 2012</superseded-by>
<supersedes>ID 12345 : 2001</supersedes>
...
</metadata>
</document>
We're using a 1-box search, based on the Marklogic search api to allow users to search these documents. The search grammar describes a variety of contraints and search options, but mostly (and by default) they search by a field defined to include most of the metadata elements, with (somewhat) carefully chosen weights (what really matters here is that document-number has the highest weight.)
The problem is that the business wants quite specific ordering of results, and I can't think of a way to achieve it using the search api.
The requirement that's causing trouble is that if the user search matches a document number (say they search for "12345",) then all documents with that document number should be at the top of the result-set, ordered by descending date. It's easy enough to get them at the top of the result-set; document-number has the highest weight, so sorting by score works fine. The problem is that the secondary sort by date doesn't work because even though all the document-number matches have higher scores than other documents, they don't have the same score, so they end up ordered by how often the search term appears in the rest of the metadata; which isn't really meaningful at all.
What I think we really need is a way of having the search api score results simply by the highest weighted element that matches the search-term, without reference to any other matches in the document. I've had a look at the scoring algorithms and can't see one that does that; have I missed something or is this just not possible? Obviously, it doesn't have to be score that we order by; if there's some other way to get at the score of the single best match in a document and use it for sorting, that would be fine.
Is there some other solution I haven't even thought of?
I thought of doing two searches (one on document-number, and one on the whole metadata tree) and then combining the results, but that seems like it's going to cause a lot of pain with pagination and performance. Which sort-of defeats the purpose of using the search api in the first place.
I should add that it is correct to have those other matches in the result-set, so we can't just search only on document-number.
I think you've reached the limits of what the high-level search API can do for you. I have a few tricks to suggest, though. These won't be 100% robust, but they might be good enough for the business. Then you can get on with the application. Sorry if I sound cynical or dismissive, but I don't believe in micromanaging search results.
Simplest possible: re-sort the first page in memory. That first page could be a bit larger than the page you show to the user. Because it is still limited in size, you can make the rules for this fairly complex without suffering much. That would fix your 'descending date' problem. The results from page 1 wouldn't quite match up with page 2, but that might be good enough.
Taking the next step in complexity, consider using document-quality to handle the descending-date issue. This approach is used by http://markmail.org among others. As each document is inserted or updated, set document quality using a number derived from the date. This could be days or weeks or months since 1970, or using some other fixed date. Newer results will tend to float to the top. If any other boosts tend to swamp the date-based boost, you might get close to what you want.
There might also be some use in analyzing the query to extract the potentially boosting terms. If necessary you could then begin a recursive run of xdmp:exists(cts:search(doc(), $query)) on each boosting term as if it were a standalone query. Bail out as soon as you find a true() result: that means you are going to boost that query term with an absurdly high weight to make it float to the top.
Once you know what the boosting term is, rewrite the entire query to set all other term weights to much lower values, perhaps even 0. The lower the weight, the less those non-boosting terms will interfere with the date-based quality and the boosting weight. If there is no boosting term, you might want make other adjustments. All this is less expensive than it sounds, by the way. Aside from the xdmp:exists calls, it's just in-memory expression evaluation.
Again, though, these are all just tricks to nudge the scores. They won't give you the absolute control over ranking that you're looking for. In my experience, attempts to micromanage scores are doomed to failure. My bet is that your users would be happier with raw TF/IDF, whatever your business managers say.
Another way to do it is to use two searches, as you suggest. Put a range index on document-number (and ideally the document date), extract any potential document-number values from the query (search:parse, extract, then search:resolve is a good strategy), then execute a cts:element-range-query for docs matching those document-number values with date descending. If there aren't enough results to fill up your N-result page, then get the next N-x results from search api. You can keep track of the documents that were returned in the first result set and exclude those URIs from the second one. Keeping track of the pagination won't be too bad.
This might not perform as well as the first solution, but the time difference for the additional range index query combined with a shorter search api query should be negligible enough for most.
Related
I want to search(query) a bunch of strings from a column in DynamoDB. Using Dynamoose https://github.com/dynamoose/dynamoose
But it returns nothing. Can you help if this type of query is allowed or is there another syntax for the same.
Code sample
Cat.query({"breed": {"contains": "Terrier","contains": "husky","contains": "wolf"}}).exec()
I want all these breeds , so these are OR queries. Please help.
Two major things here.
First. Query in DynamoDB requires that you search for where a given hasKey that is equal to something. This must be either the hashKey of the table or hashKey of an index. So even if you could get this working, the query will fail. Since you can't do multiple equals for that thing. It must be hashKey = _______. No or statements or anything for that first condition or search.
Second. Just to answer your question. It seems like what you are looking for is the condition.in function. Basically this would change your code to look like something like:
Cat.query("breed").in(["Terrier", "husky", "wolf"]).exec()
Of course. The code above will not work due to the first point.
If you really want to brute force this to work. You can use Model.scan. So basically changing query to scan` in the syntax. However, scan operations are extremely heavy on the DB at scale. It looks through every document/item before applying the filter, then returning it to you. So you get no optimization that you would normally get. If you only have a handful or couple of documents/items in your table, it might be worth it to take the performance hit. In other cases like exporting or backing up the data it also makes sense. But if you are able to avoid scan operations, I would. Might require some rethinking of your DB structure tho.
Cat.scan("breed").in(["Terrier", "husky", "wolf"]).exec()
So the code above would work and I think is what you are asking for, but keep in mind the performance & cost hit you are taking here.
I'm trying to produce a Gremlin query whereby I need to find vertexes which have edges from specific other vertexes. The less abstract version of this query is I have user vertexes, and those are related to group vertexes (i.e subjects in a school, so students who are in "Year 6 Maths" and "Year 6 English" etc). An extra difficulty is the ability for subgroups to exist in this query.
The query I need to find those users who are in 2 or more groups specified by the user.
Currently I have a brief solution, but in production usage using Amazon Netpune this query performs way too poorly, even with a small amount of data. I'm sure there's a simpler way of achieving this :/
g.V()
.has('id', 'group_1')
.repeat(out("STUDENT", "SUBGROUP"))
.until(hasLabel("USER"))
.aggregate("q-1")
.V()
.has('id', 'group_2')
.repeat(out("STUDENT", "SUBGROUP"))
.until(hasLabel("USER"))
.where(within("q-1"))
.aggregate("q-2")
.V()
.hasLabel(USER)
.where(within("q-2"))
# We add some more filtering here, such as search terms
.dedup()
.range(0, 10)
.values("id")
.toList()
The first major change you can do is to not bother iterating all of V() again for "USER" - that's already that output from the prior steps so collecting "q-2" just to use it for a filter doesn't seem necessary:
g.V().
has('id', 'group_1').
repeat(out("STUDENT", "SUBGROUP")).
until(hasLabel("USER")).
aggregate("q-1").
V().
has('id', 'group_2').
repeat(out("STUDENT", "SUBGROUP")).
until(hasLabel("USER")).
where(within("q-1")).
# We add some more filtering here, such as search terms
dedup().
range(0, 10).
values("id")
That should already be a huge savings to your query because that change avoids iterating the entire graph in memory (i.e. full scan of all vertices) as there was no index lookup there.
I don't know what your additional filters are here:
# We add some more filtering here, such as search terms
but I would definitely look to try to filter the users earlier in your query rather than later. Perhaps consider using emit() on your repeats() to filter better. You should probably also dedup() your "q-1" and reduce the size of the list there.
I'd be curious to know how much just the initial change I suggested works as that was probably the biggest bulk of your query cost (unless you have a really deep/wide tree of student/subgroups I guess). Perhaps there is more that could be tweaked here though, but it would be nice to know that you at least have a traversal with satisfying performance at this point.
I have a few XML documents in marklogic which have the structure
<abc:doc>
<abc:doc-meta>
<abc:meetings>
<abc:meeting>
</abc:meeting>
<abc:meeting>
</abc:meeting>
</abc:meetings>
</abc:doc-meta>
</abc:doc>
We can have more than one <abc:meeting> element under the <abc:meetings> element.
I am trying to write a cts:search query to get only documents that have more than one <abc:meeting> element in the document.
Please advise
This is tricky. Ideally, you'd want to drive searches from indexes for best performance. Unfortunately, MarkLogic doesn't keep track of element counts in its universal index, and aggregating counts from a range index can be cumbersome.
The overall simplest solution would be to add a count attribute on abc:meetings, and then add a range index on that. It does mean you'd have to change your data, and you'd have to keep that attribute in synch with each change.
You could also just search on the presence of abc:meeting with cts:element-query(), and append an XPath predicate to count the number of elements afterwards. Something like:
cts:search(
collection(),
cts:element-query(xs:QName('abc:meeting'), cts:true-query())
)[count(.//abc:meeting) > 1]
If not many documents contain meetings, this might work fairly well for you, but it still requires pulling up all documents containing meetings, hence could be expensive.
I played with the thought of leveraging cts:near-query(), but that is driven on word positions, so depends on the actual amount of tokens inside a meeting. If that were always an exact number of tokens (unlikely I'd guess), you could use the minimal-distance option on a double cts:element-query() wrapped in a cts:near-query(). It might help optimize the previous option a little though.
Most performant option I can think of right now, involves adding a User-Defined aggregate Function. It unfortunately means compiling c++ code. I happen to have written such a UDF in the past, that you should be able to use as-is after compilation and installation. For details see:
https://github.com/grtjn/doc-count-udf
and
http://docs.marklogic.com/guide/app-dev/aggregateUDFs
HTH!
It boils down to how many "a few" is. If it's thousands or fewer, than what grtjn presents above for a cts:search plus an XPath expression will work fine. If it's more, I'd add the count attribute to abc:meetings and then use a pre-commit trigger (e.g. on the collection of these documents) to ensure that the count attribute value is kept in sync. You'd need a range index to be able to query for "Documents that have a count of meetings of 2 or greater".
Of course, if all you need to query on is whether there's more than one meeting, then just add a "multiple" attribute to abc:meetings with a value of "true". Then you don't need a range index - you can do a cts:element-attribute-value-query on abc:meetings and multiple="true".
I am evaluating the top-k range query using NDCG. Given a spatial area and a query keyword, my top-k range query must return k documents in the given area that are textual relevant to the query keyword.
In my scenario, the range query usually finds only one document to return. But I have to compare this query to another one who can find more objects in the given area, with the same keyword. This is possible because an approach I am testing to improve objects description.
I am not figuring out how to use NDCG to compare these two queries in this scenario. I would like to compare Query A and B using NDCG#5, NDCG#10, but Query A only finds one object. Query A will have high NDCG value because of its lower ability to find more objects (probably the value will be one - the maximum). Query B finds more objects (in my opinion, a better solution) but has a lower NDCG value than query A.
You can consider looking at a different measure, e.g. Recall#10, if you care less about the ranking for your application.
NDCG is a measure designed for web search, where you really want to penalize a system that doesn't return the best item at the topmost result, which is why it has an exponential decay factor. This makes sense for navigational queries like ``stackoverflow'' you will look quite bad if you don't return this website first.
It sounds like you are building something a little more sophisticated, where the user cares about many results. Therefore, a more recall-oriented measure (that cares about getting multiple things right more than the ranking) may make more sense.
its lower ability to find more objects
I'd also double-check your implementation of NDCG: you always want to divide by the ideal ranking, regardless of what actually gets returned. It sounds like your Query A returns 1 correct object, but Query B returns more correct objects, but not at high ranks? Either way, you expect Query A to be divided by the DCG of a perfect ranking -- that means 10, 20, or thousands of "correct" objects. It may be that you just don't have enough judgments, and therefore your "perfect ranking" is too small, and therefore you aren't penalizing Query A enough.
I am working on Marklogic tool
I am having a database of around 27000 documents.
What I want to do is retrieve the keywords which have maximum frequency in the documents given by the result of any search query.
I am currently using xquery functions to count the frequency of each word in the set of all documents retrieved as query result. However, this is quite inefficient.
I was thinking that it would help me if i could get the list of words on which marklogic has performed indexing.
So is there a way to retrieve the list of indexed words from the universal index of marklogic??
Normally you would use something like this in MarkLogic:
(
for $v in cts:element-values(xs:Qname("myelem"))
let $f := cts:frequency($v)
order by $f descending
return $v
)[1 to 10]
This kind of functionality is built-in in the search:search library, which works very conveniently.
But you cannot use that on values from cts:words e.a. unfortunately. There is a little trick that could get you close though. Instead of using cts:frequency, you could use a xdmp:estimate on a cts:search to get a fragment count:
(
for $v in cts:words()
let $f := xdmp:estimate(cts:search(collection(), $v))
order by $f descending
return $v
)[1 to 10]
The performance is less, but still much faster than bluntly running through all documents.
HTH!
What if your search contains multiple terms? How will you calculate the order?
What if some of your terms are very common in your corpus of documents, and others are very rare? Should the count of "the" contribute more to the score than "protease", or should they contribute the same?
If the words occur in the title vs elsewhere in the document, should that matter?
What if one document is relatively short, and another is quite long. How do you account for that?
These are some of the basic questions that come up when trying to determine relevancy. Most search engines use a combination of term frequency (how often do the terms occur in your documents), and document frequency (how many documents contain the terms). They can also use the location of the terms in your documents to determine a score, and they can also account for document length in determining a score.
MarkLogic uses a combination of term frequency and document frequency to determine relevance by default. These factors (and others) are used to determine a relevance score for your search criteria, and this score is the default sorting for results returned by search:search from the search API or the low-level cts:search and its supporting operators.
You can look at the details of the options for cts:search to learn about some of the different scoring options. See 'score-logtfidf' and others here:
http://community.marklogic.com/pubs/5.0/apidocs/SearchBuiltins.html#cts:search
I would also look at the search developers guide:
http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf
Many of the concepts are under consideration by the XQuery working group as enhancements for a future version of XQuery. They aren't part of the language today. MarkLogic has been at the forefront of search for a number of years, so you'll find there are many features in the product, and a lot of discussion related to this area in the archives.
"Is there a way to retrieve the list of indexed words from the universal index of marklogic?" No. The universal index is a hash index, so it contains hashes not words.
As noted by others you can create value-based lexicons that can list their contents. Some of these also include frequency information. However, I have another suggestion: cts:distinctive-terms() will identify the most distinctive terms from a sequence of nodes, which could be the current page of search results. You can control whether the output terms are just words, or include more complex terms such as element-word or phrase. See the docs for more details.
http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/apidoc/SearchBuiltins.xml&category=SearchBuiltins&function=cts:distinctive-terms
I have used cts:distinctive-terms(). It gives mostly wildcarded terms in my case which are not of much use. Furthur it is suitable for finding distinctive terms in a single document. When I try to run it on many documents it is quite slow.
What I want to implement is a dynamic facet which is populated with the keywords of the documents which come up in the search result. I have implemented it but it is inefficient as it counts the frequency of all the words in the documents. I want it to be a suggestion or recommandation feature like if you have searched for this particular term or phrase then you may be interested in these suggested terms or phrases. So I want an efficient method to find the terms which are common in the result set of documents of a search.
I tried cts:words() as suggested. It gives similar words as the search query word and the number of documents in which it is contained. WHat it does not take into account is the set of search result documents. It just shows the number of documents which contain similar words in the whole database, irrespective of whether these documents are present in the search result or not