How can I find documents accessed to answer my Xquery query? - xquery

I have the following objective. I want to find, which documents contain my data when executing any kind of Xquery or XPath. In other words, I need every document that is providing me the result data for a given query. I try to do this in eXist-db environment, but I suppose there should be something on Xquery level.
I found op:context-document() operator which seems to have functionality I want, yet, as an operator it is not available for me. fn:document-uri also does not do the trick, as its $arg must be a document node, otherwise it returns an empty sequence.
Do you have any idea in mind? All the assistance is highly appreciated.

fn:base-uri() may help; it returns the base URI property of a node:
for $d in doc('....')/your[query]
return base-uri($d)
You can also use it to filter your documents for specific types:
collection('/path/to/documents')[ends-with(base-uri(), '.xml')]

Use the standard XPath / XQuery function collection() .
For example, using Saxon:
collection('file:///a/b/c/d?select=*.xml')[yourBooleanExpression]
selects the document nodes of all XML documents, residing in the /a/b/c/d directory of the filesystem, that satisfy your criteria (yourBooleanExpression evaluates to true())

Related

Using U-SQL MultiLevelJsonExtractor gives Error: Path returned multiple tokens

I am using the MultiLevelJsonExtractor forked on Git by kotvisbj, When I put a Path that contains an array (body.header.items[*] or body.header.items) into the JsonPaths parameter string, I get a "Error: Path returned multiple tokens". Is there a way to extract the paths in code so I can get an array like when using the Root? I tried to explain this the best way I could, I don't have excellent c# skills, it's been a few years.
I think it would be best to ask the owner of the branch to see if he can advise you. I assume that his code expects a single token only and not an array of tokens.
You can probably achieve what you need by using code similar to this: U-SQL - Extract data from json-array

Jayway Jsonpath syntax for string array filter?

I am attempting to use the EvaluateJsonPath processor in Nifi, and am having trouble with the jayway jsonpath syntax.
My object looks like the following:
{"text":"my stuff", "tags":["abc", "xyz", "beq"]}
I want to route messages based on the tags - I want everything containing "xyz" to be routed one way, and everything not containing it to be routed another way.
Using http://jsonpath.herokuapp.com/ I've been testing and trying to figure out the syntax to filter based on a json object containing an array of strings matching. I can match based on overt index (so $.[?(#.tags[1] =~ /xyz/i)] works just fine), but I can't guarantee the order or number of objects in the tags field.
Is there a way to do this in the jayway json module? I saw filter the Json according to string in an array in JSONPATH which I've tried, but it doesn't appear to work in the simulator above.
I do not know how to do this in one EvaluateJsonPath processor step. But it can certainly be done in a two-step process:
Use EvaluateJsonPath to filter "xyz" tags out of the tags array, using a JsonPath expression like $.tags[?(# =~ /xyz/i)] and setting the processors return-type to json so an array may be returned. This will result in ["xyz"] for a match and [] for non-matching files
Use RouteOnAttribute to route based on the resulting array, with an expression like ${matchingTags:toLower():contains('xyz')}.
It might also be worth considering evaluating the JSON as text against a regular expression to match the tag.

How to remove punctuation from a database in marklogic?

I want to remove punctuation from a database of xml document in marklogic. This is made for preprocessing purposes for machine learning. I'm new to marklogic and i don't know how to do that. Is there an xquery query that could remove punctuation?
To do a mass replacement of all text in the database, and take out punctuation, you could start with something that looks like this code (modified for your needs):
for $doc in cts:search(fn:collection(), ())
for $text in $doc//text()
return xdmp:node-replace($text, text{fn:replace($text, "[\.,;]", "")})
To be honest, that task is much less expensive to do on the source text files themselves - or in MarkLogic by treating the XML as string during the replacement process. Updating nodes one element at a time will be expensive.
Outside of Marklogic:
use SED or AWK or a similar tool BEFORE INGESTION
Inside of MarkLogic(as a trigger, perhaps)
use xdmp:quote to change the XML to a string, then replace in a sing with fn:replace and then make XML again with xdmp:unquote
let $new-doc := xdmp:unquote(fn:replace(xdmp:quote($doc), "[\.,;]", ""))
Then either store by replacing the root node with xdmp:node-replace - or store this version as a property. This all depends on if the original (punctuated version matters to you). Or perhaps you just want to keep the original and serve this cleansed version back to someone.
In all cases above, you have to make sure that your replacement does not murder your XML. Also, be aware of options for the functions above(like how cdata is handled.
Lastly, "This is for machine learning purposes". You do not elaborate. I think many of us here have a feeling that this solution (cleansing punctuation before insert) rubs against the very grain of MarkLogic - in which you store as-is and then have awesome index, tokenizing, stemming, collation, search support to find and return your data as you need. If you were to elaborate on your use case a bit, you may inspire others to give more MarkLogic-Specific suggestions.
It will work if you use 'punctuation-insensitive' and if required 'diacritic-insensitive' in cts:element-word-query()
I'm not sure if this is what you're asking, but it's technically possible to update every document in the database to remove punctuation; however, it's very expensive and I wouldn't recommend it.
Using built-in search functions, you can probably achieve the same goal without updating your documents by querying with punctuation insensitivity. For example, if you want to select documents with a title matching a case insensitive string:
cts:search(//mydoc,
cts:element-word-query(xs:QName('title'), 'Moby-Dick', 'punctuation-insensitive'))
Or in an existing XQuery:
for $d in $documents
where cts:contains($d,
cts:element-word-query(xs:QName('title'), 'Moby-Dick', 'punctuation-insensitive'))
return $d/summary

Encoding wildcarding, stemming, etc in simple search

We have a simple search interface which calls the search:search($query-text) function. Is there a syntax to include control for wildcarding, stemming and case sensitivity within the single text string that the function accepts? I haven't been able to find anything in the MarkLogic docs.
See the $options parameter and the <term> and <term-option> constraint at https://docs.marklogic.com/search:search . There is a guide at http://developer.marklogic.com/learn/2009-07-search-api-walkthrough
and some details http://developer.marklogic.com/learn/2009-07-search-api-walkthrough#ndbba3437f320a4a4
I don't know of any existing syntax for those options, aside from the built-in behavior of turning on wildcards when a term contains '*' or '?' and turning on case-sensitivity when the term contains capital letters.
You could develop a syntax. Implementing it might involve a custom parser along the lines of https://github.com/mblakele/xqysp then feeding the resulting cts:query into search:resolve.
Piggybacking on Eric Bloch's answer... you can always dynamically construct your node based on input in the user interface.
For example, I often do this in order to separate the facet selection portion of the query from the text search portion and put the facet selection query in the additional-query element in the options node.

Delete all documents matching a query?

I would like to delete all documents matching some predicates. The query I have come up with is as follows, but nothing is deleted from the database.
I suspect this is because the $doc is set to the XML value of the document rather than the document itself. Can anyone shed any light on this?
xquery version "1.0-ml";
for $doc in cts:search(fn:collection("MYCOLLECTIONNAME")/MyDocumentRoot,
cts:or-query((
cts:element-range-query (xs:QName("MyElement"), "=", "MyElementValue"),
)), "unfiltered" )
return xdmp:document-delete($doc);
The document looks like
<MyDocumentRoot>
<MyElementName>MyElementValue</MyElementName>
</MyDocumentRoot>
You are indeed passing the contents of the documents into xdmp:document-delete instead of its uri. You could derive the uri using for instance fn:base-uri(), but like this all docs you want to delete are retrieved from the database first, which is unnecessary.
Instead, enable the URI lexicon, and use cts:uris to do the deletion. It might also be wise to do the deletion in batches of lets say 1000 docs.
HTH!
xdmp:document-delete() takes the URI of the document, rather than a node. So the simplest fix would be to wrap $doc in base-uri(.):
return xdmp:document-delete(base-uri($doc))
But if you have the URI lexicon enabled, you can write a much faster query like this:
let $query := cts:and-query((
cts:collection-query("MYCOLLECTIONNAME"),
cts:or-query((...)) (: put your full or query here :)
))
for $uri in cts:uris("",(),$query) return xdmp:document-delete($uri)
In the latter case, you avoid having to read each document to get its URI.
xdmp:collection-delete("MYCOLLECTIONNAME") is also worth a mention.
On the last line, change to
xdmp:document-delete(fn:base-uri($doc));

Resources