Marklogic collate sequence in XQuery - xquery

Is there a way to modify the elements a sequence so only collated versions of the items are returned?
let $currencies := ('dollar', 'Dollar', 'dollar ')
return fn:collated-only($currencies, "http://marklogic.com/collation/en/S1/T00BB/AS")
=> ('dollar', 'dollar', 'dollar')

The values that are stored in the range index (that feeds the facets) are literally the first value that was encountered that compared equal to the others. (Because, the collation says you don't care...)
You can get a long way by calling
fn:replace(fn:lower-case(xdmp:diacritic-less(fn:normalize-unicode($str,"NFKC"))),"\p{P}","")
This won't be exactly the same in that it overfolds some things and underfolds others, but it may be good for your purposes.

Is this the expected output? There is no fn:collated-only function, so I'm assuming you're asking how to write such a function or whether there is such a function.
The thing is, there isn't a mapping from one string to another in collation comparisons, there is only a comparison algorithm (the Unicode Collation Algorithm) so there really is no canonical kind of string to return to you, and therefore no API to do so.
Stepping back, what is the problem you are actually trying to solve? By the rules of that collation, "dollar" and "Dollar" are equivalent, and by using it you declare you don't care which form you use, so you could use either one.

If these values are in XML elements and you have a range index using http://marklogic.com/collation/en/S1/T00BB/AS, you can do something like this:
let $ref := cts:element-reference(xs:QName("currency"), "collation=http://marklogic.com/collation/en/S1/T00BB/AS")
for $curr in cts:values($ref, (), "frequency-order")
return $curr || ": " || cts:frequency($curr)
This will produce results like:
"dollar: 15",
"euro: 12"
... and so on. The collation will disregard the differences among your sample inputs. These results could be formatted however you want. Is that what you're looking to do?

Related

Why fn:substring-after Xquery function could not be used inside ML TDE

In my ML db, we have documents with distributor code like 'DIST:5012' (DIST:XXXX) XXXX is a four-digit number.
currently, in my TDE, the below code works well.
However instead of concat all the raw distributor codes, I want to simply concat the number part only. I used the fn:substring-after XQuery function. However, it won't work. It won't show that distributorCode column in the SQL View anymore. (Below code does not work.)
What is wrong? How to fix that?
Both fn:substring-after and fn:string-join is in TDE Dialect page.
https://docs.marklogic.com/9.0/guide/app-dev/TDE#id_99178
substring-after() expects a single string as input, not a sequence of strings.
To demonstrate, this will not work:
let $dist := ("DIST:5012", "DIST:5013")
return substring-after($dist, "DIST:")
This will:
for $dist in ("DIST:5012", "DIST:5013")
return substring-after($dist, "DIST:")
I need to double check what XPath expressions will work in a DTE, you might be able to change it to apply the substring-after() function in the last step:
fn:string-join( distributors/distributor/urn/substring-after(., 'DIST:'), ';')

is it the same to use MATCHES (* + "" + *) and no parameters in a FOR EACH in Progress 4GL?

So I made the following FOR EACH
FOR EACH insp_cd
WHERE insp_cd.status_ = 1
AND insp_cd.item MATCHES('*' + pc-itemPost + '*')
AND insp_cd.update_at < NOW:
So, when the pc-itemPost is "", should I avoid using the MATCHES? Like:
IF pc-itemPost = "" THEN DO:
FOR EACH insp_cd
WHERE insp_cd.status_ = 1
AND insp_cd.update_at < NOW:
...
END.
ELSE DO:
FOR EACH insp_cd
WHERE insp_cd.status_ = 1
AND insp_cd.item MATCHES('*' + pc-itemPost + '*')
AND insp_cd.update_at < NOW:
I know it's very slow because of the table scan, but I'd like to know if there is any difference. Thanks.
Any time that you can avoid MATCHES you should do so.
Using an IF statement to choose branches that execute different static FOR EACH statements is one way to do it. Building dynamic queries based on similar logic would be another approach.
Whether or not your two queries are "different"? Sure, they are different. They have different WHERE clauses so their specific behavior (and performance) will depend on the index structure (which we don't know).
insp_cd.item matches “*” + pc-itempost + “*”
Can be very different from:
insp_cd.item = “”.
And logically it is not the same as omitting a check of insp_cd.item altogether. Logically maybe you’re attempting to exclude empty values? I’m not sure what the requirement is here.
If insp_cd.item is the first component of an index, or the second component after insp_cd.Status then a variation of this query using ‘ = “” ‘ will be much more efficient than one using MATCHES.
Back to avoiding MATCHES, at a high level:
If there is no need for wild cards use "=". Equality matches are always preferred.
If the wild card is at the end of the string use BEGINS.
If the wild card is being used to signify a known list use a series of OR clauses or a LOOKUP() or build a temp-table to join in the query.
There are probably more ways to avoid MATCHES but these are the ones that spring to mind.

An XDMP-NOTANODE error using xquery in marklogic

I'm getting the XDMP-NOTANODE error when I try to run an XQuery in MarkLogic. When I loaded my xml documents I loaded meta data files with them. I'm a student and I don't have experience in XQuery.
error:
[1.0-ml] XDMP-NOTANODE: (err:XPTY0019) $article/article/front/article-meta/title-group/article-title -- xs:untypedAtomic("
") is not a node
Stack Trace
At line 3 column 77:
In xdmp:eval("(for $article in fn:distinct-values(/article/text()) &#1...", (), <options xmlns="xdmp:eval"><database>4206169969988859108</database> <root>C:\mls-projects\pu...</options>)
$article := xs:untypedAtomic("
")
1. (for $article in fn:distinct-values(/article/text())
2.
3. return (fn:distinct-values($article/article/front/article-meta/title-group/article-title)
4.
5.
Code:
(
for $article in fn:distinct-values(/article/text())
return (
fn:distinct-values($article/article/front/article-meta/title-group/article-title/text())
)
)
Every $article is bound to an atomic value (fn:distinct-values() returns a sequence of atomic values). Then you try to apply a path expression (using the / operator) on $article. Which is forbidden, as the path operator requires its LHS operator to be nodes.
I am afraid your code does not make sense enough for me to suggest you an actual solution. I can only pinpoint where the error is.
Furthermore, using text() at the end of a path is most of the time a bad idea. And if /article is a complex document, it is certainly not what you want. One of the text nodes you select (most likely the first one) is simply one single newline character.
What do you want to achieve?
Your $article variable is bound to an atomic value, not a node() from the article document. You can only use an XPath axis on a node.
When you apply the function distinct-values() in the for statement, it returns simple string values, not the article document or nodes from it.
You can probably make things work by using the values in a predicate filter like this:
for $article-text in fn:distinct-values(/article/text())
return
fn:distinct-values(/article[text()=$article-text]/front/article-meta/title-group/article-title/text())
Note: The above XQuery should avoid the XDMP-NOTANODE error, but there are likely easier (and more efficient) solutions for achieving your goal. If you were to post a sample of your document and describe what you are trying to achieve, we could suggest alternatives.
Bit of a wild guess, but you have two distinct-values in your code. That makes me think you want a unique list of articles, and then finally a unique list of article-title's. I would hope you already have unique articles in your database, unless you are explicitly attempting to de-duplicate them.
In case you just want the overall unique list of article titles, I would do something like:
distinct-values(
for $article in collection()/article
return
$article/front/article-meta/title-group/article-title
)
HTH!

How can I prevent SQLite from treating a string as a number?

I would like to query an SQLite table that contains directory paths to find all the paths under some hierarchy. Here's an example of the contents of the column:
/alpha/papa/
/alpha/papa/tango/
/alpha/quebec/
/bravo/papa/
/bravo/papa/uniform/
/charlie/quebec/tango/
If I search for everything under /bravo/papa/, I would like to get:
/bravo/papa/
/bravo/papa/uniform/
I am currently trying to do this like so (see below for the long story of why I can't use more simple methods):
SELECT * FROM Files WHERE Path >= '/bravo/papa/' AND Path < '/bravo/papa0';
This works. It looks a bit weird, but it works for this example. '0' is the unicode code point 1 greater than '/'. When ordered lexicographically, all the paths starting with '/bravo/papa/' compare greater than it and less than 'bravo/papa0'. However, in my tests, I find that this breaks down when we try this:
SELECT * FROM Files WHERE Path >= '/' AND Path < '0';
This returns no results, but it should return every row. As far as I can tell, the problem is that SQLite is treating '0' as a number, not a string. If I use '0Z' instead of '0', for example, I do get results, but I introduce a risk of getting false positives. (For example, if there actually was an entry '0'.)
The simple version of my question is: is there some way to get SQLite to treat '0' in such a query as the length-1 string containing the unicode character '0' (which should sort strings such as '!', '*' and '/', but before '1', '=' and 'A') instead of the integer 0 (which SQLite sorts before all strings)?
I think in this case I can actually get away with special-casing a search for everything under '/', since all my entries will always start with '/', but I'd really like to know how to avoid this sort of thing in general, as it's unpleasantly surprising in all the same ways as Javascript's "==" operator.
First approach
A more natural approach would be to use the LIKE or GLOB operator. For example:
SELECT * FROM Files WHERE Path LIKE #prefix || '%';
But I want to support all valid path characters, so I would need to use ESCAPE for the '_' and '%' symbols. Apparently this prevents SQLite from using an index on Path. (See http://www.sqlite.org/optoverview.html#like_opt ) I really want to be able to benefit from an index here, and it sounds like that's impossible using either LIKE or GLOB unless I can guarantee that none of their special characters will occur in the directory name, and POSIX allows anything other than NUL and '/', even GLOB's '*' and '?' characters.
I'm providing this for context. I'm interested in other approaches to solve the underlying problem, but I'd prefer to accept an answer that directly addresses the ambiguity of strings-that-look-like-numbers in SQLite.
Similar questions
How do I prevent sqlite from evaluating a string as a math expression?
In that question, the values weren't quoted. I get these results even when the values are quoted or passed in as parameters.
EDIT - See my answer below. The column was created with the invalid type "STRING", which SQLite treated as NUMERIC.
* Groan *. The column had NUMERIC affinity because it had accidentally been specified as "STRING" instead of "TEXT". Since SQLite didn't recognize the type name, it made it NUMERIC, and because SQLite doesn't enforce column types, everything else worked as expected, except that any time a number-like string is inserted into that column it is converted into a numeric type.

SQLite X'...' notation with column data

I am trying to write a custom report in Spiceworks, which uses SQLite queries. This report will fetch me hard drive serial numbers that are unfortunately stored in a few different ways depending on what version of Windows and WMI were on the machine.
Three common examples (which are enough to get to the actual question) are as follows:
Actual serial number: 5VG95AZF
Hexadecimal string with leading spaces: 2020202057202d44585730354341543934383433
Hexadecimal string with leading zeroes: 3030303030303030313131343330423137454342
The two hex strings are further complicated in that even after they are converted to ASCII representation, each pair of numbers are actually backwards. Here is an example:
3030303030303030313131343330423137454342 evaluates to 00000000111430B17ECB
However, the actual serial number on that hard drive is 1141031BE7BC, without leading zeroes and with the bytes swapped around. According to other questions and answers I have read on this site, this has to do with the "endianness" of the data.
My temporary query so far looks something like this (shortened to only the pertinent section):
SELECT pd.model as HDModel,
CASE
WHEN pd.serial like "30303030%" THEN
cast(('X''' || pd.serial || '''') as TEXT)
WHEN pd.serial like "202020%" THEN
LTRIM(X'2020202057202d44585730354341543934383433')
ELSE
pd.serial
END as HDSerial
The result of that query is something like this:
HDModel HDSerial
----------------- -------------------------------------------
Normal Serial 5VG95AZF
202020% test case W -DXW05CAT94843
303030% test case X'3030303030303030313131343330423137454342'
This shows that the X'....' notation style does convert into the correct (but backwards) result of W -DXW05CAT94843 when given a fully literal number (the 202020% line). However, I need to find a way to do the same thing to the actual data in the column, pd.serial, and I can't find a way.
My initial thought was that if I could build a string representation of the X'...' notation, then perhaps cast() would evaluate it. But as you can see, that just ends up spitting out X'3030303030303030313131343330423137454342' instead of the expected 00000000111430B17ECB. This means the concatenation is working correctly, but I can't find a way to evaluate it as hex the same was as in the manual test case.
I have been googling all morning to see if there is just some syntax I am missing, but the closest I have come is this concatenation using the || operator.
EDIT: Ultimately I just want to be able to have a simple case statement in my query like this:
SELECT pd.model as HDModel,
CASE
WHEN pd.serial like "30303030%" THEN
LTRIM(X'pd.serial')
WHEN pd.serial like "202020%" THEN
LTRIM(X'pd.serial')
ELSE
pd.serial
END as HDSerial
But because pd.serial gets wrapped in single quotes, it is taken as a literal string instead of taken as the data contained in that column. My hope was/is that there is just a character or operator I need to specify, like X'$pd.serial' or something.
END EDIT
If I can get past this first hurdle, my next task will be to try and remove the leading zeroes (the way LTRIM eats the leading spaces) and reverse the bytes, but to be honest, I would be content even if that part isn't possible because it wouldn't be hard to post-process this report in Excel to do that.
If anyone can point me in the right direction I would greatly appreciate it! It would obviously be much easier if I was using PHP or something else to do this processing, but because I am trying to have it be an embedded report in Spiceworks, I have to do this all in a single SQLite query.
X'...' is the binary representation in sqlite. If the values are string, you can just use them as such.
This should be a start:
sqlite> select X'3030303030303030313131343330423137454342';
00000000111430B17ECB
sqlite> select ltrim(X'3030303030303030313131343330423137454342','0');
111430B17ECB
I hope this puts you on the right path.

Resources