Customize sorting in Marklogic - xquery

I am looking for customization in sorting for the items coming from the MarkLogic database.
While sorting double or single quote, dollar sign and all other punctuation should be ignored.
Below is the sample script:
xquery version "1.0-ml";
declare default collation "http://marklogic.com/collation/";
let $seq := ("a", "b", "1", "2", "$3", '"object"')
for $x in $seq
order by $x ascending
return $x

By setting the default collation, or specifying a collation in the order by to http://marklogic.com/collation/en/S1/T00BB/AS will handle the majority of your requirements (quote and punctuation insensitive).
However, $ is not an ignored character. You can remove $, and any other character that you do not want to affect the sort order, by using the fn:translate() function:
xquery version "1.0-ml";
let $seq := ("a", "b", "c", "1", "2", "$3", '"object"')
for $x in $seq
order by fn:translate($x, "$", "") ascending collation "http://marklogic.com/collation/en/S1/T00BB/AS"
return $x

You can use fn:replace(...) to delete all non-word characters (and $, which seems to be considered a Word character, at least by BaseX) from each string:
let $seq := ("a", "b", "1", "2", "$3", '"object"')
for $x in $seq
order by replace($x, '[\W\$]', '') ascending
return $x

Related

XQuery get data based on distinct values

My xml is like this:
full xml at inputxml
<course>
<reg_num>10616</reg_num>
<subj>BIOL</subj>
<crse>361</crse>
<sect>F</sect>
<title>Genetics and MolecularBiology</title>
<units>1.0</units>
<instructor>Russell</instructor>
<days>M-W-F</days>
<time>
<start_time>11:00AM</start_time>
<end_time>11:50</end_time>
</time>
<place>
<building>PSYCH</building>
<room>105</room>
</place>
</course>
I need to take distinct values for courses and return instructors that teach those courses.
This is my current code:
let $x := doc("reed.xml")/root/course
for $y in distinct-values($x/title)
let $z := $y/instructor
return ( {data($y)} ,{data($z)})
What am i doing wrong
In the code that you posted, the for loop is iterating over the sequence of distinct title string values.
You can't XPath into those strings.
Rather, you want to XPath into the $x courses and use the $y in each iteration to select those course elements that have the title equal to $y, then select it's instructor.
let $x := doc("reed.xml")/root/course
for $y in distinct-values($x/title)
let $z := $x[title = $y]/instructor
return ( {data($y)} ,{data($z)})
In XQuery 3.1 you can do this with group by.
for $course := doc("reed.xml")/root/course
group by $title := $course/title
return ( <title>{$title}</title> ,
<instructors>{$course/instructor}</instructors> )

Marklogic order by ascending returns incorrect sort order

I'm using MarkLogic Server (Enterprise Edition) 7.0-6.3 and wanted to sort the following data in ascending order, but the collation that I used not doing the job correctly.
Query:
for $result in ("A, 105.1", "A, 105.5", "A, 105.31", "A, 29")
order by $result ascending collation "http://marklogic.com/collation//AS/MO"
return $result
Result:
`A, 29
A, 105.1
A, 105.5
A, 105.31`
Here you see the 3rd & 4th position of the result order is incorrect. I'm not sure if its a Marklogic bug with the MO. Help required to find the solution. Thank you.
Collation used:
MO (Specifies numeric ordering.)
AS (variable characters are shifted (ignored) according
to the variable-top setting.)
collation specifies considering sequential 'digits' as numeric, not decimals. Reguardless of ignoring variable characters. 01.102 will sort after"01.2" -> "01" == "01" , "102" > "2"
Example:
for $result in ("01.102", "01.2", "01.200", "01.20")
order by $result ascending collation "http://marklogic.com/collation//AS/MO"
return $result
correctly produces
01.2
01.20
01.102
01.200
a 2 'key' numeric sort,
A="01", B=(2,20,102,200)
This produces the desired result:
xquery version "1.0-ml";
for $result in ("A, 105.1", "A, 105.5", "A, 105.31", "A, 29")
order by fn:substring-before($result, " ") , xs:float(fn:substring-after($result, " ")) ascending
return $result

Text processing. Find a datetime in a line, copy the datetime, paste it further up the line

I have a SQL INSERT statement:
INSERT INTO public.sensor_data (doc) VALUES ('{"x": 1, "readerId": "63", "locationSensorId": "13", "lux": "0", "locationCounter": "3", "alarm": "0", "rssi": "436", "datetime": "2017/07/10 17:01:44.539910", "magnet": "5", "y": 1, "uniqueName": "hive", "v": 1, "agesent": "3741337710", "sensorId": "4000043", "z": 1, "movement": "65535"}')
It actually needs to include a column named created_at, and another value after the JSON. This must be the value of the datetime key in the JSON ('2017/07/10 17:01:44.539910').
I can add the necessary sql in vim, parenthesis etc after the first value.
I cannot figure out how to copy that datetime value per line and paste it further up the line to be within the sql value parenthesis (' '). Is there some magical regular expression out there?
If I understood you correctly, you can solve this with a regex like this:
s/datetime":\s\(".\{-}"\),\(.*\))$/datetime": \1,\2, created_at: \1)/g

How to optimize the substring & instring or substitute the functions for debugging

I wrote the stored procedure below for the purposes of reading a comma delimited file that is moved onto the server ("DIR" folder), and when script is executed, it essentially parses the file (.csv), and assigns the data to its respective variables (xJOB_ID, xCTRL_ID, xACCT_SEC, xCREATEDON_DATE) such that I can insert the data into a table. I am using Oracle SQL Developer version 4.0.0.13 in a Windows 7 environment. Fortunately, after banging my head on the table a couple of times the code works and I have not had any issues running the script.
Example format of the file:
1111, 2, T, 10/10/2000
2222, 12345, U, 10/10/2001
5555, 123, S, 10/10/1999
MY QUESTION:
I found a little difficulty using the SUBSTRING & INSTRING functions to parse the data and wanted to know how can I improve the script so that in the event some debugging was needed, it can be easily resolved for someone that did not write the stored procedure. Please let me know if that makes sense. I gave you the entire script so that you can understand what I was trying to accomplish and so that I can improve the code for debugging purposes.
create or replace PROCEDURE SP_INSERT_INTO_TABLE(xFILE_NAME IN VARCHAR2)
IS
--UTL_FILE is an oracle package that allows you to read and write operating system files.
TEXT_DATA UTL_FILE.FILE_TYPE;
v_ROW_LENGTH NUMBER := 1024;
v_TEXTSTRING VARCHAR2(4000);
cLINE VARCHAR2(100);
xJOB_ID NUMBER;
xCTRL_ID NUMBER;
xACCT_SEC VARCHAR2(1);
xCREATEDON_DATE DATE;
xCOUNT NUMBER := 0;
BEGIN
BEGIN
--Streams in the file data and assigns it to TEXT_DATA variable.
TEXT_DATA := UTL_FILE.FOPEN('DIR', xFILE_NAME, 'R', v_ROW_LENGTH);
END;
--Begin LOOP to get each line and assign to cLINE to extract, assign to each variable, and insert into the table
LOOP
BEGIN
--Gets each string/line up to the line terminator
UTL_FILE.GET_LINE(TEXT_DATA, v_TEXTSTRING);
EXCEPTION
WHEN NO_DATA_FOUND THEN
EXIT;
END;
--Each line is assigned to the variable cLINE.
cLINE := v_TEXTSTRING;
--Begin to parse data using SUBSTRING and INSTRING functions
BEGIN
--Extracts string from cLINE position 1 up to the first occurrence, converts it to a number, and assigns it to the variable.
xJOB_ID := TO_NUMBER(SUBSTR(cLINE, 1,INSTR(cLINE, ',', 1, 1)-1));
--Extracts string from cLINE between the 1st and 2nd occurrence, converts it to a number, and assigns it to the variable.
xCTRL_ID := TO_NUMBER(SUBSTR(cLINE, INSTR(cLINE, ',', 1, 1)+1, INSTR(cLINE, ',', 1,2)-INSTR(cLINE, ',', 1,1)-1));
--Extracts string from cLINE between the 2nd and 3rd occurrence and assigns it to the variable.
xACCT_SEC := SUBSTR(cLINE, INSTR(cLINE, ',', 1, 2) +1, INSTR(cLINE, ',', 1,3)-INSTR(cLINE, ',', 1,2) -1);
--Extracts string from cLINE after the last occurrence, converts it to a date, and assigns it the variable.
xCREATEDON_DATE := TO_DATE(SUBSTR(cLINE, INSTR(cLINE, ',', 1, 3)+1), 'MM/DD/YYYY');
INSERT INTO TABLE(JOB_ID, CTRL_ID, ACCT_SEC, CREATEDON_DATE)
VALUES(xJOB_ID, xCTRL_ID, xACCT_SEC, xCREATEDON_DATE);
COMMIT;
--Counter to count the amount of inserts
xCOUNT := xCOUNT + 1;
EXCEPTION
--Exception to handle the conversion of a string to a NUMBER or value is longer than the declared length of the variable.
WHEN VALUE_ERROR THEN
NULL;
END;
END LOOP;
DBMS_OUTPUT.PUT_LINE('RECORDS INSERTED: ' || xCOUNT);
UTL_FILE.FCLOSE(TEXT_DATA);
END;
You could use REGEXP_SUBSTR instead, since that just requires one function call, rather than a series of SUBSTR and INSTR calls, eg:
(courtesy of Gary_W):
declare
v_str varchar2(20) := 'a,,bcd';
v_substr1 varchar2(10);
v_substr2 varchar2(10);
v_substr3 varchar2(10);
begin
v_substr1 := regexp_substr(v_str, '([^,]*)(,|$)', 1, 1, NULL, 1);
v_substr2 := regexp_substr(v_str, '([^,]*)(,|$)', 1, 2, NULL, 1);
v_substr3 := regexp_substr(v_str, '([^,]*)(,|$)', 1, 3, NULL, 1);
dbms_output.put_line(v_substr1||':'||v_substr2||':'||v_substr3);
end;
/
a::bcd
The above will work with strings that have null portions, as demonstrated. The search pattern has been split into two groups (aka subexpressions): [^,]* and ,|$.
The first group says: any character that isn't a comma ([^,]) that appears 0 or more times (*).
The second group says: comma or the end of the line.
So the whole pattern is looking for a set of any characters excluding the comma that may or may not exist which is followed by either a comma or the end of the line.
The last parameter in the regexp_substr is indicating that we want to pick the 1st subexpression from the search pattern to display - if we didn't include this, then you'd end up with the commas being displayed as part of the string being returned.
If you're absolutely sure that none of the elements of the string will ever be null, then the following will work:
declare
v_str varchar2(20) := 'a,123,bcd';
v_substr1 varchar2(10);
v_substr2 varchar2(10);
v_substr3 varchar2(10);
begin
v_substr1 := regexp_substr(v_str, '[^,]+', 1, 1);
v_substr2 := regexp_substr(v_str, '[^,]+', 1, 2);
v_substr3 := regexp_substr(v_str, '[^,]+', 1, 3);
dbms_output.put_line(v_substr1||':'||v_substr2||':'||v_substr3);
end;
/
a:123:bcd
This is just looking for the specified occurrence of a string of characters that aren't a comma, which is slighly easier to grok (imho!) than the search pattern used in the previous example, but is a lot less robust.

Selecting element attributes using boolean operators (and, or, not) in XQuery

I want to filter the results of the following XQuery:
for $units in $data//*[#id = $ids and (#xref = $a or #xref = $b)]/#id
How do I select the elements with a matching #id value and and an #xref attribute that matches either $a or $b, but not both $a and $b.
Both $a and $b are node sets with tokenized values, which both act as identifiers. The wanted identifier may be stored in either $a or $b.
My intention is that if $a matches the #xref attribute, the query does not check for $b.
Best would be to use xor. Too bad there is no xor...
But != and ne does the same:
for $units in $data//*[#id = $ids and ((#xref = $a) ne (#xref = $b))]/#id
And it should be faster to use eq instead of =, for single values:
for $units in $data//*[#id = $ids and ((#xref eq $a) ne (#xref eq $b))]/#id

Resources