searching for unicode characters in marklogic - xquery

How can I search for an encoded unicode character?
e.g. This works correctly
search:search('ģ')
but this does not:
(: $search-for received encoded with & :)
xdmp:log($search-for), (: prints "#x0123;" :)
search:search($search-for)
Output contains:
<search:qtext>&#x0123;</search:qtext>
What's the best way to detect if the string contains & and convert it?

I think some browser or other tool is fooling you. When I run this in QConsole (MarkLogic 8.0-5.2):
xdmp:document-insert('/unicode.xml', <p>hello ģ world</p>)
;
import module namespace search = "http://marklogic.com/appservices/search"
at "/MarkLogic/appservices/search/search.xqy";
let $search-for := "ģ"
return
search:search($search-for)
I get:
<search:response snippet-format="snippet" total="1" start="1" page-length="10" xmlns:search="http://marklogic.com/appservices/search">
<search:result index="1" uri="/unicode.xml" path="fn:doc("/unicode.xml")" score="36864" confidence="0.5609438" fitness="0.6934683">
<search:snippet>
<search:match path="fn:doc("/unicode.xml")/p">hello <search:highlight>ģ</search:highlight> world</search:match>
</search:snippet>
</search:result>
<search:qtext>ģ</search:qtext>
<search:metrics>
<search:query-resolution-time>PT0.003526S</search:query-resolution-time>
<search:snippet-resolution-time>PT0.001206S</search:snippet-resolution-time>
<search:total-time>PT0.005372S</search:total-time>
</search:metrics>
</search:response>
E.g. the unicode character doesn't get escaped.
HTH!

Related

How can I parse a quoted string with Parsers.jl

Julia’s CSV.jl parses csv files with quoted strings. It uses Parsers.jl to do this. Yet from the documentation of Parsers.jl it is not clear how to parse a double-quoted string on its own. How would I do that? As a secondary question, what is the supported set of escape sequences that Parsers.jl uses?
You can pass arbitrary characters to indicate quotation and escape characters via Parsers.Options. For example,
using Parsers
str = "{-1}"
oq, cq, e = UInt8('{'), UInt8('}'), UInt8('\\')
res = Parsers.xparse(Int64, str; openquotechar=oq, closequotechar=cq, escapechar=e)
x, code, tlen = res.val, res.code, res.tlen
print(x)

XSLT-style mini transformation in Xquery?

At the moment in Xquery 3.1 (in eXist 4.7) I receive XML fragments that look like the following (from eXist's Lucene full text search):
let $text :=
<tei:text>
<front>
<tei:div>
<tei:listBibl>
<tei:bibl>There is some</tei:bibl>
<tei:bibl>text in certain elements</tei:bibl>
</tei:listBibl>
</tei:div>
<tei:div>
<tei:listBibl>
<tei:bibl>which are subject <exist:match>to</exist:match> a Lucene search</tei:bibl>
<tei:bibl></tei:bibl>
<tei:listBibl>
</tei:div>
<tei:front>
<tei:body>
<tei:p>and often produces</tei:p>
<tei:p>a hit.</tei:p>
<tei:body>
<tei:text>
Currently I have Xquery send this fragment to an XSLT stylesheet in order to transform it into HTML like this:
<td>...elements which are subject <span class="search-hit">to</span> a Lucene search and often p...
Where the stylesheet's job is to return 30 characters of text before and after <exist:match/> and put the content of <exist:match/> into a span. There is only one <exist:match/> per transformation.
This all works fine. However, it's occurred to me that it is a very small job with effectively a single transformation of only one element, the rest being a sort of string-join. I therefore wonder if this can't be done efficiently in Xquery.
In trying to do this, I'm can't seem to find a way to handle the string content up to the <exist:match/> and then the string content after <exist:match/>. My idea is, in pseudo code, to output a result like:
let $textbefore := some function to get the text before <exist:match/>
let $textafter := some function to get text before <exist:match/>
return <td>...{$textbefore}
<span class="search-hit">
{$text//exist:match/text()}
</span> {$textafter}...</td>
Is this even worth doing in Xquery vs the current Xquery -> XSLT pipeline I have?
Many thanks.
I think it can be done as
declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization";
declare namespace tei = "http://example.com/tei";
declare namespace exist = "http://example.com/exist";
declare option output:method 'html';
let $text :=
<tei:text>
<tei:front>
<tei:div>
<tei:listBibl>
<tei:bibl>There is some</tei:bibl>
<tei:bibl>text in certain elements</tei:bibl>
</tei:listBibl>
</tei:div>
<tei:div>
<tei:listBibl>
<tei:bibl>which are subject <exist:match>to</exist:match> a Lucene search</tei:bibl>
<tei:bibl></tei:bibl>
</tei:listBibl>
</tei:div>
</tei:front>
<tei:body>
<tei:p>and often produces</tei:p>
<tei:p>a hit.</tei:p>
</tei:body>
</tei:text>
,
$match := $text//exist:match,
$text-before-all := normalize-space(string-join($match/preceding::text(), ' ')),
$text-before := substring($text-before-all, string-length($text-before-all) - 30),
$text-after := substring(normalize-space(string-join($match/following::text(), ' ')), 1, 30)
return
<td>...{$text-before}
<span class="search-hit">
{$match/text()}
</span> {$text-after}...</td>
which is not really much of a query in XQuery either but just some XPath selection plus some possibly expensive string joining and extraction on the preceding and following axis.

How do i decode this string? \xc3\x99\xc3\xa9\xc2\x87-B[x\xc2

This is what I need to decode
\xc3\x99\xc3\x99\xc3\xa9\xc2\x87-B[x\xc2\x99\xc2\xbe\xc3\xa6\x14Ez\xc2\xab
it is generated by String.fromCharCode(arrayPw[i]);
but i don't understand how to decode it :(
Please help
Python:
data = "\xc3\x99\xc3\x99\xc3\xa9\xc2\x87-B[x\xc2\x99\xc2\xbe\xc3\xa6\x14Ez\xc2\xab"
udata = data.decode("utf-8")
asciidata = udata.encode("ascii","ignore")
JavaScript:
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
Otherwise do more research about decoding UTF-8.
https://gist.github.com/chrisveness/bcb00eb717e6382c5608
There's also an online UTF-8 decoder/encoder:
https://mothereff.in/utf-8
HINT: ÙÙé-B[x¾æEz«
duplicate of this : https://stackoverflow.com/a/70815136/5902698
You load a dataset and you have some strange characters.
Exemple :
'戴森美å�‘é€\xa0型器完整版套装Dyson Airwrap
HS01(铜金色礼盒版)'
In my case, I know that the strange characters are chineses. So I can figure that the one who send me the data have encode it in utf-8 but should do it in 'ISO-8859-1'.
So first step, I had encoded the string, then I decode with utf-8.
so my lines are :
_encoding = 'ISO-8859-1'
_my_str.encode(_encoding, 'ignore').decode("utf-8", 'ignore')
Then my output is :
"'森Dyson Airwrap HS01礼'"
This works for me, but I guess that I do not really well understood under the hood. So feel free to tell me if you have further information.
Bonus. I'll try to detect when the str is in the first strange format because some of my entries are in chinese but others are in english
EDIT : The Bonus is useless. I Just use lamba on ma column to encode and decode without care about format. So I changed the encoding after loading the dataframe
_encoding = 'ISO-8859-1'
_decoding = "utf-8"
df[col] = df[col].apply(lambda x : x.encode(_encoding, 'ignore').decode(_decoding , 'ignore'))

xQuery substring problem

I now have a full path for a file as a string like:
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml"
However, now I need to take out only the folder path, so it will be the above string without the last back slash content like:
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/"
But it seems that the substring() function in xQuery only has substring(string,start,len) or substring(string,start), I am trying to figure out a way to specify the last occurence of the backslash, but no luck.
Could experts help? Thanks!
Try out the tokenize() function (for splitting a string into its component parts) and then re-assembling it, using everything but the last part.
let $full-path := "/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml",
$segments := tokenize($full-path,"/")[position() ne last()]
return
concat(string-join($segments,'/'),'/')
For more details on these functions, check out their reference pages:
fn:tokenize()
fn:string-join()
fn:replace can do the job with a regular expression:
replace("/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml",
"[^/]+$",
"")
This can be done even with a single XPath 2.0 (subset of XQuery) expression:
substring($fullPath,
1,
string-length($fullPath) - string-length(tokenize($fullPath, '/')[last()])
)
where $fullPath should be substituted with the actual string, such as:
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml"
The following code tokenizes, removes the last token, replaces it with an empty string, and joins back.
string-join(
(
tokenize(
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml",
"/"
)[position() ne last()],
""
),
"/"
)
It seems to return the desired result on try.zorba-xquery.com. Does this help?

Regular expression to convert substring to link

i need a Regular Expression to convert a a string to a link.i wrote something but it doesnt work in asp.net.i couldnt solve and i am new in Regular Expression.This function converts (bkz: string) to (bkz: show.aspx?td=string)
Dim pattern As String = "<bkz[a-z0-9$-$&-&.-.ö-öı-ış-şç-çğ-ğü-ü\s]+)>"
Dim regex As New Regex(pattern, RegexOptions.IgnoreCase)
str = regex.Replace(str, "<font color=""#CC0000"">$1</font>")
Generic remarks on your code: beside the lack of opening parentheses, you do redundant things: $-$ isn't incorrect but can be simplified into $ only. Same for accented chars.
Everybody will tell you that font tag is deprecated even in plain HTML: favor span with style attribute.
And from your question and the example in the reply, I think the expression could be something like:
\(bkz: ([a-z0-9$&.öışçğü\s]+)\)
the replace string would look like:
(bkz: <span style=""color: #C00"">$1</span>)
BUT the first $1 must be actually URL encoded.
Your regexp is in trouble because of a ')' without '('
Would:
<bkz:\s+((?:.(?!>))+?.)>
work better ?
The first group would capture what you are after.
Thanks Vonc,Now it doesnt raise error but also When i assign str to a Label.Text,i cant see the link too.Forexample after i bind str to my label,it should be viewed in view-source ;
<span id="Label1">(bkz: here)</span>
But now,it is in viewsource source;
<span id="Label1">(bkz: here)</span>

Resources