How to remove punctuation from a database in marklogic? - xquery

I want to remove punctuation from a database of xml document in marklogic. This is made for preprocessing purposes for machine learning. I'm new to marklogic and i don't know how to do that. Is there an xquery query that could remove punctuation?

To do a mass replacement of all text in the database, and take out punctuation, you could start with something that looks like this code (modified for your needs):
for $doc in cts:search(fn:collection(), ())
for $text in $doc//text()
return xdmp:node-replace($text, text{fn:replace($text, "[\.,;]", "")})

To be honest, that task is much less expensive to do on the source text files themselves - or in MarkLogic by treating the XML as string during the replacement process. Updating nodes one element at a time will be expensive.
Outside of Marklogic:
use SED or AWK or a similar tool BEFORE INGESTION
Inside of MarkLogic(as a trigger, perhaps)
use xdmp:quote to change the XML to a string, then replace in a sing with fn:replace and then make XML again with xdmp:unquote
let $new-doc := xdmp:unquote(fn:replace(xdmp:quote($doc), "[\.,;]", ""))
Then either store by replacing the root node with xdmp:node-replace - or store this version as a property. This all depends on if the original (punctuated version matters to you). Or perhaps you just want to keep the original and serve this cleansed version back to someone.
In all cases above, you have to make sure that your replacement does not murder your XML. Also, be aware of options for the functions above(like how cdata is handled.
Lastly, "This is for machine learning purposes". You do not elaborate. I think many of us here have a feeling that this solution (cleansing punctuation before insert) rubs against the very grain of MarkLogic - in which you store as-is and then have awesome index, tokenizing, stemming, collation, search support to find and return your data as you need. If you were to elaborate on your use case a bit, you may inspire others to give more MarkLogic-Specific suggestions.

It will work if you use 'punctuation-insensitive' and if required 'diacritic-insensitive' in cts:element-word-query()

I'm not sure if this is what you're asking, but it's technically possible to update every document in the database to remove punctuation; however, it's very expensive and I wouldn't recommend it.
Using built-in search functions, you can probably achieve the same goal without updating your documents by querying with punctuation insensitivity. For example, if you want to select documents with a title matching a case insensitive string:
cts:search(//mydoc,
cts:element-word-query(xs:QName('title'), 'Moby-Dick', 'punctuation-insensitive'))
Or in an existing XQuery:
for $d in $documents
where cts:contains($d,
cts:element-word-query(xs:QName('title'), 'Moby-Dick', 'punctuation-insensitive'))
return $d/summary

Related

Encoding wildcarding, stemming, etc in simple search

We have a simple search interface which calls the search:search($query-text) function. Is there a syntax to include control for wildcarding, stemming and case sensitivity within the single text string that the function accepts? I haven't been able to find anything in the MarkLogic docs.
See the $options parameter and the <term> and <term-option> constraint at https://docs.marklogic.com/search:search . There is a guide at http://developer.marklogic.com/learn/2009-07-search-api-walkthrough
and some details http://developer.marklogic.com/learn/2009-07-search-api-walkthrough#ndbba3437f320a4a4
I don't know of any existing syntax for those options, aside from the built-in behavior of turning on wildcards when a term contains '*' or '?' and turning on case-sensitivity when the term contains capital letters.
You could develop a syntax. Implementing it might involve a custom parser along the lines of https://github.com/mblakele/xqysp then feeding the resulting cts:query into search:resolve.
Piggybacking on Eric Bloch's answer... you can always dynamically construct your node based on input in the user interface.
For example, I often do this in order to separate the facet selection portion of the query from the text search portion and put the facet selection query in the additional-query element in the options node.

XML XQuery basic querying results

I'm here with a question that I hope can be answered, which is really quite silly and basic.
I have a file of authors in the format of:
<authorRoot>
<author>
<info tags on author>
</author>
etc
</authorRoot>
and all I wish to do is, through FLWOR, return a list where each 'author' and its information is a different value, so when I run the query, the result should come out looking like
1. <author><info>.....</info></author>
2. <author><info>.....</info></author>
etc
and I am CERTAIN that something as simple as that should just be the following code
xquery version "1.0";
for $x in //author
return $x
yet when I do so, the query result comes out as
1.<author><info>...</info></author><author><info>...</info></author><author><info>...</info></author><author><info>...</info></author><author><info>...</info></author>....etc
I'm relatively new to XQuery, and I'm using AltovaSpy. I've done similar questions as basic as this (where I have a file of similar layout and I use essentially the same code, resulting in an xquery result page of multiple values, not just one long one) but for this file it just doesn't seem to work! Is it something with my code that I'm just not seeing? Or could it be the file, perhaps?
Thank you for whatever input you have on the situation.
Well, your reasoning is correct. .
It is just a formatting issue, it seems Altova prints the entire sequence in a single line without linebreaks.
You can also try it in my XQuery online tester, there you can see that the sequence is as you expected it to be.
If you watch this demo video of Altova XMLSpy and advance to 2:35 you will see how clicking on one of the toolbar buttons (which appears to be labeled "Pretty-print") will format the results of your XQuery as nicely indented XML.

How can I find documents accessed to answer my Xquery query?

I have the following objective. I want to find, which documents contain my data when executing any kind of Xquery or XPath. In other words, I need every document that is providing me the result data for a given query. I try to do this in eXist-db environment, but I suppose there should be something on Xquery level.
I found op:context-document() operator which seems to have functionality I want, yet, as an operator it is not available for me. fn:document-uri also does not do the trick, as its $arg must be a document node, otherwise it returns an empty sequence.
Do you have any idea in mind? All the assistance is highly appreciated.
fn:base-uri() may help; it returns the base URI property of a node:
for $d in doc('....')/your[query]
return base-uri($d)
You can also use it to filter your documents for specific types:
collection('/path/to/documents')[ends-with(base-uri(), '.xml')]
Use the standard XPath / XQuery function collection() .
For example, using Saxon:
collection('file:///a/b/c/d?select=*.xml')[yourBooleanExpression]
selects the document nodes of all XML documents, residing in the /a/b/c/d directory of the filesystem, that satisfy your criteria (yourBooleanExpression evaluates to true())

SQLite - hex value to char or string

I'm writing a tool to exploit SQL Injections. I'm trying to add support to SQLite now and I'm facing a problem: if I need to insert a string but quotes are escaped in Mysql I can use 0x65..., or in Postgres CHR(65)||.... But in SQLite I can't find any way of doing this without using quotes.
Can anyone help me?
Thanks in advance
I don't believe there's a general solution. You may be able to assemble your string using parlor tricks if it contains the right characters. E.g., substr(quote(hex(0)),1,1) will return "'", upper(substr(typeof(cast(0 as text)),3,1)) will return "X", etc. I doubt you can get the whole alphabet this way, but it might be enough for whatever injection you're planning.
I don't know of an equivalent, however you can check the documentation to see if there is anything you can use:
http://www.sqlite.org/lang_corefunc.html
http://www.sqlite.org/lang_aggfunc.html

A Minor, but annoying niggle - Why does ASP.Net set SQL Server Guids to lowercase?

I'm doing some client-side stuff with Javascript/JQuery with .Net controls which expose their GUID/UniqueIdentifier IDs on the front end to allow them to be manipulated. During debugging something is driving me crazy: The GUIDs in the db are stored in uppercase, however by the time they make it to the front end they're in lowercase.
This means I can't quickly copy and paste IDs into the browser's console to execute JS on the fly when devving/debugging. I have found a just-about-workable way of doing this but I was wondering if anyone knew why this behaviour is the case and whether there is any way of forcing GUIDs to stay uppercase.
According to MSDN docs the Guid.ToString() method will produce lowercase string.
As to why it does that - apparently RFC 4122 states it should be this way.
The hexadecimal values "a" through "f" are output as lower case characters and are case insensitive on input.
Also check this question on SO - net-guid-uppercase-string-format.
So the best thing you can do is to call ToUpper() on your GUID strings, and add extension method as showed in the other answer.
If you're using an Eval template, then I'd see if you can do this via an Extension method.
something like
public static string ToUpperString(this Guid guid, string format = "")
{
string output = guid.ToString(format);
return output.ToUpper();
}
And then in your Eval block,
myGuid.ToUpperString("B")
Or however you need it to look.
I'm on my Mac at the moment so I can't test that, but it should work if you've got the right .Net version.

Resources