Phrase search and highlighting with search:search in Marklogic

Phrase search and highlighting with search:search in Marklogic - xquery

I want to search for a phrase with search:search and highlight that phrase (not individual words). For example if I am searching for search:search("lease coral") then following output is coming:-
<search:match path="fn:doc("abc.xml")/*:text>testing <search:highlight>Lease</search:highlight><search:highlight>CORAL</search:highlight></search:match>
It is highlighting lease and coral separately. But I want it to highlight "Lease Coral" together like a single phrase. Is there any way to get this result.

The answer is at http://docs.marklogic.com/guide/search-dev/search-api#id_44520
"any phrase" Anything within the double-quote marks is treated as a phrase. The example matches documents having the phrase "any phrase" (without the double-quote marks).
You can experiment with this using search:parse:
import module namespace search = "http://marklogic.com/appservices/search"
at "/MarkLogic/appservices/search/search.xqy";
search:parse('high time')
=>
<cts:and-query strength="20" qtextjoin="" qtextgroup="( )" xmlns:cts="http://marklogic.com/cts">
<cts:word-query qtextref="cts:text">
<cts:text>high</cts:text>
</cts:word-query>
<cts:word-query qtextref="cts:text">
<cts:text>time</cts:text>
</cts:word-query>
</cts:and-query>
That's an AND of two word-query terms. Now try this:
import module namespace search = "http://marklogic.com/appservices/search"
at "/MarkLogic/appservices/search/search.xqy";
search:parse('"high time"')
=>
<cts:word-query qtextpre=""" qtextref="cts:text" qtextpost=""" xmlns:cts="http://marklogic.com/cts">
<cts:text>high time</cts:text>
</cts:word-query>
That's a simple word-query term - but the text is a phrase, which is what you want. Note the nested quotes.

Related

how can i remove special emoji's using xquery from text

I have a $text = "Hello 😀😃😄 💜 🙏🏻 🦦üäö$"
I wanted to remove just emoji's from the text using xquery. How can i do that?
Expected result : "Hello üäö$"
i tried to use:
replace($text, '\p{IsEmoticons}+', '')
but didn't work.
it just removed smiley's
Result now: "Hello 💜 🙏🏻 🦦üäö$"
Expected result : "Hello üäö$"
Thanks in advance :)

I outlined the approach in my answer to the original question, which I updated based on your comment asking about how to strip out 💜.
Quoting from that expanded answer:
The "Emoticons" block doesn't contain all characters commonly associated with "emoji." For example, 💜 (Purple Heart, U+1F49C), according to a site like https://www.compart.com/en/unicode/U+1F49C that lets you look up Unicode character information, is from:
Miscellaneous Symbols and Pictographs, U+1F300 - U+1F5FF
This block is not available in XPath or XQuery processors, since it is neither listed in the XML Schema 1.0 spec linked above, nor is it in Unicode block names for use in XSD regular expressions—a list of blocks that XPath and XQuery processors conforming to XML Schema 1.1 are required to support.
For characters from blocks not available in XPath or XQuery, you can manually construct character classes. For example, given the purple heart character above, we can match it as follows:
replace("Purple 💜 heart", "[🌀-🗿]", "")
This returns the expected result:
Purple Heart
This approach can be applied to 🙏🏻 , 🦦, or any other character:
Locate the character's unicode block.
Craft your regular expression with the block name (if available in XPath) or character class.
Alternatively, rather than locating the blocks of characters you want to strip out, you could identify the blocks of characters you want to preserve. For example, given the example string in the original post, perhaps the goal is to preserve only those characters in the "Basic Latin" block. To do so, we can match characters NOT in this block via the \P Category Escape:
xquery version "3.1";
let $text := "Hello 😀😃😄 💜 🙏🏻 🦦üäö$"
return
replace($text, "\P{IsBasicLatin}", "")
This query returns:
Hello $
Notice that this has stripped out the characters with diacritics, which perhaps isn't desired. These characters with diacritics belong to the Latin-1 Supplement block. To preserve characters from both the Latin and Latin-1 Supplement blocks, we'd need to adjust the query as follows:
xquery version "3.1";
let $text := "Hello 😀😃😄 💜 🙏🏻 🦦üäö$"
return
replace($text, "[^\p{IsBasicLatin}\p{IsLatin-1Supplement}]", "")
... which returns:
Hello üäö$
This now preserves the characters with diacritics.
To be precise about the characters you preserve or remove, you need to consult the Unicode blocks and charts.

I need to perform a Stemming operation in python ,without nltk . Using pipeline methods

I have a list of words and a list of stem rules.
I need to stem the words that their sufixes are in the stem rules list.I got a hint from a friend that i can use pipeline methods
For example if i have :
stem=['less','ship','ing','les','ly','es','s']
text=['friends','friendly','keeping','friendship']
i should get :'friend','friend','keep',friend'

You can find and edit patterns using regular expressions (re package)
import re
text = ['friends', 'friendly', 'keeping', 'friendship']
stems = [
# next line finds patterns and remove them from the string.
re.sub(r'less|ship|ing|les|ly|es|s', '', word)
for word in text
]
print(stems)

GA Search & Replace Filter

I wonder whether someone can help me please.
I have the following URI in GA: /invite/accept-invitation/accepted/B
Which I'd like to change to: /invite/accept-invitation/accepted
I've tried a 'Search and Replace filter as follows:
Search String - /invite/accept-invitation/accepted/*
Replace String - /invite/accept-invitation/accepted
But the result I get is:
/inviteaccept-invitation/accepted/B
Could someone tell me where I've gone wrong with this please?
Many thanks and kind regards
Chris

Google Analytics "Search and replace" filter uses regular expressions. More precisely:
Replace string is either a regular string or it can refer to group
patterns in the search expression using backslash-escaped single
digits like (\0 to \9).
More details are available on the filter settings UI, which also refers to this link.
So in your case, the search string would be something like this.
\/invite\/accept-invitation\/accepted\/\w+
In this expression \ is escaped. Your last string part is captured with \w+, which
matches any word character (equal to [a-zA-Z0-9_]), between one and unlimited times, as many times as possible.
The Replace string doesn't have to be a regular expression. So in your case, your original version could be used:
/invite/accept-invitation/accepted/
Putting this together would result something like this, which gives the desired output in my test view:

How do I extract a section number and the text after it?

I have a question.
My text file contains lines such as:
1.1        Description.
This is the description.
1.1.1      Quality Assurance
Random sentence.
1.6.1    Quality Control. Quality Control is the responsibility of the contractor.
I'm trying to find out how to get:
1.1        Description
1.1.1      Quality Assurance
1.6.1    Quality Control
Right now, I have:
txt1 <- readLines("text1.txt")
txt2<-grep("^[0-9.]+", txt1, value = TRUE)
file<-write(txt2, "text3.txt")
which results in:
1.1        Description.
1.1.1      Quality Assurance
1.6.1    Quality Control. Quality Control is the responsibility of the contractor.

You are using grep with value=TRUE, which
returns a character vector containing the selected elements of x
(after coercion, preserving names but no other attributes).
This means, that if your regular expression matches anything in the line, the all line will be returned. You managed to build your regular expression to match numbers in the begining of the line. So all the lines which begin with numbers get selected.
It seems that your goal is not to select the all line, but to select only until there is a line break or a period.
So, you need to adjust the regular expression to be more specific, and you need to extract only the matching portion of the line.
A regular expression that matches what you want can be:
"^([0-9]\\.?)+ .+?(\\.|$)"
It selects numbers with dots, followed by a space, followed by anything, and stops matching things when a . comes or the line ends. I recommend the following website to better understand what the regex does: https://regexr.com/
The next step is extracting from the given lines only the matching portion, and not the all line where the regex has a match. For this we'll use the function regexpr, which tells us where the matches are, and the function regmatches, which helps us extract those matches:
txt1 <- readLines("text.txt")
regmatches(txt1, regexpr("^([0-9]\\.?)+ .+?(\\.|$)", txt1))

SQLite: which character can be ignored with FTS match in one word

I need to find any special character. If I put it in the middle of a word, SQLite FTS match can ignore it as if it does not exist, e.g.:
Text Body: book's
If my match string is 'books' I need to get result of "book's"..
No problem using porter or simple tokenizer.
I tried many characters for that like: book!s, book?s, book|s, book,s, book:s…, but when searching by match for 'books' no results of these returned.
I don't understand, why?
I am using: Contentless FTS4 Tables, and External Content FTS4 Tables, my text body has many characters in each word, should be changed to ignore it when searching..
I cannot change match query because I do not know where the special character in the word is. Also, I need to leave the original word length equal to the length of FTS Index word to use match info or snippet(); as such, I cannot remove these characters from text body.

The default tokenizers do not ignore punctuation characters but treat them as word separators.
So the text body or match string book's will end up as two words, book and s.
These will never match a single work like books.
To ignore characters like ', you have to install your own custom tokenizer.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Phrase search and highlighting with search:search in Marklogic - xquery

Related

how can i remove special emoji's using xquery from text

I need to perform a Stemming operation in python ,without nltk . Using pipeline methods

GA Search & Replace Filter

How do I extract a section number and the text after it?

SQLite: which character can be ignored with FTS match in one word

Categories

Resources