XQuery : Search for a combination of different letters - xquery

I have following xml structure
<image>
<id>88091942</id>
<imageType>Primary</imageType>
<format>pdf</format>
<status timestamp="2019-11-20T12:20:02.616Z">Accepted</status>
<size/>
<languageCode>
<val>eng</val>
</languageCode>
<comments/>
<effectiveDate>2013-01-01T00:00:00.000Z</effectiveDate>
<extractedText> Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, ABCDE remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including B.C.D.E versions of Lorem Ipsum.</extractedText>
</image>
I am trying to search ABCDE in extractedText. The text needs to be searched can be in any format for eg. BCDE , A.B.C.D.E. , Abcde , B.C.D.E or any combination. If any of the combinations are present then result should return text otherwise empty string
Below is the code snippet that I am trying to use:-
let $id := $image/cd:id/string()
let $text := $extractedText[contains(., "*ABCDE*" OR "*A.B.C.D.E.*&quot OR "ABCDE01"")]/string()
return fn:string-join(($id,$text),"!!!!")
I get the following result -
88091942
where as I should be getting -
88091942!!!!Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, ABCDE remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including B.C.D.E versions of Lorem Ipsum.
Any help is much appreciated.

The original requestor has mentioned that the content resides within MarkLogic. Therefore, I will give a MarkLogic-Centric response for the use-cases based on the information available. Tuning for a full implementation including case sensitivity may take more understanding of the underlying features of MarkLogic:
Case: "A.B C", "A B D.C" etc
Under the hood, all content is indexed. The word indexes tokenize content and apply rules about case sensitivity and diacritic sensitivity when storing the content. Word boundaries are based on a sane set of defaults related to whitespace and punctuation marks. That means that the data is already well prepared for the items with some separation of characters. We can see this my Analyzing a sample of the above:
xdmp:describe(cts:tokenize("A.B D.C"))
Shows that spaces are ignored and punctuation is understood based on the sample, results are:
(cts:word("A"), cts:punctuation("."), cts:word("B"), ...)
This means that we would need to simply take some more item into account the relation of each word to one-another in the text. For that, we ensure that the words (A,B,C,D) are near each-other. For this, the setting in the database called word positions may help for performance. For me, I left it off for my sample of 300k docs. Our query is then as simple as:
cts:search(doc(), cts:near-query(cts:word-query(("A", "B", "C", "D")), 1))
Breaking it down:
doc() is just the easiest searchable expression - for JS, you would not have this.
Then we ask for word queries on the words A,B,C,D. The inner working kings of cts:word-query() will already expand this into a list of or-queries.
This is all constrained by position - the word-query results should be within one position of each-other
MarkLogic has many features. For the above - where I had whitespace and puncuation available, I simply used the word search features OOTB.
Case: "ABCD"
This use-case is completely different and is related to a full word. I could start doing some heavy work by indexing the system down to single character wildcards. That would probably work, but be expensive. Instead, I think of it differently. The sample looks like there is a finite combination. If that is the case, the fastest solution would probably be to calculate the permutations of ABCD, ACBD, etc in advance and feed them to a word or value query.
You could also marry that approach with a lexicon on the element and expand terms to only those present in the system - and still just pass them as a sequence into a search.

contains(., "A OR B OR C") is searching for the literal string "A OR B OR C".
You want contains(., "A") or contains(., "B") or contains(., "C").
Alternatively, you can reformulate that as matches(., "A|B|C").
Or if the strings are related, like ABCDE , A.B.C.D.E. , Abcde , A.B.C.D.E, then you could try something like
contains(. => upper-case() => translate('.', ''), "ABCDE")

Your examples are not very clear but it looks like you want this type of RegExp in your XQuery:
let $id := /image/id
let $text := /image/extractedText[matches(.,'((a|A)\.)?(b|B)\.(c|C)\.(d|D)\.(e|E)')]
return fn:string-join(($id,$text),"!!!!")
Output:
88091942!!!! Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, ABCDE remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including B.C.D.E versions of Lorem Ipsum.
Test in here
If you want an XPath/XQuery' specific RegExp flavor, you might use flags as well like in matches(.,'(A\.)?B\.C\.D\.E', 'i') for ignoring case.

Related

R - Searching text with NEAR regex

I have a vector containing text, broken up, like the following:
words = c("Lorem Ipsum is simply dummy text of the", "printing and typesetting industry. Lorem Ipsum has been the industrys
standard dummy text ever since the 1500s", "when an unknown printer took a galley of type and scrambled it to
make a type specimen book.", "It has survived not only five ,centuries, but also the leap into electronic")
I am using the following regex to find where the words "dummy" and "text" appear within 6 words of each other:
grep("\b(?:dummy\\W+(?:\\w+\\W+){1,6}?text|text\\W+(?:\\w+\\W+){1,6}?dummy)\b", words)
However its returning 0 despite there being 'dummy text' within the first index.
Any idea where I am going wrong?
The \b in "\b" matches a backspace char, you need to double escape the \b, \\b, to make it match a word boundary.
After fixing the typo, you need to pay attention to the limiting quantifiers. {1,6}? is a lazy quantifier matching one to six occurrences (as few as possible, but still as many as necessary to find a valid match) of the modified subpattern. It means there must be at least one word between dummy and text.
So, you need to use
pattern <- "\\b(?:dummy\\W+(?:\\w+\\W+){0,6}text|text\\W+(?:\\w+\\W+){0,6}dummy)\\b"
See the regex demo.
Details
\b - a word boundary
(?: - start of a non-capturing group
dummy - a dummy word
\W+ - one or more non-word chars
(?:\w+\W+){0,6} - zero to six occurrences of one or more word chars followed with one or more non-word chars
text - a text word
| - or
text - a text word
\W+ - one or more non-word chars
(?:\w+\W+){0,6} - zero to six occurrences of one or more word chars followed with one or more non-word chars
dummy - a dummy word
) - end of the non-capturing group
\b - a word boundary

How to extend a sprache parser to cope with leading and trailing free text

I have a sprache parser that successfully recognizes a variety of complex strings.
I now have to find these strings if they are embedded in free text. Is this possible?
For example, "FJ21 [7-20]" and "7.2x1.2 FULL" are examples of strings that my parser can match.
I need to be able to find them within text such as:
"The quick brown FJ21 [7-20] jumps over the lazy 7.2x1.2 FULL"

Trouble finding string followed by variable whitespace and numbers in R with regex

I'm attempting to use some regex to find the lines in a series of documents so I can accurately subset the information. First, some sample data.
text <- c("BAR 02/ BLAHBLAH ",
" 27/ LOCATION: BLAH-TOWN",
" 2013 BLAH;BLAH",
" BAR 09/ 10/ BOOHAABLAH ",
" 25/ 14/ LOREM IPSUM, ",
" 2014 2014 LOREM LORE LOT",
" BAR BLAH MUH BLAH NO BLAH")
I am attempting to find the element of the list where BAR is followed ONLY by numbers. The number of whitespaces is variable, but the lines I am interested in capturing are always followed by numbers. I am using the base R grep() function and have tried a large number of functions. No positive lookahead configuration I have found so far seems to catch it?
Some of the things I have tried so far.
grep("(BAR\\b(?=\\s*[0-9]))", text, perl= T)
grep("(BAR\\b(?=\\s*\\b[0-9]))", text, perl= T)
grep("(BAR\\b\\s*\\d\\d\/)", text, perl = T)
grep("BAR\\s*[0-9]",text,perl=T)
grep("BAR\\s*(?![^A-Za-z])",text,perl=T)
Where am I going wrong? I've heard some about tidyr, but none of what I've read on it shows any more promise than grep.
I will provide the answer based on your feedback. It appears you modify the character vector by changing BAR to VIOL and introduce Unicode whitespace into the string.
Thus, the following should work in your case:
grep("(*UCP)VIOL\\s+[0-9]", text, perl=TRUE)
The (*UCP) PCRE verb will make \s match any Unicode whitespaces.
In other environments (this is not your case), where TRE (default base R regex engine) POSIX character classes are Unicode aware, one might also use
grep("VIOL[[:space:]]+[0-9]", text)

D7 Multiple multi value fields not showing correctly in views

I have a node type which contains a multiple value field, users can fill out multiple instances of this field. Looks something like this, where "error" is the multiple value field:
Item
Description
Error
error type
error date
Some items have multiple error entries, like:
Item A
Lorem ipsum
Error 1
type X
01-01-2014
Error 2
type Y
21-03-2014
Item B
Lorem ipsum
Error 1
type X
01-04-2014
Error 2
type Y
11-05-2014
Now when I want to generate a table in views, it shows 4 rows (which is correct, 1 row for every Item + error), but the corresponding error type and date are wrong:
Item | description | Error type | Error date
A Lorem ipsum Type X 01-01-2014
A Lorem ipsum Type Y 01-01-2014
B Lorem ipsum Type X 01-04-2014
B Lorem ipsum Type Y 01-04-2014
I tried using the aggregation option and group by entity ID, but then I end up with 2 rows (item A and B).
Any suggestions?
After rigorous testing I found out I was using the wrong module for multivalue field groups. If you want to achieve the above, use the field collection module. Then create a relationship in views to the field containing the field collection, et voila, you can output all subvalues in seperate rows.

XQuery - Why there is difference in result?

<Docs>
<Doc>
<Title>Electromagnetic Fields</Title>
<Info>
<Vol name="Physics"/>
<Year>2006</Year>
</Info>
<SD>
<Info>
<Para>blah blah blah.<P>blh blah blah.</P></Para>
</Info>
</SD>
<LD>
<Info>
<Para>blah blah blah.<P>blah blah blah.</P></Para>
<Para>blah blah blah.<P>blah blah blah.</P></Para>
<Para>blah blah blah.<P>emf waves blah.</P></Para>
<Para>blah blah blah.<B>emf waves</B> blah.</Para>
<Para>blah blah blah.<P>emf waves blah.</P></Para>
<Para>blah waves blah.<B>emf</B> waves blah.</Para>
<Para>emf blah blah.<I>waves blah.</I></Para>
<Para>blah blah blah.<B>emf waves</B> blah.</Para>
<Para>blah blah blah.<P><I>emf</I> waves blah.</P></Para>
</Info>
</LD>
</Doc>
</Docs>
Query 1 -
for $x in ft:search("Article", ("emf","waves"), map{'mode':='all words'})/ancestor::*:Doc
return $x/Title
I am getting 62 Hits
Query 2 -
for $x in ft:search("Article", ("emf","waves"), map{'mode':='all words'})
return $x/ancestor::*:Doc/Title
I am getting 159 Hits
Query 3 -
for $x in doc("Article")/Doc[Info[Vol/#name="Physics" and Year ge "2006" and Year le "2010"]]
[SD/Info/Para/text() contains text {"emf","waves"} all words or
SD/Info/Para/P/text() contains text {"emf","waves"} all words or
LD/Info/Para/text() contains text {"emf","waves"} all words or
SD/Info/Para/P/text() contains text {"emf","waves"} all words or
SD/Info/Para/P/B/text() contains text {"emf","waves"} all words or
SD/Info/Para/P/I/text() contains text {"emf","waves"} all words or
SD/Info/Para/P/U/text() contains text {"emf","waves"} all words]
return $x/Title
This results in 224 hits. In the 3rd one, I am using all the nodes which are actually present. I, B and U are for Italic, Bold and Underline the text.
Why this difference ?
Queries 1 and 2 pretty much look the same, however the path expression in Q1 results in Doc elements. So if there are multiple matching nodes below a single Doc, that Doc will count just once in Q1, whereas each node is counted individually in Q2. This is due to the fact that the node sequence resulting from a path expression, by definition, is duplicate-free.
Q3 is different, but while Q1 and Q2 depend on the properties of a full-text index, Q3 doesn't. If e.g. the index is case-sensitive, you'll get less results from it than from a contains text predicate.
So from the quoted counts, I'd assume that the text index comes up with 159 matching nodes in 62 documents, while being specified as more restrictive than a plain contains text.
Your first query searches for Doc elements which have a certain property, and returns one result for each such Doc element.
Your second query searches for nodes of any kind which have a (related) property, and returns one result for each such node.
Your third query searches for text nodes which have another (related) property.
Whenever there are Doc elements containing more than one node matching the full-text search criterion, the first and second queries will return different numbers of hits. And similarly for the third query, vis-a-vis the others.

Resources