XQuery - Why there is difference in result? - xquery

<Docs>
<Doc>
<Title>Electromagnetic Fields</Title>
<Info>
<Vol name="Physics"/>
<Year>2006</Year>
</Info>
<SD>
<Info>
<Para>blah blah blah.<P>blh blah blah.</P></Para>
</Info>
</SD>
<LD>
<Info>
<Para>blah blah blah.<P>blah blah blah.</P></Para>
<Para>blah blah blah.<P>blah blah blah.</P></Para>
<Para>blah blah blah.<P>emf waves blah.</P></Para>
<Para>blah blah blah.<B>emf waves</B> blah.</Para>
<Para>blah blah blah.<P>emf waves blah.</P></Para>
<Para>blah waves blah.<B>emf</B> waves blah.</Para>
<Para>emf blah blah.<I>waves blah.</I></Para>
<Para>blah blah blah.<B>emf waves</B> blah.</Para>
<Para>blah blah blah.<P><I>emf</I> waves blah.</P></Para>
</Info>
</LD>
</Doc>
</Docs>
Query 1 -
for $x in ft:search("Article", ("emf","waves"), map{'mode':='all words'})/ancestor::*:Doc
return $x/Title
I am getting 62 Hits
Query 2 -
for $x in ft:search("Article", ("emf","waves"), map{'mode':='all words'})
return $x/ancestor::*:Doc/Title
I am getting 159 Hits
Query 3 -
for $x in doc("Article")/Doc[Info[Vol/#name="Physics" and Year ge "2006" and Year le "2010"]]
[SD/Info/Para/text() contains text {"emf","waves"} all words or
SD/Info/Para/P/text() contains text {"emf","waves"} all words or
LD/Info/Para/text() contains text {"emf","waves"} all words or
SD/Info/Para/P/text() contains text {"emf","waves"} all words or
SD/Info/Para/P/B/text() contains text {"emf","waves"} all words or
SD/Info/Para/P/I/text() contains text {"emf","waves"} all words or
SD/Info/Para/P/U/text() contains text {"emf","waves"} all words]
return $x/Title
This results in 224 hits. In the 3rd one, I am using all the nodes which are actually present. I, B and U are for Italic, Bold and Underline the text.
Why this difference ?

Queries 1 and 2 pretty much look the same, however the path expression in Q1 results in Doc elements. So if there are multiple matching nodes below a single Doc, that Doc will count just once in Q1, whereas each node is counted individually in Q2. This is due to the fact that the node sequence resulting from a path expression, by definition, is duplicate-free.
Q3 is different, but while Q1 and Q2 depend on the properties of a full-text index, Q3 doesn't. If e.g. the index is case-sensitive, you'll get less results from it than from a contains text predicate.
So from the quoted counts, I'd assume that the text index comes up with 159 matching nodes in 62 documents, while being specified as more restrictive than a plain contains text.

Your first query searches for Doc elements which have a certain property, and returns one result for each such Doc element.
Your second query searches for nodes of any kind which have a (related) property, and returns one result for each such node.
Your third query searches for text nodes which have another (related) property.
Whenever there are Doc elements containing more than one node matching the full-text search criterion, the first and second queries will return different numbers of hits. And similarly for the third query, vis-a-vis the others.

Related

How to add text at beginning of Word document?

I'd like to add content at the beginning of a Word document. Important note: this document already has content, I want to add content BEFORE text that is already in this file. something like:
# input file text
blah blah blah
blah blah blah
# output file text
This added paragraph1
blah blah blah
blah blah blah
I'm using OfficeR package in R. I'm trying to open a file, add a line at the beginning of file, and save it with a different name:
library('officer')
sample_doc <- read_docx("inputfile.docx")
cursor_begin(sample_doc)
sample_doc <- body_add_par(sample_doc, "This added paragraph1")
print(sample_doc, target = "outputfile.docx")
Unfortunately, the cursor_begin command doesn't seem to work, the new paragraph is appended to the end of the document. I don't know if I'm reading something wrong in the documentation. Could someone give me a hint?
EDIT:
There was a suggestion below to use pos="before" to indicate where to insert the text - before or after the cursor. For example
body_add_par(sample_doc, "This added paragraph1", pos="before")
Unfortunately, this solution works only for docs with one paragraph of text. With only one paragraph of text, setting pos='before' moves the text up a line whether or not you use cursor_begin. Using this solution for more than one paragraph stil gives something like:
# input file text
blah blah blah
blah blah blah
# output file text
blah blah blah
This added paragraph1
blah blah blah
so it is not the solution i'm lookin for.
Actually, I think that cursor_begin is working, but maybe not the way that you think. It is selecting the first paragraph. But when you use body_add_par the default is pos="after". You need "before". Also, when you call cursor_begin you must save the result back into sample_doc.
This should work for you:
library('officer')
sample_doc <- read_docx("inputfile.docx")
sample_doc <- cursor_begin(sample_doc)
sample_doc <- body_add_par(sample_doc,
"This added paragraph1", pos="before")
print(sample_doc, target = "outputfile.docx")

Scrapy - Remove comma and whitespace from getall() results

would there be an effective way to directly remove commas from the yielded results via getall()?
As an example, the data I'm trying to retrieve is in this format:
<div>
Text 1
<br>
Text 2
<br>
Text 3
</div>
My current selector for this is:
response.xpath("//div//text()").getall()
Which does get the correct data but they come out as:
Text 1,
Text 2,
Text 3
instead of
Text 1
Text 2
Text 3
I understand that they get recognized as a list which is the reason for the commas but would there be a direct function to remove them without affecting the commas from the text itself?
I'm just going to leave the solution I used in case someone needs it:
tc = response.xpath("//div//text()").getall() #xpath selector
tcl = "".join(tc) #used to convert the list into a string

How to select part of a text in R

I have an HTML file which consists of 5 different articles and I would like to extract each one of these articles separately in R and run some analysis per article. Each article starts with < doc> and ends with < /doc> and also has a document number.Example:
<doc>
<docno> NA123455-0001 </docno>
<docid> 1 </docid>
<p>
NASA one-year astronaut Scott Kelly speaks after coming home to Houston on
March 3, 2016. Behind Kelly,
from left to right: U.S. Second Lady Jill Biden; Kelly's identical in
brother, Mark;
John Holdren, Assistant to the President for Science and ...
</p>
</doc>
<doc>
<docno> KA25637-1215 </docno>
<docid> 65 </docid>
<date>
<p>
February 1, 2014, Sunday
</p>
</date>
<section>
<p>
WASHINGTON -- Former Republican presidential nominee Mitt Romney
is charging into the increasingly divisive 2016 GOP
White House sweepstakes Thursday with a harsh takedown of front-runner
Donald Trump, calling him a "phony" and exhorting fellow
</p>
</type>
</doc>
<doc>
<docno> JN1234567-1225 </docno>
<docid> 67 </docid>
<date>
<p>
March 5, 2003
</p>
</date>
<section>
<p>
SEOUL—New U.S.-led efforts to cut funding for North Korea's nuclearweapons
program through targeted
sanctions risk faltering because of Pyongyang's willingness to divert all
available resources to its
military, even at the risk of economic collapse ...
</p>
</doc>
I have uploaded the url by using readLines() function and combined all lines together by using
articles<- paste(articles, collapse=" ")
I would like to select first article which is between < doc>..< /doc> and assign it to article1, and second one to article2 and so on.
Could you please advise how to construct the function in order to select each one of these articles separately?
You could use strsplit, which splits strings on whatever text or regex you give it. It will give you a list with one item for each part of the string between the splitting string, which you can then subset into different variables, if you like. (You could use other regex functions, as well, if you prefer.)
splitArticles <- strsplit(articles, '<doc>')
You'll still need to chop out the </doc> tags (plus a lot of other cruft, if you just want the text), but it's a start.
A more typical way to do the same thing would be to use a package for html scraping/parsing. Using the rvest package, you'd need something like
library(rvest)
read_html(articles) %>% html_nodes('doc') %>% html_text()
which will give you a character vector of the contents of <doc> tags. It may take more cleaning, especially if there are whitespace characters that you need to clean. Picking your selector carefully for html_nodes may help you avoid some of this; it looks like if you used p instead of doc, you're more likely to just get the text.
Simplest solution is use strsplit:
art_list <- strsplit(s, "<doc>")
art_list <- art_list[art_list != ""]
ids <- gsub(".*<docid>|</docid>.*", "", art_list[[i]] )
ids <- ids[ids != ""]
for (i in 1: length(unlist(art_list)) ){
assign( paste("article", ids[i], sep = "_") , gsub(".*<doc>|</doc>.*", "", unlist(art_list) )[i] )}

Match sentences containing a given word

I have a sentence column in my table. I wish to select all sentences which contain a given word.
Words contain only the following letters: a-z, áéíóú
The only other character is a single space, separating each word in the sentence. There are no spaces at the start or end of a sentence. So sentences look like this:
"i am here"
"no im here"
selecting sentences containing the word "i" should only match the first sentence above.
How should I select these rows from my table?
Based upon the information you have provided there are three permutations you need to cover.
The word you are searching for is the first word in the sentence. In this case there will be no space before the word but there will be a space after the word.
The word you are searching for is not the first word nor the last word in the sentence. In this case there will be a space either side of the word.
The word you are searching for is the last word in the sentence. In this case there will be a space before the word but none after the word.
So your query would look something like this ...
SELECT * FROM YourTableName
WHERE SentenceCol LIKE 'YOUR-SEARCH-WORD %'
OR SentenceCol LIKE '% YOUR-SEARCH-WORD %'
OR SentenceCol LIKE '% YOUR-SEARCH-WORD'

SQLite Virtual Table Match Escape character

I'm working on an applications where the indices are stored in a SQLite FTS3 virtual table. We are implementing full text matches which means we send through queries like:
select * from blah where term match '<insert term here>'
That's all well and good until the term we want to match contains a hyphen in case the SQLite virtual match syntax interprets bacon-and-eggs as bacon, not and, not eggs.
Does anyone know of an escape character to make the fts table ignore the hyphen? I tried adding an ESCAPE '\' clause and using \ before each hyphen but the match statement rejects that syntax.
Thanks.
There are lots of strings that FTS considers "special" and that needs to be escaped. The easiest way to do that is to add DOUBLE quotes around the string you want to search for.
Example 1: Say the term you want to search for is bacon-and-eggs.
select * from blah where term match '"bacon-and-eggs"'
This also treats the whole string as a phrase, so hits with the same words in a different order doesn't generate any hits. To get around that you can quote each word separately.
Example 2: Say the term you want to search for is bacon and eggs.
select * from blah where term match '"bacon" "and" "eggs"'
Hope this helps someone!
This question is older and involves fts3, but I thought I would add an update to show how you can do this using the newer fts5.
Let's start by setting up a test environment on the command line:
$ sqlite3 ":memory:"
Then creating an fts5 table that can handle the dash:
sqlite> CREATE VIRTUAL TABLE IF NOT EXISTS blah USING fts5(term, tokenize="unicode61 tokenchars '-'");
Notice the subtle use of double and single quotes in the tokenize value.
With setup out of the way, let's add some values to search for:
sqlite> INSERT INTO blah (term) VALUES ('bacon-and-eggs');
sqlite> INSERT INTO blah (term) VALUES ('bacon');
sqlite> INSERT INTO blah (term) VALUES ('eggs');
Then let's actually search for them:
sqlite> SELECT * from blah WHERE term MATCH '"bacon-and-eggs"';
bacon-and-eggs
sqlite> SELECT * from blah WHERE term MATCH '"bacon"*';
bacon-and-eggs
bacon
Once again, notice the subtle use of double and single quotes for the search term.
FTS ignores all non-alphanumeric characters in the index. Before sending the search term to FTS you can convert it to
bacon NEAR/0 AND NEAR/0 eggs
to search for adjacent words.

Resources