//parent::* in XPath? - r

Consider this simple example
library(xml2)
x <- read_xml("<body>
<p>Some <b>text</b>.</p>
<p>Some <b>other</b> <b>text</b>.</p>
<p>No bold here!</p>
</body>")
Now, I want to find all the parents of the nodes containing the string other
To do so, I run
> xml_find_all(x, "//b[contains(.,'other')]//parent::*")
{xml_nodeset (2)}
[1] <p>Some <b>other</b> <b>text</b>.</p>
[2] <b>other</b>
I do not understand why I also get the <b>other</b> element as well. In my view there is only one parent, which is the first node.
Is this a bug?

Change
//b[contains(.,'other')]//parent::*
which selects descendant-or-self (and you don't want self) and parent, to
//b[contains(.,'other')]/parent::*
which selects purely along parent, to eliminate <b>other</b> from the selection.
Or, better yet, use this XPath:
//p[b[contains(.,'other')]]
if you want to select all p elements with a b child whose string-value contains an "other" substring, or
//p[b = 'other']
if b's string-value is supposed to equal other. See also What does contains() do in XPath?

Related

QDomElement::text() without child element texts?

I have an xml like:
<a>
<b>1</b>
<c>2</c>
<d>3</d>
</a>
and a recursive function that parses QDomDocument that wraps it. The function iterates QDomNodes, converting them into QDomElements and calls text() method to get data.
Unfortunately QDomElement::text() works at <a> level too and returns: 123. So it gathers the texts of all nested elements.
I would like it to return an empty string bcs, I would rather not checking tagName() value as there are may be plenty of them. So I would rather chek node tag by haveng/not having text inside than vice versa. Is this doable? Is there a method that will return empty string for <a> and text values for <b>, <c>, <d> levels?
P.S. QDomNode::nodeValue() returns an empty text for all elements.
It seems I was wrong bcs I was not iterating QDomNodes that can't be converted to QDomElements. And according to this answer:
This is required by the DOM specification:
The Text interface represents the textual content (termed character data in XML) of an Element or Attr. If there is no markup
inside an element's content, the text is contained in a single object
implementing the Text interface that is the only child of the element.
If there is markup, it is parsed into a list of elements and Text
nodes that form the list of children of the element.
I have no markup inside <b>-like elements. So at <b> element's level I'm having el.childNodes().size() == 1, el.firstChild().isElement() == false and el.firstChild().nodeValue() returns a non-empty value!

Robotframework : How can I have this kind of result => Page MUST (and not SHOULD) contain element

I'm new to RobotFramework and I would have some help for my issue.
On a page, I would like to verify that a word is present on a locator. I have used the Page Should Contain Element keyword and it works partially : indeed, the elements which contains the word is found, but there isn't an error when the others elements (same type) don't contain this word.
Example (I work on a list of sell):
The page contains many locators of this type :
//*[#class="resultats mode_liste ng-scope"]/div[#ng-repeat="annonce in resultats.data.annonces "]//h2/a/strong[#class="ng-binding"]
A "div" element contains "locator1" which contains "House"
A "div" element contains "locator1" which contains "House"
A "div" element contains "locator1" which contains "Box"
Etc...
So I have written the Keyword
Page Should Contain Element //*[#class="resultats mode_liste ng-scope"]/div[#ng-repeat="annonce in resultats.data.annonces "]//h2/a/strong[#class="ng-binding"][contains(., "House")]
but the result is not what I expected (error if another word than "House" is founded on the locator1)
And I would like to have the following result: All the locator1 elements MUST contain the word "House". If the locator1 contains a different word, then the test must fail.
According to the Library documentation, you can set a "limit" parameter which allows you to set the amount of elements to check for on the page
E.g. if you wanted to check the element appears twice on the page then you can set limit to 2
Page Should Contain Element //*[#class="resultats mode_liste ng-scope"]/div[#ng-repeat="annonce in resultats.data.annonces "]//h2/a/strong[#class="ng-binding"][contains(., "House")] limit=2
From doc:
"The limit argument can used to define how many elements the page should contain. When limit is None (default) page can contain one or more elements. When limit is a number, page must contain same number of elements."
Having said that, if you're using an older version of the Library, you may need to use a different keyword like "Xpath Should Match X Times" but it is deprecated in new version
The other issue with not failing based on other words in the element is because the Xpath you're using is specifically looking at just elements that contain the text "House" therefore the other elements will be ignored completely when running this.
If you wanted to check to ensure that no other text is contained in elements and you know what they would be then you could use additional keyword
Page Should Not Contain Element //*[#class="resultats mode_liste ng-scope"]/div[#ng-repeat="annonce in resultats.data.annonces "]//h2/a/strong[#class="ng-binding"][contains(., "Box")]
However if you will not know what the text will be then you may need to approach this differently by getting the web element text values and cycling through them in a loop and checking the text of each one matches "House"
It would look something like this:
Check Element Contains Text 'House'
#{elements}= Get Webelements //*[#class="resultats mode_liste ng-scope"]/div[#ng-repeat="annonce in resultats.data.annonces "]//h2/a/strong[#class="ng-binding"]
FOR ${element} IN #{elements}
${text}= Get Text ${element}
Should Be Equal As Strings ${text} House
END
I'm unable to test it to be absolutely sure though.

Unable to find xpath list trying to use wild card contains text or style

I am trying to find an XPATH for this site the XPath under “Main Lists”. I have so far:
//div[starts-with(#class, ('sm-CouponLink_Label'))]
However this finds 32 matches…
`//div[starts-with(#class, ('sm-CouponLink_Label'))]`[contains(text(),'*')or[contains(Style(),'*')]
Unfortunately in this case I am wanting to use XPaths and not CSS.
It is for this site, my code is here and here's an image of XPATH I am after.
I have also tried:
CSS: div:nth-child(1) > .sm-MarketContainer_NumColumns3 > div > div
Xpath equiv...: //div[1]//div[starts-with(#class, ('sm-MarketContainer_NumColumns3'))]//div//div
Though it does not appear to work.
UPDATED
WORKING CSS: div.sm-Market:has(div >div:contains('Main Lists')) * > .sm-CouponLink_Label
Xpath: //div[Contains(#class, ('sm-Market'))]//preceding::('Main Lists')//div[Contains(#class, ('sm-CouponLink_Label'))]
Not working as of yet..
Though I am unsure Selenium have equivalent for :has
Alternatively...
Something like:
//div[contains(text(),"Main Lists")]//following::div[contains(#class,"sm-Market")]//div[contains(#class,"sm-CouponLink_Label")]//preceding::div[contains(#class,"sm-Market_HeaderOpen ")]
(wrong area)
You can get all required elements with below piece of code:
league_names = [league for league in driver.find_elements_by_xpath('//div[normalize-space(#class)="sm-Market" and .//div="Main Lists"]//div[normalize-space(#class)="sm-CouponLink_Label"]') if league.text]
This should return you list of only non-empty nodes
If I understand this correctly, you want to narrow down further the result of your first XPath to return only div that has inner text or has attribute style. In this case you can use the following XPath :
//div[starts-with(#class, ('sm-CouponLink_Label'))][#style or text()]
UPDATE
As you clarified further, you want to get div with class 'sm-CouponLink_Label' that resides in the 'Main Lists' section. For this purpose, you should try to incorporate the 'Main Lists' in the XPath somehow. This is one possible way (formatted for readability) :
//div[
div/div/text()='Main Lists'
]//div[
starts-with(#class, 'sm-CouponLink_Label')
and
normalize-space()
]
Notice how normalize-space() is used to filter out empty div from the result. This should return 5 elements as expected, here is the result when I tested in Chrome :

Scrape all child paragraphs under heading (preferable rvest)

My objective is to use the library(tm) toolkit on a pretty big word document. The word document has sensible typography, so we have h1 for the main sections, some h2and h3 subheadings. I want to compare and text mine each section (the text below each h1 - the subheadings is of little importance - so they can be included or excluded.)
My strategy is to export the worddocument to html and then use the rvestpacakge to extract the paragraphs.
library(rvest)
# the file has latin-1 chars
#Sys.setlocale(category="LC_ALL", locale="da_DK.UTF-8")
# small example html file
file <- rvest::html("https://83ae1009d5b31624828197160f04b932625a6af5.googledrive.com/host/0B9YtZi1ZH4VlaVVCTGlwV3ZqcWM/tidy.html", encoding = 'utf-8')
nodes <- file %>%
rvest::html_nodes("h1>p") %>%
rvest::html_text()
I can extract all the <p>with html_nodes("p"), but thats just one big soup. I need to analize each h1 separately.
The best would probably be a list, with a vector of p tags for each h1 heading. And maybe a loop with somehting like for (i in 1:length(html_nodes(fil, "h1"))) (html_children(html_nodes(fil, "h1")[i])) (which is not working).
Bonus if there is a way to tidy words html from within rvest
Note that > is the child combinator; the selector that you currently have looks for p elements that are children of an h1, which doesn't make sense in HTML and so returns nothing.
If you inspect the generated markup, at least in the example document that you've provided, you'll notice that every h1 element (as well as the heading for the table of contents, which is marked up as a p instead) has an associated parent div:
<body lang="EN-US">
<div class="WordSection1">
<p class="MsoTocHeading"><span lang="DA" class='c1'>Indholdsfortegnelse</span></p>
...
</div><span lang="DA" class='c5'><br clear="all" class='c4'></span>
<div class="WordSection2">
<h1><a name="_Toc285441761"><span lang="DA">Interview med Jakob skoleleder på
a_skolen</span></a></h1>
...
</div><span lang="DA" class='c5'><br clear="all" class='c4'></span>
<div class="WordSection3">
<h1><a name="_Toc285441762"><span lang="DA">Interviewet med Andreas skoleleder på
b_skolen</span></a></h1>
...
</div>
</body>
All of the p elements in each section denoted by an h1 are found in its respective parent div. With this in mind, you could simply select p elements that are siblings of each h1. However, since rvest doesn't currently have a way to select siblings from a context node (html_nodes() only supports looking at a node's subtree, i.e. its descendants), you will need to do this another way.
Assuming HTML Tidy creates a structure where every h1 is in a div that is directly within body, you can grab every div except the table of contents using the following selector:
sections <- html_nodes(file, "body > div ~ div")
In your example document, this should result in div.WordSection2 and div.WordSection3. The table of contents is represented by div.WordSection1, and that is excluded from the selection.
Then extract the paragraphs from each div:
for (section in sections) {
paras <- html_nodes(section, "p")
# Do stuff with paragraphs in each section...
print(length(paras))
}
# [1] 9
# [1] 8
As you can see, length(paras) corresponds to the number of p elements in each div. Note that some of them contain nothing but an which may be troublesome depending on your needs. I'll leave dealing with those outliers as an exercise to the reader.
Unfortunately, no bonus points for me as rvest does not provide its own HTML Tidy functionality. You will need to process your Word documents separately.

Please what is the difference between [attribute~=value] and [attribute*=value]

cannot find the difference between these two selectors. Both seem to do the same thing i.e select tags based on a specific attribute value containing a given string.
For [attribute~=value] : http://www.w3schools.com/cssref/sel_attribute_value_contains.asp
For [attribute*=value] : http://www.w3schools.com/cssref/sel_attr_contain.asp
The first one ([attribute~=value]) is a whitespace-separated search...
<!-- Would match -->
<div class="value another"></div>
...and the second ([attribute*=value]) is a substring search...
<!-- Would match -->
<div class="a_value"></div>
W3Schools doesn't appear to make this distinction very clear. Use a better resource.
[attribute~="value"] selects elements that contain a given word delimited by spaces while [attribute*="value"] selects elements that contain the given substring.
For example, [data-test~="value"] would not match on the below div while [data-test*="value"] would.
<div data-test="my values go here"></div>

Resources