Scrape all child paragraphs under heading (preferable rvest)

Scrape all child paragraphs under heading (preferable rvest) - r

My objective is to use the library(tm) toolkit on a pretty big word document. The word document has sensible typography, so we have h1 for the main sections, some h2and h3 subheadings. I want to compare and text mine each section (the text below each h1 - the subheadings is of little importance - so they can be included or excluded.)
My strategy is to export the worddocument to html and then use the rvestpacakge to extract the paragraphs.
library(rvest)
# the file has latin-1 chars
#Sys.setlocale(category="LC_ALL", locale="da_DK.UTF-8")
# small example html file
file <- rvest::html("https://83ae1009d5b31624828197160f04b932625a6af5.googledrive.com/host/0B9YtZi1ZH4VlaVVCTGlwV3ZqcWM/tidy.html", encoding = 'utf-8')
nodes <- file %>%
rvest::html_nodes("h1>p") %>%
rvest::html_text()
I can extract all the <p>with html_nodes("p"), but thats just one big soup. I need to analize each h1 separately.
The best would probably be a list, with a vector of p tags for each h1 heading. And maybe a loop with somehting like for (i in 1:length(html_nodes(fil, "h1"))) (html_children(html_nodes(fil, "h1")[i])) (which is not working).
Bonus if there is a way to tidy words html from within rvest

Note that > is the child combinator; the selector that you currently have looks for p elements that are children of an h1, which doesn't make sense in HTML and so returns nothing.
If you inspect the generated markup, at least in the example document that you've provided, you'll notice that every h1 element (as well as the heading for the table of contents, which is marked up as a p instead) has an associated parent div:
<body lang="EN-US">
<div class="WordSection1">
<p class="MsoTocHeading"><span lang="DA" class='c1'>Indholdsfortegnelse</span></p>
...
</div><span lang="DA" class='c5'><br clear="all" class='c4'></span>
<div class="WordSection2">
<h1><a name="_Toc285441761"><span lang="DA">Interview med Jakob skoleleder på
a_skolen</span></a></h1>
...
</div><span lang="DA" class='c5'><br clear="all" class='c4'></span>
<div class="WordSection3">
<h1><a name="_Toc285441762"><span lang="DA">Interviewet med Andreas skoleleder på
b_skolen</span></a></h1>
...
</div>
</body>
All of the p elements in each section denoted by an h1 are found in its respective parent div. With this in mind, you could simply select p elements that are siblings of each h1. However, since rvest doesn't currently have a way to select siblings from a context node (html_nodes() only supports looking at a node's subtree, i.e. its descendants), you will need to do this another way.
Assuming HTML Tidy creates a structure where every h1 is in a div that is directly within body, you can grab every div except the table of contents using the following selector:
sections <- html_nodes(file, "body > div ~ div")
In your example document, this should result in div.WordSection2 and div.WordSection3. The table of contents is represented by div.WordSection1, and that is excluded from the selection.
Then extract the paragraphs from each div:
for (section in sections) {
paras <- html_nodes(section, "p")
# Do stuff with paragraphs in each section...
print(length(paras))
}
# [1] 9
# [1] 8
As you can see, length(paras) corresponds to the number of p elements in each div. Note that some of them contain nothing but an which may be troublesome depending on your needs. I'll leave dealing with those outliers as an exercise to the reader.
Unfortunately, no bonus points for me as rvest does not provide its own HTML Tidy functionality. You will need to process your Word documents separately.

Related

Simplifying long CSS selectors

I have the following CSS selector:
#AllContextMenus :not(.menu-iconic-left):not(.menu-accel):not(.menu-accel-left):not(.menu-accel-container):not(.menu-accel-container-left):not(.menu-iconic-accel):not(.menu-right)::before
For readability purposes, I like to keep all code lines under 100 characters.
Is there any way to simplify, optimize, or write this CSS selector without changing what it matches and without reducing performance?
For example, is there any type of "and" operator that can be used within :not()?

You generally can't simplify a selector without changing the semantics of what it matches.
But you can break a selector up into multiple lines at many points to meet maximum line length requirements. Just use a comment and put the line break inside the comment. Like this:
#AllContextMenus :not(.menu-iconic-left)/*
*/:not(.menu-accel)/*
*/:not(.menu-accel-left)/*
*/:not(.menu-accel-container)/*
*/:not(.menu-accel-container-left)/*
*/:not(.menu-iconic-accel)/*
*/:not(.menu-right)::before
#AllContextMenus :not(.menu-iconic-left)/*
*/:not(.menu-accel)/*
*/:not(.menu-accel-left)/*
*/:not(.menu-accel-container)/*
*/:not(.menu-accel-container-left)/*
*/:not(.menu-iconic-accel)/*
*/:not(.menu-right)::before {
color:red;
content:'TEST '
}
<section id="AllContextMenus">
<div class="a">A</div>
<div class="menu-iconic-accel">menu-iconic-accel</div>
</section>

Forming a sequence of css selectors as argument to pup to get particular value from javadoc html

It isn't often that I attempt to implement something that attempts to integrate three different languages (four, if you count bash), sort of.
I want to write a little tool that scans the HTML files in the Java JDK javadoc package, focusing on blocks like the following:
<dl>
<dt><span class="simpleTagLabel">Since:</span></dt>
<dd>1.8</dd>
</dl>
I want to get the "1.8" value out of this.
So, I figured I would find a command-line tool that can parse HTML and figure out how to extract this.
I found the "pup" tool (which is written in "go"), and it seems to be close, but I now have to figure out the correct sequence of CSS selectors to get to this element. I've tried several variations, but nothing that really is doing what I need.
Update:
The answer from Sølve Tornøe comes close, and in fact I can implement somewhat of a kludge to get the data I want.
If I just use 'dl dt + dd', it gives me a lot of elements that match that pattern. Ideally, I wish I could do something like 'dl dt (> span[class="simpleTagLabel"]) + dd', where the "> span ..." thing is used for matching, but having it "pop back up" after matching the span, so it can look for peers of "dt". I imagine there's no way to do this in CSS.
My big kludge workaround is to assume that all of my real candidate elements have the text "1." in them. With that big assumption, I can use 'dl dt + dd:contains("1.")'. This at least works with the data I'm working with.

You can combine >(child) +(Adjacent sibling) element(dl tag..) to the following combination:
dl > dt + dd
This translates to: Give me the element that is a dd tag and is an Adjacent sibling of dt that also is a child of dl
console.log(document.querySelector('dl > dt + dd').innerText)
dl > dt + dd {
color: salmon;
}
<dl>
<dt><span class="simpleTagLabel">Since:</span></dt>
<dd>1.8</dd>
</dl>

If you're willing to use XPath instead of css selectors, you can easily step up through parent nodes of matched elements. This can be done with the perl XML::XPath command line tool, or xmllint:
$ xpath -q -e "//dt/span[contains(#class,'simpleTagLabel')]/../../dd/text()" < test.html
1.8
$ xmllint --xpath "//dt/span[contains(#class,'simpleTagLabel')]/../../dd/text()" test.html
1.8

//parent::* in XPath?

Consider this simple example
library(xml2)
x <- read_xml("<body>
<p>Some <b>text</b>.</p>
<p>Some <b>other</b> <b>text</b>.</p>
<p>No bold here!</p>
</body>")
Now, I want to find all the parents of the nodes containing the string other
To do so, I run
> xml_find_all(x, "//b[contains(.,'other')]//parent::*")
{xml_nodeset (2)}
[1] <p>Some <b>other</b> <b>text</b>.</p>
[2] <b>other</b>
I do not understand why I also get the <b>other</b> element as well. In my view there is only one parent, which is the first node.
Is this a bug?

Change
//b[contains(.,'other')]//parent::*
which selects descendant-or-self (and you don't want self) and parent, to
//b[contains(.,'other')]/parent::*
which selects purely along parent, to eliminate <b>other</b> from the selection.
Or, better yet, use this XPath:
//p[b[contains(.,'other')]]
if you want to select all p elements with a b child whose string-value contains an "other" substring, or
//p[b = 'other']
if b's string-value is supposed to equal other. See also What does contains() do in XPath?

Unable to find xpath list trying to use wild card contains text or style

I am trying to find an XPATH for this site the XPath under “Main Lists”. I have so far:
//div[starts-with(#class, ('sm-CouponLink_Label'))]
However this finds 32 matches…
`//div[starts-with(#class, ('sm-CouponLink_Label'))]`[contains(text(),'*')or[contains(Style(),'*')]
Unfortunately in this case I am wanting to use XPaths and not CSS.
It is for this site, my code is here and here's an image of XPATH I am after.
I have also tried:
CSS: div:nth-child(1) > .sm-MarketContainer_NumColumns3 > div > div
Xpath equiv...: //div[1]//div[starts-with(#class, ('sm-MarketContainer_NumColumns3'))]//div//div
Though it does not appear to work.
UPDATED
WORKING CSS: div.sm-Market:has(div >div:contains('Main Lists')) * > .sm-CouponLink_Label
Xpath: //div[Contains(#class, ('sm-Market'))]//preceding::('Main Lists')//div[Contains(#class, ('sm-CouponLink_Label'))]
Not working as of yet..
Though I am unsure Selenium have equivalent for :has
Alternatively...
Something like:
//div[contains(text(),"Main Lists")]//following::div[contains(#class,"sm-Market")]//div[contains(#class,"sm-CouponLink_Label")]//preceding::div[contains(#class,"sm-Market_HeaderOpen ")]
(wrong area)

You can get all required elements with below piece of code:
league_names = [league for league in driver.find_elements_by_xpath('//div[normalize-space(#class)="sm-Market" and .//div="Main Lists"]//div[normalize-space(#class)="sm-CouponLink_Label"]') if league.text]
This should return you list of only non-empty nodes

If I understand this correctly, you want to narrow down further the result of your first XPath to return only div that has inner text or has attribute style. In this case you can use the following XPath :
//div[starts-with(#class, ('sm-CouponLink_Label'))][#style or text()]
UPDATE
As you clarified further, you want to get div with class 'sm-CouponLink_Label' that resides in the 'Main Lists' section. For this purpose, you should try to incorporate the 'Main Lists' in the XPath somehow. This is one possible way (formatted for readability) :
//div[
div/div/text()='Main Lists'
]//div[
starts-with(#class, 'sm-CouponLink_Label')
and
normalize-space()
]
Notice how normalize-space() is used to filter out empty div from the result. This should return 5 elements as expected, here is the result when I tested in Chrome :

How to convert complex xpath to css

I have a complex html structure. New to CSS. Want to change my xpath to css as there could be some performance impact in IE
Xpath by firebug: .//*[#id='T_I:3']/span/a
I finetuned to : //div[#id='Overview']/descendant::*[#id='T_I:3']/span/a
Now I need corresponding CSS for the same. Is it possible or not?

First of all, I don't think your "finetuning" did the best possible job. An element id should be unique in the document and is therefore usually cached by modern browsers (which means that id lookup is instant). You can help the XPath engine by using the id() function.
Therefore, the XPath expression would be: id('T_I:3')/span/a (yes, that's a valid XPath 1.0 expression).
Anyway, to convert this to CSS, you'd use: #T_I:3 > span > a
Your "finetuned" expression converted would be: div#Overview #T_I:3 > span > a, but seriously, you only need one id selection.
The hashtag # is an id selector.
The space () is a descendant combinator.
The > sign is a child combinator.
EDIT based on a good comment by Fréderic Hamidi:
I don't think #T_I:3 is valid (the colon would be confused with the
start of a pseudo-class). You would have to find a way to escape it.
It turns out you also need to escape the underscore. For this, use the techniques mentioned in this SO question: Handling a colon in an element ID in a CSS selector.
The final CSS selector would be:
#T\5FI\3A3 > span > a

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scrape all child paragraphs under heading (preferable rvest) - r

Related

Simplifying long CSS selectors

Forming a sequence of css selectors as argument to pup to get particular value from javadoc html

//parent::* in XPath?

Unable to find xpath list trying to use wild card contains text or style

How to convert complex xpath to css

Categories

Resources