I have a code:
<div class="activityBody postBody thing">
<p>
(22)
where?
</p>
</div>
I am using this code to extract text:
html_nodes(messageNode, xpath=".//p") %>% html_text() %>% paste0(collapse="\n")
And getting the result:
"(22) where?"
But I need only "p" text, excluding text that could be inside "p" in child nodes. I have to get this text:
"where"
Is there any way to exclude child nodes while I getting text?
Mac OS 10.11.6 (15G31), RSrudio Version 0.99.903, R version 3.3.1 (2016-06-21)
If you are sure the text you want always comes last you can use:
doc %>% html_nodes(xpath=".//p/text()[last()]") %>% xml_text(trim = TRUE)
Alternatively you can use the following to select all "non empty" trings
doc %>% html_nodes(xpath=".//p/text()[normalize-space()]") %>% xml_text(trim = TRUE)
For more details on normalize-space() see https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/normalize-space
3rd option would be to use the xml2 package directly via:
doc %>% xml2::xml_find_chr(xpath="normalize-space(.//p/text())")
This will grab all the text from <p> children (which means it won't include text from sub-nodes that aren't "text emitters":
library(xml2)
library(rvest)
library(purrr)
txt <- '<div class="activityBody postBody thing">
<p>
(22)
where?
</p>
<p>
stays
<b>disappears</b>
<a>disappears</a>
<span>disappears</span>
stays
</p>
</div>'
doc <- read_xml(txt)
html_nodes(doc, xpath="//p") %>%
map_chr(~paste0(html_text(html_nodes(., xpath="./text()"), trim=TRUE), collapse=" "))
## [1] "where?" "stays stays"
Unfortunately, that's pretty "lossy" (you lose <b>, <span>, etc) but this or #Floo0's (also potentially lossy) solution may work sufficiently for you.
If you use the XML package you can actually edit nodes (i.e. delete node elements).
Related
I performed text mining on files that I am preparing for publication right now. There are several XML files that contain text within segments (see basic example below). Due to copyright restrictions, I have to make sure that the files that I am going to publish do not contain the whole text while someone who has the texts should be able to 'reconstruct' the files. To make sure that one can still perform basic text mining (= count lengths), the segment length should not change. Therefore I am looking for a way to replace every word except for the first and last one in all segments with dummy / placeholder text.
Basic example:
Input:
<text>
<div>
<seg xml:id="A">Lorem ipsum dolor sit amet</seg>
<seg xml:id="B">sed diam nonumy eirmod tempor invidunt</seg>
</div>
</text>
Output:
<text>
<div>
<seg xml:id="A">Lorem blank blank blank amet</seg>
<seg xml:id="B">sed blank blank blank blank invidunt</seg>
</div>
</text>
There is rapply to recursively replace values in a nested list:
Let be data.xml containing your input.
library(tidyverse)
library(xml2)
read_xml("data.xml") %>%
as_list() %>%
rapply(how = "replace", function(x) {
tokens <-
x %>%
str_split(" ") %>%
simplify()
n_tokens <- length(tokens)
c(
tokens[[1]],
rep("blank", n_tokens - 2),
tokens[[n_tokens]]
) %>%
paste0(collapse = " ")
}) %>%
as_xml_document() %>%
write_xml("data2.xml")
Output file data2.xml:
<?xml version="1.0" encoding="UTF-8"?>
<text>
<div>
<seg id="A">Lorem blank blank blank amet</seg>
<seg id="B">sed blank blank blank blank invidunt</seg>
</div>
</text>
I am trying to remove some paragraphs from an XML document in R, but I want to keep the XML structure/class. Here's some example text and my failed attempts:
library(xml2)
text = read_xml("<paper> <caption><p>The main title</p> <p>A sub title</p></caption> <p>The opening paragraph.</p> </paper>")
xml_find_all(text, './/caption//p') %>% xml_remove() # deletes text
xml_find_all(text, './/caption//p') %>% xml_text() # removes paragraphs but also XML structure
Here's what I would like to end up with (just the paragraphs in the caption removed):
ideal_text = read_xml("<paper> <caption>The main title A sub title</caption> <p>The opening paragraph.</p> </paper>")
ideal_text
It looks like this requires multiple steps. Find the node, copy the text, remove the contents of the node and then update.
library(xml2)
library(magrittr)
text = read_xml("<paper> <caption><p>The main title</p> <p>A sub title</p></caption> <p>The opening paragraph.</p> </paper>")
# find the caption
caption <- xml_find_all(text, './/caption')
#store existing text
replacemement<- caption %>% xml_find_all( './/p') %>% xml_text() %>% paste(collapse = " ")
#remove the desired text
caption %>% xml_find_all( './/p') %>% xml_remove()
#replace the caption
xml_text(caption) <- replacemement
text #test
{xml_document}
<paper>
[1] <caption>The main title A sub title</caption>
[2] <p>The opening paragraph.</p>
Most likely you will need to obtain the vector/list of caption nodes and then step through them one-by-one with a loop.
I am creating a table with a column with hyperlinks, but those hyperlinks are very long and I wanted to substitute the long text for an image, to click it and to open the link in a new tab.
For example, with this code
df = iris[c(1,51,101),]
df$hyperlink = c("https://en.wikipedia.org/wiki/Iris_setosa", "https://en.wikipedia.org/wiki/Iris_versicolor", "https://en.wikipedia.org/wiki/Iris_virginica")
kable(df,format = "html")%>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)
I obtain the last column as a hyperlinks, but what I would like is to put an image that, when clicked, it opens the url (preferably in a new window or tab)
You add clickable images by adding the appropriate html tags. <a href='...'></a> is for hyperlinks, and <img src='...'> is for images. Simply place the image tag between the opening and closing html tags. Also, be sure to include escape=FALSE in the kable statement to make it work.
library(kableExtra)
library(dplyr)
df = iris[c(1,51,101),]
df$hyperlink = c("<a href='https://en.wikipedia.org/wiki/Iris_setosa'><img src='setosa.png' /</a>",
"<a href='https://en.wikipedia.org/wiki/Iris_versicolor'><img src='versicolor.png' /></a>",
"<a href='https://en.wikipedia.org/wiki/Iris_virginica'><img src='virginica.png' /></a>")
kable(df,escape=FALSE,format = "html")%>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)
I'm scraping all the text from a website that occurs in a specific class of div. In the following example, I want to extract everything that's in a div of class "a".
site <- "<div class='a'>Hello, world</div>
<div class='b'>Good morning, world</div>
<div class='a'>Good afternoon, world</div>"
My desired output is...
"Hello, world"
"Good afternoon, world"
The code below extracts the text from every div, but I can't figure out how to include only class="a".
library(tidyverse)
library(rvest)
site %>%
read_html() %>%
html_nodes("div") %>%
html_text()
# [1] "Hello, world" "Good morning, world" "Good afternoon, world"
With Python's BeautifulSoup, it would look something like site.find_all("div", class_="a").
The CSS selector for div with class = "a" is div.a:
site %>%
read_html() %>%
html_nodes("div.a") %>%
html_text()
Or you can use XPath:
html_nodes(xpath = "//div[#class='a']")
site %>%
read_html() %>%
html_nodes(xpath = '//*[#class="a"]') %>%
html_text()
This bit of code:
library(tidyverse)
library(rvest)
url <- "http://www.imdb.com/title/tt4116284/"
director <- read_html(url) %>%
html_nodes(".summary_text+ .credit_summary_item .itemprop") %>%
html_text()
Will grab the plain text value "Chris McKay" (the director of the new LEGO Batman Movie). The underlying hyperlink href address, however, points to:
http://www.imdb.com/name/nm0003021?ref_=tt_ov_dr
I want that. How can I adjust my css selector to grab the underlying hyperlink href address?
Take the href attr of the parent a tag:
director <- read_html(url) %>%
html_nodes(".summary_text+ .credit_summary_item span a") %>%
html_attr('href')