xml in R, remove paragraphs but keep xml class - r

I am trying to remove some paragraphs from an XML document in R, but I want to keep the XML structure/class. Here's some example text and my failed attempts:
library(xml2)
text = read_xml("<paper> <caption><p>The main title</p> <p>A sub title</p></caption> <p>The opening paragraph.</p> </paper>")
xml_find_all(text, './/caption//p') %>% xml_remove() # deletes text
xml_find_all(text, './/caption//p') %>% xml_text() # removes paragraphs but also XML structure
Here's what I would like to end up with (just the paragraphs in the caption removed):
ideal_text = read_xml("<paper> <caption>The main title A sub title</caption> <p>The opening paragraph.</p> </paper>")
ideal_text

It looks like this requires multiple steps. Find the node, copy the text, remove the contents of the node and then update.
library(xml2)
library(magrittr)
text = read_xml("<paper> <caption><p>The main title</p> <p>A sub title</p></caption> <p>The opening paragraph.</p> </paper>")
# find the caption
caption <- xml_find_all(text, './/caption')
#store existing text
replacemement<- caption %>% xml_find_all( './/p') %>% xml_text() %>% paste(collapse = " ")
#remove the desired text
caption %>% xml_find_all( './/p') %>% xml_remove()
#replace the caption
xml_text(caption) <- replacemement
text #test
{xml_document}
<paper>
[1] <caption>The main title A sub title</caption>
[2] <p>The opening paragraph.</p>
Most likely you will need to obtain the vector/list of caption nodes and then step through them one-by-one with a loop.

Related

Replace Text in XML files with placeholder text

I performed text mining on files that I am preparing for publication right now. There are several XML files that contain text within segments (see basic example below). Due to copyright restrictions, I have to make sure that the files that I am going to publish do not contain the whole text while someone who has the texts should be able to 'reconstruct' the files. To make sure that one can still perform basic text mining (= count lengths), the segment length should not change. Therefore I am looking for a way to replace every word except for the first and last one in all segments with dummy / placeholder text.
Basic example:
Input:
<text>
<div>
<seg xml:id="A">Lorem ipsum dolor sit amet</seg>
<seg xml:id="B">sed diam nonumy eirmod tempor invidunt</seg>
</div>
</text>
Output:
<text>
<div>
<seg xml:id="A">Lorem blank blank blank amet</seg>
<seg xml:id="B">sed blank blank blank blank invidunt</seg>
</div>
</text>
There is rapply to recursively replace values in a nested list:
Let be data.xml containing your input.
library(tidyverse)
library(xml2)
read_xml("data.xml") %>%
as_list() %>%
rapply(how = "replace", function(x) {
tokens <-
x %>%
str_split(" ") %>%
simplify()
n_tokens <- length(tokens)
c(
tokens[[1]],
rep("blank", n_tokens - 2),
tokens[[n_tokens]]
) %>%
paste0(collapse = " ")
}) %>%
as_xml_document() %>%
write_xml("data2.xml")
Output file data2.xml:
<?xml version="1.0" encoding="UTF-8"?>
<text>
<div>
<seg id="A">Lorem blank blank blank amet</seg>
<seg id="B">sed blank blank blank blank invidunt</seg>
</div>
</text>

R - kableExtra create column with link

I am creating a table with a column with hyperlinks, but those hyperlinks are very long and I wanted to substitute the long text for an image, to click it and to open the link in a new tab.
For example, with this code
df = iris[c(1,51,101),]
df$hyperlink = c("https://en.wikipedia.org/wiki/Iris_setosa", "https://en.wikipedia.org/wiki/Iris_versicolor", "https://en.wikipedia.org/wiki/Iris_virginica")
kable(df,format = "html")%>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)
I obtain the last column as a hyperlinks, but what I would like is to put an image that, when clicked, it opens the url (preferably in a new window or tab)
You add clickable images by adding the appropriate html tags. <a href='...'></a> is for hyperlinks, and <img src='...'> is for images. Simply place the image tag between the opening and closing html tags. Also, be sure to include escape=FALSE in the kable statement to make it work.
library(kableExtra)
library(dplyr)
df = iris[c(1,51,101),]
df$hyperlink = c("<a href='https://en.wikipedia.org/wiki/Iris_setosa'><img src='setosa.png' /</a>",
"<a href='https://en.wikipedia.org/wiki/Iris_versicolor'><img src='versicolor.png' /></a>",
"<a href='https://en.wikipedia.org/wiki/Iris_virginica'><img src='virginica.png' /></a>")
kable(df,escape=FALSE,format = "html")%>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)

Scraping the content of all div tags with a specific class

I'm scraping all the text from a website that occurs in a specific class of div. In the following example, I want to extract everything that's in a div of class "a".
site <- "<div class='a'>Hello, world</div>
<div class='b'>Good morning, world</div>
<div class='a'>Good afternoon, world</div>"
My desired output is...
"Hello, world"
"Good afternoon, world"
The code below extracts the text from every div, but I can't figure out how to include only class="a".
library(tidyverse)
library(rvest)
site %>%
read_html() %>%
html_nodes("div") %>%
html_text()
# [1] "Hello, world" "Good morning, world" "Good afternoon, world"
With Python's BeautifulSoup, it would look something like site.find_all("div", class_="a").
The CSS selector for div with class = "a" is div.a:
site %>%
read_html() %>%
html_nodes("div.a") %>%
html_text()
Or you can use XPath:
html_nodes(xpath = "//div[#class='a']")
site %>%
read_html() %>%
html_nodes(xpath = '//*[#class="a"]') %>%
html_text()

How to extract text only from parent HTML node (excluding child node)?

I have a code:
<div class="activityBody postBody thing">
<p>
(22)
where?
</p>
</div>
I am using this code to extract text:
html_nodes(messageNode, xpath=".//p") %>% html_text() %>% paste0(collapse="\n")
And getting the result:
"(22) where?"
But I need only "p" text, excluding text that could be inside "p" in child nodes. I have to get this text:
"where"
Is there any way to exclude child nodes while I getting text?
Mac OS 10.11.6 (15G31), RSrudio Version 0.99.903, R version 3.3.1 (2016-06-21)
If you are sure the text you want always comes last you can use:
doc %>% html_nodes(xpath=".//p/text()[last()]") %>% xml_text(trim = TRUE)
Alternatively you can use the following to select all "non empty" trings
doc %>% html_nodes(xpath=".//p/text()[normalize-space()]") %>% xml_text(trim = TRUE)
For more details on normalize-space() see https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/normalize-space
3rd option would be to use the xml2 package directly via:
doc %>% xml2::xml_find_chr(xpath="normalize-space(.//p/text())")
This will grab all the text from <p> children (which means it won't include text from sub-nodes that aren't "text emitters":
library(xml2)
library(rvest)
library(purrr)
txt <- '<div class="activityBody postBody thing">
<p>
(22)
where?
</p>
<p>
stays
<b>disappears</b>
<a>disappears</a>
<span>disappears</span>
stays
</p>
</div>'
doc <- read_xml(txt)
html_nodes(doc, xpath="//p") %>%
map_chr(~paste0(html_text(html_nodes(., xpath="./text()"), trim=TRUE), collapse=" "))
## [1] "where?" "stays stays"
Unfortunately, that's pretty "lossy" (you lose <b>, <span>, etc) but this or #Floo0's (also potentially lossy) solution may work sufficiently for you.
If you use the XML package you can actually edit nodes (i.e. delete node elements).

Regex to remove white space between tags in gsub R

How to remove white space or tabulation between tags, without removing it from inside the tags, i tried gsub but didn't succeed
gsub("(^>)\\s(^<)", "", x)
Given a string like :
"<div class=\"panel\">\n <div class=\"shortcode\">\n\t <div class=\"article-\"> text text text text </div> \n </div>\n </div>"
Desired output:
<div class=\"panel\"><div class=\"shortcode\"><div class=\"article-\"> text text text text </div></div></div>
You could try using a look around
gsub("(?<=\\>)(\\s*)(?=\\<)", "", x, perl = TRUE)
## [1] "<div class=\"panel\"><div class=\"shortcode\"><div class=\"article-\"> text text text text </div></div></div>"
We can use the fact that the tags have \n between them giving particularly simple solutions:
1) If s is the input string then:
gsub("\\s*\n\\s*", "", s)
(If \t cannot appear within tags as is the case in the question then the pattern could alternately be written as " *[\n\t] *".)
2) Another way is:
paste(sapply(strsplit(s, "\n"), trimws), collapse = "")

Resources