Underlying hyperlink href address from css selector - r

This bit of code:
library(tidyverse)
library(rvest)
url <- "http://www.imdb.com/title/tt4116284/"
director <- read_html(url) %>%
html_nodes(".summary_text+ .credit_summary_item .itemprop") %>%
html_text()
Will grab the plain text value "Chris McKay" (the director of the new LEGO Batman Movie). The underlying hyperlink href address, however, points to:
http://www.imdb.com/name/nm0003021?ref_=tt_ov_dr
I want that. How can I adjust my css selector to grab the underlying hyperlink href address?

Take the href attr of the parent a tag:
director <- read_html(url) %>%
html_nodes(".summary_text+ .credit_summary_item span a") %>%
html_attr('href')

Related

Scraping image inside canvas with RSelenium

I'm trying to get the main image on this website
The problem is that the main image is loaded via the HTML canvas tag. When I inspect the source code, the image is loaded in here, with no information on any file:
<canvas id="image1" class="mesImages ui-draggable" style="position: fixed; width: …ransform: rotate(0deg);" title="" tabindex="-1" width="359" height="542">
Is it possible to use RSelenium (and not Python) to get the image in .png or .jpg?
Gathering links with RSelenium
library(tidyverse)
library(RSelenium)
library(netstat)
rD <- rsDriver(browser = "firefox", port = free_port())
remDr <- rD[["client"]]
remDr$navigate("https://archives.landes.fr/arkotheque/visionneuse/visionneuse.php?arko=YTo3OntzOjQ6ImRhdGUiO3M6MTA6IjIwMjItMTAtMTgiO3M6MTA6InR5cGVfZm9uZHMiO3M6MTE6ImFya29fc2VyaWVsIjtzOjQ6InJlZjEiO3M6MToiNCI7czo0OiJyZWYyIjtzOjM6IjE3MyI7czoyMjoiZm9yY2VfbnVtX2ltYWdlX2RlcGFydCI7aTozNDA7czoxNjoidmlzaW9ubmV1c2VfaHRtbCI7YjoxO3M6MjE6InZpc2lvbm5ldXNlX2h0bWxfbW9kZSI7czo0OiJwcm9kIjt9")
accept <- remDr$findElement("xpath", '//*[#id="licence_clic_bouton_accepter"]')
accept$clickElement()
more <- remDr$findElement("css selector", "#arkoVision_barToggleButton_bottom")
more$clickElement()
content <- remDr$getPageSource()[[1]]
links <- content %>%
read_html() %>%
html_elements(".imgThumb_container") %>%
html_children() %>%
html_attr("rel") %>%
.[!is.na(.)]
> links %>% head()
[1] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=13a087d1f5.jpg"
[2] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=2159e78416.jpg"
[3] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=65786a44b1.jpg"
[4] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=702fbd57fa.jpg"
[5] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=9c4421d51e.jpg"
[6] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=a4b68fd913.jpg"

xml in R, remove paragraphs but keep xml class

I am trying to remove some paragraphs from an XML document in R, but I want to keep the XML structure/class. Here's some example text and my failed attempts:
library(xml2)
text = read_xml("<paper> <caption><p>The main title</p> <p>A sub title</p></caption> <p>The opening paragraph.</p> </paper>")
xml_find_all(text, './/caption//p') %>% xml_remove() # deletes text
xml_find_all(text, './/caption//p') %>% xml_text() # removes paragraphs but also XML structure
Here's what I would like to end up with (just the paragraphs in the caption removed):
ideal_text = read_xml("<paper> <caption>The main title A sub title</caption> <p>The opening paragraph.</p> </paper>")
ideal_text
It looks like this requires multiple steps. Find the node, copy the text, remove the contents of the node and then update.
library(xml2)
library(magrittr)
text = read_xml("<paper> <caption><p>The main title</p> <p>A sub title</p></caption> <p>The opening paragraph.</p> </paper>")
# find the caption
caption <- xml_find_all(text, './/caption')
#store existing text
replacemement<- caption %>% xml_find_all( './/p') %>% xml_text() %>% paste(collapse = " ")
#remove the desired text
caption %>% xml_find_all( './/p') %>% xml_remove()
#replace the caption
xml_text(caption) <- replacemement
text #test
{xml_document}
<paper>
[1] <caption>The main title A sub title</caption>
[2] <p>The opening paragraph.</p>
Most likely you will need to obtain the vector/list of caption nodes and then step through them one-by-one with a loop.

R - kableExtra create column with link

I am creating a table with a column with hyperlinks, but those hyperlinks are very long and I wanted to substitute the long text for an image, to click it and to open the link in a new tab.
For example, with this code
df = iris[c(1,51,101),]
df$hyperlink = c("https://en.wikipedia.org/wiki/Iris_setosa", "https://en.wikipedia.org/wiki/Iris_versicolor", "https://en.wikipedia.org/wiki/Iris_virginica")
kable(df,format = "html")%>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)
I obtain the last column as a hyperlinks, but what I would like is to put an image that, when clicked, it opens the url (preferably in a new window or tab)
You add clickable images by adding the appropriate html tags. <a href='...'></a> is for hyperlinks, and <img src='...'> is for images. Simply place the image tag between the opening and closing html tags. Also, be sure to include escape=FALSE in the kable statement to make it work.
library(kableExtra)
library(dplyr)
df = iris[c(1,51,101),]
df$hyperlink = c("<a href='https://en.wikipedia.org/wiki/Iris_setosa'><img src='setosa.png' /</a>",
"<a href='https://en.wikipedia.org/wiki/Iris_versicolor'><img src='versicolor.png' /></a>",
"<a href='https://en.wikipedia.org/wiki/Iris_virginica'><img src='virginica.png' /></a>")
kable(df,escape=FALSE,format = "html")%>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)

Scraping the content of all div tags with a specific class

I'm scraping all the text from a website that occurs in a specific class of div. In the following example, I want to extract everything that's in a div of class "a".
site <- "<div class='a'>Hello, world</div>
<div class='b'>Good morning, world</div>
<div class='a'>Good afternoon, world</div>"
My desired output is...
"Hello, world"
"Good afternoon, world"
The code below extracts the text from every div, but I can't figure out how to include only class="a".
library(tidyverse)
library(rvest)
site %>%
read_html() %>%
html_nodes("div") %>%
html_text()
# [1] "Hello, world" "Good morning, world" "Good afternoon, world"
With Python's BeautifulSoup, it would look something like site.find_all("div", class_="a").
The CSS selector for div with class = "a" is div.a:
site %>%
read_html() %>%
html_nodes("div.a") %>%
html_text()
Or you can use XPath:
html_nodes(xpath = "//div[#class='a']")
site %>%
read_html() %>%
html_nodes(xpath = '//*[#class="a"]') %>%
html_text()

How to extract text only from parent HTML node (excluding child node)?

I have a code:
<div class="activityBody postBody thing">
<p>
(22)
where?
</p>
</div>
I am using this code to extract text:
html_nodes(messageNode, xpath=".//p") %>% html_text() %>% paste0(collapse="\n")
And getting the result:
"(22) where?"
But I need only "p" text, excluding text that could be inside "p" in child nodes. I have to get this text:
"where"
Is there any way to exclude child nodes while I getting text?
Mac OS 10.11.6 (15G31), RSrudio Version 0.99.903, R version 3.3.1 (2016-06-21)
If you are sure the text you want always comes last you can use:
doc %>% html_nodes(xpath=".//p/text()[last()]") %>% xml_text(trim = TRUE)
Alternatively you can use the following to select all "non empty" trings
doc %>% html_nodes(xpath=".//p/text()[normalize-space()]") %>% xml_text(trim = TRUE)
For more details on normalize-space() see https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/normalize-space
3rd option would be to use the xml2 package directly via:
doc %>% xml2::xml_find_chr(xpath="normalize-space(.//p/text())")
This will grab all the text from <p> children (which means it won't include text from sub-nodes that aren't "text emitters":
library(xml2)
library(rvest)
library(purrr)
txt <- '<div class="activityBody postBody thing">
<p>
(22)
where?
</p>
<p>
stays
<b>disappears</b>
<a>disappears</a>
<span>disappears</span>
stays
</p>
</div>'
doc <- read_xml(txt)
html_nodes(doc, xpath="//p") %>%
map_chr(~paste0(html_text(html_nodes(., xpath="./text()"), trim=TRUE), collapse=" "))
## [1] "where?" "stays stays"
Unfortunately, that's pretty "lossy" (you lose <b>, <span>, etc) but this or #Floo0's (also potentially lossy) solution may work sufficiently for you.
If you use the XML package you can actually edit nodes (i.e. delete node elements).

Resources