Scraping image inside canvas with RSelenium - r

I'm trying to get the main image on this website
The problem is that the main image is loaded via the HTML canvas tag. When I inspect the source code, the image is loaded in here, with no information on any file:
<canvas id="image1" class="mesImages ui-draggable" style="position: fixed; width: …ransform: rotate(0deg);" title="" tabindex="-1" width="359" height="542">
Is it possible to use RSelenium (and not Python) to get the image in .png or .jpg?

Gathering links with RSelenium
library(tidyverse)
library(RSelenium)
library(netstat)
rD <- rsDriver(browser = "firefox", port = free_port())
remDr <- rD[["client"]]
remDr$navigate("https://archives.landes.fr/arkotheque/visionneuse/visionneuse.php?arko=YTo3OntzOjQ6ImRhdGUiO3M6MTA6IjIwMjItMTAtMTgiO3M6MTA6InR5cGVfZm9uZHMiO3M6MTE6ImFya29fc2VyaWVsIjtzOjQ6InJlZjEiO3M6MToiNCI7czo0OiJyZWYyIjtzOjM6IjE3MyI7czoyMjoiZm9yY2VfbnVtX2ltYWdlX2RlcGFydCI7aTozNDA7czoxNjoidmlzaW9ubmV1c2VfaHRtbCI7YjoxO3M6MjE6InZpc2lvbm5ldXNlX2h0bWxfbW9kZSI7czo0OiJwcm9kIjt9")
accept <- remDr$findElement("xpath", '//*[#id="licence_clic_bouton_accepter"]')
accept$clickElement()
more <- remDr$findElement("css selector", "#arkoVision_barToggleButton_bottom")
more$clickElement()
content <- remDr$getPageSource()[[1]]
links <- content %>%
read_html() %>%
html_elements(".imgThumb_container") %>%
html_children() %>%
html_attr("rel") %>%
.[!is.na(.)]
> links %>% head()
[1] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=13a087d1f5.jpg"
[2] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=2159e78416.jpg"
[3] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=65786a44b1.jpg"
[4] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=702fbd57fa.jpg"
[5] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=9c4421d51e.jpg"
[6] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=a4b68fd913.jpg"

Related

Rselenium org.openqa.selenium.WebDriverException

Using RSelenium, I want to click the
enter image description here
button. this page has two views a desktop mode and a mobile mode,
I think the desktop view is hidden and does not allow you to click. enter image description here.
server <- phantomjs(port=5005L)
remDr <- remoteDriver(browserName = "phantomjs", port=5005L)
remDr$open()
remDr$navigate("https://catalogo.movistar.com.pe/samsung-galaxy-note-10-plus")
remDr$screenshot(display = TRUE)
element<- remDr$findElement(using = 'css selector', "body > div:nth-child(6) > div.container.detail-view > div.detail-main-head-container > div.detail-main-right > div.phone-details-specs > div.phone-specs > div.phone-specs-planes > div > div > div.owl-stage-outer > div > div:nth-child(4)")
element$clickElement()
remDr$screenshot(display = TRUE)
Apparently I must show the hidden view to be able to click because in mobile mode I do not get it.
Your help please. Aquí le tome una foto a mi pantalla con las salidas de la ejecución del código enter image description here

Scraping the content of all div tags with a specific class

I'm scraping all the text from a website that occurs in a specific class of div. In the following example, I want to extract everything that's in a div of class "a".
site <- "<div class='a'>Hello, world</div>
<div class='b'>Good morning, world</div>
<div class='a'>Good afternoon, world</div>"
My desired output is...
"Hello, world"
"Good afternoon, world"
The code below extracts the text from every div, but I can't figure out how to include only class="a".
library(tidyverse)
library(rvest)
site %>%
read_html() %>%
html_nodes("div") %>%
html_text()
# [1] "Hello, world" "Good morning, world" "Good afternoon, world"
With Python's BeautifulSoup, it would look something like site.find_all("div", class_="a").
The CSS selector for div with class = "a" is div.a:
site %>%
read_html() %>%
html_nodes("div.a") %>%
html_text()
Or you can use XPath:
html_nodes(xpath = "//div[#class='a']")
site %>%
read_html() %>%
html_nodes(xpath = '//*[#class="a"]') %>%
html_text()

Underlying hyperlink href address from css selector

This bit of code:
library(tidyverse)
library(rvest)
url <- "http://www.imdb.com/title/tt4116284/"
director <- read_html(url) %>%
html_nodes(".summary_text+ .credit_summary_item .itemprop") %>%
html_text()
Will grab the plain text value "Chris McKay" (the director of the new LEGO Batman Movie). The underlying hyperlink href address, however, points to:
http://www.imdb.com/name/nm0003021?ref_=tt_ov_dr
I want that. How can I adjust my css selector to grab the underlying hyperlink href address?
Take the href attr of the parent a tag:
director <- read_html(url) %>%
html_nodes(".summary_text+ .credit_summary_item span a") %>%
html_attr('href')

How to extract text only from parent HTML node (excluding child node)?

I have a code:
<div class="activityBody postBody thing">
<p>
(22)
where?
</p>
</div>
I am using this code to extract text:
html_nodes(messageNode, xpath=".//p") %>% html_text() %>% paste0(collapse="\n")
And getting the result:
"(22) where?"
But I need only "p" text, excluding text that could be inside "p" in child nodes. I have to get this text:
"where"
Is there any way to exclude child nodes while I getting text?
Mac OS 10.11.6 (15G31), RSrudio Version 0.99.903, R version 3.3.1 (2016-06-21)
If you are sure the text you want always comes last you can use:
doc %>% html_nodes(xpath=".//p/text()[last()]") %>% xml_text(trim = TRUE)
Alternatively you can use the following to select all "non empty" trings
doc %>% html_nodes(xpath=".//p/text()[normalize-space()]") %>% xml_text(trim = TRUE)
For more details on normalize-space() see https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/normalize-space
3rd option would be to use the xml2 package directly via:
doc %>% xml2::xml_find_chr(xpath="normalize-space(.//p/text())")
This will grab all the text from <p> children (which means it won't include text from sub-nodes that aren't "text emitters":
library(xml2)
library(rvest)
library(purrr)
txt <- '<div class="activityBody postBody thing">
<p>
(22)
where?
</p>
<p>
stays
<b>disappears</b>
<a>disappears</a>
<span>disappears</span>
stays
</p>
</div>'
doc <- read_xml(txt)
html_nodes(doc, xpath="//p") %>%
map_chr(~paste0(html_text(html_nodes(., xpath="./text()"), trim=TRUE), collapse=" "))
## [1] "where?" "stays stays"
Unfortunately, that's pretty "lossy" (you lose <b>, <span>, etc) but this or #Floo0's (also potentially lossy) solution may work sufficiently for you.
If you use the XML package you can actually edit nodes (i.e. delete node elements).

Scrolling page in RSelenium

How can I manually scroll to the bottom (or top) of a page with the RSelenium WebDriver? I have an element that only becomes available when it is visible on the page.
Assuming you got
library(RSelenium)
startServer()
remDr <- remoteDriver()
remDr$open()
remDr$setWindowSize(width = 800, height = 300)
remDr$navigate("https://www.r-project.org/about.html")
You could scroll to the buttom like this:
webElem <- remDr$findElement("css", "body")
webElem$sendKeysToElement(list(key = "end"))
And you could scroll to the top like this:
webElem$sendKeysToElement(list(key = "home"))
And in case you want to scroll down just a bit, use
webElem$sendKeysToElement(list(key = "down_arrow"))
The names of the keys are in selKeys.

Resources