This bit of code:
library(tidyverse)
library(rvest)
url <- "http://www.imdb.com/title/tt4116284/"
director <- read_html(url) %>%
html_nodes(".summary_text+ .credit_summary_item .itemprop") %>%
html_text()
Will grab the plain text value "Chris McKay" (the director of the new LEGO Batman Movie). The underlying hyperlink href address, however, points to:
http://www.imdb.com/name/nm0003021?ref_=tt_ov_dr
I want that. How can I adjust my css selector to grab the underlying hyperlink href address?
Take the href attr of the parent a tag:
director <- read_html(url) %>%
html_nodes(".summary_text+ .credit_summary_item span a") %>%
html_attr('href')
Related
I'm trying to get the main image on this website
The problem is that the main image is loaded via the HTML canvas tag. When I inspect the source code, the image is loaded in here, with no information on any file:
<canvas id="image1" class="mesImages ui-draggable" style="position: fixed; width: …ransform: rotate(0deg);" title="" tabindex="-1" width="359" height="542">
Is it possible to use RSelenium (and not Python) to get the image in .png or .jpg?
Gathering links with RSelenium
library(tidyverse)
library(RSelenium)
library(netstat)
rD <- rsDriver(browser = "firefox", port = free_port())
remDr <- rD[["client"]]
remDr$navigate("https://archives.landes.fr/arkotheque/visionneuse/visionneuse.php?arko=YTo3OntzOjQ6ImRhdGUiO3M6MTA6IjIwMjItMTAtMTgiO3M6MTA6InR5cGVfZm9uZHMiO3M6MTE6ImFya29fc2VyaWVsIjtzOjQ6InJlZjEiO3M6MToiNCI7czo0OiJyZWYyIjtzOjM6IjE3MyI7czoyMjoiZm9yY2VfbnVtX2ltYWdlX2RlcGFydCI7aTozNDA7czoxNjoidmlzaW9ubmV1c2VfaHRtbCI7YjoxO3M6MjE6InZpc2lvbm5ldXNlX2h0bWxfbW9kZSI7czo0OiJwcm9kIjt9")
accept <- remDr$findElement("xpath", '//*[#id="licence_clic_bouton_accepter"]')
accept$clickElement()
more <- remDr$findElement("css selector", "#arkoVision_barToggleButton_bottom")
more$clickElement()
content <- remDr$getPageSource()[[1]]
links <- content %>%
read_html() %>%
html_elements(".imgThumb_container") %>%
html_children() %>%
html_attr("rel") %>%
.[!is.na(.)]
> links %>% head()
[1] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=13a087d1f5.jpg"
[2] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=2159e78416.jpg"
[3] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=65786a44b1.jpg"
[4] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=702fbd57fa.jpg"
[5] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=9c4421d51e.jpg"
[6] "https://archives.landes.fr/arkotheque/visionneuse/img_prot.php?i=a4b68fd913.jpg"
I am trying to remove some paragraphs from an XML document in R, but I want to keep the XML structure/class. Here's some example text and my failed attempts:
library(xml2)
text = read_xml("<paper> <caption><p>The main title</p> <p>A sub title</p></caption> <p>The opening paragraph.</p> </paper>")
xml_find_all(text, './/caption//p') %>% xml_remove() # deletes text
xml_find_all(text, './/caption//p') %>% xml_text() # removes paragraphs but also XML structure
Here's what I would like to end up with (just the paragraphs in the caption removed):
ideal_text = read_xml("<paper> <caption>The main title A sub title</caption> <p>The opening paragraph.</p> </paper>")
ideal_text
It looks like this requires multiple steps. Find the node, copy the text, remove the contents of the node and then update.
library(xml2)
library(magrittr)
text = read_xml("<paper> <caption><p>The main title</p> <p>A sub title</p></caption> <p>The opening paragraph.</p> </paper>")
# find the caption
caption <- xml_find_all(text, './/caption')
#store existing text
replacemement<- caption %>% xml_find_all( './/p') %>% xml_text() %>% paste(collapse = " ")
#remove the desired text
caption %>% xml_find_all( './/p') %>% xml_remove()
#replace the caption
xml_text(caption) <- replacemement
text #test
{xml_document}
<paper>
[1] <caption>The main title A sub title</caption>
[2] <p>The opening paragraph.</p>
Most likely you will need to obtain the vector/list of caption nodes and then step through them one-by-one with a loop.
I am creating a table with a column with hyperlinks, but those hyperlinks are very long and I wanted to substitute the long text for an image, to click it and to open the link in a new tab.
For example, with this code
df = iris[c(1,51,101),]
df$hyperlink = c("https://en.wikipedia.org/wiki/Iris_setosa", "https://en.wikipedia.org/wiki/Iris_versicolor", "https://en.wikipedia.org/wiki/Iris_virginica")
kable(df,format = "html")%>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)
I obtain the last column as a hyperlinks, but what I would like is to put an image that, when clicked, it opens the url (preferably in a new window or tab)
You add clickable images by adding the appropriate html tags. <a href='...'></a> is for hyperlinks, and <img src='...'> is for images. Simply place the image tag between the opening and closing html tags. Also, be sure to include escape=FALSE in the kable statement to make it work.
library(kableExtra)
library(dplyr)
df = iris[c(1,51,101),]
df$hyperlink = c("<a href='https://en.wikipedia.org/wiki/Iris_setosa'><img src='setosa.png' /</a>",
"<a href='https://en.wikipedia.org/wiki/Iris_versicolor'><img src='versicolor.png' /></a>",
"<a href='https://en.wikipedia.org/wiki/Iris_virginica'><img src='virginica.png' /></a>")
kable(df,escape=FALSE,format = "html")%>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)
I'm scraping all the text from a website that occurs in a specific class of div. In the following example, I want to extract everything that's in a div of class "a".
site <- "<div class='a'>Hello, world</div>
<div class='b'>Good morning, world</div>
<div class='a'>Good afternoon, world</div>"
My desired output is...
"Hello, world"
"Good afternoon, world"
The code below extracts the text from every div, but I can't figure out how to include only class="a".
library(tidyverse)
library(rvest)
site %>%
read_html() %>%
html_nodes("div") %>%
html_text()
# [1] "Hello, world" "Good morning, world" "Good afternoon, world"
With Python's BeautifulSoup, it would look something like site.find_all("div", class_="a").
The CSS selector for div with class = "a" is div.a:
site %>%
read_html() %>%
html_nodes("div.a") %>%
html_text()
Or you can use XPath:
html_nodes(xpath = "//div[#class='a']")
site %>%
read_html() %>%
html_nodes(xpath = '//*[#class="a"]') %>%
html_text()
I have a code:
<div class="activityBody postBody thing">
<p>
(22)
where?
</p>
</div>
I am using this code to extract text:
html_nodes(messageNode, xpath=".//p") %>% html_text() %>% paste0(collapse="\n")
And getting the result:
"(22) where?"
But I need only "p" text, excluding text that could be inside "p" in child nodes. I have to get this text:
"where"
Is there any way to exclude child nodes while I getting text?
Mac OS 10.11.6 (15G31), RSrudio Version 0.99.903, R version 3.3.1 (2016-06-21)
If you are sure the text you want always comes last you can use:
doc %>% html_nodes(xpath=".//p/text()[last()]") %>% xml_text(trim = TRUE)
Alternatively you can use the following to select all "non empty" trings
doc %>% html_nodes(xpath=".//p/text()[normalize-space()]") %>% xml_text(trim = TRUE)
For more details on normalize-space() see https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/normalize-space
3rd option would be to use the xml2 package directly via:
doc %>% xml2::xml_find_chr(xpath="normalize-space(.//p/text())")
This will grab all the text from <p> children (which means it won't include text from sub-nodes that aren't "text emitters":
library(xml2)
library(rvest)
library(purrr)
txt <- '<div class="activityBody postBody thing">
<p>
(22)
where?
</p>
<p>
stays
<b>disappears</b>
<a>disappears</a>
<span>disappears</span>
stays
</p>
</div>'
doc <- read_xml(txt)
html_nodes(doc, xpath="//p") %>%
map_chr(~paste0(html_text(html_nodes(., xpath="./text()"), trim=TRUE), collapse=" "))
## [1] "where?" "stays stays"
Unfortunately, that's pretty "lossy" (you lose <b>, <span>, etc) but this or #Floo0's (also potentially lossy) solution may work sufficiently for you.
If you use the XML package you can actually edit nodes (i.e. delete node elements).