rvest: how to follow_link an image in a webpage? - r

I need to click a link which is actually an image in the html file (the UCR logo on the top left), how should I do this?
I have the following code:
url <- "http://ringmaster.cs.ucr.edu/Rings.html"
p <- html_session(url)
p %>% follow_link("")
The html code for the logo is:
<a href ="http://www.ucr.edu/">
<img class="pos_fixed" src="images/ucr_logo.jpg" >
</a>
I greatly appreciate it.

You can use:
p %>% follow_link(css = "#container > a:nth-child(1)")
Have a look at ?follow_link you can also supply css or xpath selector.
Also have a look at http://selectorgadget.com/ for how to get the css selector

Try this:
library(rvest)
url <- "http://ringmaster.cs.ucr.edu/Rings.html"
p <- html(url) %>% html_node("a") %>% xml_attr("href")
Now p contain the url you need.
More on rvest http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/

Related

Rending HTML pages using Plumber in R

I understand that plumber is more suitable to build an API than a full fledged website, this said I am trying to display the dynamic data from the db (mongo) in HTML. All works fine but the way I use (heavily inspired from a titanic example) might not be the best one. Here is an example of my homepage:
#' Return result
#' #get /
#' #serializer html
function(ht) {
title <- "Title"
body_intro <- "Welcome to R.gift!"
body_model <- paste("This is just a test ...page from the db with name <b>", single[1], "</b>")
body_msg <- paste("the home page title is <b>", single[2] , "</b>",
"The home page content:<b>",
single[3], "</b>",
sep = "\n")
css <- ' <link href="https://cloud.typography.com/7/css/fonts.css" rel="stylesheet">'
about <- 'about page'
contact <- 'contact page'
result <- paste(
"<head>",
css,
"</head>",
"<html>",
'<div style="font-family: Giant Background A, Giant Background B;font-style: normal;font-weight: 400;font-size:20pt;">',
"<h1>", title, "</h1>",
"<body>",
body_intro, "</div>",
'<div style="font-family: Gotham A, Gotham B;font-style: normal;font-weight: 400;font-size:16pt;">',
"<p>", body_model, "</p>",
"<p>", body_msg, "</p>",
"<p>", about, "</p>",
"<p>", contact, "</p>",
"</div>",
"</body>",
"</html>",
collapse = "\n"
)
return(result)
}
so my question is if there is a more elegant way to achieve the same perhaps with a semi-templating system. The solution might be obvious (I am very new to R so bear with me). I know that plumber can server static files with
#* #assets ./files/static
list()
but I assume this wouldn't allow me to pass variables into index.html for example ?
The ideal scenario is having just tags like in any templating system.
Thanks in advance!
You can try something like this:
basicHTML <- "The home page title is <b> %s </b> The home page content: <b> %s </b>" # create a template in a file or in the script.
single <- c(title = "Hello World", bodyMsg = "lorem ipsum") # set up your parameters.
finalHTML <- sprintf(basicHTML, single[1], single[2])
Or maybe you feel more comfortable with this approach: https://shiny.rstudio.com/articles/templates.html
I hope this helps you.
My advice would be to do away with the server side rendering and do everything on the client side. You could use a frontend library like vue (if you understand HTML and basic javascript, you can use vue) and then do a GET request to the backend for the data you need. Then use that data to customize the UI. This approach let's R focus on data and the frontend focus on the presentation.

How to extract image URL from the <script> in the html code in R?

I use rvest to extract information from the link.
But this time there is no image URL in the html_attr("src") under the respective html node.
The source code is:
<img alt="product name " class="cz-img large_img image_size img_slider_1060571227 img_2" id="d3-view_2" itemprop="image" style="height: auto;" src="">
<script>
var image_url = "https://images.xyz.com/i/314183/large/swatch-image20160708-13472-dh956c.jpg?1467959305";
$('.img_2').attr('src',image_url);
$('.img_2').on('load', function(){
$('.image_message_color').show();
});
</script>
I usually use:
#Get image_url
image_url<-link %>%
html_nodes("#d3-view_1") %>%
html_attr("src")
image_url
But here, the src is empty.
There are 3 or 4 images this way, and what I want to extract images.xyz.com/i/314183/large/swatch-image20160708-13472-dh956c.jpg?1467959305
Please help.
Had the same issue. For me it worked when I added a html_nodes("img") before the html_attr("src"):
library(rvest)
html <- read_html("webpage url")
html %>%
html_nodes("tr+ tr th") %>% # adjust to your path
html_nodes("img") %>%
html_attr("src")
I suggest using regular expressions to extract images, here is a sample:
html <- readLines("webpage link")
images <- regmatches(html,regexpr("https://images.xyz.com.+.[jpg|gif|png]",html))
based on your scenario you can edit the RegEx.

rvest - select href tag string

I am using rvest.
> pgsession %>% jump_to(urls[2]) %>% read_html() %>% html_nodes("a")
{xml_nodeset (114)}
[1] Date
[2] Kennwort ändern
[3] Benutzernamen ändern
[4] Abmelden
...
However, I would only like to get all tags that have the href tag Mitglieder/Detail in it back.
For example a result should look like that :
[1] /Mitglieder/Detail/1213412
...
I tried f.ex.: a[href~=\"Mitglieder\ as css selector, but I get nothing back as a result.
Any suggestions how to change this css selector?
I appreciate your replies!

R- html_nodes doesnt find selector

I wanted to scrap some data with "rvest" package from url http://www.finanzen.ch/kurse/historisch/ABB/SWL/1.1.2001_27.10.2015
I wanted to get the table with the following selector (copied via inspect option from chrome):
#historic-price-list > div > div.content > table
But html_nodes doesn't work:
> url="http://www.finanzen.ch/kurse/historisch/ABB/SWL/1.1.2001_27.10.2015"
> css_selector="#historic-price-list > div > div.content > table"
> html(url) %>% html_nodes(css_selector)
{xml_nodeset (0)}
What I can find is:
> css_selector="#historic-price-list"
> html(url) %>% html_nodes(css_selector)
{xml_nodeset (1)}
[1] <div id="historic-price-list"/>
But it doesn't goes any further.
Maybe someone got an idea why?

Retrieve Image Source with XPath in R using XML-Package

I'd like to retrieve the image source within the below HTML code block, but can't find the right syntax.
library(XML)
library(RCurl)
script <- getURL("http://www.floraweb.de/pflanzenarten/druck.xsql?suchnr=4346")
(doc <- htmlParse(script))
<div class="divider"><hr></div>
<div id="contentblock"><div id="content">
<h1>Alle Angaben</h1>
<p>Zu der von Ihnen gewählten Pflanzenart liegen folgende Informationen vor:</p>
<p>Wissenschaftlicher Name: Poa badensis agg. </p>
<p>Deutscher Name: Artengruppe Badener Rispengras</p>
<p>Familienzugehörigkeit: Poaceae, Süßgräser</p>
<p>Status: keine Angaben </p>
<p class="centeredcontent"><img border="0" src="../bilder/Arten/dummy.tmb.jpg"></p>
Desired result:
"../bilder/Arten/dummy.tmb.jpg"
Any pointers are greatly appreciated!
Try the following:
script <- getURL("http://www.floraweb.de/pflanzenarten/druck.xsql?suchnr=4346")
doc <- htmlTreeParse(script,useInternalNodes=T)
img<-xpathSApply(doc,'//*/p[#class="centeredcontent"]/img',xmlAttrs)
> img[2]
[1] "../bilder/Arten/dummy.tmb.jpg"
The use of Internal representation maybe necessary
EDIT:
I just looked up htmlParse and its equivalent to htmlTreeParse(useInternalNodes=T)
#Martin Morgan thanks have added below
doc <- htmlParse("http://www.floraweb.de/pflanzenarten/druck.xsql?suchnr=4346")
xpathSApply(doc, '//*/p[#class="centeredcontent"]/img/#src')
Use:
//div[#id='contentblock']
/div/p[#class='centeredcontent']
/img/#src
This selects the src attribute of any p element whose class attribute has the value "centeredcontent"and that (the p element) is a child of a div that is a child of a div whose id attribute has the value '"contentblock"'.
If you want to get directly the value of this attribute, use:
string(//div[#id='contentblock']
/div/p[#class='centeredcontent']
/img/#src)

Resources