rvest - select href tag string - css

I am using rvest.
> pgsession %>% jump_to(urls[2]) %>% read_html() %>% html_nodes("a")
{xml_nodeset (114)}
[1] Date
[2] Kennwort ändern
[3] Benutzernamen ändern
[4] Abmelden
...
However, I would only like to get all tags that have the href tag Mitglieder/Detail in it back.
For example a result should look like that :
[1] /Mitglieder/Detail/1213412
...
I tried f.ex.: a[href~=\"Mitglieder\ as css selector, but I get nothing back as a result.
Any suggestions how to change this css selector?
I appreciate your replies!

Related

R rvest keeping italics in text when scraping

I'm looking to scrape some message from an online message board.
Currently I am using:
html_nodes(conv,'.talk-post.message') %>%
html_text(trim = TRUE)
For the message:
I'm back now and slowly getting back to speed.
This gives:
"\nI'm back now and slowly getting back to speed.\n"
Which works fine, but removes all html formatting. I would like to retain an indication of where the text has italics tags (similarly for underlining and bold).
I appreciate I could use toString.XMLNode instead, but then that keeps all html tags, not just the three required.
"{xml_nodeset (1)}\n[1] <div class=\"talk-post message\">\\n<p><i>I'm back now and slowly getting back to speed.</i><br>
Are there any more elegant solutions to this?
You can use the XML library for get all the string in the div.
> library(XML)
> txtNode <- "<div><i>Hello</i></div><div><b>World</b></div><div><b><i>!</i></b></div>"
> html <- htmlParse(txtNode)
> html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div><i>Hello</i></div>
<div><b>World</b></div>
<div><b><i>!</i></b></div>
</body></html>
>
> lNode <- getNodeSet(html, "//div")
> lNode
[[1]]
<div>
<i>Hello</i>
</div>
[[2]]
<div>
<b>World</b>
</div>
[[3]]
<div>
<b>
<i>!</i>
</b>
</div>
attr(,"class")
[1] "XMLNodeSet"
>
> lapply(lNode, function(x) toString.XMLNode(x[[1]]))
[[1]]
[1] "<i>Hello</i> "
[[2]]
[1] "<b>World</b> "
[[3]]
[1] "<b>\n <i>!</i>\n</b> "

Web scraping with rvest - Unexpected behaviour

I would like to scrape all the links from this web page with rvest: http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15
I have tried with the following:
library(rvest)
url <- read_html('http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15')
nodes <- html_nodes(x = url, css = 'a') %>%
html_attr('href')
Rather than getting all of them, I only got 3. I had a look at the HTML structure of the page and there are definitely more links - particularly in the table.
I then tried to get those ones - the table is in the block_content div:
url <- read_html('http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15')
nodes <- html_nodes(x = url, css = '.block_content') %>%
html_attr('href')
I didn't get any. How do I go ahead?
url <- 'http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/1'
require(XML)
doc <- htmlParse(url)
links <- xpathSApply(doc, "//a/#href")
links
href
"#contentIntCenter"
href
"#menu2"
href
"http://www.interno.it"
href
"/index.html"
href
"/ser/revisori_intro.html"
href
"/docum/index.html"
href
"/ser/index.html"
href
"/ser/revisori_intro.html"
href
"/apps/revisori.php/corsi_condivisi"
href
"/ser/revisori/rev_datisintesi.html"
href
"/apps/revisori.php/albo_revisori"
href
"/apps/revisori.php/situazione_organo_ente"
href
"/apps/revisori.php/get_estraz"
href
"/apps/revisori.php/enti_organo_scadenza"
href
"/apps/revisori.php/register"
href
"/apps/revisori.php"
href
"/ser/revisori/rev_faq.html"
href
"/ser/revisori/rev_circolari.html"
href
"/ser/revisori/rev_comunicati.html"
href
"/ser/revisori/rev_algoritmo.html"
href
"/ser/revisori/rev_contributo.html"
href
"/ser/revisori/rev_comefare.html"
href
"/ser/revisori/rev_contatti.html"
href
"/ser/tbel_intro.html"
href
"#"
href
"/ser/revisori/20170711_16990elencorev.pdf"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/28366#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/31365#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/13681#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/33752#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/11324#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/38169#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/29081#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/27175#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/36459#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/21036#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/31852#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/13244#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/4532#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/32139#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/10652#combo1"
href
"#"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/30"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/45"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/60"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/75"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/16980"
href
"#inizio"
href
"#"
href
"#"
href
"#"
href
"#"
href
"/note_legali.html"
href
"http://www.gazzettaufficiale.it"
href
"http://www.italia.gov.it"
href
"http://www.governoitaliano.it"
url=read_html("http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15")
url%>%html_nodes("a")%>%html_attr("href")%>%grep(pattern="http",value=T)
for more information on how to do this please look at another example solved
here
[1] "http://www.interno.it"
[2] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/1512#combo1"
[3] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/37640#combo1"
[4] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/5185#combo1"
[5] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/36196#combo1"
[6] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/2028#combo1"
[7] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/8882#combo1"
[8] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/18386#combo1"
[9] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/354#combo1"
[10] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/31841#combo1"
[11] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/18165#combo1"
[12] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/14787#combo1"
[13] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/37955#combo1"
[14] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/37414#combo1"
[15] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/13739#combo1"
[16] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/11640#combo1"
[17] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/"
[18] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/1/"
[19] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/30"
[20] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/45"
[21] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/60"
[22] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/75"
[23] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/90"
[24] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/30"
[25] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/16980"
[26] "http://www.gazzettaufficiale.it"
[27] "http://www.italia.gov.it"
[28] "http://www.governoitaliano.it"

How to extract image URL from the <script> in the html code in R?

I use rvest to extract information from the link.
But this time there is no image URL in the html_attr("src") under the respective html node.
The source code is:
<img alt="product name " class="cz-img large_img image_size img_slider_1060571227 img_2" id="d3-view_2" itemprop="image" style="height: auto;" src="">
<script>
var image_url = "https://images.xyz.com/i/314183/large/swatch-image20160708-13472-dh956c.jpg?1467959305";
$('.img_2').attr('src',image_url);
$('.img_2').on('load', function(){
$('.image_message_color').show();
});
</script>
I usually use:
#Get image_url
image_url<-link %>%
html_nodes("#d3-view_1") %>%
html_attr("src")
image_url
But here, the src is empty.
There are 3 or 4 images this way, and what I want to extract images.xyz.com/i/314183/large/swatch-image20160708-13472-dh956c.jpg?1467959305
Please help.
Had the same issue. For me it worked when I added a html_nodes("img") before the html_attr("src"):
library(rvest)
html <- read_html("webpage url")
html %>%
html_nodes("tr+ tr th") %>% # adjust to your path
html_nodes("img") %>%
html_attr("src")
I suggest using regular expressions to extract images, here is a sample:
html <- readLines("webpage link")
images <- regmatches(html,regexpr("https://images.xyz.com.+.[jpg|gif|png]",html))
based on your scenario you can edit the RegEx.

R- html_nodes doesnt find selector

I wanted to scrap some data with "rvest" package from url http://www.finanzen.ch/kurse/historisch/ABB/SWL/1.1.2001_27.10.2015
I wanted to get the table with the following selector (copied via inspect option from chrome):
#historic-price-list > div > div.content > table
But html_nodes doesn't work:
> url="http://www.finanzen.ch/kurse/historisch/ABB/SWL/1.1.2001_27.10.2015"
> css_selector="#historic-price-list > div > div.content > table"
> html(url) %>% html_nodes(css_selector)
{xml_nodeset (0)}
What I can find is:
> css_selector="#historic-price-list"
> html(url) %>% html_nodes(css_selector)
{xml_nodeset (1)}
[1] <div id="historic-price-list"/>
But it doesn't goes any further.
Maybe someone got an idea why?

rvest: how to follow_link an image in a webpage?

I need to click a link which is actually an image in the html file (the UCR logo on the top left), how should I do this?
I have the following code:
url <- "http://ringmaster.cs.ucr.edu/Rings.html"
p <- html_session(url)
p %>% follow_link("")
The html code for the logo is:
<a href ="http://www.ucr.edu/">
<img class="pos_fixed" src="images/ucr_logo.jpg" >
</a>
I greatly appreciate it.
You can use:
p %>% follow_link(css = "#container > a:nth-child(1)")
Have a look at ?follow_link you can also supply css or xpath selector.
Also have a look at http://selectorgadget.com/ for how to get the css selector
Try this:
library(rvest)
url <- "http://ringmaster.cs.ucr.edu/Rings.html"
p <- html(url) %>% html_node("a") %>% xml_attr("href")
Now p contain the url you need.
More on rvest http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/

Resources