Web scraping with rvest - Unexpected behaviour - r

I would like to scrape all the links from this web page with rvest: http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15
I have tried with the following:
library(rvest)
url <- read_html('http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15')
nodes <- html_nodes(x = url, css = 'a') %>%
html_attr('href')
Rather than getting all of them, I only got 3. I had a look at the HTML structure of the page and there are definitely more links - particularly in the table.
I then tried to get those ones - the table is in the block_content div:
url <- read_html('http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15')
nodes <- html_nodes(x = url, css = '.block_content') %>%
html_attr('href')
I didn't get any. How do I go ahead?

url <- 'http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/1'
require(XML)
doc <- htmlParse(url)
links <- xpathSApply(doc, "//a/#href")
links
href
"#contentIntCenter"
href
"#menu2"
href
"http://www.interno.it"
href
"/index.html"
href
"/ser/revisori_intro.html"
href
"/docum/index.html"
href
"/ser/index.html"
href
"/ser/revisori_intro.html"
href
"/apps/revisori.php/corsi_condivisi"
href
"/ser/revisori/rev_datisintesi.html"
href
"/apps/revisori.php/albo_revisori"
href
"/apps/revisori.php/situazione_organo_ente"
href
"/apps/revisori.php/get_estraz"
href
"/apps/revisori.php/enti_organo_scadenza"
href
"/apps/revisori.php/register"
href
"/apps/revisori.php"
href
"/ser/revisori/rev_faq.html"
href
"/ser/revisori/rev_circolari.html"
href
"/ser/revisori/rev_comunicati.html"
href
"/ser/revisori/rev_algoritmo.html"
href
"/ser/revisori/rev_contributo.html"
href
"/ser/revisori/rev_comefare.html"
href
"/ser/revisori/rev_contatti.html"
href
"/ser/tbel_intro.html"
href
"#"
href
"/ser/revisori/20170711_16990elencorev.pdf"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/28366#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/31365#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/13681#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/33752#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/11324#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/38169#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/29081#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/27175#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/36459#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/21036#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/31852#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/13244#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/4532#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/32139#combo1"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/0/idA/10652#combo1"
href
"#"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/30"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/45"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/60"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/75"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15"
href
"http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/16980"
href
"#inizio"
href
"#"
href
"#"
href
"#"
href
"#"
href
"/note_legali.html"
href
"http://www.gazzettaufficiale.it"
href
"http://www.italia.gov.it"
href
"http://www.governoitaliano.it"

url=read_html("http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15")
url%>%html_nodes("a")%>%html_attr("href")%>%grep(pattern="http",value=T)
for more information on how to do this please look at another example solved
here
[1] "http://www.interno.it"
[2] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/1512#combo1"
[3] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/37640#combo1"
[4] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/5185#combo1"
[5] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/36196#combo1"
[6] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/2028#combo1"
[7] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/8882#combo1"
[8] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/18386#combo1"
[9] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/354#combo1"
[10] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/31841#combo1"
[11] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/18165#combo1"
[12] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/14787#combo1"
[13] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/37955#combo1"
[14] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/37414#combo1"
[15] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/13739#combo1"
[16] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/15/idA/11640#combo1"
[17] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/"
[18] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/1/"
[19] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/30"
[20] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/45"
[21] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/60"
[22] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/75"
[23] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/90"
[24] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/30"
[25] "http://finanzalocale.interno.it/apps/revisori.php/albo_revisori/elencoRevisori/indice/16980"
[26] "http://www.gazzettaufficiale.it"
[27] "http://www.italia.gov.it"
[28] "http://www.governoitaliano.it"

Related

How to get href value from a link using its class name with CSS selector in scrapy?

<a class="a-link-normal a-text-normal" href="/Art-Dutch-Republic-1585-Everyman/dp/0297833693/ref=sr_1_1?keywords=9780297833697&qid=1574351815&sr=8-1">
<span class="a-size-medium a-color-base a-text-normal">Art of the Dutch Republic 1585 - 1718 (Everyman Art Library)</span>
</a>
How to get Value of href using CSS selector or Xpath?
Here is an example:
def parse(self, response):
# iterate over all href
for href in response.xpath("//a[#class='class-name']/#href"):
# extract href as a string
url = href.extract()
CSS selectors example:
links = response.css("a.a-link-normal.a-text-normal::attr(href)").extract()
Try this response.css('.a-link-normal ::attr(href)').extract()
You can achieve this by following selector
a.your_calss_name::attr(href)

R rvest keeping italics in text when scraping

I'm looking to scrape some message from an online message board.
Currently I am using:
html_nodes(conv,'.talk-post.message') %>%
html_text(trim = TRUE)
For the message:
I'm back now and slowly getting back to speed.
This gives:
"\nI'm back now and slowly getting back to speed.\n"
Which works fine, but removes all html formatting. I would like to retain an indication of where the text has italics tags (similarly for underlining and bold).
I appreciate I could use toString.XMLNode instead, but then that keeps all html tags, not just the three required.
"{xml_nodeset (1)}\n[1] <div class=\"talk-post message\">\\n<p><i>I'm back now and slowly getting back to speed.</i><br>
Are there any more elegant solutions to this?
You can use the XML library for get all the string in the div.
> library(XML)
> txtNode <- "<div><i>Hello</i></div><div><b>World</b></div><div><b><i>!</i></b></div>"
> html <- htmlParse(txtNode)
> html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div><i>Hello</i></div>
<div><b>World</b></div>
<div><b><i>!</i></b></div>
</body></html>
>
> lNode <- getNodeSet(html, "//div")
> lNode
[[1]]
<div>
<i>Hello</i>
</div>
[[2]]
<div>
<b>World</b>
</div>
[[3]]
<div>
<b>
<i>!</i>
</b>
</div>
attr(,"class")
[1] "XMLNodeSet"
>
> lapply(lNode, function(x) toString.XMLNode(x[[1]]))
[[1]]
[1] "<i>Hello</i> "
[[2]]
[1] "<b>World</b> "
[[3]]
[1] "<b>\n <i>!</i>\n</b> "

CSS Selector For Content Crawler (Finding Category Post URL)

code
div title="xxxx" class= "xxx"
a href = "/xx/xxx"
div class = "xx"
there are multiple of these codes, trying to pull al the "a href" links from them. Please help, thanks.
Assuming your a tag to be child tag of div tag of class 'xxx'
<div title="xxxx" class= "xxx">
<a href = "/xx/xxx">
<div class = "xx">
your css namespace would be
.xxx a{ your css properties }
if your anchor tag is not a child tag of div and it lies on the same level
.xxx+a { your css properties }

How to extract image URL from the <script> in the html code in R?

I use rvest to extract information from the link.
But this time there is no image URL in the html_attr("src") under the respective html node.
The source code is:
<img alt="product name " class="cz-img large_img image_size img_slider_1060571227 img_2" id="d3-view_2" itemprop="image" style="height: auto;" src="">
<script>
var image_url = "https://images.xyz.com/i/314183/large/swatch-image20160708-13472-dh956c.jpg?1467959305";
$('.img_2').attr('src',image_url);
$('.img_2').on('load', function(){
$('.image_message_color').show();
});
</script>
I usually use:
#Get image_url
image_url<-link %>%
html_nodes("#d3-view_1") %>%
html_attr("src")
image_url
But here, the src is empty.
There are 3 or 4 images this way, and what I want to extract images.xyz.com/i/314183/large/swatch-image20160708-13472-dh956c.jpg?1467959305
Please help.
Had the same issue. For me it worked when I added a html_nodes("img") before the html_attr("src"):
library(rvest)
html <- read_html("webpage url")
html %>%
html_nodes("tr+ tr th") %>% # adjust to your path
html_nodes("img") %>%
html_attr("src")
I suggest using regular expressions to extract images, here is a sample:
html <- readLines("webpage link")
images <- regmatches(html,regexpr("https://images.xyz.com.+.[jpg|gif|png]",html))
based on your scenario you can edit the RegEx.

rvest - select href tag string

I am using rvest.
> pgsession %>% jump_to(urls[2]) %>% read_html() %>% html_nodes("a")
{xml_nodeset (114)}
[1] Date
[2] Kennwort ändern
[3] Benutzernamen ändern
[4] Abmelden
...
However, I would only like to get all tags that have the href tag Mitglieder/Detail in it back.
For example a result should look like that :
[1] /Mitglieder/Detail/1213412
...
I tried f.ex.: a[href~=\"Mitglieder\ as css selector, but I get nothing back as a result.
Any suggestions how to change this css selector?
I appreciate your replies!

Resources