Web scraping of nested links with R

Web scraping of nested links with R - r

I would like to web scrap the links that are nested in the name of the property, this script works, however, not retrieves the URLs only NAs. Could you help me or what I am missing in the script snipped.
Thank you
# Test
library(rvest)
library(dplyr)
link <- "https://www.sreality.cz/hledani/prodej/byty/brno?_escaped_fragment_="
page <- read_html(link)
price <- page %>%
html_elements(".norm-price.ng-binding") %>%
html_text()
name <- page %>%
html_elements(".name.ng-binding") %>%
html_text()
location <- page %>%
html_elements(".locality.ng-binding") %>%
html_text()
href <- page %>%
html_nodes(".name.ng-binding") %>%
html_attr("href") %>% paste("https://www.sreality.cz", ., sep="")
flat <- data.frame(price, name, location, href, stringsAsFactors = FALSE)

Your CSS selector picked the anchors' inline html instead of the anchor. This should work:
page %>%
html_nodes("a.title") %>%
html_attr("ng-href") %>%
paste0("https://www.sreality.cz", .)
paste0(...) being a shorthand for paste(..., sep = '')

Another way using JS path
page %>%
html_nodes('#page-layout > div.content-cover > div.content-inner > div.transcluded-content.ng-scope > div > div > div.content > div > div:nth-child(4) > div > div:nth-child(n)') %>%
html_nodes('a') %>% html_attr('href') %>% str_subset('detail') %>% unique() %>% paste("https://www.sreality.cz", ., sep="")
[1] "https://www.sreality.cz/detail/prodej/byt/4+1/brno-zabrdovice-tkalcovska/1857071452"
[2] "https://www.sreality.cz/detail/prodej/byt/3+kk/brno--/1336764508"
[3] "https://www.sreality.cz/detail/prodej/byt/2+kk/brno-stary-liskovec-u-posty/3639359836"
[4] "https://www.sreality.cz/detail/prodej/byt/2+1/brno-reckovice-druzstevni/3845994844"
[5] "https://www.sreality.cz/detail/prodej/byt/2+1/brno-styrice-jilova/1102981468"
[6] "https://www.sreality.cz/detail/prodej/byt/1+kk/brno-dolni-herspice-/1961502812"

Related

How to select "href" of a web page of a specific "target"?

<a class="image teaser-image ng-star-inserted" target="_self" href="/politik/inland/neuwahlen-2022-welche-szenarien-jetzt-realistisch-sind/401773131">
I just want to extract the "href" (for example the upper HTML tag) in order to concat it with the domain name of this website "https://kurier.at" and web scrape all articles on the home page.
I tried the following code
library(rvest)
library(lubridate)
kurier_wbpg <- read_html("https://kurier.at")
# I just want the "a" tags which come with the attribute "_self"
articleLinks <- kurier_wbpg %>% html_elements("a")%>%
html_elements(css = "tag[attribute=_self]") %>%
html_attr("href")%>%
paste("https://kurier.at",.,sep = "")
When I execute up to the html_attr("href") part of the above code block, the result I get is
character(0)
I think something wrong with selecting the HTML element tag.
I need some help with this?

You need to narrow down your css to the second teaser block image which you can do by using the naming conventions of the classes. You can use url_absolute() to add the domain.
library(rvest)
library(magrittr)
url <- 'https://kurier.at/'
result <- read_html(url) %>%
html_element('.teasers-2 .image') %>%
html_attr('href') %>%
url_absolute(url)
Same principle to get all teasers:
results <- read_html(url) %>%
html_elements('.teaser .image') %>%
html_attr('href') %>%
url_absolute(url)
Not sure if you want the bottom block of 5 included. If so, you can again use classes
articles <- read_html(url) %>%
html_elements('.teaser-title') %>%
html_attr('href') %>%
url_absolute(url)

It works with xpath -
library(rvest)
kurier_wbpg <- read_html("https://kurier.at")
articleLinks <- kurier_wbpg %>%
html_elements("a") %>%
html_elements(xpath = '//*[#target="_self"]') %>%
html_attr('href') %>%
paste0("https://kurier.at",.)
articleLinks
# [1] "https://kurier.at/plus"
# [2] "https://kurier.at/coronavirus"
# [3] "https://kurier.at/politik"
# [4] "https://kurier.at/politik/inland"
# [5] "https://kurier.at/politik/ausland"
#...
#...

Using rvest to webscrape multiple pages

I am trying to extract all speeches given by Melania Trump from 2016-2020 at the following link: https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush. I am trying to use rvest to do so. Here is my code thus far:
# get main link
link <- "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush"
# main page
page <- read_html(link)
# extract speech titles
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links
# extract year of speech
year <- page %>% html_nodes(".date-display-single") %>% html_text()
# extract name of person giving speech
flotus <- page %>% html_nodes(".views-field-title-1.nowrap") %>% html_text()
get_text <- function(title_link){
speech_page = read_html(title_links)
speech_text = speech_page %>% html_nodes(".field-docs-content p") %>%
html_text() %>% paste(collapse = ",")
return(speech_page)
}
text = sapply(title_links, FUN = get_text)
I am having trouble with the following line of code:
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links
In particular, title_links yields a series of links like this: "https://www.presidency.ucsb.eduNA", rather than the individual web pages. Does anyone know what I am doing wrong here? Any help would be appreciated.

You are querying the wrong css node.
Try:
page %>% html_elements(css = "td.views-field-title a") %>% html_attr('href')
[1] "https://www.presidency.ucsb.edu/documents/remarks-mrs-laura-bush-the-national-press-club"
[2] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-un-commission-the-status-women-international-womens-day"
[3] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-colorado-early-childhood-cognitive-development-summit"
[4] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-10th-anniversary-the-holocaust-memorial-museum-and-opening-anne"
[5] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-preserve-america-initiative-portland-maine"

rvest how to get last page number

Trying to get the last page number:
library(rvest)
url <- "https://www.immobilienscout24.de/Suche/de/wohnung-kaufen"
page <- read_html(url)
last_page_number <- page %>%
html_nodes("#pageSelection > select > option") %>%
html_text() %>%
length()
The result is empty for some reason.
I can access the pages by this url, for example to get page #3:
https://www.immobilienscout24.de/Suche/de/wohnung-kaufen?pagenumber=3

You are in the right direction but I think you have got wrong css selectors. Try :
library(rvest)
url <- 'https://www.immobilienscout24.de/Suche/de/wohnung-kaufen'
url %>%
read_html() %>%
html_nodes('div.select-container select option') %>%
html_text() %>%
tail(1L)
#[1] "1650"
An alternative :
url %>%
read_html() %>%
html_nodes('div.select-container select option') %>%
magrittr::extract2(length(.)) %>%
html_text()

R web-scraping on a multiple-level website with non dynamic URLs

I apologize in case I have not found a previous topic on this matter.
I want to scrape this website
http://www.fao.org/countryprofiles/en/
In particular, this page includes a lot of links to country infos. Those links'structure is:
http://www.fao.org/countryprofiles/index/en/?iso3=KAZ
http://www.fao.org/countryprofiles/index/en/?iso3=AFG
and any of this page includes a News section I am interested in.
Of course, I could scrape page-by-page but that would be a waste of time.
I tried the following but that is not working:
countries <- read_html("http://www.fao.org/countryprofiles/en/") %>%
html_nodes(".linkcountry") %>%
html_text()
country_news <- list()
sub <- html_session("http://www.fao.org/countryprofiles/en/")
for(i in countries[1:100]){
page <- sub %>%
follow_link(i) %>%
read_html()
country_news[[i]] <- page %>%
html_nodes(".white-box") %>%
html_text()
}
Any idea?

You can get all of the child pages from the top-level page:
stem = 'http://www.fao.org'
top_level = paste0(stem, '/countryprofiles/en/')
all_children = read_html(top_level) %>%
# ? and = are required to skip /iso3list/en/
html_nodes(xpath = '//a[contains(#href, "?iso3=")]/#href') %>%
html_text %>% paste0(stem, .)
head(all_children)
# [1] "http://www.fao.org/countryprofiles/index/en/?iso3=AFG"
# [2] "http://www.fao.org/countryprofiles/index/en/?iso3=ALB"
# [3] "http://www.fao.org/countryprofiles/index/en/?iso3=DZA"
# [4] "http://www.fao.org/countryprofiles/index/en/?iso3=AND"
# [5] "http://www.fao.org/countryprofiles/index/en/?iso3=AGO"
# [6] "http://www.fao.org/countryprofiles/index/en/?iso3=ATG"
If you are not comfortable with xpath, the CSS version would be:
html_nodes('a') %>% html_attr('href') %>%
grep("?iso3=", ., value = TRUE, fixed = TRUE) %>% paste0(stem, .)
Now you can loop over those pages & extract what you want

R: rvest: How to use ifelse for nodes?

I'm scrapping this page:
https://www.linio.com.pe/c/tv-y-video/televisores
And I want to extract the current price of the TVs. The problem is that some prices are inside a <div> and others -fewer- inside a <span> tag.
I'm wondering if it's possible to use an 'ifelse' construct to get all the current prices for the TVs.
#Reads Linio's HTML
linio <- read_html("https://www.linio.com.pe/c/tv-y-video/televisores", encoding = "ISO-8859-1")
#Extracts prices inside the div tag
linio %>% html_nodes("div.price-section div.price-secondary") %>% html_text()
#Extracts prices inside the span tag
linio %>% html_nodes("div.price-section span.price-secondary") %>% html_text()
I was trying this to combine the prices from the div and the span tags:
linio %>% ifelse(length(html_nodes("div.price-section div.price-secondary") %>% html_text())==0, html_nodes("div.price-section span.price-secondary") %>% html_text(), html_nodes("div.price-section div.price-secondary")) %>% html_text()
Without success... why can't you be consistant Linio's front end developers...!

There are multiple ways to accomplish that:
Drop the div/span altogether using:
linio %>% html_nodes("div.price-section .price-secondary") %>% html_text()
This selects all elements with class price-secondary inside
div.price-section.
More specific
Only select div and span tags inside div.price-section you can use:
linio %>%
html_nodes("div.price-section div.price-secondary, div.price-section span.price-secondary") %>%
html_text
For a full CSS selector reference see
https://www.w3schools.com/cssref/css_selectors.asp
minimal CSS selector
To find a minimal CSS selector have a look at http://selectorgadget.com/
In your case this would be:
linio %>% html_nodes(".price-secondary") %>% html_text
This selects all elements with class price-secondary
Test that all return the same result
res1 <- linio %>% html_nodes("div.price-section .price-secondary") %>% html_text()
res2 <- linio %>%
html_nodes("div.price-section div.price-secondary, div.price-section span.price-secondary") %>%
html_text
res3 <- linio %>% html_nodes(".price-secondary") %>% html_text
all(res1 == res2) # TRUE
all(res2 == res3) # TRUE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web scraping of nested links with R - r

Your CSS selector picked the anchors' inline html instead of the anchor. This should work: page %>% html_nodes("a.title") %>% html_attr("ng-href") %>% paste0("https://www.sreality.cz", .) paste0(...) being a shorthand for paste(..., sep = '')

Related

How to select "href" of a web page of a specific "target"?

Using rvest to webscrape multiple pages

rvest how to get last page number

R web-scraping on a multiple-level website with non dynamic URLs

R: rvest: How to use ifelse for nodes?

Categories

Resources