How to select "href" of a web page of a specific "target"? - r

<a class="image teaser-image ng-star-inserted" target="_self" href="/politik/inland/neuwahlen-2022-welche-szenarien-jetzt-realistisch-sind/401773131">
I just want to extract the "href" (for example the upper HTML tag) in order to concat it with the domain name of this website "https://kurier.at" and web scrape all articles on the home page.
I tried the following code
library(rvest)
library(lubridate)
kurier_wbpg <- read_html("https://kurier.at")
# I just want the "a" tags which come with the attribute "_self"
articleLinks <- kurier_wbpg %>% html_elements("a")%>%
html_elements(css = "tag[attribute=_self]") %>%
html_attr("href")%>%
paste("https://kurier.at",.,sep = "")
When I execute up to the html_attr("href") part of the above code block, the result I get is
character(0)
I think something wrong with selecting the HTML element tag.
I need some help with this?

You need to narrow down your css to the second teaser block image which you can do by using the naming conventions of the classes. You can use url_absolute() to add the domain.
library(rvest)
library(magrittr)
url <- 'https://kurier.at/'
result <- read_html(url) %>%
html_element('.teasers-2 .image') %>%
html_attr('href') %>%
url_absolute(url)
Same principle to get all teasers:
results <- read_html(url) %>%
html_elements('.teaser .image') %>%
html_attr('href') %>%
url_absolute(url)
Not sure if you want the bottom block of 5 included. If so, you can again use classes
articles <- read_html(url) %>%
html_elements('.teaser-title') %>%
html_attr('href') %>%
url_absolute(url)

It works with xpath -
library(rvest)
kurier_wbpg <- read_html("https://kurier.at")
articleLinks <- kurier_wbpg %>%
html_elements("a") %>%
html_elements(xpath = '//*[#target="_self"]') %>%
html_attr('href') %>%
paste0("https://kurier.at",.)
articleLinks
# [1] "https://kurier.at/plus"
# [2] "https://kurier.at/coronavirus"
# [3] "https://kurier.at/politik"
# [4] "https://kurier.at/politik/inland"
# [5] "https://kurier.at/politik/ausland"
#...
#...

Related

Web scraping of nested links with R

I would like to web scrap the links that are nested in the name of the property, this script works, however, not retrieves the URLs only NAs. Could you help me or what I am missing in the script snipped.
Thank you
# Test
library(rvest)
library(dplyr)
link <- "https://www.sreality.cz/hledani/prodej/byty/brno?_escaped_fragment_="
page <- read_html(link)
price <- page %>%
html_elements(".norm-price.ng-binding") %>%
html_text()
name <- page %>%
html_elements(".name.ng-binding") %>%
html_text()
location <- page %>%
html_elements(".locality.ng-binding") %>%
html_text()
href <- page %>%
html_nodes(".name.ng-binding") %>%
html_attr("href") %>% paste("https://www.sreality.cz", ., sep="")
flat <- data.frame(price, name, location, href, stringsAsFactors = FALSE)
Your CSS selector picked the anchors' inline html instead of the anchor. This should work:
page %>%
html_nodes("a.title") %>%
html_attr("ng-href") %>%
paste0("https://www.sreality.cz", .)
paste0(...) being a shorthand for paste(..., sep = '')
Another way using JS path
page %>%
html_nodes('#page-layout > div.content-cover > div.content-inner > div.transcluded-content.ng-scope > div > div > div.content > div > div:nth-child(4) > div > div:nth-child(n)') %>%
html_nodes('a') %>% html_attr('href') %>% str_subset('detail') %>% unique() %>% paste("https://www.sreality.cz", ., sep="")
[1] "https://www.sreality.cz/detail/prodej/byt/4+1/brno-zabrdovice-tkalcovska/1857071452"
[2] "https://www.sreality.cz/detail/prodej/byt/3+kk/brno--/1336764508"
[3] "https://www.sreality.cz/detail/prodej/byt/2+kk/brno-stary-liskovec-u-posty/3639359836"
[4] "https://www.sreality.cz/detail/prodej/byt/2+1/brno-reckovice-druzstevni/3845994844"
[5] "https://www.sreality.cz/detail/prodej/byt/2+1/brno-styrice-jilova/1102981468"
[6] "https://www.sreality.cz/detail/prodej/byt/1+kk/brno-dolni-herspice-/1961502812"

Scraping links from a web page at a specific position

I am trying to collect a number of links from a website.
For example I have the following and my idea was to collect the link where it says leer más which is where I get the xpath from.
url = "https://www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l/181"
x <- GET(url, add_headers('user-agent' = desktop_agents[sample(1:10, 1)]))
x %>%
read_html() %>%
html_nodes(xpath = '//*[#id="App"]/div[2]/div[1]/main/div/div[3]/section/article[1]/div/a/p/span[2]')
This gives me the following but not the link:
{xml_nodeset (1)}
[1] <span class="re-CardDescription-link">Leer más</span>
Additionally, I thought about collecting all links:
x %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href")
This gives me a lot of links but not the links to the individual webpages I want.
I would like to have a list of links such as:
https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-terraza-trastero-ascensor-amueblado-internet/162262978/d
https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-trastero-ascensor-amueblado/159750574/d
https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-jardin-zona-comunitaria-ascensor-patio-amueblado-parking-television-internet-piscina/162259162/d
Those links are stored inside a JavaScript object within a script tag. You can regex out the string defining that object, do some unescapes to enable jsonlite to parse, then apply a custom function to extract just the urls of interest to the json object
library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
link <- 'https://www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l/181'
p <- read_html(url) %>% html_text()
s <- str_match(p, 'window\\.__INITIAL_PROPS__ = JSON\\.parse\\("(.*)".*?;')[,2]
data <- jsonlite::parse_json(gsub('\\\\\\"', '\\\"', gsub('\\\\"', '"', s)))
links <- purrr::map(data$initialSearch$result$realEstates, ~ .x$detail$`es-ES` %>% url_absolute(link))

html_attr "href" does not extract link

I want to download the file that is in the tab "Dossier" with the text "Modul 4" here:
https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier
First I want to get the link.
My code for that is the following:
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes(".gba-download__text") %>%
.[[4]] %>%
html_attr("href")
(I know the piece .[[4]] is not really good, this is not my full code.)
This leads to NA and I don't understand why.
Similar questions couldn't help here.
Allan already left a concise answer. But let me leave another way. If you check the page source, you can see that the target is in .gba-download-list. (There are actually two of them.) So get that part and walk down to href part. Once you get urls, you can use grep() to identify a link containing Modul4. I used unique() in the end to remove a dupe.
read_html("https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier") %>%
html_nodes(".gba-download-list") %>%
html_nodes("a") %>%
html_attr("href") %>%
grep(pattern = "Modul4", value = TRUE) %>%
unique()
[1] "/downloads/92-975-67/2011-12-05_Modul4A_Apixaban.pdf"
It's easier to get to a specific node if you use xpath :
library(rvest)
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes(xpath = "//span[contains(text(),'Modul 4')]/..") %>%
.[[1]] %>%
html_attr("href")
#> [1] "/downloads/92-975-67/2011-12-05_Modul4A_Apixaban.pdf"
I have another solution now and want to share it:
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes("a.download-helper") %>%
html_attr("href") %>%
.[str_detect(., "Modul4")] %>%
unique
It is faster to use a css selector with contains operator to target the href by substring. In addition, only a single node match needs to be returned
library(rvest)
url <- "https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier"
link <- read_html(url) %>%
html_node("[href*='Modul4']") %>%
html_attr("href") %>% url_absolute(url)

rvest how to get last page number

Trying to get the last page number:
library(rvest)
url <- "https://www.immobilienscout24.de/Suche/de/wohnung-kaufen"
page <- read_html(url)
last_page_number <- page %>%
html_nodes("#pageSelection > select > option") %>%
html_text() %>%
length()
The result is empty for some reason.
I can access the pages by this url, for example to get page #3:
https://www.immobilienscout24.de/Suche/de/wohnung-kaufen?pagenumber=3
You are in the right direction but I think you have got wrong css selectors. Try :
library(rvest)
url <- 'https://www.immobilienscout24.de/Suche/de/wohnung-kaufen'
url %>%
read_html() %>%
html_nodes('div.select-container select option') %>%
html_text() %>%
tail(1L)
#[1] "1650"
An alternative :
url %>%
read_html() %>%
html_nodes('div.select-container select option') %>%
magrittr::extract2(length(.)) %>%
html_text()

R web-scraping on a multiple-level website with non dynamic URLs

I apologize in case I have not found a previous topic on this matter.
I want to scrape this website
http://www.fao.org/countryprofiles/en/
In particular, this page includes a lot of links to country infos. Those links'structure is:
http://www.fao.org/countryprofiles/index/en/?iso3=KAZ
http://www.fao.org/countryprofiles/index/en/?iso3=AFG
and any of this page includes a News section I am interested in.
Of course, I could scrape page-by-page but that would be a waste of time.
I tried the following but that is not working:
countries <- read_html("http://www.fao.org/countryprofiles/en/") %>%
html_nodes(".linkcountry") %>%
html_text()
country_news <- list()
sub <- html_session("http://www.fao.org/countryprofiles/en/")
for(i in countries[1:100]){
page <- sub %>%
follow_link(i) %>%
read_html()
country_news[[i]] <- page %>%
html_nodes(".white-box") %>%
html_text()
}
Any idea?
You can get all of the child pages from the top-level page:
stem = 'http://www.fao.org'
top_level = paste0(stem, '/countryprofiles/en/')
all_children = read_html(top_level) %>%
# ? and = are required to skip /iso3list/en/
html_nodes(xpath = '//a[contains(#href, "?iso3=")]/#href') %>%
html_text %>% paste0(stem, .)
head(all_children)
# [1] "http://www.fao.org/countryprofiles/index/en/?iso3=AFG"
# [2] "http://www.fao.org/countryprofiles/index/en/?iso3=ALB"
# [3] "http://www.fao.org/countryprofiles/index/en/?iso3=DZA"
# [4] "http://www.fao.org/countryprofiles/index/en/?iso3=AND"
# [5] "http://www.fao.org/countryprofiles/index/en/?iso3=AGO"
# [6] "http://www.fao.org/countryprofiles/index/en/?iso3=ATG"
If you are not comfortable with xpath, the CSS version would be:
html_nodes('a') %>% html_attr('href') %>%
grep("?iso3=", ., value = TRUE, fixed = TRUE) %>% paste0(stem, .)
Now you can loop over those pages & extract what you want

Resources