R: rvest: How to use ifelse for nodes? - r

I'm scrapping this page:
https://www.linio.com.pe/c/tv-y-video/televisores
And I want to extract the current price of the TVs. The problem is that some prices are inside a <div> and others -fewer- inside a <span> tag.
I'm wondering if it's possible to use an 'ifelse' construct to get all the current prices for the TVs.
#Reads Linio's HTML
linio <- read_html("https://www.linio.com.pe/c/tv-y-video/televisores", encoding = "ISO-8859-1")
#Extracts prices inside the div tag
linio %>% html_nodes("div.price-section div.price-secondary") %>% html_text()
#Extracts prices inside the span tag
linio %>% html_nodes("div.price-section span.price-secondary") %>% html_text()
I was trying this to combine the prices from the div and the span tags:
linio %>% ifelse(length(html_nodes("div.price-section div.price-secondary") %>% html_text())==0, html_nodes("div.price-section span.price-secondary") %>% html_text(), html_nodes("div.price-section div.price-secondary")) %>% html_text()
Without success... why can't you be consistant Linio's front end developers...!

There are multiple ways to accomplish that:
Drop the div/span altogether using:
linio %>% html_nodes("div.price-section .price-secondary") %>% html_text()
This selects all elements with class price-secondary inside
div.price-section.
More specific
Only select div and span tags inside div.price-section you can use:
linio %>%
html_nodes("div.price-section div.price-secondary, div.price-section span.price-secondary") %>%
html_text
For a full CSS selector reference see
https://www.w3schools.com/cssref/css_selectors.asp
minimal CSS selector
To find a minimal CSS selector have a look at http://selectorgadget.com/
In your case this would be:
linio %>% html_nodes(".price-secondary") %>% html_text
This selects all elements with class price-secondary
Test that all return the same result
res1 <- linio %>% html_nodes("div.price-section .price-secondary") %>% html_text()
res2 <- linio %>%
html_nodes("div.price-section div.price-secondary, div.price-section span.price-secondary") %>%
html_text
res3 <- linio %>% html_nodes(".price-secondary") %>% html_text
all(res1 == res2) # TRUE
all(res2 == res3) # TRUE

Related

Web scraping of nested links with R

I would like to web scrap the links that are nested in the name of the property, this script works, however, not retrieves the URLs only NAs. Could you help me or what I am missing in the script snipped.
Thank you
# Test
library(rvest)
library(dplyr)
link <- "https://www.sreality.cz/hledani/prodej/byty/brno?_escaped_fragment_="
page <- read_html(link)
price <- page %>%
html_elements(".norm-price.ng-binding") %>%
html_text()
name <- page %>%
html_elements(".name.ng-binding") %>%
html_text()
location <- page %>%
html_elements(".locality.ng-binding") %>%
html_text()
href <- page %>%
html_nodes(".name.ng-binding") %>%
html_attr("href") %>% paste("https://www.sreality.cz", ., sep="")
flat <- data.frame(price, name, location, href, stringsAsFactors = FALSE)
Your CSS selector picked the anchors' inline html instead of the anchor. This should work:
page %>%
html_nodes("a.title") %>%
html_attr("ng-href") %>%
paste0("https://www.sreality.cz", .)
paste0(...) being a shorthand for paste(..., sep = '')
Another way using JS path
page %>%
html_nodes('#page-layout > div.content-cover > div.content-inner > div.transcluded-content.ng-scope > div > div > div.content > div > div:nth-child(4) > div > div:nth-child(n)') %>%
html_nodes('a') %>% html_attr('href') %>% str_subset('detail') %>% unique() %>% paste("https://www.sreality.cz", ., sep="")
[1] "https://www.sreality.cz/detail/prodej/byt/4+1/brno-zabrdovice-tkalcovska/1857071452"
[2] "https://www.sreality.cz/detail/prodej/byt/3+kk/brno--/1336764508"
[3] "https://www.sreality.cz/detail/prodej/byt/2+kk/brno-stary-liskovec-u-posty/3639359836"
[4] "https://www.sreality.cz/detail/prodej/byt/2+1/brno-reckovice-druzstevni/3845994844"
[5] "https://www.sreality.cz/detail/prodej/byt/2+1/brno-styrice-jilova/1102981468"
[6] "https://www.sreality.cz/detail/prodej/byt/1+kk/brno-dolni-herspice-/1961502812"

How to select "href" of a web page of a specific "target"?

<a class="image teaser-image ng-star-inserted" target="_self" href="/politik/inland/neuwahlen-2022-welche-szenarien-jetzt-realistisch-sind/401773131">
I just want to extract the "href" (for example the upper HTML tag) in order to concat it with the domain name of this website "https://kurier.at" and web scrape all articles on the home page.
I tried the following code
library(rvest)
library(lubridate)
kurier_wbpg <- read_html("https://kurier.at")
# I just want the "a" tags which come with the attribute "_self"
articleLinks <- kurier_wbpg %>% html_elements("a")%>%
html_elements(css = "tag[attribute=_self]") %>%
html_attr("href")%>%
paste("https://kurier.at",.,sep = "")
When I execute up to the html_attr("href") part of the above code block, the result I get is
character(0)
I think something wrong with selecting the HTML element tag.
I need some help with this?
You need to narrow down your css to the second teaser block image which you can do by using the naming conventions of the classes. You can use url_absolute() to add the domain.
library(rvest)
library(magrittr)
url <- 'https://kurier.at/'
result <- read_html(url) %>%
html_element('.teasers-2 .image') %>%
html_attr('href') %>%
url_absolute(url)
Same principle to get all teasers:
results <- read_html(url) %>%
html_elements('.teaser .image') %>%
html_attr('href') %>%
url_absolute(url)
Not sure if you want the bottom block of 5 included. If so, you can again use classes
articles <- read_html(url) %>%
html_elements('.teaser-title') %>%
html_attr('href') %>%
url_absolute(url)
It works with xpath -
library(rvest)
kurier_wbpg <- read_html("https://kurier.at")
articleLinks <- kurier_wbpg %>%
html_elements("a") %>%
html_elements(xpath = '//*[#target="_self"]') %>%
html_attr('href') %>%
paste0("https://kurier.at",.)
articleLinks
# [1] "https://kurier.at/plus"
# [2] "https://kurier.at/coronavirus"
# [3] "https://kurier.at/politik"
# [4] "https://kurier.at/politik/inland"
# [5] "https://kurier.at/politik/ausland"
#...
#...

extract a specific table from wikipedia in R

I want to extract the 20th table from a Wikipedia page https://en.wikipedia.org/wiki/...
I now use this code, but it only extracts the first heading table.
the_url <- "https://en.wikipedia.org/wiki/..."
tb <- the_url %>% read_html() %>%
html_node("table") %>%
html_table(fill = TRUE)
What should I do to get the specific one? Thank you!!
Instead of indexing where table position could move, you could anchor according to relationship to element with id prize_money. Return just a single node for efficiency. Avoid longer xpaths as they can be fragile.
library(rvest)
table <- read_html('https://en.wikipedia.org/wiki/2018_FIFA_World_Cup#Prize_money') %>%
html_node(xpath = "//*[#id='Prize_money']/parent::h4/following-sibling::table[1]") %>%
html_table(fill = T)
since you have a specific table you want to scrape you can identify in in the html_node() call by using the xpath of the webpage element:
library(dplyr)
library(rvest)
the_url <- "https://en.wikipedia.org/wiki/2018_FIFA_World_Cup"
the_url %>%
read_html() %>%
html_nodes(xpath='/html/body/div[3]/div[3]/div[5]/div[1]/table[20]') %>%
html_table(fill=TRUE)
Try this code.
library(rvest)
webpage <- read_html("https://en.wikipedia.org/wiki/2018_FIFA_World_Cup")
tbls <- html_nodes(webpage, "table")
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[3:4] %>%
html_table(fill = TRUE)
str(tbls_ls)

html_attr "href" does not extract link

I want to download the file that is in the tab "Dossier" with the text "Modul 4" here:
https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier
First I want to get the link.
My code for that is the following:
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes(".gba-download__text") %>%
.[[4]] %>%
html_attr("href")
(I know the piece .[[4]] is not really good, this is not my full code.)
This leads to NA and I don't understand why.
Similar questions couldn't help here.
Allan already left a concise answer. But let me leave another way. If you check the page source, you can see that the target is in .gba-download-list. (There are actually two of them.) So get that part and walk down to href part. Once you get urls, you can use grep() to identify a link containing Modul4. I used unique() in the end to remove a dupe.
read_html("https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier") %>%
html_nodes(".gba-download-list") %>%
html_nodes("a") %>%
html_attr("href") %>%
grep(pattern = "Modul4", value = TRUE) %>%
unique()
[1] "/downloads/92-975-67/2011-12-05_Modul4A_Apixaban.pdf"
It's easier to get to a specific node if you use xpath :
library(rvest)
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes(xpath = "//span[contains(text(),'Modul 4')]/..") %>%
.[[1]] %>%
html_attr("href")
#> [1] "/downloads/92-975-67/2011-12-05_Modul4A_Apixaban.pdf"
I have another solution now and want to share it:
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes("a.download-helper") %>%
html_attr("href") %>%
.[str_detect(., "Modul4")] %>%
unique
It is faster to use a css selector with contains operator to target the href by substring. In addition, only a single node match needs to be returned
library(rvest)
url <- "https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier"
link <- read_html(url) %>%
html_node("[href*='Modul4']") %>%
html_attr("href") %>% url_absolute(url)

rvest how to get last page number

Trying to get the last page number:
library(rvest)
url <- "https://www.immobilienscout24.de/Suche/de/wohnung-kaufen"
page <- read_html(url)
last_page_number <- page %>%
html_nodes("#pageSelection > select > option") %>%
html_text() %>%
length()
The result is empty for some reason.
I can access the pages by this url, for example to get page #3:
https://www.immobilienscout24.de/Suche/de/wohnung-kaufen?pagenumber=3
You are in the right direction but I think you have got wrong css selectors. Try :
library(rvest)
url <- 'https://www.immobilienscout24.de/Suche/de/wohnung-kaufen'
url %>%
read_html() %>%
html_nodes('div.select-container select option') %>%
html_text() %>%
tail(1L)
#[1] "1650"
An alternative :
url %>%
read_html() %>%
html_nodes('div.select-container select option') %>%
magrittr::extract2(length(.)) %>%
html_text()

Resources