rvest how to get last page number - r

Trying to get the last page number:
library(rvest)
url <- "https://www.immobilienscout24.de/Suche/de/wohnung-kaufen"
page <- read_html(url)
last_page_number <- page %>%
html_nodes("#pageSelection > select > option") %>%
html_text() %>%
length()
The result is empty for some reason.
I can access the pages by this url, for example to get page #3:
https://www.immobilienscout24.de/Suche/de/wohnung-kaufen?pagenumber=3

You are in the right direction but I think you have got wrong css selectors. Try :
library(rvest)
url <- 'https://www.immobilienscout24.de/Suche/de/wohnung-kaufen'
url %>%
read_html() %>%
html_nodes('div.select-container select option') %>%
html_text() %>%
tail(1L)
#[1] "1650"
An alternative :
url %>%
read_html() %>%
html_nodes('div.select-container select option') %>%
magrittr::extract2(length(.)) %>%
html_text()

Related

not able to scrape review score from airbnb using Rvest

library(rvest)
library(dplyr)
page = "https://www.airbnb.ae/rooms/585742764031233504?preview_for_ml=true&source_impression_id=p3_1660929108_esIxWS5HCyk890Im"
### for average Review score
page %>% read_html() %>% html_nodes("._17p6nbba") %>% html_text2()
### for review count
page %>% read_html() %>% html_nodes("span._s65ijh7") %>% html_text2()
Both are returning "character(0)"
You can get this in JSON format with Selenium:
library(rvest)
library(jsonlite)
json1 <- read_html("https://www.airbnb.ae/rooms/585742764031233504?preview_for_ml=true&source_impression_id=p3_1660929108_esIxWS5HCyk890Im") %>%
html_element(xpath = "/html/body//script[#id='data-deferred-state']") %>%
html_text() %>%
fromJSON()
json1$niobeMinimalClientData[[1]][[2]]$data$presentation$stayProductDetailPage$sections$metadata$loggingContext$eventDataLogging
The trick I learned for these instances is to download the raw HTML with read_html(url), write it to disk with xml2::write_html and then open with Chrome, inspect, command f for the search term (such as 4.50), get that element, and then parse the JSON.

Different number of rows for headlines and urls when web-scraping Google News using rvest

Testing out different keywords on Google News to web-scrape headlines and urls, but somehow some keywords do not have matching number of headlines and urls.
library(rvest)
library(stringr)
library(magrittr)
link = "https://news.google.com/search?q=onn%20hafiz&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline = read_html(link) %>% html_nodes('.DY5T1d') %>% html_text()
url = read_html(link) %>% html_nodes(".VDXfz") %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)
data.frame(headline, url)
Results:
Error in data.frame(headline, url) :
arguments imply differing number of rows: 82, 85
But with other keywords, this seems to work fine.
link = "https://news.google.com/search?q=international%20petroleum&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline = read_html(link) %>% html_nodes('.DY5T1d') %>% html_text()
url = read_html(link) %>% html_nodes(".VDXfz") %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)
data.frame(headline, url)
Anyone knows the issue for this, and how to fix it? Thanks
With those selectors you are extracting headlines from different nodes than hrefs and there doesn't seem to be fixed 1:1 relation between those two. At the time of writing your first search results with some nested headlines and that's probably the reason why your headline and url count does not match.
Get the url and text from the same node and you should be covered:
url <- "https://news.google.com/search?q=onn%20hafiz&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline_links <- read_html(url) %>% html_nodes('a.DY5T1d')
data.frame(
headline = headline_links %>% html_text(),
url = headline_links %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)
)

character (0) after scraping webpage in read_html

I'm trying to scrape "1,335,000" from the screenshot below (the number is at the bottom of the screenshot). I wrote the following code in R.
t2<-read_html("https://fortune.com/company/amazon-com/fortune500/")
employee_number <- t2 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//*[contains(#class, 'info__value--2AHH7')]") %>%
rvest::html_text()
However, when I call "employee_number", it gives me "character(0)". Can anyone help me figure out why?
As Dave2e pointed the page uses javascript, thus can't make use of rvest.
url = "https://fortune.com/company/amazon-com/fortune500/"
#launch browser
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[#id="content"]/div[5]/div[1]/div[1]/div[12]/div[2]') %>%
html_text()
[1] "1,335,000"
Data is loaded dynamically from a script tag. No need for expense of a browser. You could either extract the entire JavaScript object within the script, pass to jsonlite to handle as JSON, then extract what you want, or, if just after the employee count, regex that out from the response text.
library(rvest)
library(stringr)
library(magrittr)
library(jsonlite)
page <- read_html('https://fortune.com/company/amazon-com/fortune500/')
data <- page %>% html_element('#preload') %>% html_text() %>%
stringr::str_match(. , "PRELOADED_STATE__ = (.*);") %>% .[, 2] %>% jsonlite::parse_json()
print(data$components$page$`/company/amazon-com/fortune500/`[[6]]$children[[4]]$children[[3]]$config$employees)
#shorter version
print(page %>%html_text() %>% stringr::str_match('"employees":"(\\d+)?"') %>% .[,2] %>% as.integer() %>% format(big.mark=","))

Using rvest to webscrape multiple pages

I am trying to extract all speeches given by Melania Trump from 2016-2020 at the following link: https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush. I am trying to use rvest to do so. Here is my code thus far:
# get main link
link <- "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush"
# main page
page <- read_html(link)
# extract speech titles
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links
# extract year of speech
year <- page %>% html_nodes(".date-display-single") %>% html_text()
# extract name of person giving speech
flotus <- page %>% html_nodes(".views-field-title-1.nowrap") %>% html_text()
get_text <- function(title_link){
speech_page = read_html(title_links)
speech_text = speech_page %>% html_nodes(".field-docs-content p") %>%
html_text() %>% paste(collapse = ",")
return(speech_page)
}
text = sapply(title_links, FUN = get_text)
I am having trouble with the following line of code:
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links
In particular, title_links yields a series of links like this: "https://www.presidency.ucsb.eduNA", rather than the individual web pages. Does anyone know what I am doing wrong here? Any help would be appreciated.
You are querying the wrong css node.
Try:
page %>% html_elements(css = "td.views-field-title a") %>% html_attr('href')
[1] "https://www.presidency.ucsb.edu/documents/remarks-mrs-laura-bush-the-national-press-club"
[2] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-un-commission-the-status-women-international-womens-day"
[3] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-colorado-early-childhood-cognitive-development-summit"
[4] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-10th-anniversary-the-holocaust-memorial-museum-and-opening-anne"
[5] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-preserve-america-initiative-portland-maine"

R: Using Rvest to loop through list

I try to scape the prices, area and addresses from all flats of this homepage (https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden)
Getting the data for one list element with Rvest and xpath works fine (see code), but I donĀ“t know how to get the ID of each list element to loop through all elements.
Here is a part of the html-code with the data-go-to-expose-id I need for the loop. How can I get all IDs?
<span class="slick-bg-layer"></span><img alt="Immobilienbild" class="gallery__image block height-full" src="https://pictures.immobilienscout24.de/listings/541dfd45-c75a-4da7-a831-3339264d578b-1193970198.jpg/ORIG/legacy_thumbnail/532x399/format/jpg/quality/80">a831-3339264d578b-1193970198.jpg/ORIG/legacy_thumbnail/532x399/format/jpg/quality/80"></a>
And here is my current R-code to fetch the data from one list element:
library(rvest)
url <- "https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden"
address <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[2]/div[2]/a') %>% html_text()
price <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[3]/div/div[1]/dl[1]/dd') %>% html_text()
area <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[3]/div/div[1]/dl[2]/dd') %>% html_text()
Does this get what you are after
library("tidyverse")
library("httr")
library("rvest")
url <- "https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden"
x <- read_html(url)
x %>%
html_nodes("#listings") %>%
html_nodes(".result-list__listing") %>%
html_attr("data-id")

Resources