R: Using Rvest to loop through list - r

I try to scape the prices, area and addresses from all flats of this homepage (https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden)
Getting the data for one list element with Rvest and xpath works fine (see code), but I donĀ“t know how to get the ID of each list element to loop through all elements.
Here is a part of the html-code with the data-go-to-expose-id I need for the loop. How can I get all IDs?
<span class="slick-bg-layer"></span><img alt="Immobilienbild" class="gallery__image block height-full" src="https://pictures.immobilienscout24.de/listings/541dfd45-c75a-4da7-a831-3339264d578b-1193970198.jpg/ORIG/legacy_thumbnail/532x399/format/jpg/quality/80">a831-3339264d578b-1193970198.jpg/ORIG/legacy_thumbnail/532x399/format/jpg/quality/80"></a>
And here is my current R-code to fetch the data from one list element:
library(rvest)
url <- "https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden"
address <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[2]/div[2]/a') %>% html_text()
price <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[3]/div/div[1]/dl[1]/dd') %>% html_text()
area <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[3]/div/div[1]/dl[2]/dd') %>% html_text()

Does this get what you are after
library("tidyverse")
library("httr")
library("rvest")
url <- "https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden"
x <- read_html(url)
x %>%
html_nodes("#listings") %>%
html_nodes(".result-list__listing") %>%
html_attr("data-id")

Related

webscraping blocked by captcha in rvest

I am webscraping this page
https://www.casa.it/vendita/residenziale/milano/
with the following code
library(rvest,tidyverse,httr2)
#useragents list
ua <- read_html(paste0("https://developers.whatismybrowser.com/useragents/explore/hardware_type_specific/computer/")) %>%
html_nodes(".code") %>%
html_text(trim = TRUE)
df_apartments <- list()
for (i in 2:80) {
#read page
user_agent <- sample(ua, 1)
Sys.sleep(3)
page <- paste0("https://www.casa.it/vendita/residenziale/milano?page=",page_number) %>%
request() %>%
req_user_agent(user_agent) %>%
req_perform() %>%
resp_body_html()
#read the parent nodes
apartments <- page %>% html_nodes(xpath= "//div[#class='art-infos is-clickable']")
# parse information from each of the parent nodes
price <- apartments %>% html_node(xpath= ".//p[#class='c-txt--f0']") %>% html_text(trim = TRUE)
rooms <- apartments %>% html_node(xpath= ".//div[#class='grid-item info-features__item grid-item grid-item--behavior-fixed']") %>% html_text(trim = TRUE)
area <- apartments %>% html_nodes(xpath= ".//div[#class='grid-item info-features__item grid-item grid-item--behavior-fixed']") %>% html_text(trim = TRUE)
quartiere <- apartments %>% html_nodes(xpath= ".//p[#class='art-addr__addrs-subt c-txt--f5 art-addr__txt']") %>% html_text(trim = TRUE)
description <- apartments %>% html_node( xpath= ".//p[#class='art-addr__txt']") %>% html_text()
# put the data together into a data frame add to list
df_apartments[[i]] <- data.frame(price, rooms, area, description,quartiere)
print(paste("Page:",i,user_agent))
}
#combine all data frames into 1
apartments_finals <- bind_rows(df_apartments)
when I run the code I get 403 error in r studio because the webpage is blocked by a captcha, I would like to know if there's a way to solve this without using rselenium. I already used Sys.sleep() but it doesn't work, and it doesn't work even using changing user agent.
I also tried using rselenium to click on the captcha but it doesn't work either.
any suggestion? tell me if you didn't understand something

character (0) after scraping webpage in read_html

I'm trying to scrape "1,335,000" from the screenshot below (the number is at the bottom of the screenshot). I wrote the following code in R.
t2<-read_html("https://fortune.com/company/amazon-com/fortune500/")
employee_number <- t2 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//*[contains(#class, 'info__value--2AHH7')]") %>%
rvest::html_text()
However, when I call "employee_number", it gives me "character(0)". Can anyone help me figure out why?
As Dave2e pointed the page uses javascript, thus can't make use of rvest.
url = "https://fortune.com/company/amazon-com/fortune500/"
#launch browser
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[#id="content"]/div[5]/div[1]/div[1]/div[12]/div[2]') %>%
html_text()
[1] "1,335,000"
Data is loaded dynamically from a script tag. No need for expense of a browser. You could either extract the entire JavaScript object within the script, pass to jsonlite to handle as JSON, then extract what you want, or, if just after the employee count, regex that out from the response text.
library(rvest)
library(stringr)
library(magrittr)
library(jsonlite)
page <- read_html('https://fortune.com/company/amazon-com/fortune500/')
data <- page %>% html_element('#preload') %>% html_text() %>%
stringr::str_match(. , "PRELOADED_STATE__ = (.*);") %>% .[, 2] %>% jsonlite::parse_json()
print(data$components$page$`/company/amazon-com/fortune500/`[[6]]$children[[4]]$children[[3]]$config$employees)
#shorter version
print(page %>%html_text() %>% stringr::str_match('"employees":"(\\d+)?"') %>% .[,2] %>% as.integer() %>% format(big.mark=","))

extract a specific table from wikipedia in R

I want to extract the 20th table from a Wikipedia page https://en.wikipedia.org/wiki/...
I now use this code, but it only extracts the first heading table.
the_url <- "https://en.wikipedia.org/wiki/..."
tb <- the_url %>% read_html() %>%
html_node("table") %>%
html_table(fill = TRUE)
What should I do to get the specific one? Thank you!!
Instead of indexing where table position could move, you could anchor according to relationship to element with id prize_money. Return just a single node for efficiency. Avoid longer xpaths as they can be fragile.
library(rvest)
table <- read_html('https://en.wikipedia.org/wiki/2018_FIFA_World_Cup#Prize_money') %>%
html_node(xpath = "//*[#id='Prize_money']/parent::h4/following-sibling::table[1]") %>%
html_table(fill = T)
since you have a specific table you want to scrape you can identify in in the html_node() call by using the xpath of the webpage element:
library(dplyr)
library(rvest)
the_url <- "https://en.wikipedia.org/wiki/2018_FIFA_World_Cup"
the_url %>%
read_html() %>%
html_nodes(xpath='/html/body/div[3]/div[3]/div[5]/div[1]/table[20]') %>%
html_table(fill=TRUE)
Try this code.
library(rvest)
webpage <- read_html("https://en.wikipedia.org/wiki/2018_FIFA_World_Cup")
tbls <- html_nodes(webpage, "table")
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[3:4] %>%
html_table(fill = TRUE)
str(tbls_ls)

Any tip for start scraping an e-commerce site with RVEST?

I am trying to scrap some data from an ecommerce site using rvest. I haven't found any good examples to guide me. Any idea about it?
Let's put as an example how I started:
library(rvest)
library(purrr)
#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#Reading the HTML code from the website
webpage <- read_html(url)
#Using CSS selectors to scrap the titles section
title_html <- html_nodes(webpage,'.main-title')
#Converting the title data to text
title <- html_text(title_html)
head(title)
#Using CSS selectors to scrap the price section
price <- html_nodes(webpage,'.item__price')
price <- html_text(price)
price
So, I would like to do two basic things:
Entering in each product and take some data from them.
Pagination to all pages
Any help?
Thank you.
Scrape that info is not difficult and is doable with rvest.
What you need to do is to get all the hrefs and loop on them. To do it, you need to use html_attr()
Following code should do the job:
library(tidyverse)
library(rvest)
#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#You need to get href and loop on hrefs
all_pages <- url_base %>% read_html() %>% html_nodes(".pagination__page > a") %>% html_attr("href")
all_pages[1] <- url_base
#create an empty table to store results
result_table <- tibble()
for(page in all_pages){
page_source <- read_html(page)
title <- html_nodes(page_source,'.item__info-title') %>% html_text()
price <- html_nodes(page_source,'.item__price') %>% html_text()
item_link <- html_nodes(page_source,'.item__info-title') %>% html_attr("href")
temp_table <- tibble(title = title, price = price, item_link = item_link)
result_table <- bind_rows(result_table,temp_table)
}
After you get link to each item, you can loop on the item links.
To View more pages
As you can see, there is a pattern in the suffix; you can simply add the number by 50 each time to navigate more pages.
> all_pages
[1] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/"
[2] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_51"
[3] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_101"
[4] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_151"
[5] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_201"
[6] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_251"
[7] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_301"
[8] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_351"
[9] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_401"
[10] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_451"
So we can do this:
str_c("https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_",seq.int(from = 51,by = 50,length.out = 40))
Scrape each page
Let's use this page as an example: https://articulo.mercadolibre.com.uy/MLU-449598178-protector-funda-clear-cover-samsung-galaxy-note-8-_JM
pagesource <- read_html("https://articulo.mercadolibre.com.uy/MLU-449598178-protector-funda-clear-cover-samsung-galaxy-note-8-_JM")
n_vendor <- pagesource %>% html_node(".item-conditions") %>% html_text() %>% remove_nt()
product_description <- pagesource %>% html_node(".item-title__primary") %>% html_text() %>% remove_nt()
n_opinion <- pagesource %>% html_node(".average-legend span:nth-child(1)") %>% html_text()
product_price <- pagesource %>% html_nodes(".price-tag-fraction") %>% html_text()
current_table <- tibble(product_description = product_description,
product_price = product_price,
n_vendor = n_vendor,
n_opinion = n_opinion)
print(current_table)
# A tibble: 1 x 4
product_description product_price n_vendor n_opinion
<chr> <chr> <chr> <chr>
1 Protector Funda Clear Cover Samsung Galaxy Note 8 14 14vendidos 2
You can loop the code chunk above and get all info.
Let's combine it all together
The following code should work, you can remove the 5-page limit to scrape all product information.
library(tidyverse)
library(rvest)
#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#You need to get href and loop on hrefs
all_pages <- url_base %>% read_html() %>% html_nodes(".pagination__page > a") %>% html_attr("href")
all_pages <- c(url_base,
str_c("https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_",
seq.int(from = 51,by = 50,length.out = 40)))
#create an empty table to store results
result_table <- tibble()
for(page in all_pages[1:5]){ #as an example, only scrape the first 5 pages
page_source <- read_html(page)
title <- html_nodes(page_source,'.item__info-title') %>% html_text()
price <- html_nodes(page_source,'.item__price') %>% html_text()
item_link <- html_nodes(page_source,'.item__info-title') %>% html_attr("href")
temp_table <- tibble(title = title, price = price, item_link = item_link)
result_table <- bind_rows(result_table,temp_table)
}
#loop on result table(item_link):
product_table <- tibble()
for(i in 1:nrow(result_table)){
pagesource <- read_html(result_table[[i,"item_link"]])
n_vendor <- pagesource %>% html_node(".item-conditions") %>% html_text() %>% remove_nt()
product_description <- pagesource %>% html_node(".item-title__primary") %>% html_text() %>% remove_nt()
currency_symbol <- pagesource %>% html_node(".price-tag-symbol") %>% html_text()
n_opinion <- pagesource %>% html_node(".average-legend span:nth-child(1)") %>% html_text()
product_price <- pagesource %>% html_nodes(".price-tag-fraction") %>% html_text()
current_table <- tibble(product_description = product_description,
currency_symbol = currency_symbol,
product_price = product_price,
n_vendor = n_vendor,
n_opinion = n_opinion,
item_link = result_table[[i,"item_link"]])
product_table <- bind_rows(product_table,current_table)
}
Result:
Some issues
There are still some bugs in the code, for example:
On this page, there are two items that match the css selector, which may break the code. There are some solutions though:
Store result in a list instead of a table
Use a more accurate CSS selector
concatenate string whenever there is more than one result and
etc.
You can choose any methods that fit your requirement.
Also, if you want to scrape in quantity, you may want to use tryCatch to prevent any errors from breaking your loop.
About apis
Api is totally different with web scraping, you may want to read some more tutorials about api if you want to use it.

extract all the possible text from a webpage in R

I used this script to extract the text from a webpage
url <- "http://www.dlink.com/it/it"
doc <- getURL(url)
#get the text from the body
html <- htmlTreeParse(doc, useInternal = TRUE)
txt <- xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)
txt<-toString(txt)
but the problem is that it takes just the words in the first page, how can I extend it to the whole website?
I'd go with rvest to scrape the links and purrr to iterate:
library(rvest)
library(purrr)
url <- "http://www.dlink.com/it/it"
r <- read_html(url) %>%
html_nodes('a') %>%
html_attr('href') %>%
Filter(function(f) !is.na(f) & !grepl(x = f, pattern = '#|facebook|linkedin|twitter|youtube'), .) %>%
map(~{
print(.x)
html_session(url) %>%
jump_to(.x) %>%
read_html() %>%
html_nodes('body') %>%
html_text() %>%
toString()
})
I filtered out social nets and dead links from the list of links, some more tuning might be in order.
Be advised that you will be scraping a lot of garbage. Some more targeting on what to scrape inside each page might be needed (ie: something more specfic than the whole body tag)

Resources