R web-scraping on a multiple-level website with non dynamic URLs

R web-scraping on a multiple-level website with non dynamic URLs - r

I apologize in case I have not found a previous topic on this matter.
I want to scrape this website
http://www.fao.org/countryprofiles/en/
In particular, this page includes a lot of links to country infos. Those links'structure is:
http://www.fao.org/countryprofiles/index/en/?iso3=KAZ
http://www.fao.org/countryprofiles/index/en/?iso3=AFG
and any of this page includes a News section I am interested in.
Of course, I could scrape page-by-page but that would be a waste of time.
I tried the following but that is not working:
countries <- read_html("http://www.fao.org/countryprofiles/en/") %>%
html_nodes(".linkcountry") %>%
html_text()
country_news <- list()
sub <- html_session("http://www.fao.org/countryprofiles/en/")
for(i in countries[1:100]){
page <- sub %>%
follow_link(i) %>%
read_html()
country_news[[i]] <- page %>%
html_nodes(".white-box") %>%
html_text()
}
Any idea?

You can get all of the child pages from the top-level page:
stem = 'http://www.fao.org'
top_level = paste0(stem, '/countryprofiles/en/')
all_children = read_html(top_level) %>%
# ? and = are required to skip /iso3list/en/
html_nodes(xpath = '//a[contains(#href, "?iso3=")]/#href') %>%
html_text %>% paste0(stem, .)
head(all_children)
# [1] "http://www.fao.org/countryprofiles/index/en/?iso3=AFG"
# [2] "http://www.fao.org/countryprofiles/index/en/?iso3=ALB"
# [3] "http://www.fao.org/countryprofiles/index/en/?iso3=DZA"
# [4] "http://www.fao.org/countryprofiles/index/en/?iso3=AND"
# [5] "http://www.fao.org/countryprofiles/index/en/?iso3=AGO"
# [6] "http://www.fao.org/countryprofiles/index/en/?iso3=ATG"
If you are not comfortable with xpath, the CSS version would be:
html_nodes('a') %>% html_attr('href') %>%
grep("?iso3=", ., value = TRUE, fixed = TRUE) %>% paste0(stem, .)
Now you can loop over those pages & extract what you want

Related

How can I do a loop of url to scrape it?

I want to screap a web site with 10 pages. I created a function to screap the web site, but I needed put all 10 urls, one after one, like this:
scraper <- function(link){
page = read_html(link)
titulo = page %>% html_nodes("h4 a") %>% html_text()
tipo = page %>% html_nodes("h4+ .row .col-md-4") %>% html_text()
data = page %>% html_nodes("p.col-md-6") %>% html_text()
protocolo = page %>% html_nodes(".row:nth-child(3) .col-md-4") %>% html_text()
situacao = page %>% html_nodes(".row~ .row+ .row p.col-md-4:nth-child(1)") %>% html_text()
regime = page %>% html_nodes("p.col-md-4:nth-child(2)") %>% html_text()
quorum = page %>% html_nodes(".col-md-4~ .col-md-4+ .col-md-4") %>% html_text()
autoria = page %>% html_nodes(".row:nth-child(5) .col-md-12") %>% html_text()
assunto = page %>% html_nodes(".row:nth-child(6) .col-md-12") %>% html_text()
result <- data.frame(titulo, tipo, data, protocolo, situacao, regime, quorum, autoria, assunto)
return(result)
}
link1 <- "https://santabarbara.siscam.com.br/Documentos/Pesquisa/74?Pesquisa=Simples&Pagina=1&Documento=117&Modulo=8&AnoInicial=2022"
result1 <- scraper(link1)
link2 <- "https://santabarbara.siscam.com.br/Documentos/Pesquisa/74?Pesquisa=Simples&Pagina=2&Documento=117&Modulo=8&AnoInicial=2022"
result2 <- scraper(link2)
How can I put all link jut once? Maybe doing a loop?

The urls for your pages differ only by the Pagina= part. Hence you could easily create loop over the pages and create the URLs dynamically for which I use a small custom function and purrr::map_df which will bind the results into one dataframe.
For the reprex I only loop over the first two pages:
library(rvest)
make_url <- function(page) {
paste0(
"https://santabarbara.siscam.com.br/Documentos/Pesquisa/74?Pesquisa=Simples&Pagina=",
page, "&Documento=117&Modulo=8&AnoInicial=2022"
)
}
result <- purrr::map_df(1:2, function(page) {
url <- make_url(page)
scraper(url)
}, .id = "page")
dim(result)
#> [1] 30 10
unique(result$page)
#> [1] "1" "2"

How to select "href" of a web page of a specific "target"?

<a class="image teaser-image ng-star-inserted" target="_self" href="/politik/inland/neuwahlen-2022-welche-szenarien-jetzt-realistisch-sind/401773131">
I just want to extract the "href" (for example the upper HTML tag) in order to concat it with the domain name of this website "https://kurier.at" and web scrape all articles on the home page.
I tried the following code
library(rvest)
library(lubridate)
kurier_wbpg <- read_html("https://kurier.at")
# I just want the "a" tags which come with the attribute "_self"
articleLinks <- kurier_wbpg %>% html_elements("a")%>%
html_elements(css = "tag[attribute=_self]") %>%
html_attr("href")%>%
paste("https://kurier.at",.,sep = "")
When I execute up to the html_attr("href") part of the above code block, the result I get is
character(0)
I think something wrong with selecting the HTML element tag.
I need some help with this?

You need to narrow down your css to the second teaser block image which you can do by using the naming conventions of the classes. You can use url_absolute() to add the domain.
library(rvest)
library(magrittr)
url <- 'https://kurier.at/'
result <- read_html(url) %>%
html_element('.teasers-2 .image') %>%
html_attr('href') %>%
url_absolute(url)
Same principle to get all teasers:
results <- read_html(url) %>%
html_elements('.teaser .image') %>%
html_attr('href') %>%
url_absolute(url)
Not sure if you want the bottom block of 5 included. If so, you can again use classes
articles <- read_html(url) %>%
html_elements('.teaser-title') %>%
html_attr('href') %>%
url_absolute(url)

It works with xpath -
library(rvest)
kurier_wbpg <- read_html("https://kurier.at")
articleLinks <- kurier_wbpg %>%
html_elements("a") %>%
html_elements(xpath = '//*[#target="_self"]') %>%
html_attr('href') %>%
paste0("https://kurier.at",.)
articleLinks
# [1] "https://kurier.at/plus"
# [2] "https://kurier.at/coronavirus"
# [3] "https://kurier.at/politik"
# [4] "https://kurier.at/politik/inland"
# [5] "https://kurier.at/politik/ausland"
#...
#...

Using rvest to webscrape multiple pages

I am trying to extract all speeches given by Melania Trump from 2016-2020 at the following link: https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush. I am trying to use rvest to do so. Here is my code thus far:
# get main link
link <- "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush"
# main page
page <- read_html(link)
# extract speech titles
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links
# extract year of speech
year <- page %>% html_nodes(".date-display-single") %>% html_text()
# extract name of person giving speech
flotus <- page %>% html_nodes(".views-field-title-1.nowrap") %>% html_text()
get_text <- function(title_link){
speech_page = read_html(title_links)
speech_text = speech_page %>% html_nodes(".field-docs-content p") %>%
html_text() %>% paste(collapse = ",")
return(speech_page)
}
text = sapply(title_links, FUN = get_text)
I am having trouble with the following line of code:
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links
In particular, title_links yields a series of links like this: "https://www.presidency.ucsb.eduNA", rather than the individual web pages. Does anyone know what I am doing wrong here? Any help would be appreciated.

You are querying the wrong css node.
Try:
page %>% html_elements(css = "td.views-field-title a") %>% html_attr('href')
[1] "https://www.presidency.ucsb.edu/documents/remarks-mrs-laura-bush-the-national-press-club"
[2] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-un-commission-the-status-women-international-womens-day"
[3] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-colorado-early-childhood-cognitive-development-summit"
[4] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-10th-anniversary-the-holocaust-memorial-museum-and-opening-anne"
[5] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-preserve-america-initiative-portland-maine"

Trouble mapping a function to a list of scraped links using rvest

I am trying to apply a function that extracts a table from a list of scraped links. I am at the final stage where I am applying the get_injury_data function to the links - I have been having issues with successfully executing this. I get the following error:
Error in matrix(unlist(values), ncol = width, byrow = TRUE) :
'data' must be of a vector type, was 'NULL'
I wonder if anyone can help me spot where I am going wrong. The code is as follows:
library(tidyverse)
library(rvest)
# create a function to grab the team links
get_team_links <- function(url){
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>% # remove rows with # string
paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
unique() %>% # keep only unique links
as_tibble() %>% # turn strings into a tibble datatset
rename("links" = "value") %>% # rename the value column
filter(!grepl('profil', links)) %>% # remove link of players included
filter(!grepl('spielplan', links)) %>% # remove link of additional team pages included
mutate(links = gsub("startseite", "kader", links)) # change link to go to the detailed page
}
# create a function to grab the player links
get_player_links <- function(url){
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>% # remove rows with # string
paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
unique() %>% # keep only unique links
as_tibble() %>% # turn strings into a tibble datatset
rename("links" = "value") %>% # rename the value column
filter(grepl('profil', links)) %>% # remove link of players included
mutate(links = gsub("profil", "verletzungen", links)) # change link to go to the injury page
}
# create a function to get the injury dataset
get_injury_data <- function(url){
url %>%
read_html() %>%
html_nodes('#yw1') %>%
html_table()
}
# get team links and save it as team_links
team_links <- get_team_links('https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1')
# get player links and by mapping the function on to the player_injury_links dataset
# and then unnest the list of lists as a long list
player_injury_links <- team_links %>%
mutate(links = map(team_links$links, get_player_links)) %>%
unnest(links)
# using the player_injury_links list create a dataset by web scrapping the play injury pages
player_injury_data <- map(player_injury_links$links, get_injury_data)

Solution
So the issue that I was having was that some of the links that I was scraping did not have any data.
To overcome this issue used, I used the possibly function from purrr package. This helped me create a new, error-free function.
The line code that was giving me trouble is as follows:
player_injury_data <- player_injury_links %>%
purrr::map(., purrr::possibly(get_injury_data, otherwise = NULL, quiet = TRUE))

Any tip for start scraping an e-commerce site with RVEST?

I am trying to scrap some data from an ecommerce site using rvest. I haven't found any good examples to guide me. Any idea about it?
Let's put as an example how I started:
library(rvest)
library(purrr)
#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#Reading the HTML code from the website
webpage <- read_html(url)
#Using CSS selectors to scrap the titles section
title_html <- html_nodes(webpage,'.main-title')
#Converting the title data to text
title <- html_text(title_html)
head(title)
#Using CSS selectors to scrap the price section
price <- html_nodes(webpage,'.item__price')
price <- html_text(price)
price
So, I would like to do two basic things:
Entering in each product and take some data from them.
Pagination to all pages
Any help?
Thank you.

Scrape that info is not difficult and is doable with rvest.
What you need to do is to get all the hrefs and loop on them. To do it, you need to use html_attr()
Following code should do the job:
library(tidyverse)
library(rvest)
#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#You need to get href and loop on hrefs
all_pages <- url_base %>% read_html() %>% html_nodes(".pagination__page > a") %>% html_attr("href")
all_pages[1] <- url_base
#create an empty table to store results
result_table <- tibble()
for(page in all_pages){
page_source <- read_html(page)
title <- html_nodes(page_source,'.item__info-title') %>% html_text()
price <- html_nodes(page_source,'.item__price') %>% html_text()
item_link <- html_nodes(page_source,'.item__info-title') %>% html_attr("href")
temp_table <- tibble(title = title, price = price, item_link = item_link)
result_table <- bind_rows(result_table,temp_table)
}
After you get link to each item, you can loop on the item links.
To View more pages
As you can see, there is a pattern in the suffix; you can simply add the number by 50 each time to navigate more pages.
> all_pages
[1] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/"
[2] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_51"
[3] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_101"
[4] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_151"
[5] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_201"
[6] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_251"
[7] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_301"
[8] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_351"
[9] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_401"
[10] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_451"
So we can do this:
str_c("https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_",seq.int(from = 51,by = 50,length.out = 40))
Scrape each page
Let's use this page as an example: https://articulo.mercadolibre.com.uy/MLU-449598178-protector-funda-clear-cover-samsung-galaxy-note-8-_JM
pagesource <- read_html("https://articulo.mercadolibre.com.uy/MLU-449598178-protector-funda-clear-cover-samsung-galaxy-note-8-_JM")
n_vendor <- pagesource %>% html_node(".item-conditions") %>% html_text() %>% remove_nt()
product_description <- pagesource %>% html_node(".item-title__primary") %>% html_text() %>% remove_nt()
n_opinion <- pagesource %>% html_node(".average-legend span:nth-child(1)") %>% html_text()
product_price <- pagesource %>% html_nodes(".price-tag-fraction") %>% html_text()
current_table <- tibble(product_description = product_description,
product_price = product_price,
n_vendor = n_vendor,
n_opinion = n_opinion)
print(current_table)
# A tibble: 1 x 4
product_description product_price n_vendor n_opinion
<chr> <chr> <chr> <chr>
1 Protector Funda Clear Cover Samsung Galaxy Note 8 14 14vendidos 2
You can loop the code chunk above and get all info.
Let's combine it all together
The following code should work, you can remove the 5-page limit to scrape all product information.
library(tidyverse)
library(rvest)
#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#You need to get href and loop on hrefs
all_pages <- url_base %>% read_html() %>% html_nodes(".pagination__page > a") %>% html_attr("href")
all_pages <- c(url_base,
str_c("https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_",
seq.int(from = 51,by = 50,length.out = 40)))
#create an empty table to store results
result_table <- tibble()
for(page in all_pages[1:5]){ #as an example, only scrape the first 5 pages
page_source <- read_html(page)
title <- html_nodes(page_source,'.item__info-title') %>% html_text()
price <- html_nodes(page_source,'.item__price') %>% html_text()
item_link <- html_nodes(page_source,'.item__info-title') %>% html_attr("href")
temp_table <- tibble(title = title, price = price, item_link = item_link)
result_table <- bind_rows(result_table,temp_table)
}
#loop on result table(item_link):
product_table <- tibble()
for(i in 1:nrow(result_table)){
pagesource <- read_html(result_table[[i,"item_link"]])
n_vendor <- pagesource %>% html_node(".item-conditions") %>% html_text() %>% remove_nt()
product_description <- pagesource %>% html_node(".item-title__primary") %>% html_text() %>% remove_nt()
currency_symbol <- pagesource %>% html_node(".price-tag-symbol") %>% html_text()
n_opinion <- pagesource %>% html_node(".average-legend span:nth-child(1)") %>% html_text()
product_price <- pagesource %>% html_nodes(".price-tag-fraction") %>% html_text()
current_table <- tibble(product_description = product_description,
currency_symbol = currency_symbol,
product_price = product_price,
n_vendor = n_vendor,
n_opinion = n_opinion,
item_link = result_table[[i,"item_link"]])
product_table <- bind_rows(product_table,current_table)
}
Result:
Some issues
There are still some bugs in the code, for example:
On this page, there are two items that match the css selector, which may break the code. There are some solutions though:
Store result in a list instead of a table
Use a more accurate CSS selector
concatenate string whenever there is more than one result and
etc.
You can choose any methods that fit your requirement.
Also, if you want to scrape in quantity, you may want to use tryCatch to prevent any errors from breaking your loop.
About apis
Api is totally different with web scraping, you may want to read some more tutorials about api if you want to use it.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R web-scraping on a multiple-level website with non dynamic URLs - r

Related

How can I do a loop of url to scrape it?

How to select "href" of a web page of a specific "target"?

Using rvest to webscrape multiple pages

Trouble mapping a function to a list of scraped links using rvest

Any tip for start scraping an e-commerce site with RVEST?

Categories

Resources