I want to screap a web site with 10 pages. I created a function to screap the web site, but I needed put all 10 urls, one after one, like this:
scraper <- function(link){
page = read_html(link)
titulo = page %>% html_nodes("h4 a") %>% html_text()
tipo = page %>% html_nodes("h4+ .row .col-md-4") %>% html_text()
data = page %>% html_nodes("p.col-md-6") %>% html_text()
protocolo = page %>% html_nodes(".row:nth-child(3) .col-md-4") %>% html_text()
situacao = page %>% html_nodes(".row~ .row+ .row p.col-md-4:nth-child(1)") %>% html_text()
regime = page %>% html_nodes("p.col-md-4:nth-child(2)") %>% html_text()
quorum = page %>% html_nodes(".col-md-4~ .col-md-4+ .col-md-4") %>% html_text()
autoria = page %>% html_nodes(".row:nth-child(5) .col-md-12") %>% html_text()
assunto = page %>% html_nodes(".row:nth-child(6) .col-md-12") %>% html_text()
result <- data.frame(titulo, tipo, data, protocolo, situacao, regime, quorum, autoria, assunto)
return(result)
}
link1 <- "https://santabarbara.siscam.com.br/Documentos/Pesquisa/74?Pesquisa=Simples&Pagina=1&Documento=117&Modulo=8&AnoInicial=2022"
result1 <- scraper(link1)
link2 <- "https://santabarbara.siscam.com.br/Documentos/Pesquisa/74?Pesquisa=Simples&Pagina=2&Documento=117&Modulo=8&AnoInicial=2022"
result2 <- scraper(link2)
How can I put all link jut once? Maybe doing a loop?
The urls for your pages differ only by the Pagina= part. Hence you could easily create loop over the pages and create the URLs dynamically for which I use a small custom function and purrr::map_df which will bind the results into one dataframe.
For the reprex I only loop over the first two pages:
library(rvest)
make_url <- function(page) {
paste0(
"https://santabarbara.siscam.com.br/Documentos/Pesquisa/74?Pesquisa=Simples&Pagina=",
page, "&Documento=117&Modulo=8&AnoInicial=2022"
)
}
result <- purrr::map_df(1:2, function(page) {
url <- make_url(page)
scraper(url)
}, .id = "page")
dim(result)
#> [1] 30 10
unique(result$page)
#> [1] "1" "2"
Related
I am trying to extract all speeches given by Melania Trump from 2016-2020 at the following link: https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush. I am trying to use rvest to do so. Here is my code thus far:
# get main link
link <- "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush"
# main page
page <- read_html(link)
# extract speech titles
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links
# extract year of speech
year <- page %>% html_nodes(".date-display-single") %>% html_text()
# extract name of person giving speech
flotus <- page %>% html_nodes(".views-field-title-1.nowrap") %>% html_text()
get_text <- function(title_link){
speech_page = read_html(title_links)
speech_text = speech_page %>% html_nodes(".field-docs-content p") %>%
html_text() %>% paste(collapse = ",")
return(speech_page)
}
text = sapply(title_links, FUN = get_text)
I am having trouble with the following line of code:
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links
In particular, title_links yields a series of links like this: "https://www.presidency.ucsb.eduNA", rather than the individual web pages. Does anyone know what I am doing wrong here? Any help would be appreciated.
You are querying the wrong css node.
Try:
page %>% html_elements(css = "td.views-field-title a") %>% html_attr('href')
[1] "https://www.presidency.ucsb.edu/documents/remarks-mrs-laura-bush-the-national-press-club"
[2] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-un-commission-the-status-women-international-womens-day"
[3] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-colorado-early-childhood-cognitive-development-summit"
[4] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-10th-anniversary-the-holocaust-memorial-museum-and-opening-anne"
[5] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-preserve-america-initiative-portland-maine"
Trying to get the last page number:
library(rvest)
url <- "https://www.immobilienscout24.de/Suche/de/wohnung-kaufen"
page <- read_html(url)
last_page_number <- page %>%
html_nodes("#pageSelection > select > option") %>%
html_text() %>%
length()
The result is empty for some reason.
I can access the pages by this url, for example to get page #3:
https://www.immobilienscout24.de/Suche/de/wohnung-kaufen?pagenumber=3
You are in the right direction but I think you have got wrong css selectors. Try :
library(rvest)
url <- 'https://www.immobilienscout24.de/Suche/de/wohnung-kaufen'
url %>%
read_html() %>%
html_nodes('div.select-container select option') %>%
html_text() %>%
tail(1L)
#[1] "1650"
An alternative :
url %>%
read_html() %>%
html_nodes('div.select-container select option') %>%
magrittr::extract2(length(.)) %>%
html_text()
the following code scrapes all questions and answers with their authors and dates, but I cannot figure out how to scarpe also answers that take more than one page, for example for the second question here
https://www.healthboards.com/boards/aspergers-syndrome/index2.html
Asperger's and talking to yourself
Answers are in 2 pages: 15 in the firt page and 3 in the second, I'm getting the answers in the first page only
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
library(RCurl)
library(xlsx)
#install.packages("xlsx")
# Scrape thread titles, thread links, authors and number of views
url <- "https://www.healthboards.com/boards/aspergers-syndrome/index2.html"
h <- read_html(url)
threads <- h %>%
html_nodes("#threadslist .alt1 div > a") %>%
html_text()
threads
thread_links <- h %>%
html_nodes("#threadslist .alt1 div > a") %>%
html_attr(name = "href")
thread_links
thread_starters <- h %>%
html_nodes("#threadslist .alt1 div.smallfont") %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "")
thread_starters
views <- h %>%
html_nodes(".alt2:nth-child(6)") %>%
html_text() %>%
str_replace_all(pattern = ",", replacement = "") %>%
as.numeric()
# Custom functions to scrape author IDs and posts
scrape_posts <- function(link){
read_html(link) %>%
html_nodes(css = ".smallfont~ hr+ div") %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
str_trim()
}
scrape_dates <- function(link){
read_html(link) %>%
html_nodes(css = "table[id^='post'] td.thead:first-child") %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
str_trim()
}
scrape_author_ids <- function(link){
h <- read_html(link) %>%
html_nodes("div")
id_index <- h %>%
html_attr("id") %>%
str_which(pattern = "postmenu")
h %>%
`[`(id_index) %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
str_trim()
}
htmls <- map(thread_links, getURL)
# Create master dataset
master_data <-
tibble(threads, thread_starters,thread_links) %>%
mutate(
post_author_id = map(htmls, scrape_author_ids),
post = map(htmls, scrape_posts),
fec=map(htmls, scrape_dates)
) %>%
select(threads: post_author_id, post, thread_links,fec) %>%
unnest()
master_data$thread_starters
threads
post
titles<-master_data$threads
therad_starters<-master_data$thread_starters
#views<-master_data$views
post_author<-master_data$post_author_id
post<-master_data$post
fech<-master_data$fec
employ.data <- data.frame(titles, therad_starters, post_author, post,fech)
write.xlsx(employ.data, "C:/2.xlsx")
Can't figure out how to include also the second page..
Taking a quick look at your code and the website, there is a td under class vbmenu_control which holds the number of pages (in your case, page 2 of 2). You could use some simple regex such as
a = "page 2 of 2"
b = as.numeric(gsub("page 2 of ","",a))
> b
[1] 2
And add a conditional if b>1. If this is TRUE, you can loop-scrape through the link ...-talking-yourself-i.html with i being the values from the sequence 1 to the number of pages.
So I'm trying to scrape data from a site that contains club data from clubs at my school. I've got a good script going that scrapes the surface level data from the site, however I can get more data by clicking the "more information" link at each club which leads to the club's profile page. I would like to scrape the data from that page (specifically the facebook link). How can I do this?
Below you'll see my current attempt at this.
url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)
get_table <- function(page, count) {
#find group names
name_text <- html_nodes(page,".grpl-name a") %>% html_text()
df <- data.frame(name_text, stringsAsFactors = FALSE)
#find text description
desc_text <- html_nodes(page, ".grpl-purpose") %>% html_text()
df$desc_text <- trimws(desc_text)
#find emails
# find the parent nodes with html_nodes
# then find the contact information from each parent using html_node
email_nodes<-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-contact a") %>% html_text()
df$emails<-email_nodes
category_nodes <- html_nodes(page, "div.grpl-grp") %>% html_node(".grpl-type") %>% html_text()
df$category<-category_nodes
pic_nodes <-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-logo img") %>% html_attr("src")
df$logo <- paste0("https://uws-community.symplicity.com/", pic_nodes)
more_info_nodes <- html_nodes(page, ".grpl-moreinfo a") %>% html_attr("href")
df$more_info <- paste0("https://uws-community.symplicity.com/", more_info_nodes)
sub_page <- page %>% follow_link(css = ".grpl-moreinfo a")
df$fb <- html_node(sub_page, "#dnf_class_values_student_group__facebook__widget") %>% html_text()
if(count != 44) {
return (rbind(df, get_table(page %>% follow_link(css = ".paging_nav a:last-child"), count + 1)))
} else{
return (df)
}
}
RSO_data <- get_table(page, 0)
The part where I try to get the facebook page comes here:
sub_page <- page %>% follow_link(css = ".grpl-moreinfo a")
df$fb <- html_node(sub_page, "#dnf_class_values_student_group__facebook__widget") %>% html_text()
However this returns an error. What am I doing wrong? Is there a way I can scrape the data from the separate page of each club?
use an xpath to extract the desired node, based on it's id.
df$fb <- html_node(sub_page, xpath = '//*[#id="dnf_class_values_student_group__facebook__widget"]') %>% html_text()
# > html_node(sub_page, xpath = '//*[#id="dnf_class_values_student_group__facebook__widget"]') %>% html_text()
# [1] "https://www.facebook.com/17thavehouse/?fref=ts"
You will, however, need to 'loop' through all your df$name_text to open all different subpagesm and extract the facebook-links.
I am trying to scrap some data from an ecommerce site using rvest. I haven't found any good examples to guide me. Any idea about it?
Let's put as an example how I started:
library(rvest)
library(purrr)
#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#Reading the HTML code from the website
webpage <- read_html(url)
#Using CSS selectors to scrap the titles section
title_html <- html_nodes(webpage,'.main-title')
#Converting the title data to text
title <- html_text(title_html)
head(title)
#Using CSS selectors to scrap the price section
price <- html_nodes(webpage,'.item__price')
price <- html_text(price)
price
So, I would like to do two basic things:
Entering in each product and take some data from them.
Pagination to all pages
Any help?
Thank you.
Scrape that info is not difficult and is doable with rvest.
What you need to do is to get all the hrefs and loop on them. To do it, you need to use html_attr()
Following code should do the job:
library(tidyverse)
library(rvest)
#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#You need to get href and loop on hrefs
all_pages <- url_base %>% read_html() %>% html_nodes(".pagination__page > a") %>% html_attr("href")
all_pages[1] <- url_base
#create an empty table to store results
result_table <- tibble()
for(page in all_pages){
page_source <- read_html(page)
title <- html_nodes(page_source,'.item__info-title') %>% html_text()
price <- html_nodes(page_source,'.item__price') %>% html_text()
item_link <- html_nodes(page_source,'.item__info-title') %>% html_attr("href")
temp_table <- tibble(title = title, price = price, item_link = item_link)
result_table <- bind_rows(result_table,temp_table)
}
After you get link to each item, you can loop on the item links.
To View more pages
As you can see, there is a pattern in the suffix; you can simply add the number by 50 each time to navigate more pages.
> all_pages
[1] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/"
[2] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_51"
[3] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_101"
[4] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_151"
[5] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_201"
[6] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_251"
[7] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_301"
[8] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_351"
[9] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_401"
[10] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_451"
So we can do this:
str_c("https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_",seq.int(from = 51,by = 50,length.out = 40))
Scrape each page
Let's use this page as an example: https://articulo.mercadolibre.com.uy/MLU-449598178-protector-funda-clear-cover-samsung-galaxy-note-8-_JM
pagesource <- read_html("https://articulo.mercadolibre.com.uy/MLU-449598178-protector-funda-clear-cover-samsung-galaxy-note-8-_JM")
n_vendor <- pagesource %>% html_node(".item-conditions") %>% html_text() %>% remove_nt()
product_description <- pagesource %>% html_node(".item-title__primary") %>% html_text() %>% remove_nt()
n_opinion <- pagesource %>% html_node(".average-legend span:nth-child(1)") %>% html_text()
product_price <- pagesource %>% html_nodes(".price-tag-fraction") %>% html_text()
current_table <- tibble(product_description = product_description,
product_price = product_price,
n_vendor = n_vendor,
n_opinion = n_opinion)
print(current_table)
# A tibble: 1 x 4
product_description product_price n_vendor n_opinion
<chr> <chr> <chr> <chr>
1 Protector Funda Clear Cover Samsung Galaxy Note 8 14 14vendidos 2
You can loop the code chunk above and get all info.
Let's combine it all together
The following code should work, you can remove the 5-page limit to scrape all product information.
library(tidyverse)
library(rvest)
#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#You need to get href and loop on hrefs
all_pages <- url_base %>% read_html() %>% html_nodes(".pagination__page > a") %>% html_attr("href")
all_pages <- c(url_base,
str_c("https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_",
seq.int(from = 51,by = 50,length.out = 40)))
#create an empty table to store results
result_table <- tibble()
for(page in all_pages[1:5]){ #as an example, only scrape the first 5 pages
page_source <- read_html(page)
title <- html_nodes(page_source,'.item__info-title') %>% html_text()
price <- html_nodes(page_source,'.item__price') %>% html_text()
item_link <- html_nodes(page_source,'.item__info-title') %>% html_attr("href")
temp_table <- tibble(title = title, price = price, item_link = item_link)
result_table <- bind_rows(result_table,temp_table)
}
#loop on result table(item_link):
product_table <- tibble()
for(i in 1:nrow(result_table)){
pagesource <- read_html(result_table[[i,"item_link"]])
n_vendor <- pagesource %>% html_node(".item-conditions") %>% html_text() %>% remove_nt()
product_description <- pagesource %>% html_node(".item-title__primary") %>% html_text() %>% remove_nt()
currency_symbol <- pagesource %>% html_node(".price-tag-symbol") %>% html_text()
n_opinion <- pagesource %>% html_node(".average-legend span:nth-child(1)") %>% html_text()
product_price <- pagesource %>% html_nodes(".price-tag-fraction") %>% html_text()
current_table <- tibble(product_description = product_description,
currency_symbol = currency_symbol,
product_price = product_price,
n_vendor = n_vendor,
n_opinion = n_opinion,
item_link = result_table[[i,"item_link"]])
product_table <- bind_rows(product_table,current_table)
}
Result:
Some issues
There are still some bugs in the code, for example:
On this page, there are two items that match the css selector, which may break the code. There are some solutions though:
Store result in a list instead of a table
Use a more accurate CSS selector
concatenate string whenever there is more than one result and
etc.
You can choose any methods that fit your requirement.
Also, if you want to scrape in quantity, you may want to use tryCatch to prevent any errors from breaking your loop.
About apis
Api is totally different with web scraping, you may want to read some more tutorials about api if you want to use it.