I have a tibble containing one column which stores hyperlinks in each column. Now I want to map over these links using map_dfr, passing the links one after another through read_html(.x[.x]) %>%
html_node(".body-copy-lg") %>% html_text. If I do so I always end up with the error :
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=3].
Which tells me that the read_html basically says: " Hey stop throwing more than one string at the same time on me."
So did I make a mistake in the mapper? Is this a bug? I really can't see why the mapper-function does not grab each element one after another.
What I tried so far :
target_regex <- "(xtm)|((k|K)(i|I|1|11)(d|D)(n|N).)|(Ar<e)\\s(you)\\s(in)|
(LOAN)|(AR(\\s|\\S)[0-9])|((B|b)(i|1|l)tc.)|(Coupon)|(Plastic.King)|(organs)|(SILI)|(Electric.Cigarette.Machine)"
adverts <- function(df) df[!grepl(target_regex, df$...1,perl = T), ]
bribe <- read_html(paste("http://ipaidabribe.com/reports/paid?page", 10, sep = "="))
report <- map(".read-more", ~html_nodes(bribe, .x) %>%
html_attr(.x[[1]][[1]][[1]], name = "href"))[[1]] %>%
as_tibble(.name_repair = "unique") %>%
bind_rows() %>%
rename( ...1 = value) %>%
adverts() %>%
map_dfr(~read_html(.x[.x]) %>%
html_node(".body-copy-lg") %>%
html_text)
Do not mind the call of rename() which is basically something what needed to be done to make the adverts usable in this case.
You're forgetting that most functions in R are vectorized, and that using map or apply functions is unnecessary. In your case, it is needed in the final step of getting the html text.
The syntax your are using in map is also puzzling, and I think you should review ?map to get a better handle on it. For instance, you use multiple .x or extracted values where you should just be using .x to refer to the sub-element of the object you are iterating over.
library(tidyverse)
library(rvest)
target_regex <- "(xtm)|((k|K)(i|I|1|11)(d|D)(n|N).)|(Ar<e)\\s(you)\\s(in)|
(LOAN)|(AR(\\s|\\S)[0-9])|((B|b)(i|1|l)tc.)|(Coupon)|(Plastic.King)|(organs)|(SILI)|(Electric.Cigarette.Machine)"
adverts <- function(df) df[!grepl(target_regex, df$...1,perl = T), ]
bribe <- read_html(paste("http://ipaidabribe.com/reports/paid?page", 10, sep = "="))
report <- html_nodes(bribe, ".read-more") %>%
html_attr("href") %>%
as_tibble(.name_repair = "unique") %>%
filter(str_detect(value, target_regex, negate = TRUE)) %>%
mutate(text = map_chr(value, ~read_html(.x) %>%
html_node(".body-copy-lg") %>%
html_text))
result
# A tibble: 3 x 2
value text
<chr> <chr>
1 http://ipaidabribe.com/reports/paid/paid-bribe-to-settle-matter… "\r\n Place: Nelamangala Police Station, Bangalore\nDate of incident: 5th Jan 2020, 3PM…
2 http://ipaidabribe.com/reports/paid/paid-500-rs-bribe-at-nizamu… "\r\n My Brother Mahesh Prasad travelling on PNR number 4822171124 train no 12721 Ni…
3 http://ipaidabribe.com/reports/paid/drone-air-follow-focus-wire… "\r\n This new Silencer Air+ is a tremendously versatile and resourceful follow focus, z…
Related
I have been working on some R code. The purpose is to collect the average word length and other stats about the words in a section of a website with 50 pages. Collecting the stats is no problem and it's a easy part. However, getting my code to collect the stats over 50 pages is the hard part, it only ever seems to output information from the first page. See the code below and ignore the poor indentation.
install.packages(c('tidytext', 'tidyverse'))
library(tidyverse)
library(tidytext)
library(rvest)
library(stringr)
websitePage <- read_html('http://books.toscrape.com/catalogue/page-1.html')
textSort <- websitePage %>%
html_nodes('.product_pod a') %>%
html_text()
for (page_result in seq(from = 1, to = 50, by = 1)) {
link = paste0('http://books.toscrape.com/catalogue/page-',page_result,'.html')
page = read_html(link)
# Creates a tibble
textSort.tbl <- tibble(text = textSort)
textSort.tidy <- textSort.tbl %>%
funnest_tokens(word, text)
}
# Finds the average word length
textSort.tidy %>%
map(nchar) %>%
map(mean)
# Finds the most common words
textSort.tidy %>%
count(word, sort = TRUE)
# Removes the stop words and then finds most common words
textSort.tidy %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
# Counts the number of times the word "Girl" is in the text
textSort.tidy %>%
count(word) %>%
filter(word == "Girl")
You can use lapply/map to extract the tetx from multiple links.
library(rvest)
link <- paste0('http://books.toscrape.com/catalogue/page-',1:50,'.html')
result <- lapply(link, function(x) x %>%
read_html %>%
html_nodes('.product_pod a') %>%
html_text)
You can continue using lapply if you want to apply other functions to text.
I'm building a scrape that pulls the name of a player and the years he played for thousands of different players. I have built an otherwise successful function to do this but unfortunately in some instances the table with the other half of data I need (years played) does not exist. For these instances, I'd like to add a way to tell the scrape to bypass these instances. Here is the code:
(note: the object "url_final" is the list of active webpage URLs of which there are many)
library(rvest)
library(curl)
library(tidyverse)
library(httr)
df <- map_dfr(.x = url_final,
.f = function(x){Sys.sleep(.3); cat(1);
fyr <- read_html(curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_table() %>%
.[[1]]
fyr <- fyr %>%
select(1) %>%
mutate(name = str_extract(string = x, pattern = "(?<=cbb/players/).*?(?=-\\d\\.html)"))
})
Here is an example of an active page in which you can recreate the scrape by replacing "url_final" as the .x call in map_dfr with:
https://www.sports-reference.com/cbb/players/karl-aaker-1.html
Here is an example of one of the instances in which there is no table and thus returns an error breaking the loop of the scrape.
https://www.sports-reference.com/cbb/players/karl-aaker-1.html
How about adding try-Catch which will ignore any errors?
library(tidyverse)
library(rvest)
df <- map_dfr(.x = url_final,
.f = function(x){Sys.sleep(.3); cat(1);
tryCatch({
fyr <- read_html(curl::curl(x,
handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_table() %>% .[[1]]
fyr <- fyr %>%
select(1) %>%
mutate(name = str_extract(string = x,
pattern = "(?<=cbb/players/).*?(?=-\\d\\.html)"))
}, error = function(e) message('Skipping url', x))
})
The code below works if I remove the Sys.sleep() from within the map() function. I tried to research the error ('Don't know how to pluck from a closure') but i haven't found much on that topic.
Does anyone know where I can find documentation on this error, and any help on why it is happening and how to prevent it?
library(rvest)
library(tidyverse)
library(stringr)
# lets assume 3 pages only to do it quickly
page <- (0:18)
# no need to create a list. Just a vector
urls = paste0("https://www.mlssoccer.com/players?page=", page)
# define this function that collects the player's name from a url
get_the_names = function( url){
url %>%
read_html() %>%
html_nodes("a.name_link") %>%
html_text()
}
# map the urls to the function that gets the names
players = map(urls, get_the_names) %>%
# turn into a single character vector
unlist() %>%
# make lower case
tolower() %>%
# replace the `space` to underscore
str_replace_all(" ", "-")
# Now create a vector of player urls
player_urls = paste0("https://www.mlssoccer.com/players/", players )
# define a function that reads the 3rd table of the url
get_the_summary_stats <- function(url){
url %>%
read_html() %>%
html_nodes("table") %>%
html_table() %>% .[[3]]
}
# lets read 3 players only to speed things up [otherwise it takes a significant amount of time to run...]
a_few_players <- player_urls[1:5]
# get the stats
tables = a_few_players %>%
# important step so I can name the rows I get in the table
set_names() %>%
#map the player urls to the function that reads the 3rd table
# note the `safely` wrap around the get_the_summary_stats' function
# since there are players with no stats and causes an error (eg.brenden-aaronson )
# the output will be a list of lists [result and error]
map(., ~{ Sys.sleep(5)
safely(get_the_summary_stats) }) %>%
# collect only the `result` output (the table) INTO A DATA FRAME
# There is also an `error` output
# also, name each row with the players name
map_df("result", .id = "player") %>%
#keep only the player name (remove the www.mls.... part)
mutate(player = str_replace(player, "https://www.mlssoccer.com/players/", "")) %>%
as_tibble()
tables <- tables %>% separate(Match,c("awayTeam","homeTeam"), extra= "drop", fill = "right")
purrr::safely(...) returns a function, so your map(., { Sys.sleep(5); safely(get_the_summary_stats) }) is returning functions, not any data. In R, a "closure" is a function and its enclosing environment.
Tilde notation is a tidyverse-specific method of more-terse anonymous functions. Typically (e.g., with lapply) one would use lapply(mydata, function(x) get_the_summary_stats(x)). In tilde notation, the same thing is written as map(mydata, ~ get_the_summary_stats(.))
So, re-write to:
... %>% map(~ { Sys.sleep(5); safely(get_the_summary_stats)(.); })
From comments by #r2evans
Does anyone know what the difference is between by_row, and rowwise? I am trying to scrape 3 simple websites, and I can't seem to get either approach to work, so I'm not sure if I am just using purr/dplyr wrong.
Data:
structure(list(beer_brewerid = c("8481", "3228", "10325"), link =
c("https://www.ratebeer.com/beer/8481/", "https://www.ratebeer.com/beer/3228/", "https://www.ratebeer.com/beer/10325/" ), scrapedname = c("", "", "")), .Names = c("beer_brewerid", "link", "scrapedname"), row.names = c(NA, 3L), class = "data.frame")
For every URL(or row), I would like to scrape the webpage using the following function:
dplyr approach:
table %>%
rowwise() %>%
read_html() %>%
extract2(2) %>%
html_nodes("#_brand4 span") %>%
html_text()
Purr Approach:
#Apply function to each row
table %>%
by_row(..f = parserows(), collate = c("rows"), .to = "scrapedname")
#Takes in row
parserows = function(){
read_html() %>%
extract2(., 2) %>%
html_nodes("#_brand4 span") %>%
html_text()
}
In the purr approach I keep getting an error where x is missing with no default. Shouldn't the value be coming from the row number? Otherwise I'd be writing a for loop specifying what index the row number is located at.
Using this magrittr piping, I keep getting timeout errors with my code.So:
How do I avoid timeout errors when using purr/dplyr to iterate over all the elements in my df? If so, should I be looking at using trycatch, or some sort of error handling mechanism to capture errors when they occur?
Is rowwise/ by_row really meant for this task? I think these functions are meant to iterative over every element within a row, which is not exactly what I am trying to solve with this problem at hand. Thanks.
output = table$link %>%
extract() %>%
map(read_html) %>%
html_nodes(row,"#_brand4 span") %>%
html_text(row)
Here is what #Thomas K's suggestions could look like:
First with purrr only:
library(purrr)
library(dplyr)
library(httr)
library(xml2)
library(rvest)
table$link %>%
purrr::set_names() %>%
map(read_html) %>%
map(html_node, "#_brand4 span") %>%
map(html_text)
# $`https://www.ratebeer.com/beer/8481/`
# [1] "Föroya Bjór"
#
# $`https://www.ratebeer.com/beer/3228/`
# [1] "King Brewing Company"
#
# $`https://www.ratebeer.com/beer/10325/`
# [1] "Bavik-De Brabandere"
(Note there is no need to use html_nodes (plural), rather than html_node (singular)).
A mixed dplyr/purrr alternative, which lets you keep each html doc in a tidy dataframe, if you need to reuse them:
res <-
table %>%
mutate(html = map(link, read_html),
brand_node = map(html, html_node, "#_brand4 span"),
scrapedname = map_chr(brand_node, html_text))
The html and brand_node columns are stored as external pointers and are not very print-friendly, so here is the resulting dataframe without them:
select(res, - html, - brand_node)
# beer_brewerid link scrapedname
# 1 8481 https://www.ratebeer.com/beer/8481/ Föroya Bjór
# 2 3228 https://www.ratebeer.com/beer/3228/ King Brewing Company
# 3 10325 https://www.ratebeer.com/beer/10325/ Bavik-De Brabandere
glimpse(res)
# Observations: 3
# Variables: 5
# $ beer_brewerid <chr> "8481", "3228", "10325"
# $ link <chr> "https://www.ratebeer.com/beer/8481/", "https://www.ratebeer.com/beer/3228/", "https://www.ratebeer.com/beer/10325/"
# $ scrapedname <chr> "Föroya Bjór", "King Brewing Company", "Bavik-De Brabandere"
# $ html <list> [<html lang="en">, <html lang="en">, <html lang="en">]
# $ brand_node <list> [<span itemprop="name">, <span itemprop="name">, <span itemprop="name">]
For the timeout issue, you could, also per #Thomas K's comment, simply wrap read_html in safely() or possibly() (which are indeed alternatives to tryCatch):
safe_read_html <- possibly(read_html, otherwise = read_html("<html></html>"))
But to address the (possible) real issue that you're going too hard on the server, I would suggest httr::RETRY() that lets you, well, retry, with "exponential backoff times":
safe_retry_read_html <- possibly(~ read_html(RETRY("GET", url = .x)), otherwise = read_html("<html></html>"))
A good practice when scraping is to go real gentle on the server, so you could even manually add an offset time before each request, with Sys.sleep(1 + runif(1)) for instance.
table$link %>%
c("https://www.wrong-url.foobar") %>%
purrr::set_names() %>%
map(~ {
Sys.sleep(1 + runif(1))
safe_retry_read_html(.x)
}) %>%
map(html_node, "#_brand4 span") %>%
map_chr(html_text)
# https://www.ratebeer.com/beer/8481/ https://www.ratebeer.com/beer/3228/
# "Föroya Bjór" "King Brewing Company"
# https://www.ratebeer.com/beer/10325/ https://www.wrong-url.foobar
# "Bavik-De Brabandere" NA
Lastly, there is your separate question about by_row()/rowwise().
First, note that by_row has been removed from the development version of purrr, and moved to a separate package, purrrlyr, where it's deprecated anyway, and it's recommended to "use a combination of: tidyr::nest(); dplyr::mutate(); purrr::map()"
From help("rowwise"), rowwise is mostly meant to be "used for the results of do() when you create list-variables".
So, no, neither is "really meant for this task", they would be superfluous.
Question
I wanted to rvest specific parts of the websites (car sales platform).
The CSS is frankly too confusing for me to figure out what's wrong on my own.
#### scraping the website www.otomoto.pl with used cars #####
baseURL_otomoto = "https://www.otomoto.pl/osobowe/?page="
i <- 1
for ( i in 1:7000 )
{
link = paste0(baseURL_otomoto,i)
out = read_html(link)
print(i)
print(link)
### building year
build_year = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
mileage = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[2]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
volume = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[3]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
fuel_type = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[4]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
price = html_nodes(out, xpath = '//div[#class="offer-item__price"]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
link = html_nodes(out, xpath = '//div[#class="offer-item__title"]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
offer_details = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
Any guesses what might be the reason for this behaviour?
PS#1.
How to rvest all build_type, mileage and fuel_type data from offers available on the analysed website at once as a data.frame? using classes (xpath = '//div[#class=...) didn't work in my case
PS#2.
I wanted to rvest details of the actual offers using f.i.
gear_type = html_nodes(out, xpath = '//*[#id="parameters"]/ul[1]/li[10]/div') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
the arguments
in ul[a] are for a in (1:2) &
in li[b] are for b in (1:12)
Unfortunately though this concept fails as the resulting data frame is empty. Any guesses why?
First and foremost, learn about CSS selectors and XPath. Your selectors are very long and extremely fragile (some of them did not work for me at all, mere two weeks later). For example, instead of:
html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>%
html_text()
you can write:
html_nodes(out, css="[data-code=year]") %>% html_text()
Second, read documentation of libraries that you use. str_replace_all pattern may be regular expression, which saves you one call (use str_replace_all("[\n\r]", "") instead of str_replace_all("\n","") %>% str_replace_all("\r","")). html_text can do text trimming for you, which means that str_trim() is not needed at all.
Third, if you copy-paste some code, step back and think if function wouldn't be better solution; usually it would. In your case, personally, I would probably skip str_replace_all calls until data cleaning step, when I would call them on data.frame holding entire scrapped data.
To create data.frame from your data, call data.frame() function with column names and content, like that:
data.frame(build_year = build_year,
mileage = mileage,
volume = volume,
fuel_type = fuel_type,
price = price,
link = link,
offer_details = offer_details)
Or you could initialize data.frame with one column only and then add further vectors as columns:
output_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE))
output_df$volume <- html_nodes(out, css="[data-code=engine_capacity]") %>%
html_text(TRUE)
Finally, you should note that data.frame columns must all be the same length, while some of data that you scrap is optional. At the moment of writing this answer I had few offers without engine capacity and without offer description. You have to use two html_nodes calls in succession (as single CSS selector will not match what doesn't exist). But even then, html_nodes will silently drop missing data. This can be worked around by piping html_nodes output to html_node call:
current_df$volume = out %>% html_nodes("ul.offer-item__params") %>%
html_node("[data-code=engine_capacity]") %>%
html_text(TRUE)
The final version of my approach to loop internals is below. Just make sure that you initialize empty data.frame before calling it and that you merge output of current iteration with final data frame (using for example rbind), or each iteration will overwrite results of previous one. Or you could use do.call(rbind, lapply()), which is idiomatic R for such task.
As a side note, when scraping large amount of quickly changing data, consider decoupling data downloading and data processing steps. Imagine that there is some corner case that you haven't accounted for which will cause R to terminate. How will you proceed if such condition appear in the middle of your iterations? The longer you stay on one page, the more duplicates you introduce (as more offers appear and existing ones are pushed down on further pages), and more offers you miss (as sale is concluded and offers disappear forever).
current_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE))
current_df$mileage = html_nodes(out, css="[data-code=mileage]") %>%
html_text(TRUE)
current_df$volume = out %>% html_nodes("ul.offer-item__params") %>%
html_node("[data-code=engine_capacity]") %>%
html_text(TRUE)
current_df$fuel_type = html_nodes(out, css="[data-code=fuel_type]") %>%
html_text(TRUE)
current_df$price = out %>% html_nodes(xpath="//div[#class='offer-price']//span[contains(#class, 'number')]") %>%
html_text(TRUE)
current_df$link = out %>% html_nodes(css = "div.offer-item__title h2 > a") %>%
html_text(TRUE) %>%
str_replace_all("[\n\r]", "")
current_df$offer_details = out %>% html_nodes("div.offer-item__title") %>%
html_node("h3") %>%
html_text(TRUE)