Extract div class text and sub tables in rvest - web-scraping

I am trying to recreate a table from this website under "Battle Pass Rewards." The final result is a data.frame with each of the following areas as different columns:
target scrape infomation
The table has three "tr" tags, but rvest is merging the 2nd and 3rd on scrape. I'm not sure why.
fnite_s2 <- read_html("https://fortnite.fandom.com/wiki/Season_2")
fnite_s2 %>%
html_table(fill = TRUE) %>%
.[2]
For example, "Blue Squire Outfit" is scrapped when "Blue Squire" is in a separate td tag from
"Outfit".
The other issue is that rarity or the blue background is set in a div tag such as the following:
div class="rarity-background uncommon">
I need to be able to scrape the "uncommon" part of the div-tag and add it as another column as well.
EDIT: I was able to grab most things, but I'd still stuck on grabbing the div tag information
fnite_bp <-
read_html("https://fortnite.fandom.com/wiki/Season_2") %>%
html_nodes(".listing") %>%
html_table(fill = T) %>%
# Convert to table
# Transpose to long
# Convert back to table
# Add tier number column
# Convert to long to mutate content types for both tiers
as_tibble(.name_repair = "unique") %>%
t() %>%
as_tibble() %>%
rownames_to_column(var = "tier") %>%
pivot_longer(-tier, names_to = "type", values_to = "content_string") %>%
mutate(
type = if_else(type == "V1", "free", "paid"),
content_name = str_extract(content_string, '[^\n]+'),
content_type = str_replace(content_string, content_name, ""),
content_type = str_replace(content_type, "Free ", ""),
amount = as.integer(str_extract(content_string, "\\d+")),
amount = if_else(type == "paid" | (type == "free" & content_string != ""), replace_na(amount, 1), NA_integer_)
)

Related

Using a for loop to get bulk tweets from the Twitter API v2 endpoint in R

I am trying to collect more tweets than is allowed in a single query, hence I am using a for loop to automate this.
tweets <- data_frame()
for(i in 1:10){
httr::GET(url = url_tweet,
httr::add_headers(.headers = headers),
query = params) %>%
httr::content(response, as = "text") %>%
fromJSON(obj, flatten = TRUE) %>%
json_data <- view(enframe(unlist(json_data))) %>%
mutate(
id2 = name %>% str_extract("[0-9]+$"), # ensure unique rows
name = name %>% str_remove("[0-9]+$") %>% str_remove("^data.")
) %>%
pivot_wider(names_from = name, values_from = value) %>%
select(`tweet_id` = id, text, user_id=includes.users.id, user_name=includes.users.username, likes=public_metrics.like_count, retweets=public_metrics.retweet_count, quotes=public_metrics.quote_count) %>%
type_convert() -> data_sep
tweets <- rbind(tweets, data_sep)
}
I have run the code individually and there is nothing wrong with any of it, but when I try to loop it I get this error
Error in `select()`:
! Can't subset columns that don't exist.
x Column `id` doesn't exist.

Web page not found in web scraping, how can I find it in R?

I've been working with R for about a year and love it. I've gotten into text mining recently and have had some difficulty. I'm trying to create a data frame with information from a website. I've been scraping the data and have been able to create two variables successfully. In attempting to create the third variable its not working. When I view the table that I've made, the content for that variable says "Sorry webpage cannot be found." But, I know its there! Any thoughts? Thanks everyone!
link = "https://www.fmprc.gov.cn/mfa_eng/wjdt_665385/zyjh_665391/"
page = read_html(link)
title = page %>% html_nodes(".newsLst_mod a") %>% html_text()
slinks = page %>% html_nodes(".newsLst_mod a") %>%
html_attr("href") %>% paste("https://www.fmprc.gov.cn", ., sep = "")
date = page %>% html_nodes(".newsLst_mod span") %>% html_text()
Somewhere here is where I run into trouble... I get 'p' when using Selector Gadget and put that in the html_ nodes function...however, this doesn't seem to work and I'm coming up empty. If I adjust the scraping a little on the page, it might have nothing on the table when I view it.
get_s = function(slinks) {
speeches_link = read_html(slinks)
speech_words = speeches_link %>% html_nodes("p") %>%
html_text() %>% paste(collapse = ",")
return(speech_words)
}
What the table looks like
words = sapply(slinks, FUN = get_s)
speeches = data.frame(title, date, words, stringsAsFactors = FALSE)
The link that you need to paste in each URL is https://www.fmprc.gov.cn/mfa_eng/wjdt_665385/zyjh_665391.
Try the following -
library(rvest)
slinks = page %>% html_nodes(".newsLst_mod a") %>%
html_attr("href") %>% trimws(whitespace = '\\.') %>%
paste0("https://www.fmprc.gov.cn/mfa_eng/wjdt_665385/zyjh_665391", .)
get_s = function(slinks) {
speeches_link = read_html(slinks)
speech_words = speeches_link %>% html_nodes("p") %>%
html_text() %>% paste(collapse = ",")
return(speech_words)
}
words = sapply(slinks, FUN = get_s)
words

Web Scraping Across multiple pages R

I have been working on some R code. The purpose is to collect the average word length and other stats about the words in a section of a website with 50 pages. Collecting the stats is no problem and it's a easy part. However, getting my code to collect the stats over 50 pages is the hard part, it only ever seems to output information from the first page. See the code below and ignore the poor indentation.
install.packages(c('tidytext', 'tidyverse'))
library(tidyverse)
library(tidytext)
library(rvest)
library(stringr)
websitePage <- read_html('http://books.toscrape.com/catalogue/page-1.html')
textSort <- websitePage %>%
html_nodes('.product_pod a') %>%
html_text()
for (page_result in seq(from = 1, to = 50, by = 1)) {
link = paste0('http://books.toscrape.com/catalogue/page-',page_result,'.html')
page = read_html(link)
# Creates a tibble
textSort.tbl <- tibble(text = textSort)
textSort.tidy <- textSort.tbl %>%
funnest_tokens(word, text)
}
# Finds the average word length
textSort.tidy %>%
map(nchar) %>%
map(mean)
# Finds the most common words
textSort.tidy %>%
count(word, sort = TRUE)
# Removes the stop words and then finds most common words
textSort.tidy %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
# Counts the number of times the word "Girl" is in the text
textSort.tidy %>%
count(word) %>%
filter(word == "Girl")
You can use lapply/map to extract the tetx from multiple links.
library(rvest)
link <- paste0('http://books.toscrape.com/catalogue/page-',1:50,'.html')
result <- lapply(link, function(x) x %>%
read_html %>%
html_nodes('.product_pod a') %>%
html_text)
You can continue using lapply if you want to apply other functions to text.

Map a tbl of hyperlinks into read_html

I have a tibble containing one column which stores hyperlinks in each column. Now I want to map over these links using map_dfr, passing the links one after another through read_html(.x[.x]) %>%
html_node(".body-copy-lg") %>% html_text. If I do so I always end up with the error :
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=3].
Which tells me that the read_html basically says: " Hey stop throwing more than one string at the same time on me."
So did I make a mistake in the mapper? Is this a bug? I really can't see why the mapper-function does not grab each element one after another.
What I tried so far :
target_regex <- "(xtm)|((k|K)(i|I|1|11)(d|D)(n|N).)|(Ar<e)\\s(you)\\s(in)|
(LOAN)|(AR(\\s|\\S)[0-9])|((B|b)(i|1|l)tc.)|(Coupon)|(Plastic.King)|(organs)|(SILI)|(Electric.Cigarette.Machine)"
adverts <- function(df) df[!grepl(target_regex, df$...1,perl = T), ]
bribe <- read_html(paste("http://ipaidabribe.com/reports/paid?page", 10, sep = "="))
report <- map(".read-more", ~html_nodes(bribe, .x) %>%
html_attr(.x[[1]][[1]][[1]], name = "href"))[[1]] %>%
as_tibble(.name_repair = "unique") %>%
bind_rows() %>%
rename( ...1 = value) %>%
adverts() %>%
map_dfr(~read_html(.x[.x]) %>%
html_node(".body-copy-lg") %>%
html_text)
Do not mind the call of rename() which is basically something what needed to be done to make the adverts usable in this case.
You're forgetting that most functions in R are vectorized, and that using map or apply functions is unnecessary. In your case, it is needed in the final step of getting the html text.
The syntax your are using in map is also puzzling, and I think you should review ?map to get a better handle on it. For instance, you use multiple .x or extracted values where you should just be using .x to refer to the sub-element of the object you are iterating over.
library(tidyverse)
library(rvest)
target_regex <- "(xtm)|((k|K)(i|I|1|11)(d|D)(n|N).)|(Ar<e)\\s(you)\\s(in)|
(LOAN)|(AR(\\s|\\S)[0-9])|((B|b)(i|1|l)tc.)|(Coupon)|(Plastic.King)|(organs)|(SILI)|(Electric.Cigarette.Machine)"
adverts <- function(df) df[!grepl(target_regex, df$...1,perl = T), ]
bribe <- read_html(paste("http://ipaidabribe.com/reports/paid?page", 10, sep = "="))
report <- html_nodes(bribe, ".read-more") %>%
html_attr("href") %>%
as_tibble(.name_repair = "unique") %>%
filter(str_detect(value, target_regex, negate = TRUE)) %>%
mutate(text = map_chr(value, ~read_html(.x) %>%
html_node(".body-copy-lg") %>%
html_text))
result
# A tibble: 3 x 2
value text
<chr> <chr>
1 http://ipaidabribe.com/reports/paid/paid-bribe-to-settle-matter… "\r\n Place: Nelamangala Police Station, Bangalore\nDate of incident: 5th Jan 2020, 3PM…
2 http://ipaidabribe.com/reports/paid/paid-500-rs-bribe-at-nizamu… "\r\n My Brother Mahesh Prasad travelling on PNR number 4822171124 train no 12721 Ni…
3 http://ipaidabribe.com/reports/paid/drone-air-follow-focus-wire… "\r\n This new Silencer Air+ is a tremendously versatile and resourceful follow focus, z…

rvest web content scraping issue / car trading website

Question
I wanted to rvest specific parts of the websites (car sales platform).
The CSS is frankly too confusing for me to figure out what's wrong on my own.
#### scraping the website www.otomoto.pl with used cars #####
baseURL_otomoto = "https://www.otomoto.pl/osobowe/?page="
i <- 1
for ( i in 1:7000 )
{
link = paste0(baseURL_otomoto,i)
out = read_html(link)
print(i)
print(link)
### building year
build_year = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
mileage = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[2]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
volume = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[3]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
fuel_type = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[4]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
price = html_nodes(out, xpath = '//div[#class="offer-item__price"]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
link = html_nodes(out, xpath = '//div[#class="offer-item__title"]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
offer_details = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
Any guesses what might be the reason for this behaviour?
PS#1.
How to rvest all build_type, mileage and fuel_type data from offers available on the analysed website at once as a data.frame? using classes (xpath = '//div[#class=...) didn't work in my case
PS#2.
I wanted to rvest details of the actual offers using f.i.
gear_type = html_nodes(out, xpath = '//*[#id="parameters"]/ul[1]/li[10]/div') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
the arguments
in ul[a] are for a in (1:2) &
in li[b] are for b in (1:12)
Unfortunately though this concept fails as the resulting data frame is empty. Any guesses why?
First and foremost, learn about CSS selectors and XPath. Your selectors are very long and extremely fragile (some of them did not work for me at all, mere two weeks later). For example, instead of:
html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>%
html_text()
you can write:
html_nodes(out, css="[data-code=year]") %>% html_text()
Second, read documentation of libraries that you use. str_replace_all pattern may be regular expression, which saves you one call (use str_replace_all("[\n\r]", "") instead of str_replace_all("\n","") %>% str_replace_all("\r","")). html_text can do text trimming for you, which means that str_trim() is not needed at all.
Third, if you copy-paste some code, step back and think if function wouldn't be better solution; usually it would. In your case, personally, I would probably skip str_replace_all calls until data cleaning step, when I would call them on data.frame holding entire scrapped data.
To create data.frame from your data, call data.frame() function with column names and content, like that:
data.frame(build_year = build_year,
mileage = mileage,
volume = volume,
fuel_type = fuel_type,
price = price,
link = link,
offer_details = offer_details)
Or you could initialize data.frame with one column only and then add further vectors as columns:
output_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE))
output_df$volume <- html_nodes(out, css="[data-code=engine_capacity]") %>%
html_text(TRUE)
Finally, you should note that data.frame columns must all be the same length, while some of data that you scrap is optional. At the moment of writing this answer I had few offers without engine capacity and without offer description. You have to use two html_nodes calls in succession (as single CSS selector will not match what doesn't exist). But even then, html_nodes will silently drop missing data. This can be worked around by piping html_nodes output to html_node call:
current_df$volume = out %>% html_nodes("ul.offer-item__params") %>%
html_node("[data-code=engine_capacity]") %>%
html_text(TRUE)
The final version of my approach to loop internals is below. Just make sure that you initialize empty data.frame before calling it and that you merge output of current iteration with final data frame (using for example rbind), or each iteration will overwrite results of previous one. Or you could use do.call(rbind, lapply()), which is idiomatic R for such task.
As a side note, when scraping large amount of quickly changing data, consider decoupling data downloading and data processing steps. Imagine that there is some corner case that you haven't accounted for which will cause R to terminate. How will you proceed if such condition appear in the middle of your iterations? The longer you stay on one page, the more duplicates you introduce (as more offers appear and existing ones are pushed down on further pages), and more offers you miss (as sale is concluded and offers disappear forever).
current_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE))
current_df$mileage = html_nodes(out, css="[data-code=mileage]") %>%
html_text(TRUE)
current_df$volume = out %>% html_nodes("ul.offer-item__params") %>%
html_node("[data-code=engine_capacity]") %>%
html_text(TRUE)
current_df$fuel_type = html_nodes(out, css="[data-code=fuel_type]") %>%
html_text(TRUE)
current_df$price = out %>% html_nodes(xpath="//div[#class='offer-price']//span[contains(#class, 'number')]") %>%
html_text(TRUE)
current_df$link = out %>% html_nodes(css = "div.offer-item__title h2 > a") %>%
html_text(TRUE) %>%
str_replace_all("[\n\r]", "")
current_df$offer_details = out %>% html_nodes("div.offer-item__title") %>%
html_node("h3") %>%
html_text(TRUE)

Resources