Web Scraping returns empty in R

Web Scraping returns empty in R - r

I'm trying to scrape prices from Bloomberg. I can get the current price as shown below but can't get the previous price. What's the wrong?
library(rvest)
url <- "https://www.bloomberg.com/quote/WORLD:IND"
price <- read_html(url) %>%
html_nodes("div.overviewRow__66339412a5 span.priceText__06f600fa3e") %>%
html_text()
prevprice <- read_html(url) %>%
html_nodes("div.value__7e29a7c90d") %>%
html_text() #returns 0
prevprice <- read_html(url) %>%
html_nodes(xpath = '//section') %>%
html_text() %>%
as.data.frame() #didn't find the price
Thanks in advance.

So, there are at least two initial options:
Extract from the script tag where that info is pulled from. When browser runs JavaScript this info is used to populate the page as you see it. With rvest/httr, JavaScript is not run, so you would need to extract from the script tag, rather than where it ends up on the rendered webpage.
Or, you can calculate the previous price using the percentage change and current price. There might be a very small margin of inaccuracy here through rounding.
I show both of the above options in the code below.
I've also adapted the css selector list to use attribute = value css selectors, with starts with operator (^). This is to make the code more robust as the classes in the html appear to be dynamic, with only the start of the class attribute value being stable.
library(httr2)
library(tidyverse)
library(rvest)
url <- "https://www.bloomberg.com/quote/WORLDT:IND"
headers <- c("user-agent" = "mozilla/5.0")
page <- request(url) |>
(\(x) req_headers(x, !!!headers))() |>
req_perform() |>
resp_body_html()
# extract direct
prev_price <- page |>
html_text() |>
stringr::str_match("previousClosingPriceOneTradingDayAgo%22%3A(\\d+\\.?\\d+?)%2C") |>
.[, 2]
curr_price <- page |>
html_element("[class^=priceText]") |>
html_text() |>
str_replace_all(",", "") |>
as.numeric()
# calculate
change <- page |>
html_element("[class^=changePercent]") |>
html_text() |>
str_extract("[\\d\\.]+") |>
as.numeric()
adjustment <- 100 - change
prev_price_calculated <- curr_price * (adjustment / 100)
print(curr_price)
print(change)
print(prev_price)
print(prev_price_calculated)

Related

R scraping when SelectorGadget cannot find valid paths

I tried to scrape the data for each country from interactive pie charts here: https://transparencyreport.google.com/eu-privacy/overview?site_types=start:1453420800000;end:1633219199999;country:&lu=site_types
But Selector Gadget does not allow me to select the data points on pie charts. How do I resolve this?
library(rvest)
library(dplyr)
link = "https://transparencyreport.google.com/eu-privacy/overview?site_types=start:1453420800000;end:1633219199999;country:&lu=site_types"
page = read_html(link)
percentage = page %>% html_nodes("#content_types div") %>% html_text()
"#content_types div" returns void.

If you inspect the page and look on the "Network" tab, you can see the api call being made to get the data.
The end number is the last millisecond of today.
There is some junk at the beginning, but the rest of the response is JSON.
You'll have to figure out what the category numbers mean.
Maybe there is documentation of the api somewhere.
library(magrittr)
link <- "https://transparencyreport.google.com/transparencyreport/api/v3/"
parms <- paste0("europeanprivacy/siteinfo/urlsbycontenttype?start=1453420800000&end=",
1000 * ((Sys.Date() + 1) %>% as.POSIXct() %>% as.numeric()) - 1)
page <- httr::GET(paste0(link, parms))
data <- page %>% httr::content(as = "text") %>%
substr(., regexpr("\\[\\[.*\\]\\]", .), .Machine$integer.max) %>%
jsonlite::fromJSON() %>% .[[1]] %>% .[[2]] %>% as.data.frame()

Fix error in a For Loop used to extract reviews of a product given different urls

I'm trying to extract the reviews of a product on Amazon, the urls of the reviews are placed on the same url with different page numbers, running manually this script is working but I need to change manually the number of the page in the url and the name of the tibble and run each time to get a different tibble.
Since it's quite boring for almost 70 pages I was trying to make a for loop to do the same thing the under the loop that I tried to do but it gives me an error
MANUAL
```
library(tidyr)
library(rvest)
url_reviews <- "https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_16?ie=UTF8&reviewerType=all_reviews&pageNumber=16"
doc <- read_html(url_reviews) # Assign results to `doc`
# Review Title
doc %>%
html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
html_text() -> review_title
# Review Text
doc %>%
html_nodes("[class='a-size-base review-text review-text-content']") %>%
html_text() -> review_text
# Number of stars in review
doc %>%
html_nodes("[data-hook='review-star-rating']") %>%
html_text() -> review_star
# Return a tibble
page_16<-data.frame(review_title,
review_text,
review_star,
page =16)
FOR LOOP
```
range <- 12:82
url_max <- paste0("https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_", range ,"?ie=UTF8&reviewerType=all_reviews&pageNumber=",range)
for (i in 1:length(url_max)) {
doc <- read_html(url_max[i]) # Assign results to `doc`
# Review Title
doc %>%
html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
html_text() -> review_title
# Review Text
doc %>%
html_nodes("[class='a-size-base review-text review-text-content']") %>%
html_text() -> review_text
# Number of stars in review
doc %>%
html_nodes("[data-hook='review-star-rating']") %>%
html_text() -> review_star
paste0("page_", range)<-tibble(review_title,
review_text,
review_star,
page = paste0("a", i))
}
```

Here's another alternative that defines a function and then uses lapply() to sequentially run the function.
The following might, however, be helpful for repeating this as necessary for different products. The function accepts two parameters, the first i is the page number and the second product is the product for which you are gathering reviews. The function constructs the url by pasting the appropriate page number.
While I used lapply(), the function below could also be inserted in the map_df() function in Ronak's answer (and would likely be faster than binding rows).
library(dplyr)
library(rvest)
library(stringr)
retrieve_reviews <- function(i, product) {
urlstr <- "https://www.amazon.it/product-reviews/${product}/ref=cm_cr_getr_d_paging_btm_next_${i}?ie=UTF8&reviewerType=all_reviews&pageNumber=${i}"
url <- str_interp(urlstr, list(product = product, i = i))
doc <- read_html(url) # Assign results to `doc`
# Review Title
doc %>%
html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
html_text() -> review_title
# Review Text
doc %>%
html_nodes("[class='a-size-base review-text review-text-content']") %>%
html_text() -> review_text
# Number of stars in review
doc %>%
html_nodes("[data-hook='review-star-rating']") %>%
html_text() -> review_star
return(tibble(
title = review_title,
text = review_text,
star = review_star,
page = paste0("a", i)
))
}
range <- 12:82
product <- "B07WTHVQZH"
reviews <- lapply(range, retrieve_reviews, product) %>%
bind_rows()

You can use map_df from purrr to use loop.
library(rvest)
page_numbers <- 12:82
purrr::map_df(page_numbers, ~{
url_reviews <- paste0("https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_16?ie=UTF8&reviewerType=all_reviews&pageNumber=", .x)
doc <- read_html(url_reviews) # Assign results to `doc`
# Review Title
doc %>%
html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
html_text() -> review_title
# Review Text
doc %>%
html_nodes("[class='a-size-base review-text review-text-content']") %>%
html_text() -> review_text
# Number of stars in review
doc %>%
html_nodes("[data-hook='review-star-rating']") %>%
html_text() -> review_star
# Return a tibble
data.frame(review_title,
review_text,
review_star,
page =.x)
}) -> result
result

Rvest scraping google news with different number of rows

I am using Rvest to scrape google news.
However, I encounter missing values in element "Time" from time to time on different keywords. Since the values are missing, it will end up having "different number of rows error" for the data frame of scraping result.
Is there anyway to fill-in NA for these missing values?
Below is the example of the code I am using.
html_dat <- read_html(paste0("https://news.google.com/search?q=",Search,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link,
Description = html_dat %>%
html_nodes('.Rai5ob') %>%
html_text(),
Time = html_dat %>%
html_nodes('.WW6dff') %>%
html_text()
)

Without knowing the exact page you were looking at I tried the first Google news page.
In the Rvest page, html_node (without the s) will always return a value even it is NA. Therefore in order to keep the vectors the same length, one needed to find the common parent node for all of the desired data nodes. Then parse the desired information from each one of those nodes.
Assuming the Title node is most complete, go up 1 level with xml_parent() and attempt to retrieving the same number of description nodes, this didn't work. Then tried 2 levels up using xml_parent() %>% xml_parent(), this seems to work.
library(rvest)
url <-"https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en"
html_dat <- read_html(url)
Title = html_dat %>% html_nodes('.DY5T1d') %>% html_text()
# Link = dat$Link
Link = html_dat %>% html_nodes('.VDXfz') %>% html_attr('href')
Link <- gsub("./articles/", "https://news.google.com/articles/",Link)
#Find the common parent node
#(this was trial and error) Tried the parent then the grandparent
Titlenodes <- html_dat %>% html_nodes('.DY5T1d') %>% xml_parent() %>% xml_parent()
Description = Titlenodes %>% html_node('.Rai5ob') %>% html_text()
Time = Titlenodes %>% html_node('.WW6dff') %>% html_text()
answer <- data.frame(Title, Time, Description, Link)

Rvest, looping through elements on a page in order to follow a link at each element?

So I'm trying to scrape data from a site that contains club data from clubs at my school. I've got a good script going that scrapes the surface level data from the site, however I can get more data by clicking the "more information" link at each club which leads to the club's profile page. I would like to scrape the data from that page (specifically the facebook link).
Below you'll see my current attempt at this.
url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)
get_table <- function(page, count) {
#find group names
name_text <- html_nodes(page,".grpl-name a") %>% html_text()
df <- data.frame(name_text, stringsAsFactors = FALSE)
#find text description
desc_text <- html_nodes(page, ".grpl-purpose") %>% html_text()
df$desc_text <- trimws(desc_text)
#find emails
# find the parent nodes with html_nodes
# then find the contact information from each parent using html_node
email_nodes<-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-contact a") %>% html_text()
df$emails<-email_nodes
category_nodes <- html_nodes(page, "div.grpl-grp") %>% html_node(".grpl-type") %>% html_text()
df$category<-category_nodes
pic_nodes <-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-logo img") %>% html_attr("src")
df$logo <- paste0("https://uws-community.symplicity.com/", pic_nodes)
more_info_nodes <- html_nodes(page, ".grpl-moreinfo a") %>% html_attr("href")
df$more_info <- paste0("https://uws-community.symplicity.com/", more_info_nodes)
sub_page <- page %>% follow_link(css = ".grpl-moreinfo a")
df$fb <- html_node(sub_page, xpath = '//*[#id="dnf_class_values_student_group__facebook__widget"]') %>% html_text()
if(count != 44) {
return (rbind(df, get_table(page %>% follow_link(css = ".paging_nav a:last-child"), count + 1)))
} else{
return (df)
}
}
RSO_data <- get_table(page, 0)
The current error I'm getting is:
Error in `$<-.data.frame`(`*tmp*`, "logo", value = "https://uws-community.symplicity.com/") :
replacement has 1 row, data has 0
I know I need to make a function that will go through each element and follow the link, then mapply that function to the dataframe df. However I don't know how I'd go about making that function so that it would work correctly.

your error says that you are trying to combine two different dimensions... your page variable already has one dimension and second is 0. page <- html_session(url) add this inside you function.

This is a reproducable example of your error message.
x = data.frame()
x[1] <- c(1)
I haven't checked your code, but the error is in there, you have to go step by step through your code. You will find the error, where you've created an empty data.frame and then tried to assign a value to it.
good luck

Scraping Lineup Data From Football Reference Using R

I seem to always have a problem scraping reference sites using either Python or R. Whenever I use my normal xpath approach (Python) or Rvest approach in R, the table I want never seems to be picked up by the scraper.
library(rvest)
url = 'https://www.pro-football-reference.com/years/2016/games.htm'
webpage = read_html(url)
table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)
for(x in boxscore_links{
keep = substr(x, 10, 36)
url2 = paste('https://www.pro-football-reference.com', keep, sep = "")
webpage2 = read_html(url2)
home_team = webpage2 %>% html_nodes(xpath='//*[#id="all_home_starters"]') %>% html_text()
away_team = webpage2 %>% html_nodes(xpath='//*[#id="all_vis_starters"]') %>% html_text()
home_starters = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_text()
home_starters2 = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_table()
#code that will bind lineup tables with some master table -- code to be written later
}
I'm trying to scrape the starting lineup tables. The first bit of code pulls the urls for all boxscores in 2016, and the for loop goes to each boxscore page with the hopes of extracting the tables led by "Insert Team Here" Starters.
Here's one link for example: 'https://www.pro-football-reference.com/boxscores/201609110rav.htm'
When I run the code above, the home_starters and home_starters2 objects contain zero elements (when ideally it should contain the table or elements of the table I'm trying to bring in).
I appreciate the help!

I've spent the last three hours trying to figure this out. This is how it shoudl be done. This is given my example but I'm sure you could apply it to yours.
"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#returns') %>% # select desired node
html_table()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web Scraping returns empty in R - r

Related

R scraping when SelectorGadget cannot find valid paths

Fix error in a For Loop used to extract reviews of a product given different urls

Rvest scraping google news with different number of rows

Rvest, looping through elements on a page in order to follow a link at each element?

Scraping Lineup Data From Football Reference Using R

Categories

Resources