R scraping when SelectorGadget cannot find valid paths - r

I tried to scrape the data for each country from interactive pie charts here: https://transparencyreport.google.com/eu-privacy/overview?site_types=start:1453420800000;end:1633219199999;country:&lu=site_types
But Selector Gadget does not allow me to select the data points on pie charts. How do I resolve this?
library(rvest)
library(dplyr)
link = "https://transparencyreport.google.com/eu-privacy/overview?site_types=start:1453420800000;end:1633219199999;country:&lu=site_types"
page = read_html(link)
percentage = page %>% html_nodes("#content_types div") %>% html_text()
"#content_types div" returns void.

If you inspect the page and look on the "Network" tab, you can see the api call being made to get the data.
The end number is the last millisecond of today.
There is some junk at the beginning, but the rest of the response is JSON.
You'll have to figure out what the category numbers mean.
Maybe there is documentation of the api somewhere.
library(magrittr)
link <- "https://transparencyreport.google.com/transparencyreport/api/v3/"
parms <- paste0("europeanprivacy/siteinfo/urlsbycontenttype?start=1453420800000&end=",
1000 * ((Sys.Date() + 1) %>% as.POSIXct() %>% as.numeric()) - 1)
page <- httr::GET(paste0(link, parms))
data <- page %>% httr::content(as = "text") %>%
substr(., regexpr("\\[\\[.*\\]\\]", .), .Machine$integer.max) %>%
jsonlite::fromJSON() %>% .[[1]] %>% .[[2]] %>% as.data.frame()

Related

Web Scraping returns empty in R

I'm trying to scrape prices from Bloomberg. I can get the current price as shown below but can't get the previous price. What's the wrong?
library(rvest)
url <- "https://www.bloomberg.com/quote/WORLD:IND"
price <- read_html(url) %>%
html_nodes("div.overviewRow__66339412a5 span.priceText__06f600fa3e") %>%
html_text()
prevprice <- read_html(url) %>%
html_nodes("div.value__7e29a7c90d") %>%
html_text() #returns 0
prevprice <- read_html(url) %>%
html_nodes(xpath = '//section') %>%
html_text() %>%
as.data.frame() #didn't find the price
Thanks in advance.
So, there are at least two initial options:
Extract from the script tag where that info is pulled from. When browser runs JavaScript this info is used to populate the page as you see it. With rvest/httr, JavaScript is not run, so you would need to extract from the script tag, rather than where it ends up on the rendered webpage.
Or, you can calculate the previous price using the percentage change and current price. There might be a very small margin of inaccuracy here through rounding.
I show both of the above options in the code below.
I've also adapted the css selector list to use attribute = value css selectors, with starts with operator (^). This is to make the code more robust as the classes in the html appear to be dynamic, with only the start of the class attribute value being stable.
library(httr2)
library(tidyverse)
library(rvest)
url <- "https://www.bloomberg.com/quote/WORLDT:IND"
headers <- c("user-agent" = "mozilla/5.0")
page <- request(url) |>
(\(x) req_headers(x, !!!headers))() |>
req_perform() |>
resp_body_html()
# extract direct
prev_price <- page |>
html_text() |>
stringr::str_match("previousClosingPriceOneTradingDayAgo%22%3A(\\d+\\.?\\d+?)%2C") |>
.[, 2]
curr_price <- page |>
html_element("[class^=priceText]") |>
html_text() |>
str_replace_all(",", "") |>
as.numeric()
# calculate
change <- page |>
html_element("[class^=changePercent]") |>
html_text() |>
str_extract("[\\d\\.]+") |>
as.numeric()
adjustment <- 100 - change
prev_price_calculated <- curr_price * (adjustment / 100)
print(curr_price)
print(change)
print(prev_price)
print(prev_price_calculated)

How to scrape a table created using datawrapper using rvest?

I am trying to scrape Table 1 from the following website using rvest:
https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/
Following is the code i have written:
link <- "https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/"
page <- read_html(link)
page %>% html_nodes("iframe") %>% html_attr("src") %>% .[11] %>% read_html() %>%
html_nodes("table.medium datawrapper-g2oKP-6idse1 svelte-1vspmnh resortable")
But, i get {xml_nodeset (0)} as the result. I am struggling to figure out the correct tag to select in html_nodes() from the datawrapper page to extract Table 1.
I will be really grateful if someone can point out the mistake i am making, or suggest a solution to scrape this table.
Many thanks.
The data is present in the iframe but needs a little manipulation. It is easier, for me at least, to construct the csv download url from the iframe page then request that csv
library(rvest)
library(magrittr)
library(vroom)
library(stringr)
page <- read_html('https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/')
iframe <- page %>% html_element('iframe[title^="Table 1"]') %>% html_attr('src')
id <- read_html(iframe) %>% html_element('meta') %>% html_attr('content') %>% str_match('/(\\d+)/') %>% .[, 2]
csv_url <- paste(iframe,id, 'dataset.csv', sep = '/' )
data <- vroom(csv_url, show_col_types = FALSE)

Rvest Scraping the correct date with xpath or html selector on IMDB

I was trying to scrape a birthdate of an actor from IMDB.
I don't get to recieve the same result as it is provided in the original code
Birthdate provided in the code and desired format: "yyy-mm-dd"
library(rvest)
link = "https://www.imdb.com/name/nm0000093/"
page = read_html(link)
Date <- page %>% html_nodes(xpath = '//*[#id="name-born-info"]/time') %>% html_text('')
Do I need to select another node??
Thanks!
no luck with these:
Date <- page %>% html_nodes(xpath = '/html/body/div[2]/div/div[2]/div/div[1]/div/div/table/tbody/tr[3]/td/div[3]/time') %>% html_attr('')
Date <- page %>% html_nodes(xpath = '/html/body/div[2]/div/div[2]/div/div[1]/div/div/table/tbody/tr[3]/td/div[3]/time') %>% html_text('')
Date <- page %>% html_nodes(xpath = '//*[#id="name-born-info"]/time') %>% html_attr('')
I think just retrieve the single node and then add the actual attribute name into html_attr. I would also switch to faster css type selector
Date <- page %>% html_node('time') %>% html_attr('datetime')
Failing that you can regex it out
str_match(page %>% html_text(),'birthDate":\\s?"(\\d{4}-\\d{2}-\\d{2})"' )[1,2]

Rvest scraping google news with different number of rows

I am using Rvest to scrape google news.
However, I encounter missing values in element "Time" from time to time on different keywords. Since the values are missing, it will end up having "different number of rows error" for the data frame of scraping result.
Is there anyway to fill-in NA for these missing values?
Below is the example of the code I am using.
html_dat <- read_html(paste0("https://news.google.com/search?q=",Search,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link,
Description = html_dat %>%
html_nodes('.Rai5ob') %>%
html_text(),
Time = html_dat %>%
html_nodes('.WW6dff') %>%
html_text()
)
Without knowing the exact page you were looking at I tried the first Google news page.
In the Rvest page, html_node (without the s) will always return a value even it is NA. Therefore in order to keep the vectors the same length, one needed to find the common parent node for all of the desired data nodes. Then parse the desired information from each one of those nodes.
Assuming the Title node is most complete, go up 1 level with xml_parent() and attempt to retrieving the same number of description nodes, this didn't work. Then tried 2 levels up using xml_parent() %>% xml_parent(), this seems to work.
library(rvest)
url <-"https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en"
html_dat <- read_html(url)
Title = html_dat %>% html_nodes('.DY5T1d') %>% html_text()
# Link = dat$Link
Link = html_dat %>% html_nodes('.VDXfz') %>% html_attr('href')
Link <- gsub("./articles/", "https://news.google.com/articles/",Link)
#Find the common parent node
#(this was trial and error) Tried the parent then the grandparent
Titlenodes <- html_dat %>% html_nodes('.DY5T1d') %>% xml_parent() %>% xml_parent()
Description = Titlenodes %>% html_node('.Rai5ob') %>% html_text()
Time = Titlenodes %>% html_node('.WW6dff') %>% html_text()
answer <- data.frame(Title, Time, Description, Link)

Scraping Lineup Data From Football Reference Using R

I seem to always have a problem scraping reference sites using either Python or R. Whenever I use my normal xpath approach (Python) or Rvest approach in R, the table I want never seems to be picked up by the scraper.
library(rvest)
url = 'https://www.pro-football-reference.com/years/2016/games.htm'
webpage = read_html(url)
table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)
for(x in boxscore_links{
keep = substr(x, 10, 36)
url2 = paste('https://www.pro-football-reference.com', keep, sep = "")
webpage2 = read_html(url2)
home_team = webpage2 %>% html_nodes(xpath='//*[#id="all_home_starters"]') %>% html_text()
away_team = webpage2 %>% html_nodes(xpath='//*[#id="all_vis_starters"]') %>% html_text()
home_starters = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_text()
home_starters2 = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_table()
#code that will bind lineup tables with some master table -- code to be written later
}
I'm trying to scrape the starting lineup tables. The first bit of code pulls the urls for all boxscores in 2016, and the for loop goes to each boxscore page with the hopes of extracting the tables led by "Insert Team Here" Starters.
Here's one link for example: 'https://www.pro-football-reference.com/boxscores/201609110rav.htm'
When I run the code above, the home_starters and home_starters2 objects contain zero elements (when ideally it should contain the table or elements of the table I'm trying to bring in).
I appreciate the help!
I've spent the last three hours trying to figure this out. This is how it shoudl be done. This is given my example but I'm sure you could apply it to yours.
"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#returns') %>% # select desired node
html_table()

Resources