How scrape dynamic pages when there's only one URL - r

I am trying to scrape data from this webpage: https://www.premierleague.com/stats/top/players/saves however there are two pages of data i want to scrape. I have been able to scrape the first page of data with the code below:
remDr$navigate("https://www.premierleague.com/stats/top/players/saves")
epl <- read_html(remDr$getPageSource()[[1]])
rank <- epl %>% html_nodes(".statsTableContainer .rank") %>% html_text()
player <- epl %>% html_nodes(".playerName ") %>% html_text()
club <- epl %>% html_nodes(".statNameSecondary") %>% html_text()
stat <- epl %>% html_nodes(".statsTableContainer .text-centre") %>% html_text()
str(rank)
str(player)
str(club)
str(stat)
Saves <- data.frame(rank, player, club, stat)
I have been using the RSelenium pkg for the scraping. For the second page there isn't a different URL you have to click the arrow on the side. How do i scrape the second page when there's only an arrow to select?
I haven't been able to try anything as i'm not sure where to even start as i've not come accross this problem before.

Related

How can i scrape the complete dataset from yahoo finance with rvest

Im trying to get the complete data set for bitcoin historical data from yahoo finance via web scraping, this is my first option code chunk:
library(rvest)
library(tidyverse)
crypto_url <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <- html_nodes(crypto_url,css = "table")
cryp_table <- html_table(cryp_table,fill = T) %>%
as.data.frame()
I the link that i provide to read_html() a long period of time is already selected, however it just get the first 101 rows and the last row is the loading message that you get when you keep scrolling, this is my second shot but i get the same:
col_page <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <-
col_page %>%
html_nodes(xpath = '//*[#id="Col1-1-HistoricalDataTable-Proxy"]/section/div[2]/table') %>%
html_table(fill = T)
cryp_final <- cryp_table[[1]]
How can i get the whole dataset?
I think you can get the link of download, if you view the Network, you see the link of download, in this case:
"https://query1.finance.yahoo.com/v7/finance/download/BTC-USD?period1=1480464000&period2=1638230400&interval=1d&events=history&includeAdjustedClose=true"
Well, this link looks like the url of the site, i.e., we can modify the url link to get the download link and read the csv. See the code:
library(stringr)
library(magrittr)
site <- "https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true"
base_download <- "https://query1.finance.yahoo.com/v7/finance/download/"
download_link <- site %>%
stringr::str_remove_all(".+(?<=quote/)|/history?|&frequency=1d") %>%
stringr::str_replace("filter", "events") %>%
stringr::str_c(base_download, .)
readr::read_csv(download_link)

Using Rvest to extract data from webpage keep getting NA, "", or {xml_nodeset (0)}

i'm trying to use Rvest to extract the actual and projected medals from this page
https://projects.fivethirtyeight.com/olympics-medal-count/
but for some reason can't get any text from it.
I've been trying multiple variations of
page<-read_html("https://projects.fivethirtyeight.com/olympics-medal-count/")
page %>%
html_nodes('.countries')%>%
html_nodes('.actual-rank') %>%
html_text()
page %>%
html_nodes('.actual-rank') %>%
html_text()
page %>%
html_nodes('div.rank-value') %>%
html_text()
But never am getting anything other than some variation of blank/missing data.
Any help or direction would be really appreciated.
Thanks.

How can I scrape an embedded tweet? [R]

I am trying to can scrape an embedded tweet on a website. I believe the tweet is loaded via JSON. Ideally, I would be able to simply scrape the embedded tweet's ID. As far as I can tell, this data should be available with the css selector '#twitter-widget-0,' but nothing is returned when I scrape using rvest.
My code is below:
page <- "https://deutsch.rt.com/amerika/86714-rund-woche-nach-russland-auch-china-schickt-militaer-nach-venezuela/"
read_html(page) %>%
html_nodes('#twitter-widget-0') %>%
html_text()
Something like this might help
library(dplyr)
library(rvest)
page %>%
read_html() %>%
html_nodes("div.rtcode") %>%
html_text()
#[1] "#Venezuela#China#Russia#Caracas#Chinese army soldiers arrived in
#Venezuela #Chinese People’s Liberation Army soldiers, as part of a
#cooperation program, #arrived, after delivering humanitarian supplies, to one
#of Venezuelan military #facilities. pic.twitter.com/HwZ9Ee67d0— Sukhoi Su-57
#frazor\U0001f1f7\U0001f1fa\U0001f1ee\U0001f1f3 (#I30mki) 1. April 2019"
Or if you want the unique twitter URL
page %>%
read_html() %>%
html_nodes("div.rtcode a") %>%
html_attr("href") %>%
grep("status", ., value = TRUE)
#[1] "https://twitter.com/I30mki/status/1112578904835981312?ref_src=twsrc%5Etfw"

Scraping a very specific section of a website?

I have a list of URL's (mesa$fullerurl) for documents and I am trying to scrape a specific section of text on each website (the paragraphs on Risk Factors). The problem is that there is no unique HTML tag that I can see for this section. The best way I can think of is to tell R to grab the text from the Risk Factors heading up to the next heading and then put that in a new data frame, k10, but I am not sure how to specify this in R. Thanks!
Here is an example of the document that I am trying to scrape from:
https://www.sec.gov/Archives/edgar/data/72903/000007290319000010/xcel1231201810-k.htm
sec<-read_html("https://www.sec.gov/cgi-bin/browse-edgar?
action=getcompany&CIK=0000072903&type=10-
k&dateb=&owner=exclude&count=40")
xcel<- sec %>%
html_nodes("#documentsbutton") %>%
html_attr("href")
xcel<-data.frame(xcel)
xcel$xcell<-paste0("https://www.sec.gov",xcel$xcell)
xcel$fullurl<-paste0(xcel$xcell,xcel$xcel)
as.character(xcel$fullurl)
mesa<-map_dfr(xcel$fullurl, ~ .x %>% read_html() %>% html_table() %>% .
[[1]])
mesa<-subset(mesa,mesa$Type=="10-K"|mesa$Type=="10-K/A"|mesa$Type=="10-
K405")
mesa
s<-gsub("(.*)/.*","\\1",xcel$fullurl)
table(xcel$fullurl)
xcel$fullurl<-s
xcel$fullurl<-paste0(xcel$fullurl,"/")
mesa$fullerurl<-paste0(xcel$fullurl,mesa$Document)
as.character(mesa$fullerurl)
mesa$Document[mesa$Document == ""] <- NA
mesa$fullerurl
#Below is the problematic part
k10<-map_dfr(mesa$fullerurl, ~ .x %>% read_html("") %>% html_nodes("")
%>% html_text(""))

R Web scraping from different URLs

I am web scraping a page at
http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
From this url, I have built up a dataframe through the following code:
dflist <- map(.x = 1:417, .f = function(x) {
Sys.sleep(5)
url <- ("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=")
read_html(url) %>%
html_nodes(".title a") %>%
html_text() %>%
as.data.frame()
}) %>% do.call(rbind, .)
I have repeated the same code in order to get all the data I was interested in and it seems to work perfectly, although is of course a little slow due to the Sys.sleep() thing.
My issue has raised once I have tried to scrape the single projects descriptions that should be included in the dataframe.
For instance, the first project description is at
http://catalog.ihsn.org/index.php/catalog/7118/study-description
the second project description is at
http://catalog.ihsn.org/index.php/catalog/6606/study-description
and so forth.
My problem is that I can't find a dynamic way to scrape all the projects' pages and insert them in the data frame, being the number in the URLs not progressive nor at the end of the link.
To make things clearer, this is the structure of the website I am scraping:
1.http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
1.1. http://catalog.ihsn.org/index.php/catalog/7118
1.1.a http://catalog.ihsn.org/index.php/catalog/7118/related_materials
1.1.b http://catalog.ihsn.org/index.php/catalog/7118/study-description
1.1.c. http://catalog.ihsn.org/index.php/catalog/7118/data_dictionary
I have scraped successfully level 1. but cannot level 1.1.b. (study-description) , the one I am interested in, since the dynamic element of the URL (in this case: 7118) is not consistent in the website's above 6000 pages of that level.
You have to extract the deeper urls from the .title a and then scrape those as well. Here's a small example on how to do that using rvest and the tidyverse
library(tidyverse)
library(rvest)
scraper <- function(x) {
Sys.sleep(5)
url <- sprintf("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=%s&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=", x)
html <- read_html(url)
tibble(title = html_nodes(html, ".title a") %>% html_text(trim = TRUE),
project_url = html_nodes(html, ".title a") %>% html_attr("href"))
}
result <- map_df(1:2, scraper) %>%
mutate(study_description = map(project_url, ~read_html(sprintf("%s/study-description", .x)) %>% html_node(".xsl-block") %>% html_text()))
This isn't complete as to all the things you want to do, but should show you an approach.

Resources