How can i scrape the complete dataset from yahoo finance with rvest - r

Im trying to get the complete data set for bitcoin historical data from yahoo finance via web scraping, this is my first option code chunk:
library(rvest)
library(tidyverse)
crypto_url <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <- html_nodes(crypto_url,css = "table")
cryp_table <- html_table(cryp_table,fill = T) %>%
as.data.frame()
I the link that i provide to read_html() a long period of time is already selected, however it just get the first 101 rows and the last row is the loading message that you get when you keep scrolling, this is my second shot but i get the same:
col_page <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <-
col_page %>%
html_nodes(xpath = '//*[#id="Col1-1-HistoricalDataTable-Proxy"]/section/div[2]/table') %>%
html_table(fill = T)
cryp_final <- cryp_table[[1]]
How can i get the whole dataset?

I think you can get the link of download, if you view the Network, you see the link of download, in this case:
"https://query1.finance.yahoo.com/v7/finance/download/BTC-USD?period1=1480464000&period2=1638230400&interval=1d&events=history&includeAdjustedClose=true"
Well, this link looks like the url of the site, i.e., we can modify the url link to get the download link and read the csv. See the code:
library(stringr)
library(magrittr)
site <- "https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true"
base_download <- "https://query1.finance.yahoo.com/v7/finance/download/"
download_link <- site %>%
stringr::str_remove_all(".+(?<=quote/)|/history?|&frequency=1d") %>%
stringr::str_replace("filter", "events") %>%
stringr::str_c(base_download, .)
readr::read_csv(download_link)

Related

webscrape unicef data with rvest

I wan wanting to automate downloading of some unicef data from https://data.unicef.org/indicator-profile/ using rvest or a simila r package. I have noticed that there are indicator codes, but I am having trouble identifying the correct codes and actually downloading the data.
Upon inspecting element, there is a data-inner-wrapper class that seems like it might be useful. You can access a download link by going to a page associated with an indicator and specifying a time period. For example, CME_TMY5T9 is the code for Deaths aged 5 to 9.
The data is available by going to
https://data.unicef.org/resources/data_explorer/unicef_f/?ag=UNICEF&df=GLOBAL_DATAFLOW&ver=1.0&dq=.CME_TMY5T9..&startPeriod=2017&endPeriod=2022` and then clicking a download link.
If anyone could help me figure out how to get all the data, that would be fantastic. Thanks
library(rvest)
library(dplyr)
library(tidyverse)
page = "https://data.unicef.org/indicator-profile/"
df = read_html(page) %>%
#html_nodes("div.data-inner-wrapper")
html_nodes(xpath = "//div[#class='data-inner-wrapper']")
EDIT: Alternatively, downloading all data for each country would be possible. I think that would just require getting the download link or getting at at the data within the table (since country codes arent much of an issue)
This shows all the data for Afghanistan. I just need to figure out a programmatic way of actually downloading the data....
https://data.unicef.org/resources/data_explorer/unicef_f/?ag=UNICEF&df=GLOBAL_DATAFLOW&ver=1.0&dq=AFG..&startPeriod=1970&endPeriod=2022
You are on the right track! When you visit the website https://data.unicef.org/indicator-profile/, it does not directly contain the indicator codes, because these are loaded dynamically at a later point. You can try using the "network analysis" function of your webbrowser and look at the different requests your browser does to fully load a webpage. The one you are looking for, with all the indicator codes is here: https://uni-drp-rdm-api.azurewebsites.net/api/indicators
library(httr)
library(jsonlite)
library(glue)
## this gets the indicator codes
indicators <- GET("https://uni-drp-rdm-api.azurewebsites.net/api/indicators") %>%
content(as = "text") %>%
jsonlite::fromJSON()
## try looking at it in your browser
browseURL("https://uni-drp-rdm-api.azurewebsites.net/api/indicators")
You also correctly identied the URL, which lets you download individual datasets in the data browser. Now you just needed to find the one that pops up, when you actually download an excel file and recursively add in the differnt helix-codes from the indicators. I have not tried applying this to all indicators, for some the url might differ and you might get incomplete data or errors. But this should get you started.
GET(glue("https://sdmx.data.unicef.org/ws/public/sdmxapi/rest/data/UNICEF,GLOBAL_DATAFLOW,1.0/.{indicators$helixCode[3]}..?startPeriod=2017&endPeriod=2022&format=csv&labels=name")) %>%
content(as = "text") %>%
read_csv()
This might be a good place to get started on how to mimick requests that your browser executes. https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html
Here is what I did based on the very helpful code from #Datapumpernickel
library(dplyr)
library(httr)
library(jsonlite)
library(glue)
library(tidyverse)
library(tictoc)
## this gets the indicator codes
indicators <- GET("https://uni-drp-rdm-api.azurewebsites.net/api/indicators") %>%
content(as = "text") %>%
jsonlite::fromJSON()
## try looking at it in your browser
#browseURL("https://uni-drp-rdm-api.azurewebsites.net/api/indicators")
tic()
FULL_DF = NULL
for(i in seq(1,length(unique(indicators$helixCode)),1)){
# Set up a trycatch loop to keep on going when it encounters errors
tryCatch({
print(paste0("Processing : ", i, " of 546 ", indicators$helixCode[i]))
TMP = GET(glue("https://sdmx.data.unicef.org/ws/public/sdmxapi/rest/data/UNICEF,GLOBAL_DATAFLOW,1.0/.{indicators$helixCode[i]}..?startPeriod=2017&endPeriod=2022&format=csv&labels=name")) %>%
content(as = "text") %>%
read_csv(col_types = cols())
# # Basic formatting for variables I want
TMP = TMP %>%
select(`Geographic area`, Indicator, Sex, TIME_PERIOD, OBS_VALUE) %>%
mutate(description = indicators$helixCode[i]) %>%
rename(country = `Geographic area`,
variablename = Indicator,
disaggregation = Sex,
year = TIME_PERIOD,
value = OBS_VALUE)
# rbind each indicator to the full dataframe
FULL_DF = FULL_DF %>% rbind(TMP)
},
error = function(cond){
cat("\n WARNING COULD NOT PROCESS : ", i, " of 546 ", indicators$helixCode[i])
message(cond)
return(NA)
}
)
}
toc()
# Save the data
rio::export(FULL_DF, "unicef-data.csv")

R - Using SelectorGadget to grab a dataset

I am trying to grab Hawaii-specific data from this site: https://www.opentable.com/state-of-industry. I want to get the data for Hawaii from every table on the site. This is done after selecting the State tab.
In R, I am trying to use rvest library with SelectorGadget.
So far I've tried
library(rvest)
html <- read_html("https://www.opentable.com/state-of-industry")
html %>%
html_element("tbody") %>%
html_table()
However, this isn't giving me what I am looking for yet. I am getting the Global dataset instead in a tibble. So any suggestions on how grab the Hawaii dataset from the State tab?
Also, is there a way to download the dataset that clicks on Download dataset tab? I can also then work from the csv file.
All the page data is stored in a script tag where it is pulled from dynamically in the browser. You can regex out the JavaScript object containing all the data, and write a custom function to extract just the info for Hawaii as shown below. Function get_state_index is written to accept a state argument, in case you wish to view other states' information.
library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
library(dplyr)
get_state_index <- function(states, state) {
return(match(T, map(states, ~ {
.x$name == state
})))
}
s <- read_html("https://www.opentable.com/state-of-industry") %>% html_text()
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
fullbook <- all_data$covidDataCenter$fullbook
hawaii_dataset <- tibble(
date = fullbook$headers %>% unlist() %>% as.Date(),
yoy = fullbook$states[get_state_index(fullbook$states, "Hawaii")][[1]]$yoy %>% unlist()
)
Regex:

html_nodes return an empty list for complex table

I would like to webscraping the table in the following website: https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats
I am using the following code but it is not working, thank you in advance.
library(rvest)
library(xml2)
library(dplyr)
link <- "https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
page<- read_html(link)
rank<- page %>% html_nodes(".sorting_2") %>% html_text()
university<-page %>% html_nodes(".ranking-institution-title ") %>% html_text()
statistics<-page %>% html_nodes(".stats") %>% html_text()
The Terms and Services of this site state: "Use data mining, robot, spider, scraping or similar automated data gathering, extraction or publication tools for any purpose."
That being said, you can read the json file that #QHarr found:
library(jsonlite)
url <- "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2021_0__fa224219a267a5b9c4287386a97c70ea.json"
x <- read_json(url, simplifyVector = TRUE)
head(x$data) # give you the data frame with universities
Now you have a well structured R list. The $data element contains a data frame with the stats of each university in rows. The other 3 list elements only provide supplementary information.

R Web scraping from different URLs

I am web scraping a page at
http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
From this url, I have built up a dataframe through the following code:
dflist <- map(.x = 1:417, .f = function(x) {
Sys.sleep(5)
url <- ("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=")
read_html(url) %>%
html_nodes(".title a") %>%
html_text() %>%
as.data.frame()
}) %>% do.call(rbind, .)
I have repeated the same code in order to get all the data I was interested in and it seems to work perfectly, although is of course a little slow due to the Sys.sleep() thing.
My issue has raised once I have tried to scrape the single projects descriptions that should be included in the dataframe.
For instance, the first project description is at
http://catalog.ihsn.org/index.php/catalog/7118/study-description
the second project description is at
http://catalog.ihsn.org/index.php/catalog/6606/study-description
and so forth.
My problem is that I can't find a dynamic way to scrape all the projects' pages and insert them in the data frame, being the number in the URLs not progressive nor at the end of the link.
To make things clearer, this is the structure of the website I am scraping:
1.http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
1.1. http://catalog.ihsn.org/index.php/catalog/7118
1.1.a http://catalog.ihsn.org/index.php/catalog/7118/related_materials
1.1.b http://catalog.ihsn.org/index.php/catalog/7118/study-description
1.1.c. http://catalog.ihsn.org/index.php/catalog/7118/data_dictionary
I have scraped successfully level 1. but cannot level 1.1.b. (study-description) , the one I am interested in, since the dynamic element of the URL (in this case: 7118) is not consistent in the website's above 6000 pages of that level.
You have to extract the deeper urls from the .title a and then scrape those as well. Here's a small example on how to do that using rvest and the tidyverse
library(tidyverse)
library(rvest)
scraper <- function(x) {
Sys.sleep(5)
url <- sprintf("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=%s&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=", x)
html <- read_html(url)
tibble(title = html_nodes(html, ".title a") %>% html_text(trim = TRUE),
project_url = html_nodes(html, ".title a") %>% html_attr("href"))
}
result <- map_df(1:2, scraper) %>%
mutate(study_description = map(project_url, ~read_html(sprintf("%s/study-description", .x)) %>% html_node(".xsl-block") %>% html_text()))
This isn't complete as to all the things you want to do, but should show you an approach.

Scrape website data using rvest

I am trying to scrape the data corresponding to Table 5 from the following link: https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls
As suggested, I used SelectorGadget to find the relevant CSS match, and the one I found that contained all the data (as well as some extraneous information) was "#page_content"
I've tried the following code, which yield errors:
fbi <- read_html("https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls")
fbi %>%
html_node("#page_content") %>%
html_table()
Error: html_name(x) == "table" is not TRUE
#Try extracting only the first column:
fbi %>%
html_nodes(".group0") %>%
html_table()
Error: html_name(x) == "table" is not TRUE
#Directly feed fbi into html_table
data = fbi %>% html_table(fill = T)
#This output creates a list of 3 elements, where within list 1 and 3, there are many missing values.
Any help would be greatly appreciated!
You can download the excel file directly. After that you should look into the excel file and take data that you want into a csv file. After that you can work on the data. Below is the code for doing the same.
library(rvest)
library(stringr)
page <- read_html("https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls")
pageAdd <- page %>%
html_nodes("a") %>% # find all links
html_attr("href") %>% # get the url
str_subset("\\.xls") %>% # find those that end in xls
.[[1]]
mydestfile <- "D:/Kumar/table5.xls" # change the path and file name as per your system
download.file(pageAdd, mydestfile, mode="wb")
The data is not in a very formatted way. Hence downloading it in R, will be more confusing. To me this appears to be the best way to solve your problem.

Resources