Scrape website data using rvest - css

I am trying to scrape the data corresponding to Table 5 from the following link: https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls
As suggested, I used SelectorGadget to find the relevant CSS match, and the one I found that contained all the data (as well as some extraneous information) was "#page_content"
I've tried the following code, which yield errors:
fbi <- read_html("https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls")
fbi %>%
html_node("#page_content") %>%
html_table()
Error: html_name(x) == "table" is not TRUE
#Try extracting only the first column:
fbi %>%
html_nodes(".group0") %>%
html_table()
Error: html_name(x) == "table" is not TRUE
#Directly feed fbi into html_table
data = fbi %>% html_table(fill = T)
#This output creates a list of 3 elements, where within list 1 and 3, there are many missing values.
Any help would be greatly appreciated!

You can download the excel file directly. After that you should look into the excel file and take data that you want into a csv file. After that you can work on the data. Below is the code for doing the same.
library(rvest)
library(stringr)
page <- read_html("https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls")
pageAdd <- page %>%
html_nodes("a") %>% # find all links
html_attr("href") %>% # get the url
str_subset("\\.xls") %>% # find those that end in xls
.[[1]]
mydestfile <- "D:/Kumar/table5.xls" # change the path and file name as per your system
download.file(pageAdd, mydestfile, mode="wb")
The data is not in a very formatted way. Hence downloading it in R, will be more confusing. To me this appears to be the best way to solve your problem.

Related

How can i scrape the complete dataset from yahoo finance with rvest

Im trying to get the complete data set for bitcoin historical data from yahoo finance via web scraping, this is my first option code chunk:
library(rvest)
library(tidyverse)
crypto_url <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <- html_nodes(crypto_url,css = "table")
cryp_table <- html_table(cryp_table,fill = T) %>%
as.data.frame()
I the link that i provide to read_html() a long period of time is already selected, however it just get the first 101 rows and the last row is the loading message that you get when you keep scrolling, this is my second shot but i get the same:
col_page <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <-
col_page %>%
html_nodes(xpath = '//*[#id="Col1-1-HistoricalDataTable-Proxy"]/section/div[2]/table') %>%
html_table(fill = T)
cryp_final <- cryp_table[[1]]
How can i get the whole dataset?
I think you can get the link of download, if you view the Network, you see the link of download, in this case:
"https://query1.finance.yahoo.com/v7/finance/download/BTC-USD?period1=1480464000&period2=1638230400&interval=1d&events=history&includeAdjustedClose=true"
Well, this link looks like the url of the site, i.e., we can modify the url link to get the download link and read the csv. See the code:
library(stringr)
library(magrittr)
site <- "https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true"
base_download <- "https://query1.finance.yahoo.com/v7/finance/download/"
download_link <- site %>%
stringr::str_remove_all(".+(?<=quote/)|/history?|&frequency=1d") %>%
stringr::str_replace("filter", "events") %>%
stringr::str_c(base_download, .)
readr::read_csv(download_link)

html_nodes return an empty list for complex table

I would like to webscraping the table in the following website: https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats
I am using the following code but it is not working, thank you in advance.
library(rvest)
library(xml2)
library(dplyr)
link <- "https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
page<- read_html(link)
rank<- page %>% html_nodes(".sorting_2") %>% html_text()
university<-page %>% html_nodes(".ranking-institution-title ") %>% html_text()
statistics<-page %>% html_nodes(".stats") %>% html_text()
The Terms and Services of this site state: "Use data mining, robot, spider, scraping or similar automated data gathering, extraction or publication tools for any purpose."
That being said, you can read the json file that #QHarr found:
library(jsonlite)
url <- "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2021_0__fa224219a267a5b9c4287386a97c70ea.json"
x <- read_json(url, simplifyVector = TRUE)
head(x$data) # give you the data frame with universities
Now you have a well structured R list. The $data element contains a data frame with the stats of each university in rows. The other 3 list elements only provide supplementary information.

Webscraping in R From Dataframe

From the following data frame
I am trying to use the package rvest to scrape each words Part of speech and synonyms from the website: https://www.thesaurus.com/browse/research?s=t into a csv.
I am not sure how to have R search each word of the data frame and pull its Part of Speech and Synonym.
install.packages("rvest")
install.packages("xml2")
library(xml2)
library(rvest)
library(dplyr)
words<data.frame("keywords"=c("research","survey","staff","outpatient","consent"))
html<- read_html("https://www.merriam-webster.com/thesaurus/research")
html %>% html_nodes(".mw-list") %>% html_text () %>%
head(n=1) # take the first 1st records
If you search [your term] on thesaurus, you will end up on the following HTML page: "https://www.thesaurus.com/browse/[your term]". If you know this, you can get the HTMLs of all the pages of terms you're interested in. After that you should be able to iterate with the map() function from the purrr pacakage to get the information you want:
# It makes more sense to just keep "words" as a vector for now
words <- c("research","survey","staff","outpatient","consent")
htmls <- paste0("https://www.thesaurus.com/browse/", words)
info_list <- map(htmls, .x %>%
read_html() %>%
html_node(.mw-list) %>%
html_text())

Scraping tabulated data from ballotpedia.org with rvest

I'm trying to scrape tabulated data on previous US statewide election results, and I think ballotpedia.org is a good place to be getting this data from - as URLs are in a consistent format for all states.
Here's the code I set up to test it:
library(dplyr)
library(rvest)
# STEP 1 - URL COMPONENTS TO SCRAPE FROM
senate_base_url <- "https://ballotpedia.org/United_States_Senate_elections_in_"
senate_state_urls <- gsub(" ", "_", state.name)
senate_year_urls <- c(",_2012", ",_2014", ",_2016")
# TEST
test_url <- paste0(senate_base_url, senate_state_urls[10], senate_year_urls[2])
this results in the following URL: https://ballotpedia.org/United_States_Senate_elections_in_Georgia,_2014
Using the 'selectorgadget' chrome plugin, I selected the table in question containing the election result, and tried parsing it into R as follows:
test_data <- read_html(test_url)
test_data <- test_data %>%
html_node(xpath = '//*[#id="collapsibleTable0"]') %>%
html_table()
However, I'm getting the following error:
Error in UseMethod("html_table") :
no applicable method for 'html_table' applied to an object of class "xml_missing"
Furthermore, the R object test_data yields a list with 2 empty elements.
Can anyone tell me what I'm doing wrong here? Is the html_table() function the wrong one? Using html_text() simply returns an NA character vector. Any help would be greatly appreciated, thanks very much :).
Your xpath statement is incorrect, thus the html_node function is returning a null value.
Here is a solution using the html tags. "Look for a table tag within a center tag"
library(rvest)
test_data <- read_html(test_url)
test_data <- test_data %>% html_nodes("center table") %>% html_table()
Or to retrieve the fully collapsed table use the html tag with class name:
collapsedtable<-test_data %>% html_nodes("table.collapsible") %>%
html_table(fill=TRUE)
this works for me:
library(httr)
library(XML)
r <- httr::GET("https://ballotpedia.org/United_States_Senate_elections_in_Georgia,_2014")
XML::readHTMLTable(rawToChar(r$content))[[2]]

Web scraping - selection of a table

I'm trying to extract the table of historical data from Yahoo Finance website.
First, by inspecting the source code I've found that it's actually a table, so I suspect that html_table() from rvest should be able to work with it, however, I can't find a way to reach it from R. I've tried providing the function with just the full page, however, it did not fetch the right table:
url <- https://finance.yahoo.com/quote/^FTSE/history?period1=946684800&period2=1470441600&interval=1mo&filter=history&frequency=1mo
read_html(url) %>% html_table(fill = TRUE)
# Returns only:
# [[1]]
# X1 X2
# 1 Show all results for Tip: Use comma to separate multiple quotes Search
Second, I've found an xpath selector for the particular table, but I am still unsuccessful in fetching the data:
xpath1 <- '//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[3]/table'
read_html(url) %>% html_node(xpath = xpath1)
# Returns an empty nodeset:
# {xml_nodeset (0)}
By removing the last term from the selector I get a non-empty nodeset, however, still no table:
xpath2 <- '//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[3]'
read_html(url) %>% html_node(xpath = xpath2) %>% html_table(fill = TRUE)
# Error: html_name(x) == "table" is not TRUE
What am I doing wrong? Any help would be appreciated!
EDIT: I've found that html_text() with the last xpath returns
read_html(url) %>% html_node(xpath = xpath2) %>% html_text()
[1] "Loading..."
which suggests that the table is not yet loaded when R did the read. This would explain why it failed to see the table. Question: any ways of bypassing that loading text?

Resources