I'm trying to scrape tabulated data on previous US statewide election results, and I think ballotpedia.org is a good place to be getting this data from - as URLs are in a consistent format for all states.
Here's the code I set up to test it:
library(dplyr)
library(rvest)
# STEP 1 - URL COMPONENTS TO SCRAPE FROM
senate_base_url <- "https://ballotpedia.org/United_States_Senate_elections_in_"
senate_state_urls <- gsub(" ", "_", state.name)
senate_year_urls <- c(",_2012", ",_2014", ",_2016")
# TEST
test_url <- paste0(senate_base_url, senate_state_urls[10], senate_year_urls[2])
this results in the following URL: https://ballotpedia.org/United_States_Senate_elections_in_Georgia,_2014
Using the 'selectorgadget' chrome plugin, I selected the table in question containing the election result, and tried parsing it into R as follows:
test_data <- read_html(test_url)
test_data <- test_data %>%
html_node(xpath = '//*[#id="collapsibleTable0"]') %>%
html_table()
However, I'm getting the following error:
Error in UseMethod("html_table") :
no applicable method for 'html_table' applied to an object of class "xml_missing"
Furthermore, the R object test_data yields a list with 2 empty elements.
Can anyone tell me what I'm doing wrong here? Is the html_table() function the wrong one? Using html_text() simply returns an NA character vector. Any help would be greatly appreciated, thanks very much :).
Your xpath statement is incorrect, thus the html_node function is returning a null value.
Here is a solution using the html tags. "Look for a table tag within a center tag"
library(rvest)
test_data <- read_html(test_url)
test_data <- test_data %>% html_nodes("center table") %>% html_table()
Or to retrieve the fully collapsed table use the html tag with class name:
collapsedtable<-test_data %>% html_nodes("table.collapsible") %>%
html_table(fill=TRUE)
this works for me:
library(httr)
library(XML)
r <- httr::GET("https://ballotpedia.org/United_States_Senate_elections_in_Georgia,_2014")
XML::readHTMLTable(rawToChar(r$content))[[2]]
Related
I would like to webscraping the table in the following website: https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats
I am using the following code but it is not working, thank you in advance.
library(rvest)
library(xml2)
library(dplyr)
link <- "https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
page<- read_html(link)
rank<- page %>% html_nodes(".sorting_2") %>% html_text()
university<-page %>% html_nodes(".ranking-institution-title ") %>% html_text()
statistics<-page %>% html_nodes(".stats") %>% html_text()
The Terms and Services of this site state: "Use data mining, robot, spider, scraping or similar automated data gathering, extraction or publication tools for any purpose."
That being said, you can read the json file that #QHarr found:
library(jsonlite)
url <- "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2021_0__fa224219a267a5b9c4287386a97c70ea.json"
x <- read_json(url, simplifyVector = TRUE)
head(x$data) # give you the data frame with universities
Now you have a well structured R list. The $data element contains a data frame with the stats of each university in rows. The other 3 list elements only provide supplementary information.
I'm trying to extract a bit of information under the node /html/head/script[16] from a website (here) but am unable to do so.
nykaa <- "https://www.nykaa.com/biotique-bio-kelp-protein-shampoo-for-falling-hair-intensive-hair-growth-treatment-conf/p/357142?categoryId=1292&productId=357142&ptype=product&skuId=39934"
obj <- read_html(nykaa)
extracted_json <- obj %>%
html_nodes(xpath = "/html/head/script[16]") %>%
html_text(trim = TRUE)
Currently, my output for the above code is null. But I would like to extract the data under the above mentioned node in an organized manner.
You can use regex to grab the javascript object inside that script tag and then pass to jsonlite and parse. You need to root around a bit to get what you want from that but it is all there
library(rvest)
library(magrittr)
library(stringr)
library(jsonlite)
p <- read_html('https://www.nykaa.com/biotique-bio-kelp-protein-shampoo-for-falling-hair-intensive-hair-growth-treatment-conf/p/357142?categoryId=1292&productId=357142&ptype=product&skuId=39934') %>% html_text()
all_data <- jsonlite::parse_json(str_match_all(p,'window\\.__PRELOADED_STATE__ = (.*)')[[1]][,2])
I'm trying to extract the table of historical data from Yahoo Finance website.
First, by inspecting the source code I've found that it's actually a table, so I suspect that html_table() from rvest should be able to work with it, however, I can't find a way to reach it from R. I've tried providing the function with just the full page, however, it did not fetch the right table:
url <- https://finance.yahoo.com/quote/^FTSE/history?period1=946684800&period2=1470441600&interval=1mo&filter=history&frequency=1mo
read_html(url) %>% html_table(fill = TRUE)
# Returns only:
# [[1]]
# X1 X2
# 1 Show all results for Tip: Use comma to separate multiple quotes Search
Second, I've found an xpath selector for the particular table, but I am still unsuccessful in fetching the data:
xpath1 <- '//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[3]/table'
read_html(url) %>% html_node(xpath = xpath1)
# Returns an empty nodeset:
# {xml_nodeset (0)}
By removing the last term from the selector I get a non-empty nodeset, however, still no table:
xpath2 <- '//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[3]'
read_html(url) %>% html_node(xpath = xpath2) %>% html_table(fill = TRUE)
# Error: html_name(x) == "table" is not TRUE
What am I doing wrong? Any help would be appreciated!
EDIT: I've found that html_text() with the last xpath returns
read_html(url) %>% html_node(xpath = xpath2) %>% html_text()
[1] "Loading..."
which suggests that the table is not yet loaded when R did the read. This would explain why it failed to see the table. Question: any ways of bypassing that loading text?
I am trying to scrape the data corresponding to Table 5 from the following link: https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls
As suggested, I used SelectorGadget to find the relevant CSS match, and the one I found that contained all the data (as well as some extraneous information) was "#page_content"
I've tried the following code, which yield errors:
fbi <- read_html("https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls")
fbi %>%
html_node("#page_content") %>%
html_table()
Error: html_name(x) == "table" is not TRUE
#Try extracting only the first column:
fbi %>%
html_nodes(".group0") %>%
html_table()
Error: html_name(x) == "table" is not TRUE
#Directly feed fbi into html_table
data = fbi %>% html_table(fill = T)
#This output creates a list of 3 elements, where within list 1 and 3, there are many missing values.
Any help would be greatly appreciated!
You can download the excel file directly. After that you should look into the excel file and take data that you want into a csv file. After that you can work on the data. Below is the code for doing the same.
library(rvest)
library(stringr)
page <- read_html("https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls")
pageAdd <- page %>%
html_nodes("a") %>% # find all links
html_attr("href") %>% # get the url
str_subset("\\.xls") %>% # find those that end in xls
.[[1]]
mydestfile <- "D:/Kumar/table5.xls" # change the path and file name as per your system
download.file(pageAdd, mydestfile, mode="wb")
The data is not in a very formatted way. Hence downloading it in R, will be more confusing. To me this appears to be the best way to solve your problem.
I am trying to extract a table using html_table and the rvest package
library(rvest)
test <- html("http://www.privacyrights.org/data-breach/new?title=")
test %>% html_table(html_nodes("table.data-breach-table")[[1]])
however, I keep getting an error
Error in UseMethod("html_nodes"): no applicable method for
'html_nodes' applied to an object of class "character"
If you are going to nest parenthesized calls anyway, why bother with piping?
html_table(html_nodes(test, "table.data-breach-table")[[1]])
Otherwise go full pipe and use magrittr as well:
library(magrittr)
test %>%
html_nodes("table.data-breach-table") %>%
extract2(1) %>%
html_table()
NOTE:
the URL you are using does not have the table you want anyway
you should be using the newest rvest and using read_html
As far as why it wasn't working, you were piping test incorrectly and html_nodes was operating on the tableā¦ string instead of the parsed HTML document it expects.
Since you're trying to scrape breaches, this may be of help:
library(rvest)
library(dplyr)
library(pbapply)
urls <- sprintf("http://www.privacyrights.org/data-breach?title=&page=%d", 1:94)
pblapply(urls, function(URL) {
pg <- read_html(URL)
tab <- html_nodes(pg, "table")[3]
rows <- html_nodes(tab, "tr:not(.data-breach-bottom)")
bind_rows(lapply(seq(2, length(rows)-2, by=2), function(i) {
tds_1 <- html_nodes(rows[i], "td")
tds_2 <- html_text(html_nodes(rows[i+1], "td"), trim=TRUE)
data_frame(date_public=html_text(tds_1[1], TRUE),
name_loc=html_text(tds_1[2], TRUE),
entity=html_text(tds_1[3], TRUE),
type=html_text(tds_1[4], TRUE),
recs=html_text(tds_1[5], TRUE),
descr=tds_2[1])
}))
}) -> things
It's from an older gitst of mine. You'll need to add a randomized sleep delay to that if you do plan on scraping all their breaches.
NOTE also that it's skewed data and be very aware of it's limitations as you attempt to use it (I do data breach research for a living).