I'd like to scrape a table of NBA team stats with rvest, I've tried using:
the table element
library(rvest)
url_nba <- "http://stats.nba.com/teams/advanced/#!?sort=TEAM_NAME&dir=-1"
team_stats <- url_nba %>% read_html %>% html_nodes('table') %>% html_table
the xpath (via google chrome inspect)
team_stats <- url_nba %>%
read_html %>%
html_nodes(xpath="/html/body/main/div[2]/div/div[2]/div/div/nba-stat-table/div[1]/div[1]/table") %>%
html_table
the css selector (via mozilla inspect):
team_stats <- url_nba %>%
read_html %>%
html_nodes(".nba-stat-table__overflow > table:nth-child(1)") %>%
html_table
but with no luck. Any help would be greatly appreciated.
This question is very similar to this one: How to select a particular section of JSON Data in R?
The data you are requesting is not stored in the html code, thus the failures using rvest. The requested data is stored as a XHR file which and can be accessed directly:
library(httr)
library(jsonlite)
nba<-GET('http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=' )
Once the data is loaded into a the nba variable, using httr and jsonlite to clean-up the data:
#access the data
out<- content(nba, as="text") %>% fromJSON(flatten=FALSE)
#convert into dataframe.
# str(out) to determine the structure
df<-data.frame(out$resultSets$rowSet)
names(df)<-out$resultSets$headers[[1]]
I highly recommend reading the answer to the question which I linked above.
Related
I would like to webscraping the table in the following website: https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats
I am using the following code but it is not working, thank you in advance.
library(rvest)
library(xml2)
library(dplyr)
link <- "https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
page<- read_html(link)
rank<- page %>% html_nodes(".sorting_2") %>% html_text()
university<-page %>% html_nodes(".ranking-institution-title ") %>% html_text()
statistics<-page %>% html_nodes(".stats") %>% html_text()
The Terms and Services of this site state: "Use data mining, robot, spider, scraping or similar automated data gathering, extraction or publication tools for any purpose."
That being said, you can read the json file that #QHarr found:
library(jsonlite)
url <- "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2021_0__fa224219a267a5b9c4287386a97c70ea.json"
x <- read_json(url, simplifyVector = TRUE)
head(x$data) # give you the data frame with universities
Now you have a well structured R list. The $data element contains a data frame with the stats of each university in rows. The other 3 list elements only provide supplementary information.
From the following data frame
I am trying to use the package rvest to scrape each words Part of speech and synonyms from the website: https://www.thesaurus.com/browse/research?s=t into a csv.
I am not sure how to have R search each word of the data frame and pull its Part of Speech and Synonym.
install.packages("rvest")
install.packages("xml2")
library(xml2)
library(rvest)
library(dplyr)
words<data.frame("keywords"=c("research","survey","staff","outpatient","consent"))
html<- read_html("https://www.merriam-webster.com/thesaurus/research")
html %>% html_nodes(".mw-list") %>% html_text () %>%
head(n=1) # take the first 1st records
If you search [your term] on thesaurus, you will end up on the following HTML page: "https://www.thesaurus.com/browse/[your term]". If you know this, you can get the HTMLs of all the pages of terms you're interested in. After that you should be able to iterate with the map() function from the purrr pacakage to get the information you want:
# It makes more sense to just keep "words" as a vector for now
words <- c("research","survey","staff","outpatient","consent")
htmls <- paste0("https://www.thesaurus.com/browse/", words)
info_list <- map(htmls, .x %>%
read_html() %>%
html_node(.mw-list) %>%
html_text())
I'm trying to extract a bit of information under the node /html/head/script[16] from a website (here) but am unable to do so.
nykaa <- "https://www.nykaa.com/biotique-bio-kelp-protein-shampoo-for-falling-hair-intensive-hair-growth-treatment-conf/p/357142?categoryId=1292&productId=357142&ptype=product&skuId=39934"
obj <- read_html(nykaa)
extracted_json <- obj %>%
html_nodes(xpath = "/html/head/script[16]") %>%
html_text(trim = TRUE)
Currently, my output for the above code is null. But I would like to extract the data under the above mentioned node in an organized manner.
You can use regex to grab the javascript object inside that script tag and then pass to jsonlite and parse. You need to root around a bit to get what you want from that but it is all there
library(rvest)
library(magrittr)
library(stringr)
library(jsonlite)
p <- read_html('https://www.nykaa.com/biotique-bio-kelp-protein-shampoo-for-falling-hair-intensive-hair-growth-treatment-conf/p/357142?categoryId=1292&productId=357142&ptype=product&skuId=39934') %>% html_text()
all_data <- jsonlite::parse_json(str_match_all(p,'window\\.__PRELOADED_STATE__ = (.*)')[[1]][,2])
I'm trying to scrape tabulated data on previous US statewide election results, and I think ballotpedia.org is a good place to be getting this data from - as URLs are in a consistent format for all states.
Here's the code I set up to test it:
library(dplyr)
library(rvest)
# STEP 1 - URL COMPONENTS TO SCRAPE FROM
senate_base_url <- "https://ballotpedia.org/United_States_Senate_elections_in_"
senate_state_urls <- gsub(" ", "_", state.name)
senate_year_urls <- c(",_2012", ",_2014", ",_2016")
# TEST
test_url <- paste0(senate_base_url, senate_state_urls[10], senate_year_urls[2])
this results in the following URL: https://ballotpedia.org/United_States_Senate_elections_in_Georgia,_2014
Using the 'selectorgadget' chrome plugin, I selected the table in question containing the election result, and tried parsing it into R as follows:
test_data <- read_html(test_url)
test_data <- test_data %>%
html_node(xpath = '//*[#id="collapsibleTable0"]') %>%
html_table()
However, I'm getting the following error:
Error in UseMethod("html_table") :
no applicable method for 'html_table' applied to an object of class "xml_missing"
Furthermore, the R object test_data yields a list with 2 empty elements.
Can anyone tell me what I'm doing wrong here? Is the html_table() function the wrong one? Using html_text() simply returns an NA character vector. Any help would be greatly appreciated, thanks very much :).
Your xpath statement is incorrect, thus the html_node function is returning a null value.
Here is a solution using the html tags. "Look for a table tag within a center tag"
library(rvest)
test_data <- read_html(test_url)
test_data <- test_data %>% html_nodes("center table") %>% html_table()
Or to retrieve the fully collapsed table use the html tag with class name:
collapsedtable<-test_data %>% html_nodes("table.collapsible") %>%
html_table(fill=TRUE)
this works for me:
library(httr)
library(XML)
r <- httr::GET("https://ballotpedia.org/United_States_Senate_elections_in_Georgia,_2014")
XML::readHTMLTable(rawToChar(r$content))[[2]]
I am trying to scrape data from ADM finance. I am using rvest library of R to pull the data. Below is the code, I am running
library(rvest)
url ="https://www.e-adm.com/futr/futr_composite_window.asp"
table1 = html(url) %>% html_nodes(".miniText tr:nth-child(1) td:nth-child(1) .smTextBlk") %>% html_nodes("table") %>%html_table
table2 = html(url) %>% html_nodes(".miniText tr:nth-child(1) td:nth-child(2) .smTextBlk") %>% html_nodes("table") %>%html_table
and getting following warning message with no data
Warning message:
'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")
My objective is to pull all the tables from this website. It would be a great help if anyone can help me with code. Thanks in advance!
library(rvest)
url ="https://www.e-adm.com/futr/futr_composite_window.asp"
tableList <- read_html(url) %>%
html_nodes(".miniText") %>%
html_nodes("td table") %>%
html_table()
This creates a list of the 9 tables in the linked website.