rvest only returns the headers - r

The code below only returns only the column headers. I have tried several ways to do it but with no luck.
library(rvest)
the <- read_html("https://www.timeshighereducation.com/world-university-rankings/2018/regional-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats")
rating <- the %>%
html_nodes("table") %>%
html_table()
rating

The issue is that the Table is loaded before the page. There are many ways to do :
One of the most simple in this case is to use RSelenium as webdriver, and collect the results with :
library(RSelenium)
library(rvest)
url <- "https://www.timeshighereducation.com/world-university-rankings/2018/regional-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
rD <- rsDriver()
remDr <- rD[["client"]]
remDr$navigate(url)
page <- read_html(remDr$getPageSource()[[1]])
table <- page %>% html_nodes("table") %>% html_table()
table
Another way,is to interpret the json result of the website transaction, the corresponding url https://www.timeshighereducation.com/sites/default/files/the_data_rankings/asia_university_rankings_2018_limit0_c36ae779f4180136af6e4bf9e6fc1081.json.
Hope this will helps
Gottavianoni

Related

How can i scrape the complete dataset from yahoo finance with rvest

Im trying to get the complete data set for bitcoin historical data from yahoo finance via web scraping, this is my first option code chunk:
library(rvest)
library(tidyverse)
crypto_url <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <- html_nodes(crypto_url,css = "table")
cryp_table <- html_table(cryp_table,fill = T) %>%
as.data.frame()
I the link that i provide to read_html() a long period of time is already selected, however it just get the first 101 rows and the last row is the loading message that you get when you keep scrolling, this is my second shot but i get the same:
col_page <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <-
col_page %>%
html_nodes(xpath = '//*[#id="Col1-1-HistoricalDataTable-Proxy"]/section/div[2]/table') %>%
html_table(fill = T)
cryp_final <- cryp_table[[1]]
How can i get the whole dataset?
I think you can get the link of download, if you view the Network, you see the link of download, in this case:
"https://query1.finance.yahoo.com/v7/finance/download/BTC-USD?period1=1480464000&period2=1638230400&interval=1d&events=history&includeAdjustedClose=true"
Well, this link looks like the url of the site, i.e., we can modify the url link to get the download link and read the csv. See the code:
library(stringr)
library(magrittr)
site <- "https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true"
base_download <- "https://query1.finance.yahoo.com/v7/finance/download/"
download_link <- site %>%
stringr::str_remove_all(".+(?<=quote/)|/history?|&frequency=1d") %>%
stringr::str_replace("filter", "events") %>%
stringr::str_c(base_download, .)
readr::read_csv(download_link)

New to rvest package - trying to scrape a basic table from a webpage using R

I'm just trying to scrape the first table at this link (titled "Standings - Points"): https://www.fantrax.com/fantasy/league/vu0zoow2kk7bh64k/standings
In following some documentation and previous posts on here, I've tried:
data <- read_html("https://www.fantrax.com/fantasy/league/vu0zoow2kk7bh64k/standings")
tables <- data %>% html_table(fill = TRUE)
and
data <- read_html("https://www.fantrax.com/fantasy/league/vu0zoow2kk7bh64k/standings")
tables <- html_nodes(data, "table")
neither were able to pick anything up from that page - R is just showing a blank return for each. I'm hoping/guessing it's just something simple that I'm missing.
Using RSelenium
library(RSelenium)
library(dplyr)
library(rvest)
# Launching the browser
rD <- rsDriver(browser="firefox", port=4551L, verbose=F)
remDr <- rD[["client"]]
url = 'https://www.fantrax.com/fantasy/league/vu0zoow2kk7bh64k/standings'
remDr$navigate(url)
# Extracting tables
remDr$getPageSource()[[1]] %>%
read_html() %>%
html_table()

Why am I getting the wrong output although the code is correct?

I'm scraping this website and sure that the code is correct for getting titles. But the output is not what I want.
library(rvest)
url <- "https://www.kariyer.net/is-ilanlari/#&kw=data%20scientist"
titles <- read_html(url) %>%
html_nodes("div.col-9 a.link.position") %>%
html_text()
How do I get rid of this result?
The url is the culprit. This worked for me as expected
library(rvest)
url <- "https://www.kariyer.net/is-ilanlari/kw=data%20scientist"
titles <- read_html(url) %>%
html_nodes("div.col-9 a.link.position") %>%
html_attr("data-title")

R Web scraping from different URLs

I am web scraping a page at
http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
From this url, I have built up a dataframe through the following code:
dflist <- map(.x = 1:417, .f = function(x) {
Sys.sleep(5)
url <- ("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=")
read_html(url) %>%
html_nodes(".title a") %>%
html_text() %>%
as.data.frame()
}) %>% do.call(rbind, .)
I have repeated the same code in order to get all the data I was interested in and it seems to work perfectly, although is of course a little slow due to the Sys.sleep() thing.
My issue has raised once I have tried to scrape the single projects descriptions that should be included in the dataframe.
For instance, the first project description is at
http://catalog.ihsn.org/index.php/catalog/7118/study-description
the second project description is at
http://catalog.ihsn.org/index.php/catalog/6606/study-description
and so forth.
My problem is that I can't find a dynamic way to scrape all the projects' pages and insert them in the data frame, being the number in the URLs not progressive nor at the end of the link.
To make things clearer, this is the structure of the website I am scraping:
1.http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
1.1. http://catalog.ihsn.org/index.php/catalog/7118
1.1.a http://catalog.ihsn.org/index.php/catalog/7118/related_materials
1.1.b http://catalog.ihsn.org/index.php/catalog/7118/study-description
1.1.c. http://catalog.ihsn.org/index.php/catalog/7118/data_dictionary
I have scraped successfully level 1. but cannot level 1.1.b. (study-description) , the one I am interested in, since the dynamic element of the URL (in this case: 7118) is not consistent in the website's above 6000 pages of that level.
You have to extract the deeper urls from the .title a and then scrape those as well. Here's a small example on how to do that using rvest and the tidyverse
library(tidyverse)
library(rvest)
scraper <- function(x) {
Sys.sleep(5)
url <- sprintf("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=%s&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=", x)
html <- read_html(url)
tibble(title = html_nodes(html, ".title a") %>% html_text(trim = TRUE),
project_url = html_nodes(html, ".title a") %>% html_attr("href"))
}
result <- map_df(1:2, scraper) %>%
mutate(study_description = map(project_url, ~read_html(sprintf("%s/study-description", .x)) %>% html_node(".xsl-block") %>% html_text()))
This isn't complete as to all the things you want to do, but should show you an approach.

Scraping table of NBA stats with rvest

I'd like to scrape a table of NBA team stats with rvest, I've tried using:
the table element
library(rvest)
url_nba <- "http://stats.nba.com/teams/advanced/#!?sort=TEAM_NAME&dir=-1"
team_stats <- url_nba %>% read_html %>% html_nodes('table') %>% html_table
the xpath (via google chrome inspect)
team_stats <- url_nba %>%
read_html %>%
html_nodes(xpath="/html/body/main/div[2]/div/div[2]/div/div/nba-stat-table/div[1]/div[1]/table") %>%
html_table
the css selector (via mozilla inspect):
team_stats <- url_nba %>%
read_html %>%
html_nodes(".nba-stat-table__overflow > table:nth-child(1)") %>%
html_table
but with no luck. Any help would be greatly appreciated.
This question is very similar to this one: How to select a particular section of JSON Data in R?
The data you are requesting is not stored in the html code, thus the failures using rvest. The requested data is stored as a XHR file which and can be accessed directly:
library(httr)
library(jsonlite)
nba<-GET('http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=' )
Once the data is loaded into a the nba variable, using httr and jsonlite to clean-up the data:
#access the data
out<- content(nba, as="text") %>% fromJSON(flatten=FALSE)
#convert into dataframe.
# str(out) to determine the structure
df<-data.frame(out$resultSets$rowSet)
names(df)<-out$resultSets$headers[[1]]
I highly recommend reading the answer to the question which I linked above.

Resources