Web scrape data from finance website using R (rvest) - r

I am trying to scrape data from ADM finance. I am using rvest library of R to pull the data. Below is the code, I am running
library(rvest)
url ="https://www.e-adm.com/futr/futr_composite_window.asp"
table1 = html(url) %>% html_nodes(".miniText tr:nth-child(1) td:nth-child(1) .smTextBlk") %>% html_nodes("table") %>%html_table
table2 = html(url) %>% html_nodes(".miniText tr:nth-child(1) td:nth-child(2) .smTextBlk") %>% html_nodes("table") %>%html_table
and getting following warning message with no data
Warning message:
'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")
My objective is to pull all the tables from this website. It would be a great help if anyone can help me with code. Thanks in advance!

library(rvest)
url ="https://www.e-adm.com/futr/futr_composite_window.asp"
tableList <- read_html(url) %>%
html_nodes(".miniText") %>%
html_nodes("td table") %>%
html_table()
This creates a list of the 9 tables in the linked website.

Related

html_nodes return an empty list for complex table

I would like to webscraping the table in the following website: https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats
I am using the following code but it is not working, thank you in advance.
library(rvest)
library(xml2)
library(dplyr)
link <- "https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
page<- read_html(link)
rank<- page %>% html_nodes(".sorting_2") %>% html_text()
university<-page %>% html_nodes(".ranking-institution-title ") %>% html_text()
statistics<-page %>% html_nodes(".stats") %>% html_text()
The Terms and Services of this site state: "Use data mining, robot, spider, scraping or similar automated data gathering, extraction or publication tools for any purpose."
That being said, you can read the json file that #QHarr found:
library(jsonlite)
url <- "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2021_0__fa224219a267a5b9c4287386a97c70ea.json"
x <- read_json(url, simplifyVector = TRUE)
head(x$data) # give you the data frame with universities
Now you have a well structured R list. The $data element contains a data frame with the stats of each university in rows. The other 3 list elements only provide supplementary information.

rvest only returns the headers

The code below only returns only the column headers. I have tried several ways to do it but with no luck.
library(rvest)
the <- read_html("https://www.timeshighereducation.com/world-university-rankings/2018/regional-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats")
rating <- the %>%
html_nodes("table") %>%
html_table()
rating
The issue is that the Table is loaded before the page. There are many ways to do :
One of the most simple in this case is to use RSelenium as webdriver, and collect the results with :
library(RSelenium)
library(rvest)
url <- "https://www.timeshighereducation.com/world-university-rankings/2018/regional-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
rD <- rsDriver()
remDr <- rD[["client"]]
remDr$navigate(url)
page <- read_html(remDr$getPageSource()[[1]])
table <- page %>% html_nodes("table") %>% html_table()
table
Another way,is to interpret the json result of the website transaction, the corresponding url https://www.timeshighereducation.com/sites/default/files/the_data_rankings/asia_university_rankings_2018_limit0_c36ae779f4180136af6e4bf9e6fc1081.json.
Hope this will helps
Gottavianoni

Scraping table with rvest

I'm attempting to scrape this table of occupation codes from IPUMS: https://usa.ipums.org/usa-action/variables/OCC1950#codes_section.
I have (ugly) code that works (see below) but I was trying to use rvest and failing. I was wondering if there is a tidier way to do it?
Inspecting the page, I think the table has a selector #dataTable > table > tbody but I haven't seen > in a selector before and putting that selector in html_node() doesn't work:
"https://usa.ipums.org/usa-action/variables/OCC1950#codes_section" %>%
read_html() %>%
html_node('#dataTable > table > tbody') %>%
html_table()
Error in UseMethod("html_table") :
no applicable method for 'html_table' applied to an object of class "xml_missing"
Nor does using just #dataTable in the selector:
"https://usa.ipums.org/usa-action/variables/OCC1950#codes_section" %>%
read_html() %>%
html_node('#dataTable') %>%
html_table()
Error: html_name(x) == "table" is not TRUE
The following does work, using RCurl, jsonlite, and stringr to slice through the page source. Perhaps the fact that the table seems to be in a json would explain why my rvest attempt fails, but I'm not sure.
library(RCurl)
library(jsonlite)
library(magrittr)
library(stringr)
txt <- "https://usa.ipums.org/usa-action/variables/OCC1950#codes_section" %>%
getURL()
ipums <-
txt %>%
str_extract(".*Farmer.*") %>%
str_extract("\\[.*false\\}\\]") %>%
fromJSON()

Scraping table of NBA stats with rvest

I'd like to scrape a table of NBA team stats with rvest, I've tried using:
the table element
library(rvest)
url_nba <- "http://stats.nba.com/teams/advanced/#!?sort=TEAM_NAME&dir=-1"
team_stats <- url_nba %>% read_html %>% html_nodes('table') %>% html_table
the xpath (via google chrome inspect)
team_stats <- url_nba %>%
read_html %>%
html_nodes(xpath="/html/body/main/div[2]/div/div[2]/div/div/nba-stat-table/div[1]/div[1]/table") %>%
html_table
the css selector (via mozilla inspect):
team_stats <- url_nba %>%
read_html %>%
html_nodes(".nba-stat-table__overflow > table:nth-child(1)") %>%
html_table
but with no luck. Any help would be greatly appreciated.
This question is very similar to this one: How to select a particular section of JSON Data in R?
The data you are requesting is not stored in the html code, thus the failures using rvest. The requested data is stored as a XHR file which and can be accessed directly:
library(httr)
library(jsonlite)
nba<-GET('http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=' )
Once the data is loaded into a the nba variable, using httr and jsonlite to clean-up the data:
#access the data
out<- content(nba, as="text") %>% fromJSON(flatten=FALSE)
#convert into dataframe.
# str(out) to determine the structure
df<-data.frame(out$resultSets$rowSet)
names(df)<-out$resultSets$headers[[1]]
I highly recommend reading the answer to the question which I linked above.

Web scraping - selection of a table

I'm trying to extract the table of historical data from Yahoo Finance website.
First, by inspecting the source code I've found that it's actually a table, so I suspect that html_table() from rvest should be able to work with it, however, I can't find a way to reach it from R. I've tried providing the function with just the full page, however, it did not fetch the right table:
url <- https://finance.yahoo.com/quote/^FTSE/history?period1=946684800&period2=1470441600&interval=1mo&filter=history&frequency=1mo
read_html(url) %>% html_table(fill = TRUE)
# Returns only:
# [[1]]
# X1 X2
# 1 Show all results for Tip: Use comma to separate multiple quotes Search
Second, I've found an xpath selector for the particular table, but I am still unsuccessful in fetching the data:
xpath1 <- '//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[3]/table'
read_html(url) %>% html_node(xpath = xpath1)
# Returns an empty nodeset:
# {xml_nodeset (0)}
By removing the last term from the selector I get a non-empty nodeset, however, still no table:
xpath2 <- '//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[3]'
read_html(url) %>% html_node(xpath = xpath2) %>% html_table(fill = TRUE)
# Error: html_name(x) == "table" is not TRUE
What am I doing wrong? Any help would be appreciated!
EDIT: I've found that html_text() with the last xpath returns
read_html(url) %>% html_node(xpath = xpath2) %>% html_text()
[1] "Loading..."
which suggests that the table is not yet loaded when R did the read. This would explain why it failed to see the table. Question: any ways of bypassing that loading text?

Resources