Using rvest to scrape ASX - web-scraping

I'm trying to scrape data from the ASX (Australian Stock Exchange) site. For example, on BHP on ASX, at the bottom of the page is a collection of fundamentals data. The selector for the values, eg eps, is:
#company_key_statistics > div > div.panel-body.row > div:nth-child(3) > table > tbody > tr:nth-child(8) > td
I tried
library(rvest)
ASX_bhp <-read_html("https://www2.asx.com.au/markets/company/bhp")
ASX_data <- ASX_bhp |> html_elements("td") |> html_text()
or instead of "td", I have tried "tr", "#company_key_statistics", or the whole selector string. However, all return an empty character. I also tried html_nodes instead of html_elements.
How should I extract fundamental data from this site?

All that data is fetched and presented through JavaScript, thus it's not available for rvest (at least not through that URL). But you can user their API:
library(jsonlite)
bhp <- fromJSON("https://asx.api.markitdigital.com/asx-research/1.0/companies/bhp/key-statistics")
bhp$data$earningsPerShare
#> [1] 5.95708
Created on 2022-09-19 with reprex v2.0.2

Related

read_html not pulling info from tables

I am trying to gather the locations for the documents found on the SEC website. When I use the read_html() function, the table function is returning empty sets and I'm not sure why. When I inspect the elements using my web browser, I can see that the nodes in are populated but that information is not being carried over into my session.
test_url <- "https://www.sec.gov/edgar/search/#/dateRange=custom&entityName=(CIK%25200000887568)&startdt=1980-01-01&enddt=2021-06-23&filter_forms=10-K"
pg <- read_html(test_url) %>%
html_nodes(., css="#hits > table")
#But its empty
xml_attrs(pg, xml_child(pg[[1]], 2))
[[1]]
class
"table"
Thank you for any and all help!

Finding all csv links from website using R

I am trying to download the datafiles from ICE website (https://www.theice.com/clear-us/risk-management#margin-rates) containing info on margin strategy. I tried to do so by appliying the following code in R:
page <- read_html("https://www.theice.com/clear-us/risk-management#margin-rates")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.csv") # find those that end in csv only
However, it only finds two csv files. That is, it doesn't detect any files displayed when clicking at Margin Rates and going to Historic ICE Risk Model Parameter. See below
raw_list
[1] "/publicdocs/iosco_reporting/haircut_history/icus/ICUS_Asset_Haircuts_History.csv"
[2] "/publicdocs/iosco_reporting/haircut_history/icus/ICUS_Currency_Haircuts_History.csv"
I am wondering how I can do that so later on I can select the files and download them.
Thanks a lot in advance
We can look at the network traffic in browser devtools to find the url for each dropdown action.
The Historic ICE Risk Model Parameter dropdown pulls from this page:
https://www.theice.com/marginrates/ClearUSMarginParameterFiles.shtml;jsessionid=7945F3FE58331C88218978363BA8963C?getParameterFileTable&category=Historical
We remove the jsessionid (per QHarr's comment) and use that as our endpoint:
endpoint <- "https://www.theice.com/marginrates/ClearUSMarginParameterFiles.shtml?getParameterFileTable&category=Historical"
page <- read_html(endpoint)
Then we can get the full csv list:
raw_list <- page %>%
html_nodes(".table-partitioned a") %>% # add specificity as QHarr suggests
html_attr("href")
Output:
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERMONTH_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERCONTRACT_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_SCANNING_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_STRATEGY_20210310.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERMONTH_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_INTERCONTRACT_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_SCANNING_20210226.CSV'
'/publicdocs/clear_us/irmParameters/ICUS_MARGIN_STRATEGY_20210226.CSV'
...
It seems like the page does not load that part of the page instantly and it is missing in your request. The network monitor indicates that a file "ClearUSRiskArrayFiles.shtml" is being loaded 400 ms later. That file seems to provide the required links once you specify year and month in the URL.
library(rvest)
library(stringr)
page <- read_html("https://www.theice.com/iceriskmodel/ClearUSRiskArrayFiles.shtml?getRiskArrayTable=&type=icus&year=2021&month=03")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href")
head(raw_list[grepl("csv", raw_list)], 3L)
#> [1] "/publicdocs/irm_files/icus/2021/03/NYB0312E.csv.zip"
#> [2] "/publicdocs/irm_files/icus/2021/03/NYB0311E.csv.zip"
#> [3] "/publicdocs/irm_files/icus/2021/03/NYB0311F.csv.zip"
Created on 2021-03-12 by the reprex package (v1.0.0)

rvest scraping, getting td for particular th (tranlsation from Python)

Hi StackOverflow users,
Sorry for a silly question.
My question is a bit general, but here's an example:
Suppose I'm scraping Wikipedia infobox info on official webpages of US cities. So for a given list of Wikipedia URLs I need last row of infobox (the box on the right of the page) with the information on website.
In Python I will do it in this way. However, I cannot understand how to do it in R. So
r = requests.get("https://en.wikipedia.org/wiki/Los_Angeles")
if r:
text = r.text
soup = BeautifulSoup(text, 'lxml')
def get_website(soup):
for tr in soup.find("table",
class_="infobox")("tr"):
if tr.th and 'Website' in tr.th.text:
print(tr.td)
s = tr.td.p.string
return (s)
There's a better way in both Python & R via XPath.
library(rvest)
pg <- read_html("https://en.wikipedia.org/wiki/Los_Angeles")
html_node(pg, xpath=".//table[contains(#class,'infobox') and
tr[contains(., 'Website')]]/tr[last()]/td//a") -> last_row_link
html_text(last_row_link)
## [1] "Official website"
html_attr(last_row_link, "href")
## [1] "https://www.lacity.org/"
I made an assumption that you really wanted the href attribute of the link in the last <tr> but the las() expression in the XPath was the essential ingredient. The last td//a says (essentially) "once you find the <td> in the <tr> we just found, look where in the element subtree tree for and anchor tag".
Is there any particular identifier for the td or th you want?
But if you want tr element for table with class infobox similar to your code, here is what I would do:
require(rvest)
# read the webpage
webpage <- read_html("https://en.wikipedia.org/wiki/Los_Angeles")
# extract the url-link element of table with class infobox
your_infobox_tr <- webpage %>% html_nodes(".infobox") %>% html_nodes(".url>a")
# extract the href link content
your_href <- your_infobox_tr %>% html_attr(name='href')
Or if you like a one-liner
your_wanted_link <- read_html("https://en.wikipedia.org/wiki/Los_Angeles") %>% html_nodes(".infobox") %>% html_nodes(".url>a") %>% html_attr(name="href")
FYI: if you do not know what %>% is, it is a pipe operator which can be obtained by installing magrittr package.

Scraping Kickstarter With R?

I tried to scrape Kickstarter. However I do not get a result when I try to get the URLs that refer to the projects.
This should be one of the results:
https://www.kickstarter.com/projects/1534822242/david-bowie-hunger-city-photo-story?ref=category_ending_soon
and this is my code:
Code:
main.page1 <- read_html(x ="https://www.kickstarter.com/discover/advanced?
category_id=1&sort=end_date&seed=2498921&page=1")
urls1 <- main.page1 %>% # feed `main.page` to the next step
html_nodes(".block.img-placeholder.w100p") %>% # get the CSS nodes
html_attr("href") # extract the URLs
Does anyone see where I go wrong?
First declare all the packages you use - I had to go search to realise I needed rvest:
> library(rvest)
> library(dplyr)
Get your HTML:
> main.page1 <- read_html(x ="https://www.kickstarter.com/discover/advanced?category_id=1&sort=end_date&seed=2498921&page=1")
As that stands, the data for each project is stashed in a data-project attribute in a bunch of divs. Some Javascript (I suspect built using the React framework) in the browser will normally fill the other DIVs in and get the images, format the links etc. But you have just grabbed the raw HTML so that isn't available. But the raw data is.... So....
The relevant divs appear to be class "react-disc-landing" so this gets the data as text strings:
> data = main.page1 %>%
html_nodes("div.react-disc-landing") %>%
html_attr("data-project")
These things appear to be JSON strings:
> substr(data[[1]],1,80)
[1] "{\"id\":208460273,\"photo\":{\"key\":\"assets/017/007/465/9b725fdf5ba1ee63e8987e26a1d33"
So let's use the rjson package to decode the first one:
> library(rjson)
> jdata = fromJSON(data[[1]])
jdata is now a very complex nested list. Use str(jdata) to see what is in it. I'm not sure what bit of it you want, but maybe this URL:
> jdata$urls$web$project
[1] "https://www.kickstarter.com/projects/1513052868/sense-of-place-by-jose-davila"
If not, the URL you want must be in that structure somewhere.
Repeat over data[[i]] to get all links.
Note that you should check the site T+Cs that you are allowed to do this, and also see if there's an API you should really be using.

Clean Data Scraped from teambhp website using rvest in R

I am doing scraping in R using rvest package.
i want to scrape user comments and review from teambhp.com car's pages.
Doing this for below link.
Team BHP REVIEW
i am writing following code in r
library(rvest)
library(httr)
library(httpuv)
team_bhp <- read_html(httr::GET("http://www.team-bhp.com/forum/official-new-car-reviews/172150-tata-zica-official-review.html"))
all_tables <- team_bhp %>%
html_nodes(".tcat:nth-child(1) , #posts strong , hr+ div") %>%
html_text()
but i am getting all the text in on list. and that contains spaces and "\t \n" even if i am applying html_text() function to it. how to clean it and convert to data frame. ?
also , i want to do it for all cars reviews available on website. how can i recursively traverse all the car's reviews. ?

Resources