Scraping a table from a website using rvest - r

I am trying to scrape the table from the treasury website.
https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yieldYear&year=2019
What I have currently is to collect the data but
library("rvest")
url <- "https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yieldAll"
data <- url %>%
html()
But I cannot seem to get it into a table format since I have a function.
data %>%
html_table()

It's better to first use CSS to locate the node which contains the table. The table is big(around 7400 rows). It took 30 seconds to render using html_table.
library("rvest")
library(httr)
url <- "https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yieldAll"
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
data <- html_session(url,user_agent(ua))
data %>%
html_node("table.t-chart") %>%
html_table()

Related

spotifyr web scraping: "Request Failed [429]" error

I wish to get artist data from spotifr API for Spotify on R
First I'm scraping the data and saving the necessary data in relevant data frames and using the get_artist_audio_features() function available in spotifyr library, I'm trying to get artist details, but every time I encounter the same error.
My code can be found here: [the first part can be ignored, you may run it as it is to scrape it and directly proceed to the last chunk of code]
library(rvest)
library(dplyr)
library(spotifyr)
upcoming_artists <- "https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/"
upcoming_artists <- read_html(upcoming_artists)
upcoming_artists <- upcoming_artists %>%
html_elements("tbody") %>%
html_table() %>%
`[[`(1) %>%
tidyr::separate_rows(X2, sep = "\n")
upcoming_artists_india = upcoming_artists[33:35,]
upcoming_artists_india_list = list(length = nrow(upcoming_artists_india))
for(i in 1:nrow(upcoming_artists_india))
{
upcoming_artists_india$X2[i] <- paste("'",upcoming_artists_india$X2[i],"'",sep="")
}
for(i in 1:nrow(upcoming_artists_india))
{
upcoming_artists_india_list[[i]] <- get_artist_audio_features(upcoming_artists_india$X2[i])
}
The error code which I'm getting is:
Request failed [429]
This error is always followed by some random time and the entire error may look something like:
Request failed [429]. Retrying in 59423 seconds...
The code hasn't run a single time and I'm unable to comprehend the error.
You have to inject user-agent as parameter
upcoming_artists <- read_html(upcoming_artists,user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36")

rvest visualize current session html page

I am able to navigate to a password protected website using rvest and session()
library(rvest)
library(httr)
url_login <- "https://www.myurl.com/Home/Login"
uastring <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"
url_myProfile <- "https://www.myurl.com/?page=profile"
pgsession<-session(url_login, user_agent(uastring))
pgsession$url
pgform<-html_form(pgsession)[[1]]
filled_form<-html_form_set(pgform, loginID="xxx", pinCode="yyy")
mysess <- session_submit(pgsession, filled_form)
mysess <- session_jump_to(pgsession, url_myProfile)
Now I would like to view the exact html of where the session is at the moment, potentially by saving the html locally. Using the below, I see a page which seems to be correct, but without any images for example, so I was wondering if there was a better way to ensure the session is where I need it to be.
myhtml <- read_html(mysess)
write_xml(myhtml, "myprofile.html")

Scrape data table using xpath in R

I am fairly familiar with R, but have 0 experience with web scraping. I had looked around and cannot seem to figure out why my web scraping is "failing." Here is my code including the URL I want to scrape (the ngs-data-table to be specific):
library(rvest)
webpage <- read_html("https://nextgenstats.nfl.com/stats/rushing/2020/REG/1#yards")
tbls <- html_nodes(webpage, xpath = '/html/body/div[2]/div[3]/main/div/div/div[3]')
#also attempted using this Xpath '//*[#id="stats-rushing-view"]/div[3]' but neither worked
tbls
I am not receiving any errors with the code but I am receiving:
{xml_nodeset (0)}
I know this isn't a ton of code, I have tried multiple different xpaths as well. I know I will eventually need more code to be more specific for web-scraping, but I figured even the code above would at least begin to point me in the right direction? Any help would be appreciated. Thank you!
The data is stored as JSON. Here is a method to download and process that file.
library(httr)
#URL for week 6 data
url <- "https://nextgenstats.nfl.com/api/statboard/rushing?season=2020&seasonType=REG&week=6"
#create a user agent
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
#download the information
content <-httr::GET(url, verbose() , user_agent(ua), add_headers(Referer = "https://nextgenstats.nfl.com/stats/rushing/2020/REG/1"))
answer <-jsonlite::fromJSON(content(content, as = "text") ,flatten = TRUE)
answer$stats
The content of that table is generated dynamically: check it by saving the page from your browser (or, with your code, write_html(webpage,'test.html')) and then opening the saved file. So you probably can't capture it with rvest. Browser simulation packages like RSelenium will solve the problem.

__scrape__ dynamic data behind a chart using R

Is there a way in R which i can extract historical stocks quotes behind a chart based on that historical data from this web site :
https://www.egx.com.eg/en/stocksdata.aspx?ISIN=EGS60322C012
I tried my best with figure out the headers and query from the network tab in chrome broweser with no luck!
Thanks in advance!
I also tired this code for reference
rl <- "https://www.egx.com.eg/content/charts/dcp_401410da-0-4.swf?guid=339b28e6-3e74-4477-badb-b49c52a851a5"
r <- GET(url = rl, user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36"))
headers(r)
text <- content(r, as = "text", encoding = "UTF-8")
text
dput(text,"test.txt")
data <- fromJSON(text)
data

R - How to make a click on webpage using rvest or rcurl

I want to download data from this webpage
The data can be easily scraped with rvest.
The code maybe like this :
library(rvest)
library(pipeR)
url <- "http://www.tradingeconomics.com/"
css <- "#ctl00_ContentPlaceHolder1_defaultUC1_CurrencyMatrixAllCountries1_GridView1"
data <- url %>>%
html() %>>%
html_nodes(css) %>>%
html_table()
But there is a problem for webpages like this.
There is a + button to show the data of all the countries, but the default is just data of 50 countries.
So if I use the code, I can just scrape data of 50 countries.
The + button is made in javascript, so I want to know if there is a way in R to click the button and then scrape the data.
Sometimes it's better to attack the problem at the ajax web-request level. For this site, you can use Chrome's dev tools and watch the requests. To build the table (the whole table, too) it makes a POST to the site with various ajax-y parameters. Just replicate that, do a bit of data-munging of the response and you're good to go:
library(httr)
library(rvest)
library(dplyr)
res <- POST("http://www.tradingeconomics.com/",
encode="form",
user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.50 Safari/537.36"),
add_headers(`Referer`="http://www.tradingeconomics.com/",
`X-MicrosoftAjax`="Delta=true"),
body=list(
`ctl00$AjaxScriptManager1$ScriptManager1`="ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$UpdatePanel1|ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$LinkButton1",
`__EVENTTARGET`="ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$LinkButton1",
`srch-term`="",
`ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$GridView1$ctl01$DropDownListCountry`="top",
`ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$ParameterContinent`="",
`__ASYNCPOST`="false"))
res_t <- content(res, as="text")
res_h <- paste0(unlist(strsplit(res_t, "\r\n"))[-1], sep="", collapse="\n")
css <- "#ctl00_ContentPlaceHolder1_defaultUC1_CurrencyMatrixAllCountries1_GridView1"
tab <- html(res_h) %>%
html_nodes(css) %>%
html_table()
tab[[1]]$COUNTRIESWORLDAMERICAEUROPEASIAAUSTRALIAAFRICA
glimpse(tab[[1]]
Another alternative would have been to use RSelenium to go to the page, click the "+" and then scrape the resultant table.

Resources