__scrape__ dynamic data behind a chart using R - r

Is there a way in R which i can extract historical stocks quotes behind a chart based on that historical data from this web site :
https://www.egx.com.eg/en/stocksdata.aspx?ISIN=EGS60322C012
I tried my best with figure out the headers and query from the network tab in chrome broweser with no luck!
Thanks in advance!
I also tired this code for reference
rl <- "https://www.egx.com.eg/content/charts/dcp_401410da-0-4.swf?guid=339b28e6-3e74-4477-badb-b49c52a851a5"
r <- GET(url = rl, user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36"))
headers(r)
text <- content(r, as = "text", encoding = "UTF-8")
text
dput(text,"test.txt")
data <- fromJSON(text)
data

Related

spotifyr web scraping: "Request Failed [429]" error

I wish to get artist data from spotifr API for Spotify on R
First I'm scraping the data and saving the necessary data in relevant data frames and using the get_artist_audio_features() function available in spotifyr library, I'm trying to get artist details, but every time I encounter the same error.
My code can be found here: [the first part can be ignored, you may run it as it is to scrape it and directly proceed to the last chunk of code]
library(rvest)
library(dplyr)
library(spotifyr)
upcoming_artists <- "https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/"
upcoming_artists <- read_html(upcoming_artists)
upcoming_artists <- upcoming_artists %>%
html_elements("tbody") %>%
html_table() %>%
`[[`(1) %>%
tidyr::separate_rows(X2, sep = "\n")
upcoming_artists_india = upcoming_artists[33:35,]
upcoming_artists_india_list = list(length = nrow(upcoming_artists_india))
for(i in 1:nrow(upcoming_artists_india))
{
upcoming_artists_india$X2[i] <- paste("'",upcoming_artists_india$X2[i],"'",sep="")
}
for(i in 1:nrow(upcoming_artists_india))
{
upcoming_artists_india_list[[i]] <- get_artist_audio_features(upcoming_artists_india$X2[i])
}
The error code which I'm getting is:
Request failed [429]
This error is always followed by some random time and the entire error may look something like:
Request failed [429]. Retrying in 59423 seconds...
The code hasn't run a single time and I'm unable to comprehend the error.
You have to inject user-agent as parameter
upcoming_artists <- read_html(upcoming_artists,user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36")

rvest visualize current session html page

I am able to navigate to a password protected website using rvest and session()
library(rvest)
library(httr)
url_login <- "https://www.myurl.com/Home/Login"
uastring <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"
url_myProfile <- "https://www.myurl.com/?page=profile"
pgsession<-session(url_login, user_agent(uastring))
pgsession$url
pgform<-html_form(pgsession)[[1]]
filled_form<-html_form_set(pgform, loginID="xxx", pinCode="yyy")
mysess <- session_submit(pgsession, filled_form)
mysess <- session_jump_to(pgsession, url_myProfile)
Now I would like to view the exact html of where the session is at the moment, potentially by saving the html locally. Using the below, I see a page which seems to be correct, but without any images for example, so I was wondering if there was a better way to ensure the session is where I need it to be.
myhtml <- read_html(mysess)
write_xml(myhtml, "myprofile.html")

Scrape data table using xpath in R

I am fairly familiar with R, but have 0 experience with web scraping. I had looked around and cannot seem to figure out why my web scraping is "failing." Here is my code including the URL I want to scrape (the ngs-data-table to be specific):
library(rvest)
webpage <- read_html("https://nextgenstats.nfl.com/stats/rushing/2020/REG/1#yards")
tbls <- html_nodes(webpage, xpath = '/html/body/div[2]/div[3]/main/div/div/div[3]')
#also attempted using this Xpath '//*[#id="stats-rushing-view"]/div[3]' but neither worked
tbls
I am not receiving any errors with the code but I am receiving:
{xml_nodeset (0)}
I know this isn't a ton of code, I have tried multiple different xpaths as well. I know I will eventually need more code to be more specific for web-scraping, but I figured even the code above would at least begin to point me in the right direction? Any help would be appreciated. Thank you!
The data is stored as JSON. Here is a method to download and process that file.
library(httr)
#URL for week 6 data
url <- "https://nextgenstats.nfl.com/api/statboard/rushing?season=2020&seasonType=REG&week=6"
#create a user agent
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
#download the information
content <-httr::GET(url, verbose() , user_agent(ua), add_headers(Referer = "https://nextgenstats.nfl.com/stats/rushing/2020/REG/1"))
answer <-jsonlite::fromJSON(content(content, as = "text") ,flatten = TRUE)
answer$stats
The content of that table is generated dynamically: check it by saving the page from your browser (or, with your code, write_html(webpage,'test.html')) and then opening the saved file. So you probably can't capture it with rvest. Browser simulation packages like RSelenium will solve the problem.

Scraping a table from a website using rvest

I am trying to scrape the table from the treasury website.
https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yieldYear&year=2019
What I have currently is to collect the data but
library("rvest")
url <- "https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yieldAll"
data <- url %>%
html()
But I cannot seem to get it into a table format since I have a function.
data %>%
html_table()
It's better to first use CSS to locate the node which contains the table. The table is big(around 7400 rows). It took 30 seconds to render using html_table.
library("rvest")
library(httr)
url <- "https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yieldAll"
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
data <- html_session(url,user_agent(ua))
data %>%
html_node("table.t-chart") %>%
html_table()

R XML Parse for a web address

I am trying to download weather data, similar to the question asked here: How to parse XML to R data frame
but when I run the first line in the example, I get "Error: 1: failed to load HTTP resource". I've checked that the URL is valid. Here is the line I'm referring to:
data <- xmlParse("http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML")
I've managed to find a work around with the following, but would like to understand why the first line didn't work.
testfile <- "G:/Self Improvement/R Working Directory/test.xml"
url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
download.file(url, testfile, mode="wb") # get data into test
data <- xmlParse(testfile)
Appreciate any insights.
You can download the file by setting a UserAgent as follows:
require(httr)
UA <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"
my_url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
doc <- GET(my_url, user_agent(UA))
Now have a look at content(doc, "text") to see that it is the file you see in the browser
Then you can parse it via XML or xml2. I find xml2 easier but that is just my taste. Both work.
data <- XML::xmlParse(content(doc, "text"))
data2 <- xml2::read_xml(content(doc, "text"))
Why do i have to use a user agent?
From the RCurl FAQ: http://www.omegahat.org/RCurl/FAQ.html
Why doesn't RCurl provide a default value for the useragent that some sites require?
This is a matter of philosophy. Firstly, libcurl doesn't specify a default value and it is a framework for others to build applications. Similarly, RCurl is a general framework for R programmers to create applications to make "Web" requests. Accordingly, we don't set the user agent either. We expect the R programmer to do this. R programmers using RCurl in an R package to make requests to a site should use the package name (and also the version of R) as the user agent and specify this in all requests.
Basically, we expect others to specify a meaningful value for useragent so that they identify themselves correctly.
Note that users (not recommended for programmers) can set the R option named RCurlOptions via R's option() function. The value should be a list of named curl options. This is used in each RCurl request merging these values with those specified in the call. This allows one to provide default values.
I suspect http://forecast.weather.gov/ to reject all requests without a UserAgent.
I downloaded this url to a text file. After that, I get the content of the file and parse it to XML data. Here is my code:
rm(list=ls())
require(XML)
require(xml2)
require(httr)
url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
download.file(url=url,"url.txt" )
xmlParse(url)
data <- xmlParse("url.txt")
xml_data <- xmlToList(data)
location <- as.list(xml_data[["data"]][["location"]][["point"]])
start_time <- unlist(xml_data[["data"]][["time-layout"]][
names(xml_data[["data"]][["time-layout"]]) == "start-valid-time"])

Resources