I am able to navigate to a password protected website using rvest and session()
library(rvest)
library(httr)
url_login <- "https://www.myurl.com/Home/Login"
uastring <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"
url_myProfile <- "https://www.myurl.com/?page=profile"
pgsession<-session(url_login, user_agent(uastring))
pgsession$url
pgform<-html_form(pgsession)[[1]]
filled_form<-html_form_set(pgform, loginID="xxx", pinCode="yyy")
mysess <- session_submit(pgsession, filled_form)
mysess <- session_jump_to(pgsession, url_myProfile)
Now I would like to view the exact html of where the session is at the moment, potentially by saving the html locally. Using the below, I see a page which seems to be correct, but without any images for example, so I was wondering if there was a better way to ensure the session is where I need it to be.
myhtml <- read_html(mysess)
write_xml(myhtml, "myprofile.html")
Related
I wish to get artist data from spotifr API for Spotify on R
First I'm scraping the data and saving the necessary data in relevant data frames and using the get_artist_audio_features() function available in spotifyr library, I'm trying to get artist details, but every time I encounter the same error.
My code can be found here: [the first part can be ignored, you may run it as it is to scrape it and directly proceed to the last chunk of code]
library(rvest)
library(dplyr)
library(spotifyr)
upcoming_artists <- "https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/"
upcoming_artists <- read_html(upcoming_artists)
upcoming_artists <- upcoming_artists %>%
html_elements("tbody") %>%
html_table() %>%
`[[`(1) %>%
tidyr::separate_rows(X2, sep = "\n")
upcoming_artists_india = upcoming_artists[33:35,]
upcoming_artists_india_list = list(length = nrow(upcoming_artists_india))
for(i in 1:nrow(upcoming_artists_india))
{
upcoming_artists_india$X2[i] <- paste("'",upcoming_artists_india$X2[i],"'",sep="")
}
for(i in 1:nrow(upcoming_artists_india))
{
upcoming_artists_india_list[[i]] <- get_artist_audio_features(upcoming_artists_india$X2[i])
}
The error code which I'm getting is:
Request failed [429]
This error is always followed by some random time and the entire error may look something like:
Request failed [429]. Retrying in 59423 seconds...
The code hasn't run a single time and I'm unable to comprehend the error.
You have to inject user-agent as parameter
upcoming_artists <- read_html(upcoming_artists,user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36")
I am trying to scrape data off this website using the rvest package.
https://www.footballdb.com/games/index.html?lg=NFL&yr=2021
But when I run my code, I'm getting an error that I don't recognize. I am unsure if I not using the right html class.
here is the html I see when I inspect element
And here is my code:
#Downloading data - 2021 schedule
library(rvest)
url <- "https://www.footballdb.com/games/index.html?lg=NFL&yr=2021"
data <- url %>%
html_nodes("statistics") %>%
html_table()
I got error 403 when used your code and its likely incorrect user-agent used or website thinks rvest call is a bot.
Hence used httr package to setup a user-agent and the following works for your url.
library(httr)
library(rvest)
tmp_user_agent<- 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'
page_response <- GET(url, user_agent(tmp_user_agent))
df_lists<-page_response%>%
read_html() %>%
html_nodes(".statistics") %>% #classes are queries with dot
html_table()
df_lists[1] # 2,3,4.....
But its advisable to check if the website allows scraping before doing anything large scale or getting their data for any commercial use.
I am fairly familiar with R, but have 0 experience with web scraping. I had looked around and cannot seem to figure out why my web scraping is "failing." Here is my code including the URL I want to scrape (the ngs-data-table to be specific):
library(rvest)
webpage <- read_html("https://nextgenstats.nfl.com/stats/rushing/2020/REG/1#yards")
tbls <- html_nodes(webpage, xpath = '/html/body/div[2]/div[3]/main/div/div/div[3]')
#also attempted using this Xpath '//*[#id="stats-rushing-view"]/div[3]' but neither worked
tbls
I am not receiving any errors with the code but I am receiving:
{xml_nodeset (0)}
I know this isn't a ton of code, I have tried multiple different xpaths as well. I know I will eventually need more code to be more specific for web-scraping, but I figured even the code above would at least begin to point me in the right direction? Any help would be appreciated. Thank you!
The data is stored as JSON. Here is a method to download and process that file.
library(httr)
#URL for week 6 data
url <- "https://nextgenstats.nfl.com/api/statboard/rushing?season=2020&seasonType=REG&week=6"
#create a user agent
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
#download the information
content <-httr::GET(url, verbose() , user_agent(ua), add_headers(Referer = "https://nextgenstats.nfl.com/stats/rushing/2020/REG/1"))
answer <-jsonlite::fromJSON(content(content, as = "text") ,flatten = TRUE)
answer$stats
The content of that table is generated dynamically: check it by saving the page from your browser (or, with your code, write_html(webpage,'test.html')) and then opening the saved file. So you probably can't capture it with rvest. Browser simulation packages like RSelenium will solve the problem.
Is there a way in R which i can extract historical stocks quotes behind a chart based on that historical data from this web site :
https://www.egx.com.eg/en/stocksdata.aspx?ISIN=EGS60322C012
I tried my best with figure out the headers and query from the network tab in chrome broweser with no luck!
Thanks in advance!
I also tired this code for reference
rl <- "https://www.egx.com.eg/content/charts/dcp_401410da-0-4.swf?guid=339b28e6-3e74-4477-badb-b49c52a851a5"
r <- GET(url = rl, user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36"))
headers(r)
text <- content(r, as = "text", encoding = "UTF-8")
text
dput(text,"test.txt")
data <- fromJSON(text)
data
I am trying to scrape the table from the treasury website.
https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yieldYear&year=2019
What I have currently is to collect the data but
library("rvest")
url <- "https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yieldAll"
data <- url %>%
html()
But I cannot seem to get it into a table format since I have a function.
data %>%
html_table()
It's better to first use CSS to locate the node which contains the table. The table is big(around 7400 rows). It took 30 seconds to render using html_table.
library("rvest")
library(httr)
url <- "https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yieldAll"
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
data <- html_session(url,user_agent(ua))
data %>%
html_node("table.t-chart") %>%
html_table()