Scrape data table using xpath in R - r

I am fairly familiar with R, but have 0 experience with web scraping. I had looked around and cannot seem to figure out why my web scraping is "failing." Here is my code including the URL I want to scrape (the ngs-data-table to be specific):
library(rvest)
webpage <- read_html("https://nextgenstats.nfl.com/stats/rushing/2020/REG/1#yards")
tbls <- html_nodes(webpage, xpath = '/html/body/div[2]/div[3]/main/div/div/div[3]')
#also attempted using this Xpath '//*[#id="stats-rushing-view"]/div[3]' but neither worked
tbls
I am not receiving any errors with the code but I am receiving:
{xml_nodeset (0)}
I know this isn't a ton of code, I have tried multiple different xpaths as well. I know I will eventually need more code to be more specific for web-scraping, but I figured even the code above would at least begin to point me in the right direction? Any help would be appreciated. Thank you!

The data is stored as JSON. Here is a method to download and process that file.
library(httr)
#URL for week 6 data
url <- "https://nextgenstats.nfl.com/api/statboard/rushing?season=2020&seasonType=REG&week=6"
#create a user agent
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
#download the information
content <-httr::GET(url, verbose() , user_agent(ua), add_headers(Referer = "https://nextgenstats.nfl.com/stats/rushing/2020/REG/1"))
answer <-jsonlite::fromJSON(content(content, as = "text") ,flatten = TRUE)
answer$stats

The content of that table is generated dynamically: check it by saving the page from your browser (or, with your code, write_html(webpage,'test.html')) and then opening the saved file. So you probably can't capture it with rvest. Browser simulation packages like RSelenium will solve the problem.

Related

spotifyr web scraping: "Request Failed [429]" error

I wish to get artist data from spotifr API for Spotify on R
First I'm scraping the data and saving the necessary data in relevant data frames and using the get_artist_audio_features() function available in spotifyr library, I'm trying to get artist details, but every time I encounter the same error.
My code can be found here: [the first part can be ignored, you may run it as it is to scrape it and directly proceed to the last chunk of code]
library(rvest)
library(dplyr)
library(spotifyr)
upcoming_artists <- "https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/"
upcoming_artists <- read_html(upcoming_artists)
upcoming_artists <- upcoming_artists %>%
html_elements("tbody") %>%
html_table() %>%
`[[`(1) %>%
tidyr::separate_rows(X2, sep = "\n")
upcoming_artists_india = upcoming_artists[33:35,]
upcoming_artists_india_list = list(length = nrow(upcoming_artists_india))
for(i in 1:nrow(upcoming_artists_india))
{
upcoming_artists_india$X2[i] <- paste("'",upcoming_artists_india$X2[i],"'",sep="")
}
for(i in 1:nrow(upcoming_artists_india))
{
upcoming_artists_india_list[[i]] <- get_artist_audio_features(upcoming_artists_india$X2[i])
}
The error code which I'm getting is:
Request failed [429]
This error is always followed by some random time and the entire error may look something like:
Request failed [429]. Retrying in 59423 seconds...
The code hasn't run a single time and I'm unable to comprehend the error.
You have to inject user-agent as parameter
upcoming_artists <- read_html(upcoming_artists,user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36")

rvest visualize current session html page

I am able to navigate to a password protected website using rvest and session()
library(rvest)
library(httr)
url_login <- "https://www.myurl.com/Home/Login"
uastring <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"
url_myProfile <- "https://www.myurl.com/?page=profile"
pgsession<-session(url_login, user_agent(uastring))
pgsession$url
pgform<-html_form(pgsession)[[1]]
filled_form<-html_form_set(pgform, loginID="xxx", pinCode="yyy")
mysess <- session_submit(pgsession, filled_form)
mysess <- session_jump_to(pgsession, url_myProfile)
Now I would like to view the exact html of where the session is at the moment, potentially by saving the html locally. Using the below, I see a page which seems to be correct, but without any images for example, so I was wondering if there was a better way to ensure the session is where I need it to be.
myhtml <- read_html(mysess)
write_xml(myhtml, "myprofile.html")

Issues with scraping data off a website using R

I am trying to scrape data off this website using the rvest package.
https://www.footballdb.com/games/index.html?lg=NFL&yr=2021
But when I run my code, I'm getting an error that I don't recognize. I am unsure if I not using the right html class.
here is the html I see when I inspect element
And here is my code:
#Downloading data - 2021 schedule
library(rvest)
url <- "https://www.footballdb.com/games/index.html?lg=NFL&yr=2021"
data <- url %>%
html_nodes("statistics") %>%
html_table()
I got error 403 when used your code and its likely incorrect user-agent used or website thinks rvest call is a bot.
Hence used httr package to setup a user-agent and the following works for your url.
library(httr)
library(rvest)
tmp_user_agent<- 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'
page_response <- GET(url, user_agent(tmp_user_agent))
df_lists<-page_response%>%
read_html() %>%
html_nodes(".statistics") %>% #classes are queries with dot
html_table()
df_lists[1] # 2,3,4.....
But its advisable to check if the website allows scraping before doing anything large scale or getting their data for any commercial use.

R XML Parse for a web address

I am trying to download weather data, similar to the question asked here: How to parse XML to R data frame
but when I run the first line in the example, I get "Error: 1: failed to load HTTP resource". I've checked that the URL is valid. Here is the line I'm referring to:
data <- xmlParse("http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML")
I've managed to find a work around with the following, but would like to understand why the first line didn't work.
testfile <- "G:/Self Improvement/R Working Directory/test.xml"
url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
download.file(url, testfile, mode="wb") # get data into test
data <- xmlParse(testfile)
Appreciate any insights.
You can download the file by setting a UserAgent as follows:
require(httr)
UA <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"
my_url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
doc <- GET(my_url, user_agent(UA))
Now have a look at content(doc, "text") to see that it is the file you see in the browser
Then you can parse it via XML or xml2. I find xml2 easier but that is just my taste. Both work.
data <- XML::xmlParse(content(doc, "text"))
data2 <- xml2::read_xml(content(doc, "text"))
Why do i have to use a user agent?
From the RCurl FAQ: http://www.omegahat.org/RCurl/FAQ.html
Why doesn't RCurl provide a default value for the useragent that some sites require?
This is a matter of philosophy. Firstly, libcurl doesn't specify a default value and it is a framework for others to build applications. Similarly, RCurl is a general framework for R programmers to create applications to make "Web" requests. Accordingly, we don't set the user agent either. We expect the R programmer to do this. R programmers using RCurl in an R package to make requests to a site should use the package name (and also the version of R) as the user agent and specify this in all requests.
Basically, we expect others to specify a meaningful value for useragent so that they identify themselves correctly.
Note that users (not recommended for programmers) can set the R option named RCurlOptions via R's option() function. The value should be a list of named curl options. This is used in each RCurl request merging these values with those specified in the call. This allows one to provide default values.
I suspect http://forecast.weather.gov/ to reject all requests without a UserAgent.
I downloaded this url to a text file. After that, I get the content of the file and parse it to XML data. Here is my code:
rm(list=ls())
require(XML)
require(xml2)
require(httr)
url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
download.file(url=url,"url.txt" )
xmlParse(url)
data <- xmlParse("url.txt")
xml_data <- xmlToList(data)
location <- as.list(xml_data[["data"]][["location"]][["point"]])
start_time <- unlist(xml_data[["data"]][["time-layout"]][
names(xml_data[["data"]][["time-layout"]]) == "start-valid-time"])

R - How to make a click on webpage using rvest or rcurl

I want to download data from this webpage
The data can be easily scraped with rvest.
The code maybe like this :
library(rvest)
library(pipeR)
url <- "http://www.tradingeconomics.com/"
css <- "#ctl00_ContentPlaceHolder1_defaultUC1_CurrencyMatrixAllCountries1_GridView1"
data <- url %>>%
html() %>>%
html_nodes(css) %>>%
html_table()
But there is a problem for webpages like this.
There is a + button to show the data of all the countries, but the default is just data of 50 countries.
So if I use the code, I can just scrape data of 50 countries.
The + button is made in javascript, so I want to know if there is a way in R to click the button and then scrape the data.
Sometimes it's better to attack the problem at the ajax web-request level. For this site, you can use Chrome's dev tools and watch the requests. To build the table (the whole table, too) it makes a POST to the site with various ajax-y parameters. Just replicate that, do a bit of data-munging of the response and you're good to go:
library(httr)
library(rvest)
library(dplyr)
res <- POST("http://www.tradingeconomics.com/",
encode="form",
user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.50 Safari/537.36"),
add_headers(`Referer`="http://www.tradingeconomics.com/",
`X-MicrosoftAjax`="Delta=true"),
body=list(
`ctl00$AjaxScriptManager1$ScriptManager1`="ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$UpdatePanel1|ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$LinkButton1",
`__EVENTTARGET`="ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$LinkButton1",
`srch-term`="",
`ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$GridView1$ctl01$DropDownListCountry`="top",
`ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$ParameterContinent`="",
`__ASYNCPOST`="false"))
res_t <- content(res, as="text")
res_h <- paste0(unlist(strsplit(res_t, "\r\n"))[-1], sep="", collapse="\n")
css <- "#ctl00_ContentPlaceHolder1_defaultUC1_CurrencyMatrixAllCountries1_GridView1"
tab <- html(res_h) %>%
html_nodes(css) %>%
html_table()
tab[[1]]$COUNTRIESWORLDAMERICAEUROPEASIAAUSTRALIAAFRICA
glimpse(tab[[1]]
Another alternative would have been to use RSelenium to go to the page, click the "+" and then scrape the resultant table.

Resources