I am trying to scrape data off this website using the rvest package.
https://www.footballdb.com/games/index.html?lg=NFL&yr=2021
But when I run my code, I'm getting an error that I don't recognize. I am unsure if I not using the right html class.
here is the html I see when I inspect element
And here is my code:
#Downloading data - 2021 schedule
library(rvest)
url <- "https://www.footballdb.com/games/index.html?lg=NFL&yr=2021"
data <- url %>%
html_nodes("statistics") %>%
html_table()
I got error 403 when used your code and its likely incorrect user-agent used or website thinks rvest call is a bot.
Hence used httr package to setup a user-agent and the following works for your url.
library(httr)
library(rvest)
tmp_user_agent<- 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'
page_response <- GET(url, user_agent(tmp_user_agent))
df_lists<-page_response%>%
read_html() %>%
html_nodes(".statistics") %>% #classes are queries with dot
html_table()
df_lists[1] # 2,3,4.....
But its advisable to check if the website allows scraping before doing anything large scale or getting their data for any commercial use.
Related
I've come up with a partial solution to my question, but still need help getting to the end.
My issue is that I can no longer get JSON from a website using R, but I can still access it via my browser:
library(rvest)
library(httr)
library(jsonlite)
library(dplyr)
website <- 'http://api.draftkings.com/sites/US-DK/sports/v1/sports?format=json'
fromJSON(website)
Now gives me:
Error in open.connection(con, "rb") : SSL certificate problem: certificate has expired but I'm still able to visit the site on Chrome.
Ideally I'd like to find a way to get this working using fromJSON()
I don't entirely understand what's causing this error, so I tried a bunch of different solutions. I found that I could at least read the html using this:
website <- 'http://api.draftkings.com/sites/US-DK/sports/v1/sports?format=json'
doc <- webpage %>%
httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>%
read_html()
However, from here I'm stuck. I'm struggling trying to parse the doc to the JSON I require. I've tried things like doc %>% xmlParse() %>% xmlToList() %>% toJSON() %>% fromJSON() but it comes out as gibberish.
So my question comes down to: 1) is there a way to get around the SSL certificate problem so that I can use fromJSON() directly again. And 2) if not, how can I sanitize the html document to get it in a usable JSON format?
I am fairly familiar with R, but have 0 experience with web scraping. I had looked around and cannot seem to figure out why my web scraping is "failing." Here is my code including the URL I want to scrape (the ngs-data-table to be specific):
library(rvest)
webpage <- read_html("https://nextgenstats.nfl.com/stats/rushing/2020/REG/1#yards")
tbls <- html_nodes(webpage, xpath = '/html/body/div[2]/div[3]/main/div/div/div[3]')
#also attempted using this Xpath '//*[#id="stats-rushing-view"]/div[3]' but neither worked
tbls
I am not receiving any errors with the code but I am receiving:
{xml_nodeset (0)}
I know this isn't a ton of code, I have tried multiple different xpaths as well. I know I will eventually need more code to be more specific for web-scraping, but I figured even the code above would at least begin to point me in the right direction? Any help would be appreciated. Thank you!
The data is stored as JSON. Here is a method to download and process that file.
library(httr)
#URL for week 6 data
url <- "https://nextgenstats.nfl.com/api/statboard/rushing?season=2020&seasonType=REG&week=6"
#create a user agent
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
#download the information
content <-httr::GET(url, verbose() , user_agent(ua), add_headers(Referer = "https://nextgenstats.nfl.com/stats/rushing/2020/REG/1"))
answer <-jsonlite::fromJSON(content(content, as = "text") ,flatten = TRUE)
answer$stats
The content of that table is generated dynamically: check it by saving the page from your browser (or, with your code, write_html(webpage,'test.html')) and then opening the saved file. So you probably can't capture it with rvest. Browser simulation packages like RSelenium will solve the problem.
I'm trying to scrape data from a webpage, however I get a 404 error for the URLs below. However, there is data from the 404 link that I need from within the browser. Here's the example:
library(tidyverse)
library(rvest)
url <- "http://www.uscho.com/scoreboard/division-i-men/20172018/composite-schedule/"
link_list <- url %>%
read_html() %>%
html_nodes("td:nth-child(13) a") %>%
html_attr("href") %>%
{paste0("http://www.uscho.com", .)}
Now, for example, search the 200th example here (http://www.uscho.com/recaplink.php?gid=1_970_20172018) in your web browser. You'll get this:
I don't actually want the 404 Error, but in the address bar, there's a URL that -- after some manipulation -- I can use to get the actual webpage that I want ("https://www.uscho.com/recaps/?p=171810970")
This URL, however, doesn't show up in R. Running read_html(link_list[200]), I only get a 404 error.
Any idea how I can get the URL from the browser within R?
To get the URL from the browser within R using rvest you can search for the meta data:
library(rvest)
library(tidyverse)
url <- "https://stackoverflow.com/questions/50555460/scrape-data-in-url-from-404-error-scrape"
url %>%
read_html() %>%
html_nodes(xpath = '//meta[#property="og:url"]') %>%
html_attr('content')
#[1] "https://stackoverflow.com/questions/50555460/scrape-data-in-url-from-404-error-scrape"
However, this will not suffice for your case. I think it would be better for you to use RSelenium to scrape the data dynamically. It might be slower, but it is most certainly a solution to your problem. You can check out this tutorial on how to do so.
EDIT:
Not really experienced with splashr, but I do know that RSelenium is different from rvest because Selenium simulates whereas rvest is dependent on RESTful API's. It crashes when a 404 is received, where Selenium can just ignore by waiting using setImplicitWaitTimeout() so that the page redirects. You can then get the URL captured by using remoteDriver$getCurrentUrl()
I am trying to download weather data, similar to the question asked here: How to parse XML to R data frame
but when I run the first line in the example, I get "Error: 1: failed to load HTTP resource". I've checked that the URL is valid. Here is the line I'm referring to:
data <- xmlParse("http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML")
I've managed to find a work around with the following, but would like to understand why the first line didn't work.
testfile <- "G:/Self Improvement/R Working Directory/test.xml"
url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
download.file(url, testfile, mode="wb") # get data into test
data <- xmlParse(testfile)
Appreciate any insights.
You can download the file by setting a UserAgent as follows:
require(httr)
UA <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"
my_url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
doc <- GET(my_url, user_agent(UA))
Now have a look at content(doc, "text") to see that it is the file you see in the browser
Then you can parse it via XML or xml2. I find xml2 easier but that is just my taste. Both work.
data <- XML::xmlParse(content(doc, "text"))
data2 <- xml2::read_xml(content(doc, "text"))
Why do i have to use a user agent?
From the RCurl FAQ: http://www.omegahat.org/RCurl/FAQ.html
Why doesn't RCurl provide a default value for the useragent that some sites require?
This is a matter of philosophy. Firstly, libcurl doesn't specify a default value and it is a framework for others to build applications. Similarly, RCurl is a general framework for R programmers to create applications to make "Web" requests. Accordingly, we don't set the user agent either. We expect the R programmer to do this. R programmers using RCurl in an R package to make requests to a site should use the package name (and also the version of R) as the user agent and specify this in all requests.
Basically, we expect others to specify a meaningful value for useragent so that they identify themselves correctly.
Note that users (not recommended for programmers) can set the R option named RCurlOptions via R's option() function. The value should be a list of named curl options. This is used in each RCurl request merging these values with those specified in the call. This allows one to provide default values.
I suspect http://forecast.weather.gov/ to reject all requests without a UserAgent.
I downloaded this url to a text file. After that, I get the content of the file and parse it to XML data. Here is my code:
rm(list=ls())
require(XML)
require(xml2)
require(httr)
url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
download.file(url=url,"url.txt" )
xmlParse(url)
data <- xmlParse("url.txt")
xml_data <- xmlToList(data)
location <- as.list(xml_data[["data"]][["location"]][["point"]])
start_time <- unlist(xml_data[["data"]][["time-layout"]][
names(xml_data[["data"]][["time-layout"]]) == "start-valid-time"])
I want to download data from this webpage
The data can be easily scraped with rvest.
The code maybe like this :
library(rvest)
library(pipeR)
url <- "http://www.tradingeconomics.com/"
css <- "#ctl00_ContentPlaceHolder1_defaultUC1_CurrencyMatrixAllCountries1_GridView1"
data <- url %>>%
html() %>>%
html_nodes(css) %>>%
html_table()
But there is a problem for webpages like this.
There is a + button to show the data of all the countries, but the default is just data of 50 countries.
So if I use the code, I can just scrape data of 50 countries.
The + button is made in javascript, so I want to know if there is a way in R to click the button and then scrape the data.
Sometimes it's better to attack the problem at the ajax web-request level. For this site, you can use Chrome's dev tools and watch the requests. To build the table (the whole table, too) it makes a POST to the site with various ajax-y parameters. Just replicate that, do a bit of data-munging of the response and you're good to go:
library(httr)
library(rvest)
library(dplyr)
res <- POST("http://www.tradingeconomics.com/",
encode="form",
user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.50 Safari/537.36"),
add_headers(`Referer`="http://www.tradingeconomics.com/",
`X-MicrosoftAjax`="Delta=true"),
body=list(
`ctl00$AjaxScriptManager1$ScriptManager1`="ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$UpdatePanel1|ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$LinkButton1",
`__EVENTTARGET`="ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$LinkButton1",
`srch-term`="",
`ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$GridView1$ctl01$DropDownListCountry`="top",
`ctl00$ContentPlaceHolder1$defaultUC1$CurrencyMatrixAllCountries1$ParameterContinent`="",
`__ASYNCPOST`="false"))
res_t <- content(res, as="text")
res_h <- paste0(unlist(strsplit(res_t, "\r\n"))[-1], sep="", collapse="\n")
css <- "#ctl00_ContentPlaceHolder1_defaultUC1_CurrencyMatrixAllCountries1_GridView1"
tab <- html(res_h) %>%
html_nodes(css) %>%
html_table()
tab[[1]]$COUNTRIESWORLDAMERICAEUROPEASIAAUSTRALIAAFRICA
glimpse(tab[[1]]
Another alternative would have been to use RSelenium to go to the page, click the "+" and then scrape the resultant table.