R XML Parse for a web address

R XML Parse for a web address - r

I am trying to download weather data, similar to the question asked here: How to parse XML to R data frame
but when I run the first line in the example, I get "Error: 1: failed to load HTTP resource". I've checked that the URL is valid. Here is the line I'm referring to:
data <- xmlParse("http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML")
I've managed to find a work around with the following, but would like to understand why the first line didn't work.
testfile <- "G:/Self Improvement/R Working Directory/test.xml"
url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
download.file(url, testfile, mode="wb") # get data into test
data <- xmlParse(testfile)
Appreciate any insights.

You can download the file by setting a UserAgent as follows:
require(httr)
UA <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"
my_url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
doc <- GET(my_url, user_agent(UA))
Now have a look at content(doc, "text") to see that it is the file you see in the browser
Then you can parse it via XML or xml2. I find xml2 easier but that is just my taste. Both work.
data <- XML::xmlParse(content(doc, "text"))
data2 <- xml2::read_xml(content(doc, "text"))
Why do i have to use a user agent?
From the RCurl FAQ: http://www.omegahat.org/RCurl/FAQ.html
Why doesn't RCurl provide a default value for the useragent that some sites require?
This is a matter of philosophy. Firstly, libcurl doesn't specify a default value and it is a framework for others to build applications. Similarly, RCurl is a general framework for R programmers to create applications to make "Web" requests. Accordingly, we don't set the user agent either. We expect the R programmer to do this. R programmers using RCurl in an R package to make requests to a site should use the package name (and also the version of R) as the user agent and specify this in all requests.
Basically, we expect others to specify a meaningful value for useragent so that they identify themselves correctly.
Note that users (not recommended for programmers) can set the R option named RCurlOptions via R's option() function. The value should be a list of named curl options. This is used in each RCurl request merging these values with those specified in the call. This allows one to provide default values.
I suspect http://forecast.weather.gov/ to reject all requests without a UserAgent.

I downloaded this url to a text file. After that, I get the content of the file and parse it to XML data. Here is my code:
rm(list=ls())
require(XML)
require(xml2)
require(httr)
url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
download.file(url=url,"url.txt" )
xmlParse(url)
data <- xmlParse("url.txt")
xml_data <- xmlToList(data)
location <- as.list(xml_data[["data"]][["location"]][["point"]])
start_time <- unlist(xml_data[["data"]][["time-layout"]][
names(xml_data[["data"]][["time-layout"]]) == "start-valid-time"])

Related

Issues with scraping data off a website using R

I am trying to scrape data off this website using the rvest package.
https://www.footballdb.com/games/index.html?lg=NFL&yr=2021
But when I run my code, I'm getting an error that I don't recognize. I am unsure if I not using the right html class.
here is the html I see when I inspect element
And here is my code:
#Downloading data - 2021 schedule
library(rvest)
url <- "https://www.footballdb.com/games/index.html?lg=NFL&yr=2021"
data <- url %>%
html_nodes("statistics") %>%
html_table()

I got error 403 when used your code and its likely incorrect user-agent used or website thinks rvest call is a bot.
Hence used httr package to setup a user-agent and the following works for your url.
library(httr)
library(rvest)
tmp_user_agent<- 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'
page_response <- GET(url, user_agent(tmp_user_agent))
df_lists<-page_response%>%
read_html() %>%
html_nodes(".statistics") %>% #classes are queries with dot
html_table()
df_lists[1] # 2,3,4.....
But its advisable to check if the website allows scraping before doing anything large scale or getting their data for any commercial use.

Scrape data table using xpath in R

I am fairly familiar with R, but have 0 experience with web scraping. I had looked around and cannot seem to figure out why my web scraping is "failing." Here is my code including the URL I want to scrape (the ngs-data-table to be specific):
library(rvest)
webpage <- read_html("https://nextgenstats.nfl.com/stats/rushing/2020/REG/1#yards")
tbls <- html_nodes(webpage, xpath = '/html/body/div[2]/div[3]/main/div/div/div[3]')
#also attempted using this Xpath '//*[#id="stats-rushing-view"]/div[3]' but neither worked
tbls
I am not receiving any errors with the code but I am receiving:
{xml_nodeset (0)}
I know this isn't a ton of code, I have tried multiple different xpaths as well. I know I will eventually need more code to be more specific for web-scraping, but I figured even the code above would at least begin to point me in the right direction? Any help would be appreciated. Thank you!

The data is stored as JSON. Here is a method to download and process that file.
library(httr)
#URL for week 6 data
url <- "https://nextgenstats.nfl.com/api/statboard/rushing?season=2020&seasonType=REG&week=6"
#create a user agent
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
#download the information
content <-httr::GET(url, verbose() , user_agent(ua), add_headers(Referer = "https://nextgenstats.nfl.com/stats/rushing/2020/REG/1"))
answer <-jsonlite::fromJSON(content(content, as = "text") ,flatten = TRUE)
answer$stats

The content of that table is generated dynamically: check it by saving the page from your browser (or, with your code, write_html(webpage,'test.html')) and then opening the saved file. So you probably can't capture it with rvest. Browser simulation packages like RSelenium will solve the problem.

Access sharepoint folders in R

I'm currently trying to access sharepoint folders in R. I read multiple articles addressing that issue but all the proposed solutions don't seem to work in my case.
I first tried to upload a single .txt file using the httr package, as follows:
URL <- "<domain>/<file>/<subfile>/document.txt"
r <- httr::GET(URL, httr::authenticate("username","password",type="any"))
I get the following error:
Error in curl::curl_fetch_memory(url, handle = handle) :
URL using bad/illegal format or missing URL
I then tried another package that use a similar syntax (RCurl):
URL <- "<domain>/<file>/<subfile>/document.txt"
r <- getURL(URL, userpwd = "username:password")
I get the following error:
Error in function (type, msg, asError = TRUE) :
I tried many other ways of linking R to sharepoint, but these two seemed the most straightforward. (also, my URL doesn't seem to be the problem since it works when I run it in my web browser).
Ultimately, I want to be able to upload a whole sharepoint folder to R (not only a single document). Something that would really help is to set my sharepoint folder as my working directory and use the base::list.files() function to list files in my folder, but I doubt thats possible.
Does anyone have a clue how I can do that?

I created an R library called sharepointr for doing just that.
What I basically did was:
Create App Registration
Add permissions
Get credentials
Make REST calls
The Readme.md for the repository has a full description, and here is an example:
# Install
install.packages("devtools")
devtools::install_github("esbeneickhardt/sharepointr")
# Parameters
client_id <- "insert_from_first_step"
client_secret <- "insert_from_first_step"
tenant_id <- "insert_from_fourth_step"
resource_id <- "insert_from_fourth_step"
site_domain <- "yourorganisation.sharepoint.com"
sharepoint_url <- "https://yourorganisation.sharepoint.com/sites/MyTestSite"
# Get Token
sharepoint_token <- get_sharepoint_token(client_id, client_secret, tenant_id, resource_id, site_domain)
# Get digest value
sharepoint_digest_value <- get_sharepoint_digest_value(sharepoint_token, sharepoint_url)
# List folders
sharepoint_path <- "Shared Documents/test"
get_sharepoint_folder_names(sharepoint_token, sharepoint_url, sharepoint_digest_value, sharepoint_path)

Using R to download data automatically

I want to download all the data in either pdf or excel for
each State X Crop Year X Standard Reports combination from this website.
I followed this tutorial to do what I want.
Download data from URL
However, I hit an error on the second line.
driver <- rsDriver()
Error in subprocess::spawn_process(tfile, ...) :
group termination: could not assign process to a job: Access is denied
Are there any alternative methods that I could use to download these data?

First, check robots.txt on the website if there is any. Then read the terms and conditions if there is any. And it is always important to throttle the request below.
After checking all the terms and conditions, the code below should get you started:
library(httr)
library(xml2)
link <- "https://aps.dac.gov.in/LUS/Public/Reports.aspx"
r <- GET(link)
doc <- read_html(content(r, "text"))
#write_html(doc, "temp.html")
states <- sapply(xml_find_all(doc, ".//select[#name='DdlState']/option"), function(x)
setNames(xml_attr(x, "value"), xml_text(x)))
states <- states[!grepl("^Select", names(states))]
years <- sapply(xml_find_all(doc, ".//select[#name='DdlYear']/option"), function(x)
setNames(xml_attr(x, "value"), xml_text(x)))
years <- years[!grepl("^Select", names(years))]
rptfmt <- sapply(xml_find_all(doc, ".//select[#name='DdlFormat']/option"), function(x)
setNames(xml_attr(x, "value"), xml_text(x)))
stdrpts <- unlist(lapply(xml_find_all(doc, ".//td/a"), function(x) {
id <- xml_attr(x, "id")
if (grepl("^TreeView1t", id)) return(setNames(id, xml_text(x)))
}))
get_vs <- function(doc) sapply(xml_find_all(doc, ".//input[#type='hidden']"), function(x)
setNames(xml_attr(x, "value"), xml_attr(x, "name")))
fmt <- rptfmt[2] #Excel format
for (sn in names(states)) {
for (yn in names(years)) {
for (srn in seq_along(stdrpts)) {
s <- states[sn]
y <- years[yn]
sr <- stdrpts[srn]
r <- POST(link,
body=as.list(c("__EVENTTARGET"="DdlState",
"__EVENTARGUMENT"="",
"__LASTFOCUS"="",
"TreeView1_ExpandState"="ennnn",
"TreeView1_SelectedNode"="",
"TreeView1_PopulateLog"="",
get_vs(doc),
DdlState=unname(s),
DdlYear=0,
DdlFormat=1)),
encode="form")
doc <- read_html(content(r, "text"))
treeview <- c("__EVENTTARGET"="TreeView1",
"__EVENTARGUMENT"=paste0("sStandard Reports\\", srn),
"__LASTFOCUS"="",
"TreeView1_ExpandState"="ennnn",
"TreeView1_SelectedNode"=unname(stdrpts[srn]),
"TreeView1_PopulateLog"="")
vs <- get_vs(doc)
ddl <- c(DdlState=unname(s), DdlYear=unname(y), DdlFormat=unname(fmt))
r <- POST(link, body=as.list(c(treeview, vs, ddl)), encode="form")
if (r$headers$`content-type`=="application/vnd.ms-excel")
writeBin(content(r, "raw"), paste0(sn, "_", yn, "_", names(stdrpts)[srn], ".xls"))
Sys.sleep(5)
}
}
}

Here is my best attempt:
If you look in the network activities you will see a post request is sent:
Request body data:
If you scroll down you will see the form data that is used.
body <- structure(list(`__EVENTTARGET` = "TreeView1", `__EVENTARGUMENT` = "sStandard+Reports%5C4",
`__LASTFOCUS` = "", TreeView1_ExpandState = "ennnn", TreeView1_SelectedNode = "TreeView1t4",
TreeView1_PopulateLog = "", `__VIEWSTATE` = "", `__VIEWSTATEGENERATOR` = "",
`__VIEWSTATEENCRYPTED` = "", `__EVENTVALIDATION` = "", DdlState = "35",
DdlYear = "2001", DdlFormat = "1"), .Names = c("__EVENTTARGET",
"__EVENTARGUMENT", "__LASTFOCUS", "TreeView1_ExpandState", "TreeView1_SelectedNode",
"TreeView1_PopulateLog", "__VIEWSTATE", "__VIEWSTATEGENERATOR",
"__VIEWSTATEENCRYPTED", "__EVENTVALIDATION", "DdlState", "DdlYear",
"DdlFormat"))
There are certain session related values:
attr_names <- c("__EVENTVALIDATION", "__VIEWSTATEGENERATOR", "__VIEWSTATE", "__VIEWSTATEENCRYPTED")
You could add them like this:
setAttrNames <- function(attr_name){
name <- doc %>%
html_nodes(xpath = glue("//*[#id = '{attr_name}']")) %>%
html_attr(name = "value")
body[[attr_name]] <<- name
}
Then you can add this session specific values:
library(rvest)
library(glue)
url <- "https://aps.dac.gov.in/LUS/Public/Reports.aspx"
doc <- url %>% GET %>% content("text") %>% read_html
sapply(attr_names, setAttrNames)
Sending the request:
Then you can send the request:
response <- POST(
url = url,
encode = "form",
body = body,
hdrs
)
response$status_code # still indicates that we have an error in the request.
Follow up ideas:
I checked for cookies. There is a session cookie, but it does not seem to be necessary for the request.
Adding headers.
Trying to set the request headers
header <- structure(c("aps.dac.gov.in", "keep-alive", "3437", "max-age=0",
"https://aps.dac.gov.in", "1", "application/x-www-form-urlencoded",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36",
"?1", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"same-origin", "navigate", "https://aps.dac.gov.in/LUS/Public/Reports.aspx",
"gzip, deflate, br", "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"), .Names = c("Host",
"Connection", "Content-Length", "Cache-Control", "Origin", "Upgrade-Insecure-Requests",
"Content-Type", "User-Agent", "Sec-Fetch-User", "Accept", "Sec-Fetch-Site",
"Sec-Fetch-Mode", "Referer", "Accept-Encoding", "Accept-Language"
))
hdrs <- header %>% add_headers
response <- POST(
url = url,
encode = "form",
body = body,
hdrs
)
But i get a timeout for this request.
Note: The site does not seem to have a robots.txt. But check the Terms and Conditions of the site.

I tried running these 2 lines myself at work and got somewhat a more explicit error message than you.
Could not open chrome browser.
Client error message:
Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
Further Details: run errorDetails method
Check server log for further details.
It might be because if you are at work without admin privileges, R can't create a child process.
As a matter of fact I used to run into absolutely awful problems myself trying to build a bot using RSelenium. rsDriver() was not consistent at all and kept crashing. I had to include it in a loop with error catching in order to keep it running, but then I had to find out and delete gigabytes of temp files manually.
I tried to install Docker and spent a lot of time doing the setup but finally it wasn't supported on my Windows non-professional edition.
Solution: Selenium from Python is very well documented, never crashes, works like a charm. Coding in the interactive Spyder editor from Anaconda feels almost like R.
And of course you can use something like system("python myscript.py") from R in order to get the process started and the resulting files back into R if you wish so.
EDIT: No admin privileges are required at all for Anaconda or Selenium. I run it myself without any problem from work. If you have trouble with pip install commands being SSL-blocked like me you can bypass it using the --trusted-host argument.

Selenium is useful when you must run the javascript on a webpage. For websites that don't require the javascript to be run (i.e. if the information you're after is contained within the webpage HTML), rvest or httr are your best bets.
In your case though, to download a file, simply use download.file(), which is a function in base R.
The website in your question is currently down (so I can't see it), but here's an example using a random file from another website
download.file("https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf", "mygreatfile.pdf")
To check that it worked
dir()
# [1] "mygreatfile.pdf"
Depending on how the website is structured, you may be able to obtain a list of the file urls, then loop through them in R downloading one after another.
Lastly, an extra tip. Depending on the file type, and what you're doing with them, you may be able to read them directly into R (instead of saving them first). For example read.csv() works with a url to directly read the csv from the web. Other read functions may be able to do the same.
Update
I currently see an internal 500 error when I visit the site, but I can see the site via the wayback machine, so I can see there is indeed javascript on the webpage. When the site is back up and running, I will attempt to download the files

How to download an .xlsx file from a dropbox (https:) location

I'm trying to adopt the Reproducible Research paradigm but meet people who like looking at Excel rather than text data files half way, by using Dropbox to host Excel files which I can then access using the .xlsx package.
Rather like downloading and unpacking a zipped file I assumed something like the following would work:
# Prerequisites
require("xlsx")
require("ggplot2")
require("repmis")
require("devtools")
require("RCurl")
# Downloading data from Dropbox location
link <- paste0(
"https://www.dropbox.com/s/",
"{THE SHA-1 KEY}",
"{THE FILE NAME}"
)
url <- getURL(link)
temp <- tempfile()
download.file(url, temp)
However, I get Error in download.file(url, temp) : unsupported URL scheme
Is there an alternative to download.file that will accept this URL scheme?
Thanks,
Jon

You have the wrong URL - the one you are using just goes to the landing page. I think the actual download URL is different, I managed to get it sort of working using the below.
I actually don't think you need to use RCurl or the getURL() function, and I think you were leaving out some relatively important /'s in your previous formulation.
Try the following:
link <- paste("https://dl.dropboxusercontent.com/s",
"{THE SHA-1 KEY}",
"{THE FILE NAME}",
sep="/")
download.file(url=link,destfile="your.destination.xlsx")
closeAllConnections()

UPDATE:
I just realised there is a source_XlsxData function in the repmis package, which in theory should do the job perfectly.
Also the function below works some of the time but not others, and appears to get stuck at the GET line. So, a better solution would be very welcome.
I decided to try taking a step back and figure out how to download a raw file from a secure (https) url. I adapted (butchered?) the source_url function in devtools to produce the following:
download_file_url <- function (
url,
outfile,
..., sha1 = NULL)
{
require(RCurl)
require(devtools)
require(repmis)
require(httr)
require(digest)
stopifnot(is.character(url), length(url) == 1)
filetag <- file(outfile, "wb")
request <- GET(url)
stop_for_status(request)
writeBin(content(request, type = "raw"), filetag)
close(filetag)
}
This seems to work for producing local versions of binary files - Excel included. Nicer, neater, smarter improvements in this gratefully received.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R XML Parse for a web address - r

Related

Issues with scraping data off a website using R

Scrape data table using xpath in R

Access sharepoint folders in R

Using R to download data automatically

How to download an .xlsx file from a dropbox (https:) location

Categories

Resources