I'm trying to get a data table off of a website using the RCurl package. My code works successfully for the URL that you get to by clicking through the website:
http://statsheet.com/mcb/teams/air-force/game_stats/
Once you try to select previous years (which I want); my code no longer works.
Example link:
http://statsheet.com/mcb/teams/air-force/game_stats?season=2012-2013
I'm guessing this has something to do with the reserved symbol(s) in the year specific address. I've tried URLencode as well as manually encoding the address but that hasn't worked either.
My code:
library(RCurl)
library(XML)
#Define URL
theurl <-URLencode("http://statsheet.com/mcb/teams/air-force/game_stats?season=2012-
2013", reserved=TRUE)
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table[1]/thead[1]/tr[2]/th", xmlValue)
results <- xpathSApply(pagetree,"//*/table[1]/tbody/tr/td", xmlValue)
content <- as.data.frame(matrix(results, ncol = 19, byrow = TRUE))
testtablehead <- c("W/L","Opponent",tablehead[c(2:18)])
names(content) <- testtablehead
The relevant error that R returns:
Error in function (type, msg, asError = TRUE) :
Could not resolve host: http%3a%2f%2fstatsheet.com%2fmcb%2fteams%2fair-
force%2fgame_stats%3fseason%3d2012-2013; No data record of requested type
Does anyone have an idea what the problem is and how to fix it?
Skip the unneeded encoding and download of the url:
library(XML)
url <- "http://statsheet.com/mcb/teams/air-force/game_stats?season=2012-2013"
pagetree <- htmlTreeParse(url, useInternalNodes = TRUE)
Related
I am trying to download an Excel in Sharepoint site into an R dataset. But I keep getting errors. I have also tried searching for solutions across forums for this issue. Although I got many updates, none of them were working for me. Here is what I have tried so far.
METHOD 1:
library(readxl)
library(httr)
url1 <- 'http://<companyname>.sharepoint.com/sites/<sitename>/Shared%20Documents/General/TRACKERS/<FolderName>/<TrackerName>.xlsx?d=wbae96ce171e14926863e453a8bec146a?Web=0'
GET(url1,write_disk(tf <- tempfile(fileext = ".xlsx")))
df <- read_excel(tf,sheet = "sheetname")
OUTPUT 1:
GET(url1,write_disk(tf <- tempfile(fileext = ".xlsx")))
Response ['https://<companyname>.sharepoint.com/sites/<sitename>/Shared%20Documents/General/TRACKERS/<FolderName>/<TrackerName>.xlsx?d=wbae96ce171e14926863e453a8bec146a?Web=0']
Date: 2020-08-06 08:43
Status: 400
Content-Type: text/html; charset=us-ascii
Size: 311 B
<ON DISK> C:\Users\
\<username>\AppData\Local\Temp\RtmpuM3YpD\file2c646e4c5d50.xlsx
df <- read_excel(tf,sheet = "sheetname")
Error: Evaluation error: zip file 'C:\Users\<username>\AppData\Local\Temp\RtmpuM3YpD\file2c646e4c5d50.xlsx' cannot be opened.
Please note that I had added “?Web=0” at the end of the url to make the xls directly download.
METHOD 2:
url1 <- 'http://<companyname>.sharepoint.com/sites/<sitename>/Shared%20Documents/General/TRACKERS/<FolderName>/<TrackerName>.xlsx?d=wbae96ce171e14926863e453a8bec146a?Web=0'
destfile <- "C:/Users/<username> /Downloads/<TrackerName>.xlsx"
download.file(url = url1,destfile = destfile)
df <- read_excel(destfile,sheet = "sheetname")
OUTPUT 2:
trying URL …
cannot open URL …
HTTP status was '403 FORBIDDEN'Error in download.file(url = url1, destfile = destfile) :
cannot open URL …
METHOD 3:
url1 <- 'http://<companyname>.sharepoint.com/sites/<sitename>/Shared%20Documents/General/TRACKERS/<FolderName>/<TrackerName>.xlsx?d=wbae96ce171e14926863e453a8bec146a?Web=0'
GET(url1,authenticate("<myusername>","<mypassword>", type = "any"),write_disk(tf <- tempfile(fileext = ".xls")))
df <- read_excel(tf,sheet = "sheetname")
OUTPUT 3:
GET(url1,authenticate("<myusername>","<mypassword>", type = "any"),write_disk(tf <- tempfile(fileext = ".xls")))
Response ['https://<companyname>.sharepoint.com/sites/<sitename>/Shared%20Documents/General/TRACKERS/<FolderName>/<TrackerName>.xlsx?d=wbae96ce171e14926863e453a8bec146a?Web=0']
Date: 2020-08-06 09:04
Status: 400
Content-Type: text/html; charset=us-ascii
Size: 311 B
\<ON DISK> C:\Users\<username>\AppData\Local\Temp\RtmpuM3YpD\ file2c6456bd6d20.xlsx
df <- read_excel(tf,sheet = "sheetname")
Error: Evaluation error: zip file 'C:\Users\<username>\AppData\Local\Temp\RtmpuM3YpD\ file2c6456bd6d20.xlsx' cannot be opened.
Of course, Initially, I tried reading the excel from Sharepoint directly (Method 4 below). But that didn’t work. Then I tried the above methods, by first downloading the Excel and then importing to a dataset.
METHOD 4:
url1 <- 'http://<companyname>.sharepoint.com/sites/<sitename>/Shared%20Documents/General/TRACKERS/<FolderName>/<TrackerName>.xlsx?d=wbae96ce171e14926863e453a8bec146a?Web=0'
df <- read.xlsx(file = url1,sheetName = " sheetname")
OUTPUT 4:
Error in loadWorkbook(file, password = password) :
Cannot find <url> …
I encounter the same issue as you did. And I thought there are some problem with url. So I did this instead of directly copy url above the browser :
Find your file in Sharepoint site and click "Show actions"(3 dots) buttons of your file.
Click "Details", and then find "Path" at the end of details information.
Click "Copy" icon.
Last, follow the METHOD 1 you use :
GET(url1,write_disk(tf <- tempfile(fileext = ".xlsx")))
readxl::read_excel(tf, sheet = "Sheet 1")
This works on me. Hope it can be useful to you.
is there a way to do error handling using function "read_excel" from R when the file exists, but it can't be read for some other reason (e.g., wrong format or something)?
Just to illustrate, my piece of code is as follows:
f <- GET(url, authenticate(":", ":", type="ntlm"), write_disk(tf <- tempfile(tmpdir = here("data/temp"), fileext = ".xlsx")))
dt <- read_excel(tf)
where url contains the http file address.
I would like to check if read_excel returns an error to do the proper handling and prevent the markdown stops.
Thanks in advance!
Looks like a duplicate question. The code below is modified from the answer found here. The someOtherFunction in the code below is where one can have some function to run if there is an error.
f <- GET(url, authenticate(":", ":", type="ntlm"), write_disk(tf <- tempfile(tmpdir = here("data/temp"), fileext = ".xlsx")))
t <- try(read_excel(tf))
if("try-error" %in% class(t)) SomeOtherFunction()
I want to download a number of excel files from a website. The website requires a username and password, but I can log in manually before running the code. After logging in, if I manually copy the url to my browser (chrome) it downloads file for me. But when I do it in R, closest it gives me a text file which looks like HTML code of the same website from which I need to download the file. Also note the URL structure. There is an "occ4.xlsx" in the middle which I believe is the file.
I can also explain the other parameters in url:
country=JP (country - Japan)
jloc=state-1808 (state id - it will change with states)
...
...
time frame etc etc
Here is what I have tried:
Iteration 1 (inbuilt methods):
url <- "https://www.wantedanalytics.com/wa/counts/occ4.xlsx?country=JP&jloc=state-1808&mapview=msa&methodology=available&t%5Bsegment%5D%5Bperiod_prior%5D=count&t%5Bsegment%5D%5Bperiod_timeframe%5D=count&t%5Bsegment%5D%5Bperiod_type%5D=&t%5Bsegment%5D%5Bqty%5D=1000&t%5Btimeframe%5D=f2013-10-17-2017-02-17&timeframe=f2013-09-28-2017-02-17"
url_ns <- "http://www.wantedanalytics.com/wa/counts/occ4.xlsx?country=JP&jloc=state-1808&mapview=msa&methodology=available&t%5Bsegment%5D%5Bperiod_prior%5D=count&t%5Bsegment%5D%5Bperiod_timeframe%5D=count&t%5Bsegment%5D%5Bperiod_type%5D=&t%5Bsegment%5D%5Bqty%5D=1000&t%5Btimeframe%5D=f2013-10-17-2017-02-17&timeframe=f2013-09-28-2017-02-17"
destfile <- "test"
download.file(url, destfile,method="auto")
download.file(url, destfile,method="wininet")
download.file(url, destfile,method="auto", mode="wb")
download.file(url, destfile,method="wininet", mode="wb")
download.file(url_ns, destfile,method="auto")
download.file(url_ns, destfile,method="wininet")
download.file(url_ns, destfile,method="auto", mode="wb")
download.file(url_ns, destfile,method="wininet", mode="wb")
#all of above download the webpage and not the file
Iteration 2 (using RCurl):
# install.packages("RCurl")
library(RCurl)
library(readxl)
x <- getURL(url)
y <- getURL(url, ssl.verifypeer = FALSE)
z <- getURL(url, ssl.verifypeer = FALSE, ssl.verifyhost=FALSE)
identical(x,y) #TRUE
identical(y,z) #TRUE
x
[1] "<html><body>You are being redirected.</body></html>"
# **Note the text about redirect**
out <- readxl::read_xlsx(textConnection(x)) # I know it won't work
#Error in read_fun(path = path, sheet = sheet, limits = limits, shim = shim, :
Expecting a single string value: [type=integer; extent=1].
w = substr(x,36,nchar(x)-31) #removing redirect text
identical(w,url) # FALSE
out <- readxl::read_xlsx(textConnection(w))
#Error in read_fun(path = path, sheet = sheet, limits = limits, shim = shim, :
Expecting a single string value: [type=integer; extent=1].
download.file(w, destfile,method="auto")
#Downloads the webpage again
download.file(url_ns,destfile,method="libcurl")
#Downloads the webpage again
I also tried downloader package but same results!
I can't share username and password on this question, but if you are trying your hands on this problem, do let me know by comment/pm and I will share the same with you!
On this website, I want to enter the code "539300" at the top searchbox and get the results either (just the new url) or some content (by using Xpath) from the page.
library(rvest); library(httr); library(RCurl)
url <- "http://www.moneycontrol.com"
res <- POST(url, body = list(search_str = "539300"), encode = "form")
pg <- read_html(content(res, as="text", encoding="UTF-8"))
html_node(pg, xpath = '//*[#id="nChrtPrc"]/div[3]/h1')
This results in an error
{xml_missing}
<NA>
Or just use RCurl and XML libraries.
library(RCurl)
library(XML)
url <- "http://www.moneycontrol.com/india/stockpricequote/miscellaneous/akspintex/AKS01"
curl <- getCurlHandle()
html <- getURL(url,curl=curl, .opts = list(ssl.verifypeer = FALSE),followlocation=TRUE)
doc <- htmlParse(html, encoding = "UTF-8")
h1 <-xpathSApply(doc, "//*[#id='nChrtPrc']/div[3]/h1//text()")
print(h1)
I'm using R to scrape a list of ~1,000 URLs. The script often fails in a way which is not reproducible; when I re-run it, it may succeed or it may fail at a different URL. This leads me to believe that the problem may be caused by my internet connection momentarily dropping or by a momentary error on the server whose URL I'm scraping.
How can I design my R code to continue to the next URL if it encounters an error? I've tried using the try function but that doesn't seem to work for this scenario.
library(XML)
df <- data.frame(URL=c("http://www.google.com/", "http://www.ask.com/", "http://www.bing.com/"))
for (i in 1:nrow(df)) {
URL <- df$URL[i]
# Exception handling
Test <- try(htmlTreeParse(URL, useInternalNodes = TRUE), silent = TRUE)
if(inherits(Test, "try-error")) next
HTML <- htmlTreeParse(URL, useInternalNodes = TRUE)
Result <- xpathSApply(HTML, "//li", xmlValue)
print(URL)
print(Result[1])
}
Let's assume that the URL to be scraped is accessible at this step:
Test <- try(htmlTreeParse(URL, useInternalNodes = TRUE), silent = TRUE)
if(inherits(Test, "try-error")) next
But then the URL stops working just before this step:
HTML <- htmlTreeParse(URL, useInternalNodes = TRUE)
Then htmlTreeParse won't work, R will throw up a warning/error, and my for loop will break. I want the for loop to continue to the next URL to be scraped - how can I accomplish this?
Thanks
Try this:
library(XML)
library(httr)
df <- c("http://www.google.com/", "http://www.ask.com/", "http://www.bing.com/")
for (i in 1:length(df)) {
URL <- df[i]
response <- GET(URL)
if (response$status_code != 200) next
HTML <- htmlTreeParse(content(response,type="text"),useInternalNodes=T)
Result <- xpathSApply(HTML, "//li", xmlValue)
if (length(Result) == 0) next
print(URL)
print(Result[1])
}
# [1] "http://www.ask.com/"
# [1] "\n \n Answers \n "
# [1] "http://www.bing.com/"
# [1] "Images"
So there are potentially (at least) two things going on here: the http request fails, or there are no <li> tags in the response. This uses GET(...) in the httr package to return the whole response and check the status code. It also checks for absence of <li> tags.