I'm trying to download data from Yahoo using this code:
library(quantmod)
getSymbols("WOW", auto.assign=F)
This has worked for me in the past in every occasion except now, 5 days before my group assignment is due.
Except now I receive this error:
Error in download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, : cannot download all files
In addition: Warning message:
In download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, :
URL 'https://ichart.finance.yahoo.com/table.csv?
s=WOW&a=0&b=01&c=2007&d=4&e=17&f=2017&g=d&q=q&y=0&z=WOW&x=.csv': status was
'502 Bad Gateway'
The price history csv URL's appear to have changed
Old
https://chart.finance.yahoo.com/table.csv?s=AAPL&a=2&b=17&c=2017&d=3&e=17&f=2017&g=d&ignore=.csv
New:
https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1492438581&period2=1495030581&interval=1d&events=history&crumb=XXXXXXX
The new version appends a "crumb" field which appears to reflect cookie information in the user's browser. It seems they are intentionally blocking automated downloads of price histories and forcing queries to provide information to validate cookies in a web browser
The fix is detailed at https://github.com/joshuaulrich/quantmod/issues/157
Essentialy
remotes::install_github("joshuaulrich/quantmod", ref="157_yahoo_502")
# or
devtools::install_github("joshuaulrich/quantmod", ref="157_yahoo_502")
Version 0.4-9 of quantmod fixes this issue, and is now available on CRAN.
I've always wondered why Yahoo was so nice as to provide data downloads and how screwed I would be if they stopped doing it. Fortunately, help is on the way courtesy Joshua Ulrich.
Superfluous as it now may be, I coded a fix that shows one approach to get around the download problem.
library(xts)
getSymbols.yahoo.fix <- function (symbol,
from = "2007-01-01",
to = Sys.Date(),
period = c("daily","weekly","monthly"),
envir = globalenv(),
crumb = "YourCrumb",
DLdir = "~/Downloads/") { #1
# build yahoo query
query1 <- paste("https://query1.finance.yahoo.com/v7/finance/download/",symbol,"?",sep="")
fromPosix <- as.numeric(as.POSIXlt(from))
toPosix <- as.numeric(as.POSIXlt(to))
query2 <- paste("period1=", fromPosix, "&period2=", toPosix, sep = "")
interval <- switch(period[1], daily = "1d", weekly = "1wk", monthly = "1mo")
query3 <- paste("&interval=", interval, "&events=history&crumb=", crumb, sep = "")
yahooURL <- paste(query1, query2, query3, sep = "")
#' requires browser to be open
utils::browseURL("https://www.google.com")
#' run the query - downloads the security as a csv file
#' DLdir defaults to download directory in browser preferences
utils::browseURL(yahooURL)
#' wait 500 msec for download to complete - mileage may vary
Sys.sleep(time = 0.5)
yahooCSV <- paste(DLdir, symbol, ".csv", sep = "")
yahooDF <- utils::read.csv(yahooCSV, header = TRUE)
#' -------
#' if you get: Error in file(file, "rt") : cannot open the connection
#' it's because the csv file has not completed downloading
#' try increasing the time for Sys.sleep(time = x)
#' -------
#' delete the csv file
file.remove(yahooCSV)
# convert date as character to date format
yahooDF$Date <- as.Date(yahooDF$Date)
# convert to xts
yahoo.xts <- xts(yahooDF[,-1],order.by=yahooDF$Date)
# assign the xts file to the specified environment
# default is globalenv()
assign(symbol, yahoo.xts, envir = as.environment(envir))
print(symbol)
} #1
It works like this:
Go to https://finance.yahoo.com/quote/AAPL/history?p=AAPL
Right click on "download data" and copy the link
Copy the crumb after "&crumb=" and use it in the function call
Set DLdir to the default download directory in your browser
preferences
Set envir = as.environment("yourEnvir") - defaults to globalenv()
After downloading, the csv file is removed from your download
directory to avoid clutter
Note that this will leave an "untitled" window open in the browser
As a simple test: getSymbols.yahoo.fix("AAPL")
-
You can also use getSymbols.yahoo.fix with lapply to get a list of asset data
from <- "2016-04-01"
to <- Sys.Date()
period <- "daily"
envir <- globalenv()
crumb <- "yourCrumb"
DLdir <- "~/Downloads/"
assetList <- c("AAPL", "ADBE", "AMAT")
lapply(assetList, getSymbols.yahoo.fix, from, to, envir = globalenv(), crumb = crumb, DLdir)}
Coded in RStudio on Mac OSX 10.11 using Safari as my default browser. It also appears to work with Chrome, but you will need to use the cookie crumb for Chrome. I use a cookie blocker but had to whitelist finance.yahoo.com to retain the cookie for future browser sessions.
getSymbols.yahoo.fix might be useful. qauantmod::getSymbols of necessity, has more code built in for options and exception-handling. I'm coding for personal work, so I often lift those pieces of code I need from package functions. I haven't benchmarked getSymbols.yahoo.fix because, of course, I don't have a working version of GetSymbol for comparison. Besides, I couldn't pass up the opportunity to enter my first stackoverflow answer.
I too am encountering this error. A user on mrexcel fourm (jonathanwang003) explains that the new URL uses Unix Timecoding for dates. The updated VBA code would look something like this:
qurl = "https://query1.finance.yahoo.com/v7/finance/download/" & Symbol
qurl = qurl & "?period1=" & (StartDate - DateSerial(1970, 1, 1)) * 86400 & _
"&period2=" & (EndDate - DateSerial(1970, 1, 1)) * 86400 & _
"&interval=1d&events=history&crumb=" & **Crumb**
QueryQuote:
With Sheets(Symbol).QueryTables.Add(Connection:="URL;" & qurl, Destination:=Sheets(Symbol).Range("a1"))
.BackgroundQuery = True
.TablesOnlyFromHTML = False
.Refresh BackgroundQuery:=False
.SaveData = True
End With
The missing piece here is how to retrieve the "Crumb" field that contains cookie information from the browser. Anyone have any ideas. I found this post, which may help: https://www.mrexcel.com/forum/excel-questions/1001259-when-using-querytables-what-posttext-syntax-click-button-webpage.html (look at last post by john_w).
Try Google. The CSV is just a little different (does not have the adjusted price and the date has another format).
http://www.google.com/finance/historical?q=NASDAQ:ADBE&startdate=Jan+01%2C+2009&enddate=Aug+2%2C+2012&output=csv
http://www.google.com/finance/historical?q=BVMF:PETR4&startdate=Jan+01%2C+2009&enddate=Aug+2%2C+2012&output=csv
Related
This seems like a simple problem but I've been struggling with it for a few days. This is a minimum working example rather than the actual problem:
This question seemed similat but I was unable to use the answer to solve my problem.
In a browser, I go to this url, and click on [Search] (no need to make any choices from the lists), and then on [Download Results] (choosing, for example, the Xlsx option). The file then downloads.
To automate this in R I have tried:
library(rvest)
url1 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search"
sesh1 <- html_session(url1)
form1 <-html_form(sesh1)[[1]]
subform <- submit_form(sesh1, form1)
Using Chrome Developer tools I find the url being used to initiate the download, so I try:
url2 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search/Download"
res <- GET(url = url2, query = list(format = "xlsx"))
However this does not download the file:
> res$content
raw(0)
I also tried
download.file(url = paste0(url2, "?format=xlsx") , destfile = "down.xlsx", mode = "wb")
But this downloads nothing:
> Content type '' length 0 bytes
> downloaded 0 bytes
Note that, in the browser, pasting url2 and adding the format query does initiate the download (after doing the search from url1)
I thought that I should somehow be using the session info from the initial code block to do the download, but so far I can't see how.
Thanks in advance for any help !
You are almost there and your intuition is correct about using the session info.
You just need to use rvest::jump_to to navigate to the second url and then write it to disk:
library(rvest)
url1 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search"
sesh1 <- html_session(url1)
form1 <-html_form(sesh1)[[1]]
subform <- submit_form(sesh1, form1)
url2 <- "https://secure.gamblingcommission.gov.uk/PublicRegister/Search/Download"
#### The above is your original code - below is the additional code you need:
download <- jump_to(subform, paste0(url2, "?format=xlsx"))
writeBin(download$response$content, "down.xlsx")
I'm trying to download some images from a website. I have a series of urls of images that I have to download. So I run it with this code :
dlphoto <- function(x){
print(x)
setTimeLimit(5)
Sys.sleep(0.3)
download.file(x , destfile = basename(x))
}
This function has however one major problem :
When I run my vector of 15000 urls with it, it freezes the entire R session, and stop reacting to anything. However, if I run urls separately, it works fine. Or when I run for example 1:50 urls, it works too. However, when I put 1:100, for example, it freezes as well.... So can you please help me to figure this out ?
at first I was using this line to call:
dlphoto(allimage[,2])
then I changed to this one :
dlphoto(allimage[c(1:50),2])
dlphoto(allimage[c(51:100),2])
dlphoto(allimage[c(101:150),2])
dlphoto(allimage[c(151:200),2])
and so on untill 15000
and so on. But it still freeze a lot. And each time it dies I have to close R and search where the process reached and start from there. And I get this warning message regularly :
Error in download.file(x, destfile = basename(x)) :
reached CPU time limit
And also, can you help me to make that the photos downloaded are saved in
/Users/name/Desktop/M2/Mémoire M2/Scrapingtest/photos
thanks a lot !!
There are couple of improvements possible. I have assumed that OP is using download.file from base packages which supports only single file in one attempt if method is not set libcurl and quiet = T.
Hence the fix should be to use method = "libcurl" and quiet = TRUE in download.file function. The changed function:
dlphoto <- function(x){
print(x)
download.file(x , destfile = basename(x), method="libcurl", quiet = TRUE)
}
OR
download.file(x , destfile = basename(x), method="libcurl", quiet = TRUE)
Note: In both above cases, the progress-bar will not be displayed.
I think the value of timeout from options is good enough to ensure return from download.file in case of delays.
The error for return value from download.file should be checked. Any non-zero return value indicate failure.
If you want to see progress-bars (which is probably not needed for 1500 files in one go) then function should be modified to handle 1 file at a time. The modified function will be:
# This function will display progressbar for each file
dlphoto <- function(x){
for(file in x){
print(fine)
download.file(file , destfile = basename(file))
}
}
I'm trying to download data from Yahoo using this code:
library(quantmod)
getSymbols("WOW", auto.assign=F)
This has worked for me in the past in every occasion except now, 5 days before my group assignment is due.
Except now I receive this error:
Error in download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, : cannot download all files
In addition: Warning message:
In download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, :
URL 'https://ichart.finance.yahoo.com/table.csv?
s=WOW&a=0&b=01&c=2007&d=4&e=17&f=2017&g=d&q=q&y=0&z=WOW&x=.csv': status was
'502 Bad Gateway'
The price history csv URL's appear to have changed
Old
https://chart.finance.yahoo.com/table.csv?s=AAPL&a=2&b=17&c=2017&d=3&e=17&f=2017&g=d&ignore=.csv
New:
https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1492438581&period2=1495030581&interval=1d&events=history&crumb=XXXXXXX
The new version appends a "crumb" field which appears to reflect cookie information in the user's browser. It seems they are intentionally blocking automated downloads of price histories and forcing queries to provide information to validate cookies in a web browser
The fix is detailed at https://github.com/joshuaulrich/quantmod/issues/157
Essentialy
remotes::install_github("joshuaulrich/quantmod", ref="157_yahoo_502")
# or
devtools::install_github("joshuaulrich/quantmod", ref="157_yahoo_502")
Version 0.4-9 of quantmod fixes this issue, and is now available on CRAN.
I've always wondered why Yahoo was so nice as to provide data downloads and how screwed I would be if they stopped doing it. Fortunately, help is on the way courtesy Joshua Ulrich.
Superfluous as it now may be, I coded a fix that shows one approach to get around the download problem.
library(xts)
getSymbols.yahoo.fix <- function (symbol,
from = "2007-01-01",
to = Sys.Date(),
period = c("daily","weekly","monthly"),
envir = globalenv(),
crumb = "YourCrumb",
DLdir = "~/Downloads/") { #1
# build yahoo query
query1 <- paste("https://query1.finance.yahoo.com/v7/finance/download/",symbol,"?",sep="")
fromPosix <- as.numeric(as.POSIXlt(from))
toPosix <- as.numeric(as.POSIXlt(to))
query2 <- paste("period1=", fromPosix, "&period2=", toPosix, sep = "")
interval <- switch(period[1], daily = "1d", weekly = "1wk", monthly = "1mo")
query3 <- paste("&interval=", interval, "&events=history&crumb=", crumb, sep = "")
yahooURL <- paste(query1, query2, query3, sep = "")
#' requires browser to be open
utils::browseURL("https://www.google.com")
#' run the query - downloads the security as a csv file
#' DLdir defaults to download directory in browser preferences
utils::browseURL(yahooURL)
#' wait 500 msec for download to complete - mileage may vary
Sys.sleep(time = 0.5)
yahooCSV <- paste(DLdir, symbol, ".csv", sep = "")
yahooDF <- utils::read.csv(yahooCSV, header = TRUE)
#' -------
#' if you get: Error in file(file, "rt") : cannot open the connection
#' it's because the csv file has not completed downloading
#' try increasing the time for Sys.sleep(time = x)
#' -------
#' delete the csv file
file.remove(yahooCSV)
# convert date as character to date format
yahooDF$Date <- as.Date(yahooDF$Date)
# convert to xts
yahoo.xts <- xts(yahooDF[,-1],order.by=yahooDF$Date)
# assign the xts file to the specified environment
# default is globalenv()
assign(symbol, yahoo.xts, envir = as.environment(envir))
print(symbol)
} #1
It works like this:
Go to https://finance.yahoo.com/quote/AAPL/history?p=AAPL
Right click on "download data" and copy the link
Copy the crumb after "&crumb=" and use it in the function call
Set DLdir to the default download directory in your browser
preferences
Set envir = as.environment("yourEnvir") - defaults to globalenv()
After downloading, the csv file is removed from your download
directory to avoid clutter
Note that this will leave an "untitled" window open in the browser
As a simple test: getSymbols.yahoo.fix("AAPL")
-
You can also use getSymbols.yahoo.fix with lapply to get a list of asset data
from <- "2016-04-01"
to <- Sys.Date()
period <- "daily"
envir <- globalenv()
crumb <- "yourCrumb"
DLdir <- "~/Downloads/"
assetList <- c("AAPL", "ADBE", "AMAT")
lapply(assetList, getSymbols.yahoo.fix, from, to, envir = globalenv(), crumb = crumb, DLdir)}
Coded in RStudio on Mac OSX 10.11 using Safari as my default browser. It also appears to work with Chrome, but you will need to use the cookie crumb for Chrome. I use a cookie blocker but had to whitelist finance.yahoo.com to retain the cookie for future browser sessions.
getSymbols.yahoo.fix might be useful. qauantmod::getSymbols of necessity, has more code built in for options and exception-handling. I'm coding for personal work, so I often lift those pieces of code I need from package functions. I haven't benchmarked getSymbols.yahoo.fix because, of course, I don't have a working version of GetSymbol for comparison. Besides, I couldn't pass up the opportunity to enter my first stackoverflow answer.
I too am encountering this error. A user on mrexcel fourm (jonathanwang003) explains that the new URL uses Unix Timecoding for dates. The updated VBA code would look something like this:
qurl = "https://query1.finance.yahoo.com/v7/finance/download/" & Symbol
qurl = qurl & "?period1=" & (StartDate - DateSerial(1970, 1, 1)) * 86400 & _
"&period2=" & (EndDate - DateSerial(1970, 1, 1)) * 86400 & _
"&interval=1d&events=history&crumb=" & **Crumb**
QueryQuote:
With Sheets(Symbol).QueryTables.Add(Connection:="URL;" & qurl, Destination:=Sheets(Symbol).Range("a1"))
.BackgroundQuery = True
.TablesOnlyFromHTML = False
.Refresh BackgroundQuery:=False
.SaveData = True
End With
The missing piece here is how to retrieve the "Crumb" field that contains cookie information from the browser. Anyone have any ideas. I found this post, which may help: https://www.mrexcel.com/forum/excel-questions/1001259-when-using-querytables-what-posttext-syntax-click-button-webpage.html (look at last post by john_w).
Try Google. The CSV is just a little different (does not have the adjusted price and the date has another format).
http://www.google.com/finance/historical?q=NASDAQ:ADBE&startdate=Jan+01%2C+2009&enddate=Aug+2%2C+2012&output=csv
http://www.google.com/finance/historical?q=BVMF:PETR4&startdate=Jan+01%2C+2009&enddate=Aug+2%2C+2012&output=csv
I am writing a program to collect all of the daily .csv files from this page. However, for some of the files, I get the error message:
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
cannot open URL 'https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/05042016_DailyAbsenceData.csv': HTTP status was '404 Not Found'
Here is an example from the May 12, 2016 file:
read.csv(url("https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/05122016_DailyAbsenceData.csv"))
The bizarre thing is, if you go to the website, find the link to that file and click it, R no longer gives the error and reads the file correctly. What is going on here and how can I read those files without having to click them manually? (Note, only the first one of you is going to be able to replicate the problem, because clicking the file fixes it for the rest.)
Ultimately, I want to use the following loop to collect all the files:
# Create a vector of dates. This is the interval data is collected from.
dates = seq(as.Date("2016-05-1"), as.Date("2016-05-30"), by="days")
# Format to match the filename prefixes
dates = strftime(dates, '%m%d%Y')
# Create the vector of a file names I want read.
file.names = paste(dates,"_DailyAbsenceData.csv", sep = "")
# A loop that reads the .csv files into a list of data frame
daily.truancy = list()
for (i in 1:length(dates)) {
tryCatch({ #this function prevents the loop from stopping from an error when read.csv cannot access the file
daily.truancy[[i]] = read.csv(url(paste("https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/", file.names[i], sep = "")), sep = ",")
stop("School day") #this indicates that the file was successfully read in to the list
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
# Unlist the daily data to a large panel
daily.truancy.2016 <- do.call("rbind", daily.truancy)
Note that the same error message is given for days when there is, in fact, no file (weekends). This is not a problem.
Since the pages are dynamically generated url function will not work here butRSelenium was expressly designed was such tasks.
I want to thank #jdharrison for this superb package as well as his answers to challenging questions, see his answers page
for more examples.
Basic setup procedure is explained here: RSelenium Setup
To extract the elementID of our interest the easiest way is to right-click on the element and click "Inspect" in chrome, I am not sure of other browsers,they should have similar functionality with possibly different names
This will open a side window containing html tags for the selected element.
library(RSelenium)
RSelenium:::startServer()
#you can replace browser name with your version e.g. firefox
remDr <- remoteDriver(browserName = "chrome")
remDr$open(silent = TRUE)
appURL <- 'https://www.eride.ri.gov/eride2K5/AggregateAttendance/AttendanceReports.aspx'
monthYearCounter = 1
#total months to download
totalMonths = 2
remDr$navigate(appURL)
for(monthYearCounter in 1:totalMonths) {
#Active monthYear on the page e.g April 2017
monthYearElem = remDr$findElement("xpath", "//td[contains(#style,'width:70%')]")
#highlights the element in yellow for visual feedback
monthYearElem$highlightElement()
#extract text
monthYearText = unlist(monthYearElem$getElementAttribute("innerHTML"))
cat(paste0("Processing month year=",monthYearText,"\n"))
# For a particular month all the CSV files are listed in a table
#extract elementID of all CSV files using the pattern "imgBtnXls"
csvFilesElemList = remDr$findElements("xpath", "//input[contains(#id,'imgBtnXls')]")
#For all elements, enable click function and save file to default download location
#Ensure delay between consecutive requests from burdening the servers
lapply(csvFilesElemList,function(x) {
#
x$clickElement()
#Be nice, do no overload servers with rapid requests!!
Sys.sleep(60)
})
#Go to previous month
remDr$findElement("xpath", "//a[contains(#title,'Go to the previous month')]")$clickElement()
}
My R workflow is usually such that I have a file open into which I type R commands, and I’d like to execute those commands in a separately opened R shell.
The easiest way of doing this is to say source('the-file.r') inside R. However, this always reloads the whole file which may take considerable time if big amounts of data are processed. It also requires me to specify the filename again.
Ideally, I’d like to source only a specific line (or lines) from the file (I’m working on a terminal where copy&paste doesn’t work).
source doesn’t seem to offer this functionality. Is there another way of achieving this?
Here's another way with just R:
source2 <- function(file, start, end, ...) {
file.lines <- scan(file, what=character(), skip=start-1, nlines=end-start+1, sep='\n')
file.lines.collapsed <- paste(file.lines, collapse='\n')
source(textConnection(file.lines.collapsed), ...)
}
Using the right tool for the job …
As discussed in the comments, the real solution is to use an IDE that allows sourcing specific parts of a file. There are many existing solutions:
For Vim, there’s Nvim-R.
For Emacs, there’s ESS.
And of course there’s the excellent stand-alone RStudio IDE.
As a special point of note, all of the above solutions work both locally and on a server (accessed via an SSH connection, say). R can even be run on an HPC cluster — it can still communicate with the IDEs if set up properly.
… or … not.
If, for whatever reason, none of the solutions above work, here’s a small module[gist] that can do the job. I generally don’t recommend using it, though.1
#' (Re-)source parts of a file
#'
#' \code{rs} loads, parses and executes parts of a file as if entered into the R
#' console directly (but without implicit echoing).
#'
#' #param filename character string of the filename to read from. If missing,
#' use the last-read filename.
#' #param from first line to parse.
#' #param to last line to parse.
#' #return the value of the last evaluated expression in the source file.
#'
#' #details If both \code{from} and \code{to} are missing, the default is to
#' read the whole file.
rs = local({
last_file = NULL
function (filename, from, to = if (missing(from)) -1 else from) {
if (missing(filename)) filename = last_file
stopifnot(! is.null(filename))
stopifnot(is.character(filename))
force(to)
if (missing(from)) from = 1
source_lines = scan(filename, what = character(), sep = '\n',
skip = from - 1, n = to - from + 1,
encoding = 'UTF-8', quiet = TRUE)
result = withVisible(eval.parent(parse(text = source_lines)))
last_file <<- filename # Only save filename once successfully sourced.
if (result$visible) result$value else invisible(result$value)
}
})
Usage example:
# Source the whole file:
rs('some_file.r')
# Re-soure everything (same file):
rs()
# Re-source just the fifth line:
rs(from = 5)
# Re-source lines 5–10
rs(from = 5, to = 10)
# Re-source everything up until line 7:
rs(to = 7)
1 Funny story: I recently found myself on a cluster with a messed-up configuration that made it impossible to install the required software, but desperately needing to debug an R workflow due to a looming deadline. I literally had no choice but to copy and paste lines of R code into the console manually. This is a situation in which the above might come in handy. And yes, that actually happened.