I'm trying to download data from Yahoo using this code:
library(quantmod)
getSymbols("WOW", auto.assign=F)
This has worked for me in the past in every occasion except now, 5 days before my group assignment is due.
Except now I receive this error:
Error in download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, : cannot download all files
In addition: Warning message:
In download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, :
URL 'https://ichart.finance.yahoo.com/table.csv?
s=WOW&a=0&b=01&c=2007&d=4&e=17&f=2017&g=d&q=q&y=0&z=WOW&x=.csv': status was
'502 Bad Gateway'
The price history csv URL's appear to have changed
Old
https://chart.finance.yahoo.com/table.csv?s=AAPL&a=2&b=17&c=2017&d=3&e=17&f=2017&g=d&ignore=.csv
New:
https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1492438581&period2=1495030581&interval=1d&events=history&crumb=XXXXXXX
The new version appends a "crumb" field which appears to reflect cookie information in the user's browser. It seems they are intentionally blocking automated downloads of price histories and forcing queries to provide information to validate cookies in a web browser
The fix is detailed at https://github.com/joshuaulrich/quantmod/issues/157
Essentialy
remotes::install_github("joshuaulrich/quantmod", ref="157_yahoo_502")
# or
devtools::install_github("joshuaulrich/quantmod", ref="157_yahoo_502")
Version 0.4-9 of quantmod fixes this issue, and is now available on CRAN.
I've always wondered why Yahoo was so nice as to provide data downloads and how screwed I would be if they stopped doing it. Fortunately, help is on the way courtesy Joshua Ulrich.
Superfluous as it now may be, I coded a fix that shows one approach to get around the download problem.
library(xts)
getSymbols.yahoo.fix <- function (symbol,
from = "2007-01-01",
to = Sys.Date(),
period = c("daily","weekly","monthly"),
envir = globalenv(),
crumb = "YourCrumb",
DLdir = "~/Downloads/") { #1
# build yahoo query
query1 <- paste("https://query1.finance.yahoo.com/v7/finance/download/",symbol,"?",sep="")
fromPosix <- as.numeric(as.POSIXlt(from))
toPosix <- as.numeric(as.POSIXlt(to))
query2 <- paste("period1=", fromPosix, "&period2=", toPosix, sep = "")
interval <- switch(period[1], daily = "1d", weekly = "1wk", monthly = "1mo")
query3 <- paste("&interval=", interval, "&events=history&crumb=", crumb, sep = "")
yahooURL <- paste(query1, query2, query3, sep = "")
#' requires browser to be open
utils::browseURL("https://www.google.com")
#' run the query - downloads the security as a csv file
#' DLdir defaults to download directory in browser preferences
utils::browseURL(yahooURL)
#' wait 500 msec for download to complete - mileage may vary
Sys.sleep(time = 0.5)
yahooCSV <- paste(DLdir, symbol, ".csv", sep = "")
yahooDF <- utils::read.csv(yahooCSV, header = TRUE)
#' -------
#' if you get: Error in file(file, "rt") : cannot open the connection
#' it's because the csv file has not completed downloading
#' try increasing the time for Sys.sleep(time = x)
#' -------
#' delete the csv file
file.remove(yahooCSV)
# convert date as character to date format
yahooDF$Date <- as.Date(yahooDF$Date)
# convert to xts
yahoo.xts <- xts(yahooDF[,-1],order.by=yahooDF$Date)
# assign the xts file to the specified environment
# default is globalenv()
assign(symbol, yahoo.xts, envir = as.environment(envir))
print(symbol)
} #1
It works like this:
Go to https://finance.yahoo.com/quote/AAPL/history?p=AAPL
Right click on "download data" and copy the link
Copy the crumb after "&crumb=" and use it in the function call
Set DLdir to the default download directory in your browser
preferences
Set envir = as.environment("yourEnvir") - defaults to globalenv()
After downloading, the csv file is removed from your download
directory to avoid clutter
Note that this will leave an "untitled" window open in the browser
As a simple test: getSymbols.yahoo.fix("AAPL")
-
You can also use getSymbols.yahoo.fix with lapply to get a list of asset data
from <- "2016-04-01"
to <- Sys.Date()
period <- "daily"
envir <- globalenv()
crumb <- "yourCrumb"
DLdir <- "~/Downloads/"
assetList <- c("AAPL", "ADBE", "AMAT")
lapply(assetList, getSymbols.yahoo.fix, from, to, envir = globalenv(), crumb = crumb, DLdir)}
Coded in RStudio on Mac OSX 10.11 using Safari as my default browser. It also appears to work with Chrome, but you will need to use the cookie crumb for Chrome. I use a cookie blocker but had to whitelist finance.yahoo.com to retain the cookie for future browser sessions.
getSymbols.yahoo.fix might be useful. qauantmod::getSymbols of necessity, has more code built in for options and exception-handling. I'm coding for personal work, so I often lift those pieces of code I need from package functions. I haven't benchmarked getSymbols.yahoo.fix because, of course, I don't have a working version of GetSymbol for comparison. Besides, I couldn't pass up the opportunity to enter my first stackoverflow answer.
I too am encountering this error. A user on mrexcel fourm (jonathanwang003) explains that the new URL uses Unix Timecoding for dates. The updated VBA code would look something like this:
qurl = "https://query1.finance.yahoo.com/v7/finance/download/" & Symbol
qurl = qurl & "?period1=" & (StartDate - DateSerial(1970, 1, 1)) * 86400 & _
"&period2=" & (EndDate - DateSerial(1970, 1, 1)) * 86400 & _
"&interval=1d&events=history&crumb=" & **Crumb**
QueryQuote:
With Sheets(Symbol).QueryTables.Add(Connection:="URL;" & qurl, Destination:=Sheets(Symbol).Range("a1"))
.BackgroundQuery = True
.TablesOnlyFromHTML = False
.Refresh BackgroundQuery:=False
.SaveData = True
End With
The missing piece here is how to retrieve the "Crumb" field that contains cookie information from the browser. Anyone have any ideas. I found this post, which may help: https://www.mrexcel.com/forum/excel-questions/1001259-when-using-querytables-what-posttext-syntax-click-button-webpage.html (look at last post by john_w).
Try Google. The CSV is just a little different (does not have the adjusted price and the date has another format).
http://www.google.com/finance/historical?q=NASDAQ:ADBE&startdate=Jan+01%2C+2009&enddate=Aug+2%2C+2012&output=csv
http://www.google.com/finance/historical?q=BVMF:PETR4&startdate=Jan+01%2C+2009&enddate=Aug+2%2C+2012&output=csv
Related
This seems like a simple problem but I've been struggling with it for a few days. This is a minimum working example rather than the actual problem:
This question seemed similat but I was unable to use the answer to solve my problem.
In a browser, I go to this url, and click on [Search] (no need to make any choices from the lists), and then on [Download Results] (choosing, for example, the Xlsx option). The file then downloads.
To automate this in R I have tried:
library(rvest)
url1 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search"
sesh1 <- html_session(url1)
form1 <-html_form(sesh1)[[1]]
subform <- submit_form(sesh1, form1)
Using Chrome Developer tools I find the url being used to initiate the download, so I try:
url2 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search/Download"
res <- GET(url = url2, query = list(format = "xlsx"))
However this does not download the file:
> res$content
raw(0)
I also tried
download.file(url = paste0(url2, "?format=xlsx") , destfile = "down.xlsx", mode = "wb")
But this downloads nothing:
> Content type '' length 0 bytes
> downloaded 0 bytes
Note that, in the browser, pasting url2 and adding the format query does initiate the download (after doing the search from url1)
I thought that I should somehow be using the session info from the initial code block to do the download, but so far I can't see how.
Thanks in advance for any help !
You are almost there and your intuition is correct about using the session info.
You just need to use rvest::jump_to to navigate to the second url and then write it to disk:
library(rvest)
url1 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search"
sesh1 <- html_session(url1)
form1 <-html_form(sesh1)[[1]]
subform <- submit_form(sesh1, form1)
url2 <- "https://secure.gamblingcommission.gov.uk/PublicRegister/Search/Download"
#### The above is your original code - below is the additional code you need:
download <- jump_to(subform, paste0(url2, "?format=xlsx"))
writeBin(download$response$content, "down.xlsx")
I'm trying to download some images from a website. I have a series of urls of images that I have to download. So I run it with this code :
dlphoto <- function(x){
print(x)
setTimeLimit(5)
Sys.sleep(0.3)
download.file(x , destfile = basename(x))
}
This function has however one major problem :
When I run my vector of 15000 urls with it, it freezes the entire R session, and stop reacting to anything. However, if I run urls separately, it works fine. Or when I run for example 1:50 urls, it works too. However, when I put 1:100, for example, it freezes as well.... So can you please help me to figure this out ?
at first I was using this line to call:
dlphoto(allimage[,2])
then I changed to this one :
dlphoto(allimage[c(1:50),2])
dlphoto(allimage[c(51:100),2])
dlphoto(allimage[c(101:150),2])
dlphoto(allimage[c(151:200),2])
and so on untill 15000
and so on. But it still freeze a lot. And each time it dies I have to close R and search where the process reached and start from there. And I get this warning message regularly :
Error in download.file(x, destfile = basename(x)) :
reached CPU time limit
And also, can you help me to make that the photos downloaded are saved in
/Users/name/Desktop/M2/Mémoire M2/Scrapingtest/photos
thanks a lot !!
There are couple of improvements possible. I have assumed that OP is using download.file from base packages which supports only single file in one attempt if method is not set libcurl and quiet = T.
Hence the fix should be to use method = "libcurl" and quiet = TRUE in download.file function. The changed function:
dlphoto <- function(x){
print(x)
download.file(x , destfile = basename(x), method="libcurl", quiet = TRUE)
}
OR
download.file(x , destfile = basename(x), method="libcurl", quiet = TRUE)
Note: In both above cases, the progress-bar will not be displayed.
I think the value of timeout from options is good enough to ensure return from download.file in case of delays.
The error for return value from download.file should be checked. Any non-zero return value indicate failure.
If you want to see progress-bars (which is probably not needed for 1500 files in one go) then function should be modified to handle 1 file at a time. The modified function will be:
# This function will display progressbar for each file
dlphoto <- function(x){
for(file in x){
print(fine)
download.file(file , destfile = basename(file))
}
}
I'm trying to download data from Yahoo using this code:
library(quantmod)
getSymbols("WOW", auto.assign=F)
This has worked for me in the past in every occasion except now, 5 days before my group assignment is due.
Except now I receive this error:
Error in download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, : cannot download all files
In addition: Warning message:
In download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, :
URL 'https://ichart.finance.yahoo.com/table.csv?
s=WOW&a=0&b=01&c=2007&d=4&e=17&f=2017&g=d&q=q&y=0&z=WOW&x=.csv': status was
'502 Bad Gateway'
The price history csv URL's appear to have changed
Old
https://chart.finance.yahoo.com/table.csv?s=AAPL&a=2&b=17&c=2017&d=3&e=17&f=2017&g=d&ignore=.csv
New:
https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1492438581&period2=1495030581&interval=1d&events=history&crumb=XXXXXXX
The new version appends a "crumb" field which appears to reflect cookie information in the user's browser. It seems they are intentionally blocking automated downloads of price histories and forcing queries to provide information to validate cookies in a web browser
The fix is detailed at https://github.com/joshuaulrich/quantmod/issues/157
Essentialy
remotes::install_github("joshuaulrich/quantmod", ref="157_yahoo_502")
# or
devtools::install_github("joshuaulrich/quantmod", ref="157_yahoo_502")
Version 0.4-9 of quantmod fixes this issue, and is now available on CRAN.
I've always wondered why Yahoo was so nice as to provide data downloads and how screwed I would be if they stopped doing it. Fortunately, help is on the way courtesy Joshua Ulrich.
Superfluous as it now may be, I coded a fix that shows one approach to get around the download problem.
library(xts)
getSymbols.yahoo.fix <- function (symbol,
from = "2007-01-01",
to = Sys.Date(),
period = c("daily","weekly","monthly"),
envir = globalenv(),
crumb = "YourCrumb",
DLdir = "~/Downloads/") { #1
# build yahoo query
query1 <- paste("https://query1.finance.yahoo.com/v7/finance/download/",symbol,"?",sep="")
fromPosix <- as.numeric(as.POSIXlt(from))
toPosix <- as.numeric(as.POSIXlt(to))
query2 <- paste("period1=", fromPosix, "&period2=", toPosix, sep = "")
interval <- switch(period[1], daily = "1d", weekly = "1wk", monthly = "1mo")
query3 <- paste("&interval=", interval, "&events=history&crumb=", crumb, sep = "")
yahooURL <- paste(query1, query2, query3, sep = "")
#' requires browser to be open
utils::browseURL("https://www.google.com")
#' run the query - downloads the security as a csv file
#' DLdir defaults to download directory in browser preferences
utils::browseURL(yahooURL)
#' wait 500 msec for download to complete - mileage may vary
Sys.sleep(time = 0.5)
yahooCSV <- paste(DLdir, symbol, ".csv", sep = "")
yahooDF <- utils::read.csv(yahooCSV, header = TRUE)
#' -------
#' if you get: Error in file(file, "rt") : cannot open the connection
#' it's because the csv file has not completed downloading
#' try increasing the time for Sys.sleep(time = x)
#' -------
#' delete the csv file
file.remove(yahooCSV)
# convert date as character to date format
yahooDF$Date <- as.Date(yahooDF$Date)
# convert to xts
yahoo.xts <- xts(yahooDF[,-1],order.by=yahooDF$Date)
# assign the xts file to the specified environment
# default is globalenv()
assign(symbol, yahoo.xts, envir = as.environment(envir))
print(symbol)
} #1
It works like this:
Go to https://finance.yahoo.com/quote/AAPL/history?p=AAPL
Right click on "download data" and copy the link
Copy the crumb after "&crumb=" and use it in the function call
Set DLdir to the default download directory in your browser
preferences
Set envir = as.environment("yourEnvir") - defaults to globalenv()
After downloading, the csv file is removed from your download
directory to avoid clutter
Note that this will leave an "untitled" window open in the browser
As a simple test: getSymbols.yahoo.fix("AAPL")
-
You can also use getSymbols.yahoo.fix with lapply to get a list of asset data
from <- "2016-04-01"
to <- Sys.Date()
period <- "daily"
envir <- globalenv()
crumb <- "yourCrumb"
DLdir <- "~/Downloads/"
assetList <- c("AAPL", "ADBE", "AMAT")
lapply(assetList, getSymbols.yahoo.fix, from, to, envir = globalenv(), crumb = crumb, DLdir)}
Coded in RStudio on Mac OSX 10.11 using Safari as my default browser. It also appears to work with Chrome, but you will need to use the cookie crumb for Chrome. I use a cookie blocker but had to whitelist finance.yahoo.com to retain the cookie for future browser sessions.
getSymbols.yahoo.fix might be useful. qauantmod::getSymbols of necessity, has more code built in for options and exception-handling. I'm coding for personal work, so I often lift those pieces of code I need from package functions. I haven't benchmarked getSymbols.yahoo.fix because, of course, I don't have a working version of GetSymbol for comparison. Besides, I couldn't pass up the opportunity to enter my first stackoverflow answer.
I too am encountering this error. A user on mrexcel fourm (jonathanwang003) explains that the new URL uses Unix Timecoding for dates. The updated VBA code would look something like this:
qurl = "https://query1.finance.yahoo.com/v7/finance/download/" & Symbol
qurl = qurl & "?period1=" & (StartDate - DateSerial(1970, 1, 1)) * 86400 & _
"&period2=" & (EndDate - DateSerial(1970, 1, 1)) * 86400 & _
"&interval=1d&events=history&crumb=" & **Crumb**
QueryQuote:
With Sheets(Symbol).QueryTables.Add(Connection:="URL;" & qurl, Destination:=Sheets(Symbol).Range("a1"))
.BackgroundQuery = True
.TablesOnlyFromHTML = False
.Refresh BackgroundQuery:=False
.SaveData = True
End With
The missing piece here is how to retrieve the "Crumb" field that contains cookie information from the browser. Anyone have any ideas. I found this post, which may help: https://www.mrexcel.com/forum/excel-questions/1001259-when-using-querytables-what-posttext-syntax-click-button-webpage.html (look at last post by john_w).
Try Google. The CSV is just a little different (does not have the adjusted price and the date has another format).
http://www.google.com/finance/historical?q=NASDAQ:ADBE&startdate=Jan+01%2C+2009&enddate=Aug+2%2C+2012&output=csv
http://www.google.com/finance/historical?q=BVMF:PETR4&startdate=Jan+01%2C+2009&enddate=Aug+2%2C+2012&output=csv
I am writing a program to collect all of the daily .csv files from this page. However, for some of the files, I get the error message:
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
cannot open URL 'https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/05042016_DailyAbsenceData.csv': HTTP status was '404 Not Found'
Here is an example from the May 12, 2016 file:
read.csv(url("https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/05122016_DailyAbsenceData.csv"))
The bizarre thing is, if you go to the website, find the link to that file and click it, R no longer gives the error and reads the file correctly. What is going on here and how can I read those files without having to click them manually? (Note, only the first one of you is going to be able to replicate the problem, because clicking the file fixes it for the rest.)
Ultimately, I want to use the following loop to collect all the files:
# Create a vector of dates. This is the interval data is collected from.
dates = seq(as.Date("2016-05-1"), as.Date("2016-05-30"), by="days")
# Format to match the filename prefixes
dates = strftime(dates, '%m%d%Y')
# Create the vector of a file names I want read.
file.names = paste(dates,"_DailyAbsenceData.csv", sep = "")
# A loop that reads the .csv files into a list of data frame
daily.truancy = list()
for (i in 1:length(dates)) {
tryCatch({ #this function prevents the loop from stopping from an error when read.csv cannot access the file
daily.truancy[[i]] = read.csv(url(paste("https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/", file.names[i], sep = "")), sep = ",")
stop("School day") #this indicates that the file was successfully read in to the list
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
# Unlist the daily data to a large panel
daily.truancy.2016 <- do.call("rbind", daily.truancy)
Note that the same error message is given for days when there is, in fact, no file (weekends). This is not a problem.
Since the pages are dynamically generated url function will not work here butRSelenium was expressly designed was such tasks.
I want to thank #jdharrison for this superb package as well as his answers to challenging questions, see his answers page
for more examples.
Basic setup procedure is explained here: RSelenium Setup
To extract the elementID of our interest the easiest way is to right-click on the element and click "Inspect" in chrome, I am not sure of other browsers,they should have similar functionality with possibly different names
This will open a side window containing html tags for the selected element.
library(RSelenium)
RSelenium:::startServer()
#you can replace browser name with your version e.g. firefox
remDr <- remoteDriver(browserName = "chrome")
remDr$open(silent = TRUE)
appURL <- 'https://www.eride.ri.gov/eride2K5/AggregateAttendance/AttendanceReports.aspx'
monthYearCounter = 1
#total months to download
totalMonths = 2
remDr$navigate(appURL)
for(monthYearCounter in 1:totalMonths) {
#Active monthYear on the page e.g April 2017
monthYearElem = remDr$findElement("xpath", "//td[contains(#style,'width:70%')]")
#highlights the element in yellow for visual feedback
monthYearElem$highlightElement()
#extract text
monthYearText = unlist(monthYearElem$getElementAttribute("innerHTML"))
cat(paste0("Processing month year=",monthYearText,"\n"))
# For a particular month all the CSV files are listed in a table
#extract elementID of all CSV files using the pattern "imgBtnXls"
csvFilesElemList = remDr$findElements("xpath", "//input[contains(#id,'imgBtnXls')]")
#For all elements, enable click function and save file to default download location
#Ensure delay between consecutive requests from burdening the servers
lapply(csvFilesElemList,function(x) {
#
x$clickElement()
#Be nice, do no overload servers with rapid requests!!
Sys.sleep(60)
})
#Go to previous month
remDr$findElement("xpath", "//a[contains(#title,'Go to the previous month')]")$clickElement()
}
I have a question about downloading files. I know how to download files, using the download.file function. I need to download multiple files from a particular site, each file corresponding to a different date. I have a series of dates, using which I can prepare the URL to download the file. I know for a fact that for some particular dates, the files are missing on the website. Subsequently my code stops at that point. I then have to manually reset the date index (increment it by 1) and re-run the code. Since I have to download more than 1500 files, I was wondering if I can somehow capture the 'absence of the file' and instead of the code stopping, it continues with the next date in the array.
Below is the dput of a part of the date array:
dput(head(fnames,10))
c("20060102.trd", "20060103.trd", "20060104.trd", "20060105.trd",
"20060106.trd", "20060109.trd", "20060110.trd", "20060112.trd",
"20060113.trd", "20060116.trd")
This file has 1723 dates. Below is the code that I am using:
for (i in 1:length(fnames)){
file <- paste(substr(fnames[i],7,8), substr(fnames[i],5,6), substr(fnames[i],1,4), sep = "")
URL <- paste("http://xxxxx_",file,".zip",sep="")
download.file(URL, paste(file, "zip", sep = "."))
unzip(paste(file, "zip", sep = "."))}
The program works fine, till it encounters a particular date for which the file is missing, and it stops. Is there a way to capture this, and print the missing file name (the variable 'file'), and move on to the next date in the array?
Please help.
I apologize that I have not shared the exact URL. In case it becomes difficult to simulate the issue, then please let me know.
* Trying to incorporate #Paul's suggestion.
I worked on a smaller dataset.
dput(testnames) is
c("20120214.trd", "20120215.trd", "20120216.trd", "20120217.trd",
"20120221.trd")
I know that file corresponding to the date '20120216' is missing from the website. I altered my code to incorporate the tryCatch function. Below it is:
tryCatch({for (i in 1:length(testnames)){
file <- paste(substr(testnames[i],7,8), substr(testnames[i],5,6), substr(testnames[i],1,4), sep = "")
URL <- paste("http://xxxx_",file,".zip",sep="")
download.file(URL, paste(file, "zip", sep = "."))
unzip(paste(file, "zip", sep = "."))}
},
error = function(e) {cat(file, '\n')
i=i+1},
warning = function(w) {message('cannot unzip')
i=i+1}
)
It runs fine for the first two dates, and as expected, throws an error for the 3rd one. I am facing 2 issues:
When I 'exclude' the warning block, it gives me the missing file name file as coded in the error block. But when I 'include' the warning block, it only issues the warning, and somehow doesnt execute the error block. Why is that?
In either case, the code stops after reading "20120216.trd" and doesnt proceed ahead with the next file, which is desirable. Is incrementing the variable i not sufficient for that purpose?
Please advise.
You can do this using tryCatch. This function will try the operation you feed it, and provide you with a way to dealing with errors. For example, in your case an error could simply lead to skipping the file and ignoring the error. For example:
skip_with_message = simpleError('Did not work out')
tryCatch(print(bla), error = function(e) skip_with_message)
# <simpleError: Did not work out>
Notice that the error here is that the bla object does not exist.