I'm trying to download some images from a website. I have a series of urls of images that I have to download. So I run it with this code :
dlphoto <- function(x){
print(x)
setTimeLimit(5)
Sys.sleep(0.3)
download.file(x , destfile = basename(x))
}
This function has however one major problem :
When I run my vector of 15000 urls with it, it freezes the entire R session, and stop reacting to anything. However, if I run urls separately, it works fine. Or when I run for example 1:50 urls, it works too. However, when I put 1:100, for example, it freezes as well.... So can you please help me to figure this out ?
at first I was using this line to call:
dlphoto(allimage[,2])
then I changed to this one :
dlphoto(allimage[c(1:50),2])
dlphoto(allimage[c(51:100),2])
dlphoto(allimage[c(101:150),2])
dlphoto(allimage[c(151:200),2])
and so on untill 15000
and so on. But it still freeze a lot. And each time it dies I have to close R and search where the process reached and start from there. And I get this warning message regularly :
Error in download.file(x, destfile = basename(x)) :
reached CPU time limit
And also, can you help me to make that the photos downloaded are saved in
/Users/name/Desktop/M2/Mémoire M2/Scrapingtest/photos
thanks a lot !!
There are couple of improvements possible. I have assumed that OP is using download.file from base packages which supports only single file in one attempt if method is not set libcurl and quiet = T.
Hence the fix should be to use method = "libcurl" and quiet = TRUE in download.file function. The changed function:
dlphoto <- function(x){
print(x)
download.file(x , destfile = basename(x), method="libcurl", quiet = TRUE)
}
OR
download.file(x , destfile = basename(x), method="libcurl", quiet = TRUE)
Note: In both above cases, the progress-bar will not be displayed.
I think the value of timeout from options is good enough to ensure return from download.file in case of delays.
The error for return value from download.file should be checked. Any non-zero return value indicate failure.
If you want to see progress-bars (which is probably not needed for 1500 files in one go) then function should be modified to handle 1 file at a time. The modified function will be:
# This function will display progressbar for each file
dlphoto <- function(x){
for(file in x){
print(fine)
download.file(file , destfile = basename(file))
}
}
Related
I have a script to loop through a selection of net cdf files. The files are opened, data extracted, then closed again. I have used this many times before and it works with no issue. I was recently sent a new selection of files to run through the same code. I can check the files individually using the ncdf4 package and nc_open() function. The files look fine and are not corrupt. However, when I run through the loop the function will not let me open the files and I get this error:
Error in R_nc4_open: NetCDF: HDF error
When I run though the loop to check, all is fine and the file opens. It just cannot open in the loop. There is no issue with the code.
Has anyone come across this before with non-corrupt net cdf files getting this error only on occasion. Even outside the loop I can run the code and get the error first time, then run it again without changing anything and the connection works.
Not sure how to trouble shoot this one, so just looking for advice as to why this might be happening.
Code snippet:
targetYear <- '2005-2019'
variables <- c('CHL','SSH')
ncNam <- list.files(folderdir, '.nc', recursive = TRUE)
for(v in 1:length((variables)))
{
varNam <- unlist(unique(variables))[v]
# Get names corresponding to variable
varLs <- ncNam[grep(varNam, basename(ncNam))]
varLs <- varLs[grep(targetYear, varLs)]}
varLs <- varLs[1]
export <- paste0(exportdir,varNam,'/')
dir.create(export, recursive = TRUE)
if(varNam == 'Proximity1km' | varNam == 'Proximity200m'| varNam ==
'ProximityCoast'| varNam == 'Bathymetry'){
fileNam <- varLs
ncfilename <- paste0(folderdir, fileNam)
print(ncfilename)
# Read ncfile
ncfile <- nc_open(ncfilename)
nc_close(ncfile)
gc()
} else {
fileNam <- varLs
ncfilename <- paste0(folderdir, fileNam)
print(ncfilename)
# Read ncfile
ncfile <- nc_open(ncfilename)
nc_close(ncfile)
gc()}`
I figured out the issue. It was to do with the error detection filer in the .nc files.
I removed the filter and the files work fine inside the loop. Still a bit strange.
Perhaps the ncdf4 package is not up to date with this filtering.
I have some code which compacts and repairs a number of MS Access databases:
library(RDCOMClient)
library(stringr)
accfolders <- list.dirs('C:\\users\\username\\accessdb\\',recursive = FALSE,full.names=F)[-1] #need -1 to exclude current dir
accfolders <- paste0("C:\\users\\username\\accessdb\\",accfolders)
#launch access
oApp <- COMCreate("Access.Application")
for (folder in accfolders) {
accfiles <- list.files(path=folder, pattern="\\.mdb", full.names=TRUE)
print(paste("working in dir", folder))
for (file in accfiles){
print (paste("working in db", file))
bkfile <- sub(".mdb", "_bk.mdb", file)
oApp$CompactRepair(file, bkfile, FALSE)
file.copy(bkfile, file, overwrite = TRUE)
file.remove(bkfile)
}
#print(paste("completed", folder))
}
oApp$quit()
gc()
However, sometimes the code returns this following error:
<checkErrorInfo> 80020009
Error: Exception occurred.
This error seems to happen somewhat randomly and it happens on the call oApp$CompactRepair during the second for loop
I can't seem to figure out why this happens and it happens with random .mdb files rather than a specific one. Sometimes I run the code and there is no issue at all, other times it produces the error.
Seeing as I can't figure it out, I'm wondering if I could capture this error somehow and just skip that element in the for loop? That way the code will not break down
trying to run this function within a function based loosely off of this, however, since xPDF can convert PDFs to PNGs, I skipped the ImageMagick conversion step, as well as the faulty logic with the function(i) process, since pdftopng requires a root name and that is "ocrbook-000001.png" in this case and throws an error when looking for a PNG of the original PDF's file name.
My issue is now with getting Tesseract to do anything with my PNG files. I get the error:
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.
Here is my code:
lapply(myfiles, function(i){
shell(shQuote(paste0("pdftopng -f 1 -l 10 -r 600 ", i, " ocrbook")))
mypngs <- list.files(path = dest, pattern = "png", full.names = TRUE)
lapply(mypngs, function(z){
shell(shQuote(paste0("tesseract ", z, " out")))
file.remove(paste0(z))
})
})
The issue was the DPI set too high for Tesseract to handle, apparently. Changing the PDFtoPNG DPI parameter from 600 to 150 appears to have corrected the issue. There seems to be a max DPI for Tesseract to understand and know what to do.
I have also corrected my code from a static naming convention to a more dynamic one that mimics the file's original names.
dest <- "C:\\users\\YOURNAME\\desktop"
files <- tools::file_path_sans_ext(list.files(path = dest, pattern = "pdf", full.names = TRUE))
lapply(files, function(i){
shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 150 ", i,".pdf", " ",i)))
})
myppms <- tools::file_path_sans_ext(list.files(path = dest, pattern = "ppm", full.names = TRUE))
lapply(myppms, function(y){
shell(shQuote(paste0("magick ", y,".ppm"," ",y,".tif")))
file.remove(paste0(y,".ppm"))
})
mytiffs <- tools::file_path_sans_ext(list.files(path = dest, pattern = "tif", full.names = TRUE))
lapply(mytiffs, function(z){
shell(shQuote(paste0("tesseract ", z,".tif", " ",z)))
file.remove(paste0(z,".tif"))
})
Background
It sounds like you already solved your problem. Yay! I'm writing this answer because I encountered a very similar problem calling tesseract from R and wanted to share some of the workarounds I came up with in case anyone else stumbles across the post and needs further troubleshooting ideas.
In my case I was converting a bunch of faxes (about 3000 individual pdf files, most of them between 1-15 pages) to text. I used an apply function to make the text from each individual fax as a separate entry in a list (length = number of faxes = ~ 3000). Then the faxes were converted to a vector and then that vector was combined with a vector of file names to make a data frame. Finally I wrote the data frame to a csv file. (See below for the code I used).
The problem was I kept getting the same string of errors that you got:
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.
Followed by this error: error in FUN(X[[i]], ...) : basic_string::_M_construct null not valid
What I think the problem is
What was weird for me was that I re-ran the code multiple times and it was always a different fax where the error occurred. It seemed to also occur more often when I was trying to do something else that used a lot of RAM or CPU (opening microsoft teams etc.). I tried changing the DPI as suggested in the first answer and that didn't seem to work.
It was also noticeable that while this code was running I was regularly using close to 100% of RAM and 50% of CPU (based on windows task manager).
When I ran this process (on a similiar batch of about 3,000 faxes) on linux machine with significantly more RAM and CPU I never encountered this problem.
basic_string::_M_construct null not valid, appears to be a c++ error. I'm not familiar with c++, but it sort of sounds like it's a bit of a catch all error that might indicate something that should have been created wasn't created.
Based on all that, I think the problem is that R runs out of memory and in response somehow the memory available to some of the underlying tesseract processes gets throttled off. This means there's not enough memory to convert a pdf to a png and then extract the text which is what throws these errors. This leads to a text blob not getting created where one is expected and the final C++ error of : basic_string::_M_construct null not valid It's possible that lowering the dpi is what gave your process enough memory to complete, but maybe the fundamental underlying problem was the memory not the DPI.
Possible workarounds
So, I'm not sure about any of what I just said, but running with that assumption, here's some ideas I came up with for people running the tesseract package in R who encounter similar problems:
Switch from Rstudio to Rgui: This alone solved my problem. I was able to complete the whole 3000 fax process without any errors using Rgui. Rgui also used between 100-400 MB instead 1000+ that Rstudio used, and about 25% of CPU instead of 50%. Putting R in the path and running R from the console or running R in the background might reduce memory use even further.
Close any memory intensive processes while the code is running. Microsoft teams, videoconferencing, streaming, docker on windows and the windows linux subsystem are all huge memory hogs.
lower DPI As suggested by the first answer, this would also probably reduce memory use.
break the process up. I think running my processes in batches of about 500 might have also reduced the amount of working memory R has to take up before writing to file.
These are all quick and easy solutions that can be done from R without having to learn C++ or upgrade hardware. A more durable solution probably would require doing more customization of the tesseract parameters, implementing the process in C++, changing memory allocation settings for R and the operating system, or buying more RAM.
Example Code
# Load Libraries
library(tesseract)
dir.create("finished_data")
# Define Functions
ocr2 <- function(pdf_path){
# tell tesseract which language to guess
eng <- tesseract("eng")
#convert to png first
#pngfile <- pdftools::pdf_convert(pdf_path, dpi = 300)
# tell tesseract to convert the pdf at pdf_path
seperated_pages <- tesseract::ocr(pdf_path, engine = eng)
#combine all the pages into one page
combined_pages <- paste(seperated_pages, collapse = "**new page**")
# I delete png files as I go to avoid overfilling the hard drive
# because work computer has no hard drive space :'(
png_file_paths <- list.files(pattern = "png$")
file.remove(png_file_paths)
combined_pages
}
# find pdf_paths
fax_file_paths <- list.files(path="./raw_data",
pattern = "pdf$",
recursive = TRUE)
#this converts all the pdfs to text using the ocr
faxes <- lapply(paste0("./raw_data/",fax_file_paths),
ocr2)
fax_table <- data.frame(file_name= fax_file_paths, file_text= unlist(faxes))
write.csv(fax_table, file = paste0("./finished_data/faxes_",format(Sys.Date(),"%b-%d-%Y"), "_test.csv"),row.names = FALSE)
I'm trying to download data from Yahoo using this code:
library(quantmod)
getSymbols("WOW", auto.assign=F)
This has worked for me in the past in every occasion except now, 5 days before my group assignment is due.
Except now I receive this error:
Error in download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, : cannot download all files
In addition: Warning message:
In download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, :
URL 'https://ichart.finance.yahoo.com/table.csv?
s=WOW&a=0&b=01&c=2007&d=4&e=17&f=2017&g=d&q=q&y=0&z=WOW&x=.csv': status was
'502 Bad Gateway'
The price history csv URL's appear to have changed
Old
https://chart.finance.yahoo.com/table.csv?s=AAPL&a=2&b=17&c=2017&d=3&e=17&f=2017&g=d&ignore=.csv
New:
https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1492438581&period2=1495030581&interval=1d&events=history&crumb=XXXXXXX
The new version appends a "crumb" field which appears to reflect cookie information in the user's browser. It seems they are intentionally blocking automated downloads of price histories and forcing queries to provide information to validate cookies in a web browser
The fix is detailed at https://github.com/joshuaulrich/quantmod/issues/157
Essentialy
remotes::install_github("joshuaulrich/quantmod", ref="157_yahoo_502")
# or
devtools::install_github("joshuaulrich/quantmod", ref="157_yahoo_502")
Version 0.4-9 of quantmod fixes this issue, and is now available on CRAN.
I've always wondered why Yahoo was so nice as to provide data downloads and how screwed I would be if they stopped doing it. Fortunately, help is on the way courtesy Joshua Ulrich.
Superfluous as it now may be, I coded a fix that shows one approach to get around the download problem.
library(xts)
getSymbols.yahoo.fix <- function (symbol,
from = "2007-01-01",
to = Sys.Date(),
period = c("daily","weekly","monthly"),
envir = globalenv(),
crumb = "YourCrumb",
DLdir = "~/Downloads/") { #1
# build yahoo query
query1 <- paste("https://query1.finance.yahoo.com/v7/finance/download/",symbol,"?",sep="")
fromPosix <- as.numeric(as.POSIXlt(from))
toPosix <- as.numeric(as.POSIXlt(to))
query2 <- paste("period1=", fromPosix, "&period2=", toPosix, sep = "")
interval <- switch(period[1], daily = "1d", weekly = "1wk", monthly = "1mo")
query3 <- paste("&interval=", interval, "&events=history&crumb=", crumb, sep = "")
yahooURL <- paste(query1, query2, query3, sep = "")
#' requires browser to be open
utils::browseURL("https://www.google.com")
#' run the query - downloads the security as a csv file
#' DLdir defaults to download directory in browser preferences
utils::browseURL(yahooURL)
#' wait 500 msec for download to complete - mileage may vary
Sys.sleep(time = 0.5)
yahooCSV <- paste(DLdir, symbol, ".csv", sep = "")
yahooDF <- utils::read.csv(yahooCSV, header = TRUE)
#' -------
#' if you get: Error in file(file, "rt") : cannot open the connection
#' it's because the csv file has not completed downloading
#' try increasing the time for Sys.sleep(time = x)
#' -------
#' delete the csv file
file.remove(yahooCSV)
# convert date as character to date format
yahooDF$Date <- as.Date(yahooDF$Date)
# convert to xts
yahoo.xts <- xts(yahooDF[,-1],order.by=yahooDF$Date)
# assign the xts file to the specified environment
# default is globalenv()
assign(symbol, yahoo.xts, envir = as.environment(envir))
print(symbol)
} #1
It works like this:
Go to https://finance.yahoo.com/quote/AAPL/history?p=AAPL
Right click on "download data" and copy the link
Copy the crumb after "&crumb=" and use it in the function call
Set DLdir to the default download directory in your browser
preferences
Set envir = as.environment("yourEnvir") - defaults to globalenv()
After downloading, the csv file is removed from your download
directory to avoid clutter
Note that this will leave an "untitled" window open in the browser
As a simple test: getSymbols.yahoo.fix("AAPL")
-
You can also use getSymbols.yahoo.fix with lapply to get a list of asset data
from <- "2016-04-01"
to <- Sys.Date()
period <- "daily"
envir <- globalenv()
crumb <- "yourCrumb"
DLdir <- "~/Downloads/"
assetList <- c("AAPL", "ADBE", "AMAT")
lapply(assetList, getSymbols.yahoo.fix, from, to, envir = globalenv(), crumb = crumb, DLdir)}
Coded in RStudio on Mac OSX 10.11 using Safari as my default browser. It also appears to work with Chrome, but you will need to use the cookie crumb for Chrome. I use a cookie blocker but had to whitelist finance.yahoo.com to retain the cookie for future browser sessions.
getSymbols.yahoo.fix might be useful. qauantmod::getSymbols of necessity, has more code built in for options and exception-handling. I'm coding for personal work, so I often lift those pieces of code I need from package functions. I haven't benchmarked getSymbols.yahoo.fix because, of course, I don't have a working version of GetSymbol for comparison. Besides, I couldn't pass up the opportunity to enter my first stackoverflow answer.
I too am encountering this error. A user on mrexcel fourm (jonathanwang003) explains that the new URL uses Unix Timecoding for dates. The updated VBA code would look something like this:
qurl = "https://query1.finance.yahoo.com/v7/finance/download/" & Symbol
qurl = qurl & "?period1=" & (StartDate - DateSerial(1970, 1, 1)) * 86400 & _
"&period2=" & (EndDate - DateSerial(1970, 1, 1)) * 86400 & _
"&interval=1d&events=history&crumb=" & **Crumb**
QueryQuote:
With Sheets(Symbol).QueryTables.Add(Connection:="URL;" & qurl, Destination:=Sheets(Symbol).Range("a1"))
.BackgroundQuery = True
.TablesOnlyFromHTML = False
.Refresh BackgroundQuery:=False
.SaveData = True
End With
The missing piece here is how to retrieve the "Crumb" field that contains cookie information from the browser. Anyone have any ideas. I found this post, which may help: https://www.mrexcel.com/forum/excel-questions/1001259-when-using-querytables-what-posttext-syntax-click-button-webpage.html (look at last post by john_w).
Try Google. The CSV is just a little different (does not have the adjusted price and the date has another format).
http://www.google.com/finance/historical?q=NASDAQ:ADBE&startdate=Jan+01%2C+2009&enddate=Aug+2%2C+2012&output=csv
http://www.google.com/finance/historical?q=BVMF:PETR4&startdate=Jan+01%2C+2009&enddate=Aug+2%2C+2012&output=csv
I have a question about downloading files. I know how to download files, using the download.file function. I need to download multiple files from a particular site, each file corresponding to a different date. I have a series of dates, using which I can prepare the URL to download the file. I know for a fact that for some particular dates, the files are missing on the website. Subsequently my code stops at that point. I then have to manually reset the date index (increment it by 1) and re-run the code. Since I have to download more than 1500 files, I was wondering if I can somehow capture the 'absence of the file' and instead of the code stopping, it continues with the next date in the array.
Below is the dput of a part of the date array:
dput(head(fnames,10))
c("20060102.trd", "20060103.trd", "20060104.trd", "20060105.trd",
"20060106.trd", "20060109.trd", "20060110.trd", "20060112.trd",
"20060113.trd", "20060116.trd")
This file has 1723 dates. Below is the code that I am using:
for (i in 1:length(fnames)){
file <- paste(substr(fnames[i],7,8), substr(fnames[i],5,6), substr(fnames[i],1,4), sep = "")
URL <- paste("http://xxxxx_",file,".zip",sep="")
download.file(URL, paste(file, "zip", sep = "."))
unzip(paste(file, "zip", sep = "."))}
The program works fine, till it encounters a particular date for which the file is missing, and it stops. Is there a way to capture this, and print the missing file name (the variable 'file'), and move on to the next date in the array?
Please help.
I apologize that I have not shared the exact URL. In case it becomes difficult to simulate the issue, then please let me know.
* Trying to incorporate #Paul's suggestion.
I worked on a smaller dataset.
dput(testnames) is
c("20120214.trd", "20120215.trd", "20120216.trd", "20120217.trd",
"20120221.trd")
I know that file corresponding to the date '20120216' is missing from the website. I altered my code to incorporate the tryCatch function. Below it is:
tryCatch({for (i in 1:length(testnames)){
file <- paste(substr(testnames[i],7,8), substr(testnames[i],5,6), substr(testnames[i],1,4), sep = "")
URL <- paste("http://xxxx_",file,".zip",sep="")
download.file(URL, paste(file, "zip", sep = "."))
unzip(paste(file, "zip", sep = "."))}
},
error = function(e) {cat(file, '\n')
i=i+1},
warning = function(w) {message('cannot unzip')
i=i+1}
)
It runs fine for the first two dates, and as expected, throws an error for the 3rd one. I am facing 2 issues:
When I 'exclude' the warning block, it gives me the missing file name file as coded in the error block. But when I 'include' the warning block, it only issues the warning, and somehow doesnt execute the error block. Why is that?
In either case, the code stops after reading "20120216.trd" and doesnt proceed ahead with the next file, which is desirable. Is incrementing the variable i not sufficient for that purpose?
Please advise.
You can do this using tryCatch. This function will try the operation you feed it, and provide you with a way to dealing with errors. For example, in your case an error could simply lead to skipping the file and ignoring the error. For example:
skip_with_message = simpleError('Did not work out')
tryCatch(print(bla), error = function(e) skip_with_message)
# <simpleError: Did not work out>
Notice that the error here is that the bla object does not exist.