Timeout while reading csv file from url in R - r

I currently have a script in R that loops around 2000 times (for loop), and on each loop it queries data from a database using a url link and the read.csv function to put the data into a variable.
My problem is: when I query low amounts of data (around 10000 rows) it takes around 12 seconds per loop and its fine. But now I need to query around 50000 rows of data per loop and the query time increases quite a lot, to 50 seconds or so per loop. And this is fine for me but sometimes I notice it takes longer for the server to send the data (≈75-90 seconds) and APPARENTLY the connection times out and I get these errors:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") : cannot open: HTTP status was '0 (nil)'
or this one:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") : InternetOpenUrl failed: 'The operation timed
out'
I don't get the same warning every time, it changes between those two.
Now, what I want is to avoid my program to stop when this happens, or to simply prevent this timeout error and tell R to wait more time for the data. I have tried these settings at the start of my script as a possible solution but it keeps happening.
options(timeout=190)
setInternet2(use=NA)
setInternet2(use=FALSE)
setInternet2(use=NA)
Any other suggestions or workarounds? Maybe to skip to the next loop when this happens and store in a variable the loop number of the times this error occurred so it can be queried again in the end but only for those i's in the loop that were skipped due to the connection error? The ideal solution would be, of course, to avoid having this error.

A solution using the RCurl package:
You can change the timeout option using
curlSetOpt(timeout = 200)
or by passing it into the call to getURL
getURL(url_vect[i], timeout = 200)
A solution using base R:
Simply download each file using download.file, and then worry about manipulating those file later.

I see this is an older post, but it still comes up early in the list of Google results, so...
If you are downloading via WinInet (rather than curl, internal, wget, etc.) options, including timeout, are inherited from the system. Thus, you cannot set the timeout in R. You must change the Internet Explorer settings. See Microsoft references for details:
https://support.microsoft.com/en-us/kb/181050
https://support.microsoft.com/en-us/kb/193625

This is partial code that I show you, but you can modify to you're needs :
# connect to website
withRestarts(
tryCatch(
webpage <- getURL(url_vect[i]),
finally = print(" Succes.")
),
abort = function(){},
error = function(e) {
i<-i+1
}
)
In my case the url_vect[i] was one of the url's I copied information. This will increase the time you need to wait for the program to finish sadly.
UPDATED
tryCatch how to example

Related

RODBC connection issue

I am trying to use RODBC to connect to an access database. I have used the same structure several times in this project with success. However, in this instance it is now failing and I cannot figure out why. The code is not really reprex as I can't provide the DB, but...
This works for a single table:
library(magrittr);library(RODBC)
#xWalk_path is simply the path to the accdb
#xtabs generated by querying the available tables
x=1
tab=xtabs$TABLE_NAME[x]
temp<-RODBC::odbcConnectAccess2007(xWalk_path)%>%
RODBC::sqlFetch(., tab, stringsAsFactors = FALSE)
odbcCloseAll()
#that worked perfectly
However, I really want to use this in a a function so I can read several similar tables into a list. As a function it does not work:
xWalk_ls<- lapply(seq_along(xtabs$TABLE_NAME), function(x, xWalk_path=xWalk_path, tab=xtabs$TABLE_NAME[x]){
#print(tab) #debug code
temp<-RODBC::odbcConnectAccess2007(xWalk_path)%>%
RODBC::sqlFetch(., tab, stringsAsFactors = FALSE)
return(temp)
odbcCloseAll()
})
#error every time
The above code will return the error:
Warning in odbcDriverConnect(con, ...) :
[RODBC] ERROR: Could not SQLDriverConnect
Warning in odbcDriverConnect(con, ...) : ODBC connection failed
Error in RODBC::sqlFetch(., tab, stringsAsFactors = FALSE) :
first argument is not an open RODBC channel
I am baffled. I accessed the db to pull table names and generate the xtabs variable using sql Tables. Also, earlier in my code I used a similar code structure (not identical, but same core: sqlFetch to retrieve a table into a list) nd it worked without a problem. Only difference between then and now is that: Then I was opening and closing different .accdb files, but pulling the same table name from each. Now, I am opening and closing the same .accdb file but pulling different sheet names each time.
Am I somehow opening and closing this too fast and it is getting irritated with me? That seems unlikely, because if I force it to print(tab) as the first line of the function it will only print the first table name. If it was getting annoyed about the speed of opening an closing I would expect it to print 2 table names before throwing the error.
return returns its argument and exits, so the remaining code (odbcCloseAll()) won't be executed and the opened file (AccessDB) remains locked as you supposed.

How do I close unused connections after read_html in R

I am quite new to R and am trying to access some information on the internet, but am having problems with connections that don't seem to be closing. I would really appreciate it if someone here could give me some advice...
Originally I wanted to use the WebChem package, which theoretically delivers everything I want, but when some of the output data is missing from the webpage, WebChem doesn't return any data from that page. To get around this, I have taken most of the code from the package but altered it slightly to fit my needs. This worked fine, for about the first 150 usages, but now, although I have changed nothing, when I use the command read_html, I get the warning message " closing unused connection 4 (http:....." Although this is only a warning message, read_html doesn't return anything after this warning is generated.
I have written a simplified code, given below. This has the same problem
Closing R completely (or even rebooting my PC) doesn't seem to make a difference - the warning message now appears the second time I use the code. I can run the querys one at a time, outside of the loop with no problems, but as soon as I try to use the loop, the error occurs again on the 2nd iteration.
I have tried to vectorise the code, and again it returned the same error message.
I tried showConnections(all=TRUE), but only got connections 0-2 for stdin, stdout, stderr.
I have tried searching for ways to close the html connection, but I can't define the url as a con, and close(qurl) and close(ttt) also don't work. (Return errors of no applicable method for 'close' applied to an object of class "character and no applicable method for 'close' applied to an object of class "c('xml_document', 'xml_node')", repectively)
Does anybody know a way to close these connections so that they don't break my routine? Any suggestions would be very welcome. Thanks!
PS: I am using R version 3.3.0 with RStudio Version 0.99.902.
CasNrs <- c("630-08-0","463-49-0","194-59-2","86-74-8","148-79-8")
tit = character()
for (i in 1:length(CasNrs)){
CurrCasNr <- as.character(CasNrs[i])
baseurl <- 'http://chem.sis.nlm.nih.gov/chemidplus/rn/'
qurl <- paste0(baseurl, CurrCasNr, '?DT_START_ROW=0&DT_ROWS_PER_PAGE=50')
ttt <- try(read_html(qurl), silent = TRUE)
tit[i] <- xml_text(xml_find_all(ttt, "//head/title"))
}
After researching the topic I came up with the following solution:
url <- "https://website_example.com"
url = url(url, "rb")
html <- read_html(url)
close(url)
# + Whatever you wanna do with the html since it's already saved!
I haven't found a good answer for this problem. The best work-around that I came up with is to include the function below, with Secs = 3 or 4. I still don't know why the problem occurs or how to stop it without building in a large delay.
CatchupPause <- function(Secs){
Sys.sleep(Secs) #pause to let connection work
closeAllConnections()
gc()
}
I found this post as I was running into the same problems when I tried to scrape multiple datasets in the same script. The script would get progressively slower and I feel it was due to the connections. Here is a simple loop that closes out all of the connections.
for (i in seq_along(df$URLs)){function(i)
closeAllConnections(i)
}

parallel r, multiple errors

I am running a foreach loop and am not sure how to interpret some strange behaviour. The functions are tested individually, the loop roughly looks like this:
cl<-makeCluster(2, outfile = "")
registerDoParallel(cl)
outputs <- foreach(k = 1:x, .packages = "various packages") %dopar% {
do things #contains a for-loop
return(stuff for output)
}
stopCluster(c1)
I get these errors
Error in unserialize(socklist[[n]]) : error reading from connection
In addition: Warning message:
closing unused connection 6 (<-computer:11531)
and
Error in serialize(data, node$con) : error writing to connection
The red dot in rstudio disappears and the console cursor is active again, suggesting it has quit the program.
Now, I know these are pretty non-specific errors, but the really weird thing is that the cluster still runs after the errors. The processes spawn, they take up the expected amount of CPU and memory for as long as I would expect the job to take, but I get no return data.
I thought maybe the weird behaviour might be useful to pin down the problem.

closing unused RODBC handle

I have been receiving a Warning Message:
`historicalHourly <- importHistoricalHourly(startDatePast,endDatePast,Markets,location)
[1] "Importing Hourly Data"
[1] "Flag - Moving from importHistoricalHourly to CleaningUpHourly"
[1] "Flag - Moving to importHistoricalDaily from CleaningUpHourly"Warning messages:
1: closing unused RODBC handle 41
2: closing unused RODBC handle 40
3: closing unused RODBC handle 36`
In the function, everything checks out as far as return values, print statements.
I have an idea that it is definitely a warning due to this function:
hHourly.df <- retrievelim(PowerCodeID,columns,startDatePast,endDatePast,unitstr="Hours")
which is accessing a separate database in another program. This function is returning a dataframe of dateTime Values by the hour with different numeric values in the next column
If anyone could give me an idea about why it is closing the database and what is happening, I would greatly appreciate it.
It's because that function contains odbcConnect(...) without odbcClose(...) as joran suggests. Since the odbcConnect object is created within the function, it is pending deletion the next time there's a garbage collection (?gc). Sometimes that happens when you call the function, sometimes it happens later.
When an odbcConnect object gets deleted by gc(), it closes the database connection and displays a message. Nothing to worry about.

Using R package BerkeleyEarth

I'm working for the first time with the R package BerkeleyEarth, and attempting to use its convenience functions to access the BEST data. I think maybe it's just a problem with their servers (a matter I've separately addressed to the package's maintainer) but I wanted to know if it's instead something silly I'm doing.
To reproduce my fault
library(BerkeleyEarth)
downloadBerkeley()
which provides the following error message
trying URL 'http://download.berkeleyearth.org/downloads/TAVG/LATEST%20-%20Non-seasonal%20_%20Quality%20Controlled.zip'
Error in download.file(urls$Url[thisUrl], destfile = file.path(destDir, :
cannot open URL 'http://download.berkeleyearth.org/downloads/TAVG/LATEST%20-%20Non-seasonal%20_%20Quality%20Controlled.zip'
In addition: Warning message:
In download.file(urls$Url[thisUrl], destfile = file.path(destDir, :
InternetOpenUrl failed: 'A connection with the server could not be established'
Has anyone had a better experience using this package?
The error message is pointing to a different URL than one should get judging what URLs are listed at http://berkeleyearth.org/data/ that point to the zip formatted files. There are another set of .nc files that appear to be more recent. I would replace the entries in the BerkeleyUrls dataframe with the ones that match your analysis strategy:
This is the current URL that should be in position 1,1:
http://berkeleyearth.lbl.gov/downloads/TAVG/LATEST%20-%20Non-seasonal%20_%20Quality%20Controlled.zip
And this is the one that is in the package dataframe:
> BerkeleyUrls[1,1]
[1] "http://download.berkeleyearth.org/downloads/TAVG/LATEST%20-%20Non-seasonal%20_%20Quality%20Controlled.zip"
I suppose you could try:
BerkeleyUrls[, 1] <- sub( "download\\.berkeleyearth\\.org", "berkeleyearth.lbl.gov", BerkeleyUrls[, 1])

Resources