R - `try` in conjunction with capturing ALL console output? - r

Here's a piece of code I'm working with:
install.package('BiocManager');BiocManager::install('UniProt.ws')
requireNamespace('UniProt.ws')
uniprot_object <- UniProt.ws::UniProt.ws(
UniProt.ws::availableUniprotSpecies(
pattern = '^Homo sapiens$')$`taxon ID`)
query_results <- try(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')))
This particular key/keytype combination is non-productive and produces the following output:
Getting mapping data for BAA08084.1 ... and ACC
error while trying to retrieve data in chunk 1:
no lines available in input
continuing to try
Error in `colnames<-`(`*tmp*`, value = `*vtmp*`) :
attempt to set 'colnames' on an object with less than two dimensions
Of the two [eE]rrors reported only the second is a 'proper' R error object and given the use of try accordingly captured in the variable query_result.
I am, however, desperate to capture the other error bit (no lines available in input) to inform downstream programmatic processes.
After playing with a plethora of capture.output, sink, purrr::quietly, etc. options found by startpaging (googling), I continue to fail capturing that bit. How can I do that?

As #Csd suggested, you could use tryCatch. The message that you are after is printed by the message() function in R, not stop(), so try() will ignore it. To capture output from message(), use code like this:
query_results <- tryCatch(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')),
message = function(e) conditionMessage(e))
This will abort evaluation when it gets any message, and return the message in query_results. If you are doing more than debugging, you probably want the message saved, but evaluation to continue. In that case, use withCallingHandlers instead. For example,
saveMessages <- c()
query_results <- withCallingHandlers(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')),
message = function(e)
saveMessages <<- c(saveMessages, conditionMessage(e)))
When I run this version, query_results is unchanged (because the later error aborted execution), but the messages are saved:
saveMessages
[1] "Getting mapping data for BAA08084.1 ... and ACC\n"
[2] "error while trying to retrieve data in chunk 1:\n no lines available in input\ncontinuing to try\n"

Based on #user2554330 s most excellent answer, I constructed an ugly thing that does exactly what I want:
try to execute the statement
don't fail fatally
leave no ugly messages
allow me access to errors and messages
So here it is in all it's despicable glory:
saveMessages <- c()
query_results <- suppressMessages(
withCallingHandlers(
try(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')),
silent = TRUE),
message = function(e)
saveMessages <<- c(saveMessages, conditionMessage(e))))

Related

How to use Trycatch to skip errors in data downloading in R

I am trying to download data from the USGS website using the dataRetrieval package of R.
For that purpose, I have generated a function called getstreamflow in R that works fine when I ran for example.
siteNumber <- c("094985005","09498501","09489500","09489499","09498502")
Streamflow = getstreamflow(siteNumber)
The output of the function is a list of data frames
I could run the function when there is no issue downloading the data, but for some stations, I got the following error:
Request failed [404]. Retrying in 1.1 seconds...
Request failed [404]. Retrying in 3.3 seconds...
For: https://waterservices.usgs.gov/nwis/site/?siteOutput=Expanded&format=rdb&site=0946666666
To avoid that the function stops when encounters an error, I am trying to use tryCatch as in the following code:
Streamflow = tryCatch(
expr = {
getstreamflow(siteNumber)
},
error = function(e) {
message(paste(siteNumber," there was an error"))
})
I want the function to skip the station and go to the next when encountering an error. Currently, the output I got is the one presented below, that obviously is wrong, because it says that for all the stations there was an error:
094985005 there was an error09498501 there was an error09489500 there was an error09489499 there was an error09498502 there was an error09511300 there was an error09498400 there was an error09498500 there was an error09489700 there was an error09500500 there was an error09489082 there was an error09510200 there was an error09489100 there was an error09490500 there was an error09510180 there was an error09494000 there was an error09490000 there was an error09489086 there was an error09489089 there was an error09489200 there was an error09489078 there was an error09510170 there was an error09493500 there was an error09493000 there was an error09498503 there was an error09497500 there was an error09510000 there was an error09509502 there was an error09509500 there was an error09492400 there was an error09492500 there was an error09497980 there was an error09497850 there was an error09492000 there was an error09497800 there was an error09510150 there was an error09499500 there was an error... <truncated>
What I am doing wrong using the tryCatch?
Answer
You wrote the tryCatch outside of getstreamflow. Hence, if one site fails, then getstreamflow will return an error and nothing else. You should either supply 1 site at a time, or put the tryCatch inside getstreamflow.
Example
x <- 1:5
fun <- function(x) {
for (i in x) if (i == 5) stop("ERROR")
return(x^2)
}
tryCatch(fun(x), error = function(e) paste0("wrong", x))
This returns:
[1] "wrong1" "wrong2" "wrong3" "wrong4" "wrong5"
Multiple arguments
You indicated that you have both siteNumber and datatype to iterate over.
Using Map, we can define a function that takes two inputs:
Map(function(x, y) tryCatch(fun(x, y),
error = function(e) message(paste(x, " there was an error"))),
x = siteNumber,
y = datatype)
Using a for-loop, we can just iterate over them:
Streamflow <- vector(mode = "list", length = length(siteNumber))
for (i in seq_along(siteNumber)) {
Streamflow[[i]] <- tryCatch(getstreamflow(siteNumber[i], datatype), error = function(e) message(paste(x, " there was an error")))
}
Or, as suggested, just modify getstreamflow.

Meaning of this warning; "Warning message: In get(object, envir = currentEnv, inherits = TRUE) : restarting interrupted promise evaluation"

I have written a function in R which extracts data from a database and builds a new table.
My new table is labeled with the date of the extract (build_date_0).
When I'm debugging my function I get the following warning when I look at my date string:
Browse[2]> build_date_0
[1] "2019-05-01"
Warning message:
In get(object, envir = currentEnv, inherits = TRUE) :
restarting interrupted promise evaluation
Questions:
What does this warning mean / what is happening (step-by-step/basics)?
Should I care?
In general how can I find out more about this error?
This is my code:
build_account_db = function(conn = connection_object
,various_inputs = 24){
browser()
# create connection objects
tabs_1 = dplyr::tbl(conn,in_schema("DB_1","VIEW_W") # some table
# create date string
build_date_0 = lubridate::today() %>% as.character()
build_date = str_replace_all(build_date_0,"-+","_")
db_name_1 = paste0('DATABASE.tab_1_',build_date)
db_name_2 = paste0('DATABASE.tab_2_',build_date)
# build query
query_text_1 = tabs_1 %>% select(COL_1) # some query
query_text_1 = tabs_1 %>% select(COL_2)
# build new tables
create_db = DBI::dbSendQuery(conn_t,paste('CREATE TABLE',db_name_1,'AS (',query_text_1,') WITH DATA PRIMARY INDEX (ID_1)'))
create_db2 = DBI::dbSendQuery(conn_t,paste('CREATE TABLE',db_name_2,'AS (',query_text_2,') WITH DATA PRIMARY INDEX (ID_1)'))
}
When I check a variable, I may or may not get this warning (it varies, even if I restart R, and run my code again with a cleared environment)
Browse[2]> build_date
[1] "2019-02-28 11:00:00 AEDT"
Warning message:
In get(object, envir = currentEnv, inherits = TRUE) :
restarting interrupted promise evaluation
What I've tried: I read this question, but it's more about suppressing the error. Also, google.
I found this link on promises and evaluation in R helpful for a related problem: https://mailund.dk/posts/promises-and-lazy-evaluation/. I wonder if after build_date_0 = lubridate::today() %>% as.character() if you add a call to just build_date_0 if that would solve the promise? Good luck!

R - Trycatch is saving warning instead of returning function output

I am trying to download records from twitter using rtweet. One issue with this is the twitter server needs to wait 15minutes every 18000 records. So, after record number 18000, I receive a data frame with all the records and a nice warning telling me to wait for a bit. search_tweets has an function argument to download more than 18000 records called retryonratelimit. However, this isnt working so I am exploring other options.
I have produced a function, incorporating tryCatch to address this. However, when the warning at 18000 records pops up, tryCatch is saving the warning rather than the data frame which should be spit out before the warning. Something it would not do if 17999 records were downloaded
library(rtweet)
library(RDCOMClient)
library(profvis)
TwitScrape = function(SearchTerm){
ReturnDF = tryCatch({
TempList=NULL
Temp = search_tweets(SearchTerm,n=18000)
TempList = list(as.data.frame(Temp), SearchTerm)
return(TempList)
},
warning = function(TempList){
Comb=NULL
MAXID = min(TempList[[1]]$status_id)
message("Delay for 15 minutes to accommodate server download limits")
pause(901)
TempWarn = search_tweets(TempList[[2]],n=18000, max_id=MAXID)
TempWarn = as.data.frame(TempWarn)
Comb = rbind(TempList[[1]], TempWarn)
CombList = list(Comb, TempList[[2]])
return(CombList)
}
)
}
Searches = c("#MUFC","#LFC", "#MCFC")
TestExpandList=NULL
TestExpand=NULL
TestExpand2=NULL
for (i in seq_along(Searches)){
TestExpandList = TwitScrape(SearchTerm = Searches[i])
TestExpand = TestExpandList[[1]]
TestExpand$Cat = Searches[i]
TestExpand$DownloadDate = Sys.Date()
TestExpand2 = rbind(TestExpand2, TestExpand)
}
I hope this makes sense. If I can offer any more information please let me know. In summary, why is tryCatch saving my warning rather than the data frame I want?
I am not 100% sure what you would like to achieve, but it seems you are using tryCatch with a wrong understanding.
The argument in the warning-handler warning = function(TempList) is the warning itself, i.e. you have named it TempList, but that doesn't mean it will become your TempList variable, it will still just pass the warning into the handler.
Your function TwitScrape is returning ReturnDF by convention, as you are not properly returning anything, I guess that is still what you want and ok.
I would try to re-structure your solution without tryCatch
Thanks for your comments. RolandASc, you were right. I went back to the drawing board. See the working TwitScrape function below:
TwitScrape = function(SearchTerm){
DF=NULL
DF = search_tweets(SearchTerm,n=18001)
Warn = warnings()
if (names(Warn[1]) == "Rate limit exceeded - 88"){
message("paused")
pause(910)
DF2 = search_tweets(SearchTerm,n=18000, max_id = min(DF$status_id))
DF3 = rbind(DF, DF2)
return(DF3)
}
else {
return(DF)
}}

R - Parallel Processing and ldply error

I am trying to use the below code to make API calls in a parallel process to speed up the API calls. (I know this isn't the best way to speed up API calls but it works)
It only fails when I try to use parallel, otherwise it works. In the ldply function I am getting the below error:
Error in do.ply(i) :
task 1 failed - "object of type 'closure' is not subsettable"
In addition:
Warning messages:
1: : ... may be used in an incorrect context: ‘.fun(piece, ...)’
2: : ... may be used in an incorrect context: ‘.fun(piece, ...)’
any help would be appreciated!
One <- 26
cl<-makeCluster(4)
registerDoSNOW(cl)
func.time <- Sys.time()
## API CALL ONE FOR "kline"
url <- "https://api.binance.com"
path <- paste("/api/v1/klines?symbol=",pairs[1],"&interval=1m&limit=1", sep = "")
raw.results <- GET(url = url, path = path)
text_content <- content(raw.results, as = "text", encoding = "UTF-8")
kline <- data.frame(text_content %>% fromJSON())
kline$symbol <- pairs[1]
## API FUNCTION TO BE APPLIED FOR REST
loopfunction <- function(i){
url <- "https://api.binance.com"
path <- paste("/api/v1/klines?symbol=",pairs[i],"&interval=1m&limit=1", sep = "")
raw.results <- GET(url = url, path = path)
text_content <- content(raw.results, as = "text", encoding = "UTF-8")
kline_temp <- data.frame(text_content %>% fromJSON())
kline_temp$symbol <- pairs[i]
kline <- rbind(kline,kline_temp)
return(kline)
}
## DPLY PARALLEL FUNCTION
kline2 <- data.frame(ldply(2:(One - 1), .fun = loopfunction, .parallel = T, .paropts = c("httr", "jsonlite", "dplyr"))) ##"ONE" is a list varriable created earlier
stopCluster(cl)
func.end.time <- Sys.time()
func.tot.time <- func.end.time - func.time
Your question isn't fully reproducible, so the following is an educated guess.
Your loopfunction() references an object called pairs. It seems from your script that a variable called pairs is defined somewhere in your local environment. However, when loopfunction() is passed to ldply(), it no longer has access to that variable (ordinarily, it would, but parallelization requires fresh R environments to be created). Having failed to find an object called pairs in the environment, R continues searching, and finds a match in stats::pairs(). This is a plotting function, not a subsettable object like a vector or data frame. Hence the error message, "object of type 'closure' is not subsettable".
I'm not especially familiar with how ldply implements parallel processing, but you could probably modify your function definition like this:
loopfunction <- function(i, pairs) {
...[body of function]...
}
And pass pairs as an extra parameter in your ldply call:
kline2 <- data.frame(ldply(2:(One - 1), .fun = loopfunction, pairs = pairs, .parallel = T, .paropts = list(.packages = c("httr", "jsonlite", "dplyr"))))

slurm_apply a RefClass method from within a RefClass method

EDIT: New version of rslurm makes the solution very easy. See my answer below.
Apologies for the somewhat longer than desired MWE, and a title that I realize after submitting the question may be needlessly complicated. I believe the real issue is getting the environment of a RefClass object into rslurm::slurm_apply.
MWE
Here I define a toy reference class called BankAccount. It has two fields and two methods.
The fields are transactions, a list of all transactions associated with the account and suspicion_threshold the value above which the bank will investigate the transaction.
The two methods are is_suspicious which compares the transactions with the suspicion_threshold on the local machine and is_suspicious_slurm, which uses rslurm::slurm_apply to spread many calls to is_suspicious over a cluster of computers managed by SLURM. You can imagine if there were many transactions or if the is_suspicious function were more complex, this might be necessary.
So, here's the setup
BankAccount <- setRefClass(
Class = 'BankAccount',
fields = list(
transactions = 'numeric',
suspicion_threshold = 'numeric'
)
)
BankAccount$methods(
is_suspicious = function(start_idx = 1, stop_idx = length(transactions)) {
return(start_idx + which(transactions[start_idx:stop_idx] > suspicion_threshold) - 1)
}
)
BankAccount$methods(
is_suspicious_slurm = function(num_nodes) {
usingMethods(is_suspicious)
t <- length(transactions)
t_per_n <- floor(t/num_nodes)
starts <- seq(from = 1, length.out = num_nodes, by = t_per_n)
stops <- seq(from = t_per_n, length.out = num_nodes, by = t_per_n)
stops[num_nodes] <- t
sjob <- rslurm::slurm_apply(f = is_suspicious,
params = data.frame(start_idx = starts,
stop_idx = stops),
nodes = num_nodes,
add_objects = .self)
results_list <- rslurm::get_slurm_out(slr_job = sjob,
outtype = "raw",
wait = TRUE)
return(unlist(results_list))
}
)
Now, on my local machine I can run:
library(RCexampleforSE)
set.seed(27599)
b <- BankAccount$new()
b$transactions <- rnorm(n = 500)
b$suspicion_threshold <- 2
b$is_suspicious()
b$is_suspicious_slurm(num_nodes = 3)
and it works as expected:
62 103 155 171 182 188 297 398 493 499
If I run:
b$is_suspicious_slurm(num_nodes = 3)
I get an error, since my personal computer is not connected to a SLURM cluster.
sh: squeue: command not found
Cannot submit; no SLURM workload manager on path
Submission scripts output in directory _rslurm_13ba46e3c70b0
Error in rslurm::get_slurm_out(slr_job = sjob, outtype = "raw", wait = TRUE):
slr_job has not been submitted
If I logon to my university cluster, which uses SLURM, and run the same script, the setup and local methods work just as they did on my personal computer. When I run:
b$is_suspicious_slurm(num_nodes = 3)
it sends jobs to the cluster, as hoped for:
Submitted batch job 6363868
But these jobs error immediately with the following error message in slurm_0.out, slurm_1.out, and slurm_2.out:
Error in attr(, "mayCall") : argument 1 is empty
Execution halted
Thoughts and Attempts
I figure the job probably needs, but doesn't have available, the BankAccount object. So I tried passing it in as add_objects parameter to rslurm::slurm_apply:
sjob <- rslurm::slurm_apply(f = is_suspicious,
params = data.frame(start_idx = starts,
stop_idx = stops),
nodes = num_nodes,
add_objects = .self)
I also tried it in quotes and inside eval(), neither of which worked.
How can I make the object accessible to the worker jobs created with rslurm::slurm_apply?
Version 0.4.0 of rslurm completely solved this problem.
Define is_suspicious_slurm() as:
BankAccount$methods(
is_suspicious_slurm = function(num_nodes) {
usingMethods(is_suspicious)
t <- length(transactions)
t_per_n <- floor(t/num_nodes)
starts <- seq(from = 1, length.out = num_nodes, by = t_per_n)
stops <- seq(from = t_per_n, length.out = num_nodes, by = t_per_n)
stops[num_nodes] <- t
sjob <- rslurm::slurm_apply(f = is_suspicious,
params = data.frame(start_idx = starts,
stop_idx = stops),
nodes = num_nodes)
results_list <- rslurm::get_slurm_out(slr_job = sjob,
outtype = "raw",
wait = TRUE)
return(unlist(results_list))
}
)
The only change is that in the call to rslurm::slurm_apply, the add_objects parameter is not specified. It does not need to be specified because as #Ian pointed out:
"...you don't need to pass self at all when slurm_apply sends the serialized function, which appears to include both ".self" and "transactions" in the enclosing environment."
EDIT: OP's answer is all you need to know.
The add_objects parameter is used for passing a character vector, not the objects themselves. All the objects are then saved in one RData file, assuming they can be found by name. In theory, you should be able to use add_objects = c('.self') within your method definition.
The key here is, "assuming they can be found". I will edit this post once a pending update to the rslurm package (which should make that finding more successful) is released.
Be very careful passing objects to cluster nodes: they do not come back. Not only will any side effects be lost, there's no inter-node communication implemented by rslurm.
Also be careful with which :) Your is_suspicious method will be wrong for arguments that don't start at 1. Try this version:
BankAccount$methods(
is_suspicious = function(i = 1:length(transactions)) {
idx <- which(transactions[i] > suspicion_threshold)
i[idx]
}
)

Resources