R - Parallel Processing and ldply error

R - Parallel Processing and ldply error - r

I am trying to use the below code to make API calls in a parallel process to speed up the API calls. (I know this isn't the best way to speed up API calls but it works)
It only fails when I try to use parallel, otherwise it works. In the ldply function I am getting the below error:
Error in do.ply(i) :
task 1 failed - "object of type 'closure' is not subsettable"
In addition:
Warning messages:
1: : ... may be used in an incorrect context: ‘.fun(piece, ...)’
2: : ... may be used in an incorrect context: ‘.fun(piece, ...)’
any help would be appreciated!
One <- 26
cl<-makeCluster(4)
registerDoSNOW(cl)
func.time <- Sys.time()
## API CALL ONE FOR "kline"
url <- "https://api.binance.com"
path <- paste("/api/v1/klines?symbol=",pairs[1],"&interval=1m&limit=1", sep = "")
raw.results <- GET(url = url, path = path)
text_content <- content(raw.results, as = "text", encoding = "UTF-8")
kline <- data.frame(text_content %>% fromJSON())
kline$symbol <- pairs[1]
## API FUNCTION TO BE APPLIED FOR REST
loopfunction <- function(i){
url <- "https://api.binance.com"
path <- paste("/api/v1/klines?symbol=",pairs[i],"&interval=1m&limit=1", sep = "")
raw.results <- GET(url = url, path = path)
text_content <- content(raw.results, as = "text", encoding = "UTF-8")
kline_temp <- data.frame(text_content %>% fromJSON())
kline_temp$symbol <- pairs[i]
kline <- rbind(kline,kline_temp)
return(kline)
}
## DPLY PARALLEL FUNCTION
kline2 <- data.frame(ldply(2:(One - 1), .fun = loopfunction, .parallel = T, .paropts = c("httr", "jsonlite", "dplyr"))) ##"ONE" is a list varriable created earlier
stopCluster(cl)
func.end.time <- Sys.time()
func.tot.time <- func.end.time - func.time

Your question isn't fully reproducible, so the following is an educated guess.
Your loopfunction() references an object called pairs. It seems from your script that a variable called pairs is defined somewhere in your local environment. However, when loopfunction() is passed to ldply(), it no longer has access to that variable (ordinarily, it would, but parallelization requires fresh R environments to be created). Having failed to find an object called pairs in the environment, R continues searching, and finds a match in stats::pairs(). This is a plotting function, not a subsettable object like a vector or data frame. Hence the error message, "object of type 'closure' is not subsettable".
I'm not especially familiar with how ldply implements parallel processing, but you could probably modify your function definition like this:
loopfunction <- function(i, pairs) {
...[body of function]...
}
And pass pairs as an extra parameter in your ldply call:
kline2 <- data.frame(ldply(2:(One - 1), .fun = loopfunction, pairs = pairs, .parallel = T, .paropts = list(.packages = c("httr", "jsonlite", "dplyr"))))

Related

How to use Trycatch to skip errors in data downloading in R

I am trying to download data from the USGS website using the dataRetrieval package of R.
For that purpose, I have generated a function called getstreamflow in R that works fine when I ran for example.
siteNumber <- c("094985005","09498501","09489500","09489499","09498502")
Streamflow = getstreamflow(siteNumber)
The output of the function is a list of data frames
I could run the function when there is no issue downloading the data, but for some stations, I got the following error:
Request failed [404]. Retrying in 1.1 seconds...
Request failed [404]. Retrying in 3.3 seconds...
For: https://waterservices.usgs.gov/nwis/site/?siteOutput=Expanded&format=rdb&site=0946666666
To avoid that the function stops when encounters an error, I am trying to use tryCatch as in the following code:
Streamflow = tryCatch(
expr = {
getstreamflow(siteNumber)
},
error = function(e) {
message(paste(siteNumber," there was an error"))
})
I want the function to skip the station and go to the next when encountering an error. Currently, the output I got is the one presented below, that obviously is wrong, because it says that for all the stations there was an error:
094985005 there was an error09498501 there was an error09489500 there was an error09489499 there was an error09498502 there was an error09511300 there was an error09498400 there was an error09498500 there was an error09489700 there was an error09500500 there was an error09489082 there was an error09510200 there was an error09489100 there was an error09490500 there was an error09510180 there was an error09494000 there was an error09490000 there was an error09489086 there was an error09489089 there was an error09489200 there was an error09489078 there was an error09510170 there was an error09493500 there was an error09493000 there was an error09498503 there was an error09497500 there was an error09510000 there was an error09509502 there was an error09509500 there was an error09492400 there was an error09492500 there was an error09497980 there was an error09497850 there was an error09492000 there was an error09497800 there was an error09510150 there was an error09499500 there was an error... <truncated>
What I am doing wrong using the tryCatch?

Answer
You wrote the tryCatch outside of getstreamflow. Hence, if one site fails, then getstreamflow will return an error and nothing else. You should either supply 1 site at a time, or put the tryCatch inside getstreamflow.
Example
x <- 1:5
fun <- function(x) {
for (i in x) if (i == 5) stop("ERROR")
return(x^2)
}
tryCatch(fun(x), error = function(e) paste0("wrong", x))
This returns:
[1] "wrong1" "wrong2" "wrong3" "wrong4" "wrong5"
Multiple arguments
You indicated that you have both siteNumber and datatype to iterate over.
Using Map, we can define a function that takes two inputs:
Map(function(x, y) tryCatch(fun(x, y),
error = function(e) message(paste(x, " there was an error"))),
x = siteNumber,
y = datatype)
Using a for-loop, we can just iterate over them:
Streamflow <- vector(mode = "list", length = length(siteNumber))
for (i in seq_along(siteNumber)) {
Streamflow[[i]] <- tryCatch(getstreamflow(siteNumber[i], datatype), error = function(e) message(paste(x, " there was an error")))
}
Or, as suggested, just modify getstreamflow.

How to read in parallel multiple chunks from the same connection in R?

I have a .bz2 file and I want to read it and do some processing. The file cannot be loaded in memory. I want to do some computations on the chunks I read and I they can be performed independently of one another and therefore I thought I would try to do it in parallel.
I tried the following:
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
con = file("myfile.bz2", "r")
parLapply(cl, con,
function(x)
print(head(read.csv(x, nrows = 100, stringsAsFactors = F, header = F, colClasses = "character", fill = F), 16)))
## Doesn't work Error in checkForRemoteErrors(val) :
one node produced an error: 'file' must be a character string or connection
parLapply(cl, list(con, con, con),
function(x)
print(head(read.csv(x, nrows = 100, stringsAsFactors = F, header = F, colClasses = "character", fill = F), 16)))
## Doesn't work Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: invalid connection
Can this somehow work?
Also any other recommendation as to how to go about it would be really helpful, as I am new to the world of parallel processing.

You cannot and must not use connections from one R process in another R process - connections are unique to the R session where they are created.
Internally, they are just integer indices and there is very little in R that protects you from mistakenly trying to use them in other R processes. If your want to know for details, see https://github.com/HenrikBengtsson/Wishlist-for-R/issues/81.
FWIW, if you use the future framework for parallelization and set R option future.globals.onReference to "error", then it protects you against this mistake (https://cran.r-project.org/web/packages/future/vignettes/future-4-non-exportable-objects.html). For example,
library(future.apply)
options(future.globals.onReference = "error")
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
cat("Hello world\n", file = bzfile("myfile.bz2", open="wb"))
con <- file("myfile.bz2", "r")
y <- future_lapply(list(con, con, con), FUN = function(x) {
data <- read.csv(x, nrows = 100, stringsAsFactors = FALSE, header = FALSE, colClasses = "character", fill = FALSE)
print(head(data), 16)
})
Error: Detected a non-exportable reference ('externalptr') in one of the globals (<unknown>) used in the future expression

How to reuse sparklyr context with mclapply?

I have a R code that does some distributed data preprocessing in sparklyr, and then collects the data to R local dataframe to finally save the result in the CSV. Everything works as expected and now I plan to re-use the spark context across multiple input files processing.
My code looks similar to this reproducible example:
library(dplyr)
library(sparklyr)
sc <- spark_connect(master = "local")
# Generate random input
matrix(rbinom(1000, 1, .5), ncol=1) %>% write.csv('/tmp/input/df0.csv')
matrix(rbinom(1000, 1, .5), ncol=1) %>% write.csv('/tmp/input/df1.csv')
# Multi-job input
input = list(
list(name="df0", path="/tmp/input/df0.csv"),
list(name="df1", path="/tmp/input/df1.csv")
)
global_parallelism = 2
results_dir = "/tmp/results2"
# Function executed on each file
f <- function (job) {
spark_df <- spark_read_csv(sc, "df_tbl", job$path)
local_df <- spark_df %>%
group_by(V1) %>%
summarise(n=n()) %>%
sdf_collect
output_path <- paste(results_dir, "/", job$name, ".csv", sep="")
local_df %>% write.csv(output_path)
return (output_path)
}
If I execute the function of a job inputs in sequential way with lapply everything works as expected:
> lapply(input, f)
[[1]]
[1] "/tmp/results2/df0.csv"
[[2]]
[1] "/tmp/results2/df1.csv"
However, if I plan to run it in parallel to maximize usage of spark context (if df0 spark processing is done and the local R is working on it, df1 can be already processed by spark):
> library(parallel)
> library(MASS)
> mclapply(input, f, mc.cores = global_parallelism)
*** caught segfault ***
address 0x560b2c134003, cause 'memory not mapped'
[[1]]
[1] "Error in as.vector(x, \"list\") : \n cannot coerce type 'environment' to vector of type 'list'\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in as.vector(x, "list"): cannot coerce type 'environment' to vector of type 'list'>
[[2]]
NULL
Warning messages:
1: In mclapply(input, f, mc.cores = global_parallelism) :
scheduled core 2 did not deliver a result, all values of the job will be affected
2: In mclapply(input, f, mc.cores = global_parallelism) :
scheduled core 1 encountered error in user code, all values of the job will be affected
When I'm doing similar with Python and ThreadPoolExcutor, the spark context is shared across threads, same for Scala and Java.
Is this possible to reuse sparklyr context in parallel execution in R?

Yeah, unfortunately, the sc object, which is of class spark_connection, cannot be exported to another R process (even if forked processing is used). If you use the future.apply package, part of the future ecosystem, you can see this if you use:
library(future.apply)
plan(multicore)
## Look for non-exportable objects and given an error if found
options(future.globals.onReference = "error")
y <- future_lapply(input, f)
That will throw:
Error: Detected a non-exportable reference (‘externalptr’) in one of the
globals (‘sc’ of class ‘spark_connection’) used in the future expression

R - `try` in conjunction with capturing ALL console output?

Here's a piece of code I'm working with:
install.package('BiocManager');BiocManager::install('UniProt.ws')
requireNamespace('UniProt.ws')
uniprot_object <- UniProt.ws::UniProt.ws(
UniProt.ws::availableUniprotSpecies(
pattern = '^Homo sapiens$')$`taxon ID`)
query_results <- try(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')))
This particular key/keytype combination is non-productive and produces the following output:
Getting mapping data for BAA08084.1 ... and ACC
error while trying to retrieve data in chunk 1:
no lines available in input
continuing to try
Error in `colnames<-`(`*tmp*`, value = `*vtmp*`) :
attempt to set 'colnames' on an object with less than two dimensions
Of the two [eE]rrors reported only the second is a 'proper' R error object and given the use of try accordingly captured in the variable query_result.
I am, however, desperate to capture the other error bit (no lines available in input) to inform downstream programmatic processes.
After playing with a plethora of capture.output, sink, purrr::quietly, etc. options found by startpaging (googling), I continue to fail capturing that bit. How can I do that?

As #Csd suggested, you could use tryCatch. The message that you are after is printed by the message() function in R, not stop(), so try() will ignore it. To capture output from message(), use code like this:
query_results <- tryCatch(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')),
message = function(e) conditionMessage(e))
This will abort evaluation when it gets any message, and return the message in query_results. If you are doing more than debugging, you probably want the message saved, but evaluation to continue. In that case, use withCallingHandlers instead. For example,
saveMessages <- c()
query_results <- withCallingHandlers(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')),
message = function(e)
saveMessages <<- c(saveMessages, conditionMessage(e)))
When I run this version, query_results is unchanged (because the later error aborted execution), but the messages are saved:
saveMessages
[1] "Getting mapping data for BAA08084.1 ... and ACC\n"
[2] "error while trying to retrieve data in chunk 1:\n no lines available in input\ncontinuing to try\n"

Based on #user2554330 s most excellent answer, I constructed an ugly thing that does exactly what I want:
try to execute the statement
don't fail fatally
leave no ugly messages
allow me access to errors and messages
So here it is in all it's despicable glory:
saveMessages <- c()
query_results <- suppressMessages(
withCallingHandlers(
try(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')),
silent = TRUE),
message = function(e)
saveMessages <<- c(saveMessages, conditionMessage(e))))

using package snow's parRapply: argument missing error

I want to find documents whose similarity between other doucuments are larger than a given value(0.1) by cutting documents into blocks.
library(tm)
data("crude")
sample.dtm <- DocumentTermMatrix(
crude, control=list(
weighting=function(x) weightTfIdf(x, normalize=FALSE),
stopwords=TRUE
)
)
step = 5
n = nrow(sample.dtm)
block = n %/% step
start = (c(1:block)-1)*step+1
end = start+step-1
j = unlist(lapply(1:(block-1),function(x) rep(((x+1):block),times=1)))
i = unlist(lapply(1:block,function(x) rep(x,times=(block-x))))
ij <- cbind(i,j)
library(skmeans)
getdocs <- function(k){
ci <- c(start[k[[1]]]:end[k[[1]]])
cj <- c(start[k[[2]]]:end[k[[2]]])
combi <- sample.dtm[ci]
combj < -sample.dtm[cj]
rownames(combi)<-ci
rownames(combj)<-cj
comb<-c(combi,combj)
sim<-1-skmeans_xdist(comb)
cat("Block", k[[1]], "with Block", k[[2]], "\n")
flush.console()
tri.sim<-upper.tri(sim,diag=F)
results<-tri.sim & sim>0.1
docs<-apply(results,1,function(x) length(x[x==TRUE]))
docnames<-names(docs)[docs>0]
gc()
return (docnames)
}
It works well when using apply
system.time(rmdocs<-apply(ij,1,getdocs))
When using parRapply
library(snow)
library(skmeans)
cl<-makeCluster(2)
clusterExport(cl,list("getdocs","sample.dtm","start","end"))
system.time(rmdocs<-parRapply(cl,ij,getdocs))
Error:
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: attempt to set 'rownames' on an object with no dimensions
Timing stopped at: 0.01 0 0.04
It seems that sample.dtm coundn't be used in parRapply. I'm confused. Can anyone help me? Thanks!

In addition to exporting objects, you need to load the necessary packages on the cluster workers. In your case, the result of not doing so is that there isn't a dimnames method defined for "DocumentTermMatrix" objects, causing rownames<- to fail.
You can load packages on the cluster workers with the clusterEvalQ function:
clusterEvalQ(cl, { library(tm); library(skmeans) })
After doing that, rownames(combi)<-ci will work correctly.
Also, if you want to see the output from cat, you should use the makeCluster outfile argument:
cl <- makeCluster(2, outfile='')