R - db connection function - r

I'd like to create function for dbplyr connection, this connection I call this way.
tbl("connection"), in_schema("schema", "table") %>%
collect()
I created function
loadData <-
function(tbl_name) {
result <- tbl("connection"), in_schema("usa", tbl_name)
return(result)
}
I call it this way
test <- loadData(tbl_name = "sales") %>% collect()
But I'm getting error:
Error in return(tbl("connection"), in_schema("usa", tbl_name)) :
multi-argument returns are not permitted
In addition: Warning message:
In mget(objectNames, envir = ns, inherits = TRUE) :
strings not representable in native encoding will be translated to UTF-8
Is there a way how to call all my connection via similar function?

Related

How skip some line in R

I have many URLs which I import their text in R.
I use this code:
setNames(lapply(1:1000, function(x) gettxt(get(paste0("url", x)))), paste0("url", 1:1000, "_txt")) %>%
list2env(envir = globalenv())
However, some URLs can not import and show this error:
Error in file(con, "r") : cannot open the connection In addition:
Warning message: In file(con, "r") : InternetOpenUrl failed: 'A
connection with the server could not be established'
So, my code doesn't run and doesn't import any text from any URL.
How can I recognize wrong URLs and skip them in other to import correct URLs?
one possible aproach besides trycatch mentioned by #tester can be the purrr-package:
library(purrr)
# declare function
my_gettxt <- function(x) {
gettxt(get(paste0("url", x)))
}
# make function error prone by defining the otherwise value (could be empty df with column defintion, etc.) used as output if function fails
my_gettxt <- purrr::possibly(my_gettxt , otherwise = NA)
# use map from purrr instead of apply function
my_data <- purrr::map(1:1000, ~my_gettxt(.x))

How to reuse sparklyr context with mclapply?

I have a R code that does some distributed data preprocessing in sparklyr, and then collects the data to R local dataframe to finally save the result in the CSV. Everything works as expected and now I plan to re-use the spark context across multiple input files processing.
My code looks similar to this reproducible example:
library(dplyr)
library(sparklyr)
sc <- spark_connect(master = "local")
# Generate random input
matrix(rbinom(1000, 1, .5), ncol=1) %>% write.csv('/tmp/input/df0.csv')
matrix(rbinom(1000, 1, .5), ncol=1) %>% write.csv('/tmp/input/df1.csv')
# Multi-job input
input = list(
list(name="df0", path="/tmp/input/df0.csv"),
list(name="df1", path="/tmp/input/df1.csv")
)
global_parallelism = 2
results_dir = "/tmp/results2"
# Function executed on each file
f <- function (job) {
spark_df <- spark_read_csv(sc, "df_tbl", job$path)
local_df <- spark_df %>%
group_by(V1) %>%
summarise(n=n()) %>%
sdf_collect
output_path <- paste(results_dir, "/", job$name, ".csv", sep="")
local_df %>% write.csv(output_path)
return (output_path)
}
If I execute the function of a job inputs in sequential way with lapply everything works as expected:
> lapply(input, f)
[[1]]
[1] "/tmp/results2/df0.csv"
[[2]]
[1] "/tmp/results2/df1.csv"
However, if I plan to run it in parallel to maximize usage of spark context (if df0 spark processing is done and the local R is working on it, df1 can be already processed by spark):
> library(parallel)
> library(MASS)
> mclapply(input, f, mc.cores = global_parallelism)
*** caught segfault ***
address 0x560b2c134003, cause 'memory not mapped'
[[1]]
[1] "Error in as.vector(x, \"list\") : \n cannot coerce type 'environment' to vector of type 'list'\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in as.vector(x, "list"): cannot coerce type 'environment' to vector of type 'list'>
[[2]]
NULL
Warning messages:
1: In mclapply(input, f, mc.cores = global_parallelism) :
scheduled core 2 did not deliver a result, all values of the job will be affected
2: In mclapply(input, f, mc.cores = global_parallelism) :
scheduled core 1 encountered error in user code, all values of the job will be affected
When I'm doing similar with Python and ThreadPoolExcutor, the spark context is shared across threads, same for Scala and Java.
Is this possible to reuse sparklyr context in parallel execution in R?
Yeah, unfortunately, the sc object, which is of class spark_connection, cannot be exported to another R process (even if forked processing is used). If you use the future.apply package, part of the future ecosystem, you can see this if you use:
library(future.apply)
plan(multicore)
## Look for non-exportable objects and given an error if found
options(future.globals.onReference = "error")
y <- future_lapply(input, f)
That will throw:
Error: Detected a non-exportable reference (‘externalptr’) in one of the
globals (‘sc’ of class ‘spark_connection’) used in the future expression

Meaning of this warning; "Warning message: In get(object, envir = currentEnv, inherits = TRUE) : restarting interrupted promise evaluation"

I have written a function in R which extracts data from a database and builds a new table.
My new table is labeled with the date of the extract (build_date_0).
When I'm debugging my function I get the following warning when I look at my date string:
Browse[2]> build_date_0
[1] "2019-05-01"
Warning message:
In get(object, envir = currentEnv, inherits = TRUE) :
restarting interrupted promise evaluation
Questions:
What does this warning mean / what is happening (step-by-step/basics)?
Should I care?
In general how can I find out more about this error?
This is my code:
build_account_db = function(conn = connection_object
,various_inputs = 24){
browser()
# create connection objects
tabs_1 = dplyr::tbl(conn,in_schema("DB_1","VIEW_W") # some table
# create date string
build_date_0 = lubridate::today() %>% as.character()
build_date = str_replace_all(build_date_0,"-+","_")
db_name_1 = paste0('DATABASE.tab_1_',build_date)
db_name_2 = paste0('DATABASE.tab_2_',build_date)
# build query
query_text_1 = tabs_1 %>% select(COL_1) # some query
query_text_1 = tabs_1 %>% select(COL_2)
# build new tables
create_db = DBI::dbSendQuery(conn_t,paste('CREATE TABLE',db_name_1,'AS (',query_text_1,') WITH DATA PRIMARY INDEX (ID_1)'))
create_db2 = DBI::dbSendQuery(conn_t,paste('CREATE TABLE',db_name_2,'AS (',query_text_2,') WITH DATA PRIMARY INDEX (ID_1)'))
}
When I check a variable, I may or may not get this warning (it varies, even if I restart R, and run my code again with a cleared environment)
Browse[2]> build_date
[1] "2019-02-28 11:00:00 AEDT"
Warning message:
In get(object, envir = currentEnv, inherits = TRUE) :
restarting interrupted promise evaluation
What I've tried: I read this question, but it's more about suppressing the error. Also, google.
I found this link on promises and evaluation in R helpful for a related problem: https://mailund.dk/posts/promises-and-lazy-evaluation/. I wonder if after build_date_0 = lubridate::today() %>% as.character() if you add a call to just build_date_0 if that would solve the promise? Good luck!

R - `try` in conjunction with capturing ALL console output?

Here's a piece of code I'm working with:
install.package('BiocManager');BiocManager::install('UniProt.ws')
requireNamespace('UniProt.ws')
uniprot_object <- UniProt.ws::UniProt.ws(
UniProt.ws::availableUniprotSpecies(
pattern = '^Homo sapiens$')$`taxon ID`)
query_results <- try(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')))
This particular key/keytype combination is non-productive and produces the following output:
Getting mapping data for BAA08084.1 ... and ACC
error while trying to retrieve data in chunk 1:
no lines available in input
continuing to try
Error in `colnames<-`(`*tmp*`, value = `*vtmp*`) :
attempt to set 'colnames' on an object with less than two dimensions
Of the two [eE]rrors reported only the second is a 'proper' R error object and given the use of try accordingly captured in the variable query_result.
I am, however, desperate to capture the other error bit (no lines available in input) to inform downstream programmatic processes.
After playing with a plethora of capture.output, sink, purrr::quietly, etc. options found by startpaging (googling), I continue to fail capturing that bit. How can I do that?
As #Csd suggested, you could use tryCatch. The message that you are after is printed by the message() function in R, not stop(), so try() will ignore it. To capture output from message(), use code like this:
query_results <- tryCatch(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')),
message = function(e) conditionMessage(e))
This will abort evaluation when it gets any message, and return the message in query_results. If you are doing more than debugging, you probably want the message saved, but evaluation to continue. In that case, use withCallingHandlers instead. For example,
saveMessages <- c()
query_results <- withCallingHandlers(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')),
message = function(e)
saveMessages <<- c(saveMessages, conditionMessage(e)))
When I run this version, query_results is unchanged (because the later error aborted execution), but the messages are saved:
saveMessages
[1] "Getting mapping data for BAA08084.1 ... and ACC\n"
[2] "error while trying to retrieve data in chunk 1:\n no lines available in input\ncontinuing to try\n"
Based on #user2554330 s most excellent answer, I constructed an ugly thing that does exactly what I want:
try to execute the statement
don't fail fatally
leave no ugly messages
allow me access to errors and messages
So here it is in all it's despicable glory:
saveMessages <- c()
query_results <- suppressMessages(
withCallingHandlers(
try(
UniProt.ws::select(
x = uniprot_object,
keys = 'BAA08084.1',
keytype = 'EMBL/GENBANK/DDBJ',
columns = c('ENSEMBL','UNIPROTKB')),
silent = TRUE),
message = function(e)
saveMessages <<- c(saveMessages, conditionMessage(e))))

R - Parallel Processing and ldply error

I am trying to use the below code to make API calls in a parallel process to speed up the API calls. (I know this isn't the best way to speed up API calls but it works)
It only fails when I try to use parallel, otherwise it works. In the ldply function I am getting the below error:
Error in do.ply(i) :
task 1 failed - "object of type 'closure' is not subsettable"
In addition:
Warning messages:
1: : ... may be used in an incorrect context: ‘.fun(piece, ...)’
2: : ... may be used in an incorrect context: ‘.fun(piece, ...)’
any help would be appreciated!
One <- 26
cl<-makeCluster(4)
registerDoSNOW(cl)
func.time <- Sys.time()
## API CALL ONE FOR "kline"
url <- "https://api.binance.com"
path <- paste("/api/v1/klines?symbol=",pairs[1],"&interval=1m&limit=1", sep = "")
raw.results <- GET(url = url, path = path)
text_content <- content(raw.results, as = "text", encoding = "UTF-8")
kline <- data.frame(text_content %>% fromJSON())
kline$symbol <- pairs[1]
## API FUNCTION TO BE APPLIED FOR REST
loopfunction <- function(i){
url <- "https://api.binance.com"
path <- paste("/api/v1/klines?symbol=",pairs[i],"&interval=1m&limit=1", sep = "")
raw.results <- GET(url = url, path = path)
text_content <- content(raw.results, as = "text", encoding = "UTF-8")
kline_temp <- data.frame(text_content %>% fromJSON())
kline_temp$symbol <- pairs[i]
kline <- rbind(kline,kline_temp)
return(kline)
}
## DPLY PARALLEL FUNCTION
kline2 <- data.frame(ldply(2:(One - 1), .fun = loopfunction, .parallel = T, .paropts = c("httr", "jsonlite", "dplyr"))) ##"ONE" is a list varriable created earlier
stopCluster(cl)
func.end.time <- Sys.time()
func.tot.time <- func.end.time - func.time
Your question isn't fully reproducible, so the following is an educated guess.
Your loopfunction() references an object called pairs. It seems from your script that a variable called pairs is defined somewhere in your local environment. However, when loopfunction() is passed to ldply(), it no longer has access to that variable (ordinarily, it would, but parallelization requires fresh R environments to be created). Having failed to find an object called pairs in the environment, R continues searching, and finds a match in stats::pairs(). This is a plotting function, not a subsettable object like a vector or data frame. Hence the error message, "object of type 'closure' is not subsettable".
I'm not especially familiar with how ldply implements parallel processing, but you could probably modify your function definition like this:
loopfunction <- function(i, pairs) {
...[body of function]...
}
And pass pairs as an extra parameter in your ldply call:
kline2 <- data.frame(ldply(2:(One - 1), .fun = loopfunction, pairs = pairs, .parallel = T, .paropts = list(.packages = c("httr", "jsonlite", "dplyr"))))

Resources