R slow assignment in setRefClass - r

Perhaps this question should be in some programming forum, but I thought I would ask it in the statistics community. The following code illustrates the problem when performing global assignment in R's setRefClass:
class <- setRefClass("class",
fields = list(
params = "numeric"
),
methods = list(
initialize = function() {
params <<- 5
},
do.stuff = function() {
for (i in 1:1e5)
params <<- 2
}
))
# FAST:
params <- 5
time <- Sys.time(); for (i in 1:1e5) params <- 2; time <- Sys.time() - time
print(time)
# SLOW:
newclass <- class$new()
time <- Sys.time(); newclass$do.stuff(); time <- Sys.time() - time
print(time)
And pqR shows a slight improvement in runtime, but nothing drastic.
I would like to know why this is happening... in my mind, assigning a variable should be fast. Maybe this has something to do with locating an object "slot" (variable location), similar to S3/S4 classes. I bet I can only observe such behavior with R, and not C++.

As defined, an error check will be done on each assignment to "params" to ensure that only "numeric" data is stored there. It goes faster if the definition is changed from fields = list(params = "numeric") to just fields="params".

Related

How to update data.table variable within a function?

Sorry if this is a duplicate. I am very new to data.table. Basically, I am able to get my code to work outside of functions, but when I pack the operations inside of a function, they breakdown. Ultimately, I had hoped to make the functions age.inds and m.inds internal functions in a package.
# required functions ------------------------------------------------------
# create object
create.obj <- function(n = 100){
obj = list()
obj$inds <- data.table(age = rep(0.1, n), m = NA)
obj$m$model <- function(age, a){return(age^a)}
obj$m$params <- list(a = 2)
return(obj)
}
# calculate new 'age' of inds
age.inds <- function(obj){
obj$inds[, age := age + 1]
return(obj)
}
# calculate new 'm' of inds
m.inds <- function(obj){
ARGS <- list()
args.incl <- which(names(obj$m$params) %in% names(formals(obj$m$model)))
ARGS <- c(ARGS, obj$m$params[args.incl])
args.incl <- names(obj$inds)[names(obj$inds) %in% names(formals(obj$m$model))]
ARGS <- c(ARGS, obj$inds[, ..args.incl]) # double dot '..' version
# ARGS <- c(ARGS, inds[, args.incl, with = FALSE]) # 'with' version
obj$inds[, m := do.call(obj$m$model, ARGS)]
return(obj)
}
# advance object
adv.obj <- function(obj, times = 1){
for(i in seq(times)){
obj <- age.inds(obj)
obj <- m.inds(obj)
}
return(obj)
}
# Example ----------------------------------------------------------------
# this doesn't work
obj <- create.obj(n = 10)
obj # so far so good
obj <- age.inds(obj)
obj # 'inds' gone
# would ultimately like to call adv.obj
obj <- adv.obj(obj, times = 5)
Also (as a side note), most of what I would like to do in my code would be vectorized calculations (i.e. updating variables in obj$inds), so I don't even know if going to data.tables makes too much sense for me (i.e. no by grouping operations as of yet). I am dealing with large objects and wondered if switching from data.frame objects would speed things up (I can get my code to work using data.frames).
Thanks
Update
OK, the issue with the printing has been solved thanks to #eddi. I am however unable to use these "inds" functions when they are located internally within a package (i.e not exported). I made a small package (DTtester), that has this example in the help file for adv.obj:
obj <- create.obj(n=10)
obj <- adv.obj(obj, times = 5)
# Error in `:=`(age, new.age) :
# Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are
# defined for use in j, once only and in particular ways. See help(":=").
Any idea why the functions would fail in this way?

How to create a data frame with Rblpapi subscribe function

I'm sorry this example won't be reproducible by those who aren't Bloomberg users.
For the others, I'm using Rblpapi and its subscribe function. I would like to create something like a data frame, a matrix or an array and fill it with values that are streamed by the subscription.
Assuming your BBComm component is up and running, my example says:
require(Rblpapi)
con <- blpConnect()
securities <- c('SX5E 07/20/18 C3400 Index',
'SX5E 07/20/18 C3450 Index',
'SX5E 07/20/18 C3500 Index')
I would like to fill a 3 x 2 matrix with these fields:
fields <- c('BID', 'ASK')
I guess I can create a matrix like this with almost no performance overhead:
mat <- matrix(data = NA,
nrow = 3,
ncol = 2)
Now I use subscribe and its argument fun for filling purposes, so something like this (albeit ugly to see and likely inefficient):
i <- 1
subscribe(securities = securities,
fields = fields,
fun = function(x){
if (i > length(securities))
i <<- 1
tryCatch(
expr = {
mat[i, 1] <<- x$data$BID
mat[i, 2] <<- x$data$ASK
i <<- i + 1
},
error = function(e){
message(e)
},
finally = {}
)
})
Result:
Error in subscribe_Impl(con, securities, fields, fun, options, identity) :
Evaluation error: number of items to replace is not a multiple of replacement length.
Of course, this doesn't work because I don't really know how to use indexing on streamed data. $ operator seems fine to retrieve data points by name - like I did with BID and ASK - but I cannot find a way to figure out which values are referring to, say, securities[1] or to securities[2]. It seems that I get a stream of numeric values that are indistinguishable one from each other because I cannot retrieve the ownership of the value among the securities.
Using an index on x$data$BID[1] throws the same error.
Ok your code looks fine, the only thing that does not work is x$data$BID, change to x$data["BID"] and then you can store it, Im working with your code and this is my result.
fields=c("TIME","LAST_PRICE", "BID", "ASK")
blpConnect()
blpConnect()
i <- 1
subscribe(securities = securities,
fields = fields,"interval=60",
fun = function(x){
if (i > length(securities))
i <<- 1
tryCatch(
expr = {
tim <- x$data["TIME"]
last <<- x$data["LAST_PRICE"]
ask <<- x$data["ASK"]
bid <<- x$data["BID"]
i <<- i + 1
},
error = function(e){
message(e)
},
finally = {}
)
print(cbind(tim$TIME,last$LAST_PRICE,ask$ASK, bid$BID))
})
result
A good way to take a look at the result object from the subscribe function is:
subscribe(securities=c("AAPL US Equity"),
fields=c("LAST_PRICE"),
fun=function(x) print(str(x)))
From there you can work your way into the data:
subscribe(securities=c("AAPL US Equity", "INTC US Equity"),
fields=c("LAST_PRICE","BID","ASK"),
fun=function(x) {
if (!is.null(x$data$MKTDATA_EVENT_TYPE) && x$data$MKTDATA_EVENT_TYPE == "TRADE" && exists("LAST_PRICE", where = x$data)) {
print(data.frame(Ticker = x$topic, DateTime = x$data$TRADE_UPDATE_STAMP_RT, Trade = x$data$LAST_PRICE))
}
})
I only printed the data.frame here. The data can be processed or stored directly using the FUN argument of subscribe.

Parallelized `Find` loop in R

There are several packages in R to simplify running code in parallel, like foreach and future. Most of these have constructs which are like lapply or a for loop: they carry on until all the tasks have finished.
Is there a simple parallel version of Find? That is, I would like to run several tasks in parallel. I don't need all of them to finish, I just need to get the first one that finishes (maybe with a particular result). After that the other tasks can be killed, or left to finish on their own.
Conceptual code:
hunt_needle <- function (x, y) x %in% (y-1000):y
x <- sample.int(1000000, 1)
result <- parallel_find(seq(1000, 1000000, 1000), hunt_needle)
# should return the first value for which hunt_needle is true
You can use shared memory so that processes can communicate with one another.
For that, you can use package bigstatsr (disclaimer: I'm the author).
Choose a block size and do:
# devtools::install_github("privefl/bigstatsr")
library(bigstatsr)
# Data example
cond <- logical(1e6)
cond[sample(length(cond), size = 1)] <- TRUE
ind.block <- bigstatsr:::CutBySize(length(cond), block.size = 1000)
cl <- parallel::makeCluster(nb_cores())
doParallel::registerDoParallel(cl)
# This value (in an on-disk matrix) is shared by processes
found_it <- FBM(1, 1, type = "integer", init = 0L)
library(foreach)
res <- foreach(ic = sample(rows_along(ind.block)), .combine = 'c') %dopar% {
if (found_it[1]) return(NULL)
ind <- bigstatsr:::seq2(ind.block[ic, ])
find <- which(cond[ind])
if (length(find)) {
found_it[1] <- 1L
return(ind[find[1]])
} else {
return(NULL)
}
}
parallel::stopCluster(cl)
# Verification
all.equal(res, which(cond))
Basically, when a solution is found, you don't need to do some computations anymore, and others know it because you put a 1 in found_it which is shared between all processes.
As your question is not reproducible and I don't understand everything you need, you may have to adapt this solution a little bit.

Loops and variable scope in R

I have following for loop in R:
v = c(1,2,3,4)
s = create.some.complex.object()
for (i in v){
print(i)
s = some.complex.function.that.updates.s(s)
}
# s here has the right content.
Needless to say, this loop is horribly slow in R.
I tried to write it in functional style:
lapply(v, function(i){
print(i)
s = some.complex.function.that.updates.s(s)
})
# s wasn't updated.
But this doesn't work, because s is passed by value and not by reference.
I only need the result of the last iteration, not all of the intermediate steps.
How do I formulate the first loop in R-style?
Mulone
lapply(v, function(i){
print(i)
s = some.complex.function.that.updates.s(s)
return(s)
})
the result will be a list of object s created for each value of v. Even if it should have passed the value of v anyway cause it was the last operation performed by the function.
If you can't afford to create it many times then there are not a lot of options. It is hard to say as well without seeing the object that you are operating on. If the object is growing/appending you could collect the intermediate results and do the appending at the end. If it is actually mutating you should try to get away from the pass value and use reference classes (http://www.inside-r.org/r-doc/methods/ReferenceClasses). Then the function that modifies it will actually be a method you just call n times.
Is the loop itself really the problem? Or is it rather the time the execution of some.complex.function.that.updates.s needs?
Some R programers will jump through hoops to avoid loops but have a look at this example:
f <- function(a) a/1.001
loop <- function(n) { s = (1/f(1)^n); for (i in 1:n) s <- f(s); s}
system.time(loop(1E7))
user system elapsed
7.011 0.030 7.008
This is 0.7 micro seconds (on a MacBook Pro) per call of a very trivial function in a loop.
v = c(1,2,3,4)
s = create.some.complex.object()
lapply(v, function(i){
print(i)
s <<- some.complex.function.that.updates.s(s)
}) |> invisible()
Use of the <<- operator can sometimes get you into trouble and is (somewhat) discouraged, but when I want to mimic a for loop with side-effects this is a pattern I have found useful.
v = c(1,2,3,4)
s = create.some.complex.object()
lapply(v, function(i){
print(i)
assign('s', some.complex.function.that.updates.s(s), envir = .GlobalEnv)
}) |> invisible()
Using assign allows you to avoid the use of <<- operator. Using <<- is significantly faster than invoking the assign function. For performance reasons in more intensive applications it is very much worth it to replace sequential for loops with vectorized operations as the median execution time of lapply can be several orders of magnitude faster! Here are some toy benchmarks to support this assertion:
v <- c(1, 2, 3, 4)
microbenchmark::microbenchmark({
s <- 1
lapply(v, function(i) {
s <<- s + i
})
}, times = 1e4, unit = 'microseconds')
Median: ~ 4 microseconds
v <- c(1, 2, 3, 4)
microbenchmark::microbenchmark({
s <- 1
for(i in v) {
s <- s + i
}
}, times = 1e4, unit = 'microseconds')
Median: ~ 1488 microseconds

Reference Classes, tab completion and forced method definition

I am currently writing a package using reference classes. I have come across
an issue which from reading various sources:
Method initialisation in R reference classes
Can't reliably use RefClass methods in Snowfall
I gather is caused because reference methods are not all copied to every object
in the class rather they are copied when first accessed.
https://stat.ethz.ch/pipermail/r-devel/2011-June/061261.html
As an example define:
test <- setRefClass("TEST",
fields = list( a = "numeric"),
methods = list(
addone = function(){
a <<- a+1
},
initialize = function(){
a <<- 1
}
)
)
example <- test$new()
So example is a new object of class TEST. Typing example$ and tabbing in the
console gives
> example$
# example$.->a example$.refClassDef example$.self
# example$a example$initialize
so the method addone is not presented as an option. It is available to
call however:
example$addone()
Now tabbing again reveals
# >
# > example
# Reference class object of class "TEST"
# Field "a":
# [1] 2
# > example$
# example$.->a example$.refClassDef example$.self
# example$a example$addone example$field
# example$initialize example$show
so now addone and field and show are presented as options.
Martin Morgan advises to force definition of the methods in one of the above links. This
works well
test <- setRefClass("TEST",
fields = list( a = "numeric"),
methods = list(
addone = function(){
a <<- a+1
},
initialize = function(){
a <<- 1
.self$addone #force definition
}
)
)
example <- test$new()
so now tabbing gives:
# > example$
# example$.->a example$.refClassDef example$.self
# example$a example$addone example$initialize
Some of my classes have over 30 methods so I would like to do this as succintly as possible.
I have defined:
test <- setRefClass("TEST",
fields = list( a = "numeric"),
methods = list(
addone = function(){
a <<- a+1
},
initialize = function(){
a <<- 1
eval(parse(text=paste0('.self$',ls(test$def#refMethods))))
}
)
)
example <- test$new()
tabbing now gives:
# > example$
# example$.->a example$.refClassDef example$.self
# example$a example$addone example$callSuper
# example$copy example$export example$field
# example$getClass example$getRefClass example$import
# example$initFields example$initialize example$show
# example$trace example$untrace
Whilst this works it feels a bit clumsy. Also test$def#refMethods is used rather then getRefClass("TEST")$def#refMethods so that
feels a bit wrong. Has anyone dealt with this issue before.
Is there a better way to approach a solution? Thanks for any advice and apologies if the question is overly drawn out.
I wonder what your objective is? Function names showing up with tab completion? Then it's worth a post to the R-devel mailing list with a feature request. The original scenario is more elegantly handled with usingMethods as documented on ?setRefClass. A continued hack might be
initialize = function(...) {
methods <- getRefClass(class(.self))$methods()
eval(parse(text=paste0(".self$", methods)))
callSuper(...)
}
Tab completions can be customized via .DollarNames in the utils package, so
.DollarNames.TEST <- function(x, pattern)
grep(pattern, getRefClass(class(x))$methods(), value=TRUE)
Maybe an S3 method could be written at the base of your class hierarchy for this?
I know this is an old question but it is still the top entry when searching for refClass tab completion on google, so I'll just add an update:
Instead of using grep in the .DollarNames function as suggested by Martin, use findMatches from the utils package as it plays better with the different Rgui's around (grep will delete your partially typed name upon hitting tab)
.DollarNames.TEST <- function(x, pattern){
utils:::findMatches(pattern, getRefClass(class(x))$methods())
}
This is also how tab completion is handled internally for lists and data.frames
#Martin Morgan noted that this was termed tab completion. The package rcompletion and later rcompgen were tasked with achieving this. They have been now moved to utils.
rcompletion update
I looked thru the code for completion.R and from what I could determine utils:::.DollarNames.environment was handling tab completion for reference classes.
completion.R
Redefining the function seemed to achieve tab completion:
assignInNamespace( x = ".DollarNames.environment",
function(x, pattern = "") {
y <- NULL
if(isS4(x) && !is.null(x[['.refClassDef']])){
if(.hasSlot(x$.refClassDef,'refMethods')){
y<-x$.refClassDef#refMethods
y<-ls(y, all.names = TRUE, pattern = pattern)
}
}
x<-ls(x, all.names = TRUE, pattern = pattern)
unique(c(x,y))
}
,ns = "utils")
Some things to note:
I would only use this for my own use. Currently I am debugging and documenting a package. I had some longish method names and couldnt remember exactly what they were so tab completion will help greatly.
Usage of assignInNamespace in a package is frowned upon (if not banned) see ?assignInNamespace.
Forced definition of methods is more advisable.

Resources