I want to run a moving average (preferably witch custom weights) on a data.table object. frollmean is very fast compared to frollapply (especially) and even TTR::EMA (which does not allow custom weights but only wilder=T/F)--performance comparison at the end. I tried to use data.table:::froll (which frollmean uses and is not exported). It's first argument must be the (character) name of the function (for frollmean it is mean).
data.table:::froll('f.roll', x$c, 10)
Error in data.table:::froll("f.roll", x$c, 10) :
Internal error: invalid fun argument in rolling function, should have been caught before. please report to data.table issue tracker.
I tried to access the data.table package as an environment:
e.dt <- as.environment('package:data.table')
e.dt$froll <- data.table:::froll
Error in e.dt$froll <- data.table:::froll (from c.ta.R#287) :
cannot add bindings to a locked environment
e.dt2 <- new.env(parent=e.dt)
e.dt2$froll <- data.table:::froll
e.dt2$froll('f.roll', top.., n, na.rm=T)
Error in e.dt2$froll("f.roll", x$c, n, na.rm = T) :
Internal error: invalid fun argument in rolling function, should have been caught before. please report to data.table issue tracker.
I also tried attaching my custom function f.roll to e.dt2:
e.dt2$f.roll <- f.roll
e.dt2$froll('f.roll', top.., n, na.rm=T)
Error in e.dt2$froll("f.roll", top.., n, na.rm = T) :
Internal error: invalid fun argument in rolling function, should have been caught before. please report to data.table issue tracker.
froll calls a C function (CfrollfunR):
data.table:::froll
function (fun, x, n, fill = NA, algo = c("fast", "exact"), align = c("right",
"left", "center"), na.rm = FALSE, hasNA = NA, adaptive = FALSE)
{
stopifnot(!missing(fun), is.character(fun), length(fun) ==
1L, !is.na(fun))
algo = match.arg(algo)
align = match.arg(align)
ans = .Call(CfrollfunR, fun, x, n, fill, algo, align, na.rm,
hasNA, adaptive)
ans
}
<bytecode: 0x0000000013738b40>
<environment: namespace:data.table>
data.table:::CfrollfunR
$name
[1] "CfrollfunR"
$address
<pointer: 0x000000001ef93e80>
attr(,"class")
[1] "RegisteredNativeSymbol"
$dll
DLL name: datatable
Filename: C:/bin/cygwin/cygwin64/home/Adi/R/win-library/4.0/data.table/libs/x64/datatable.dll
Dynamic lookup: FALSE
$numParameters
[1] -1
attr(,"class")
[1] "CallRoutine" "NativeSymbolInfo"
However, while my custom function f.roll doesnt' work, mean as the first parameter works:
tail(e.dt2$froll('mean', x$c, 10, na.rm=T))
[1] 43.506 43.148 42.855 42.548 42.331 42.200
I inspected the data.table DLL C:\bin\cygwin\cygwin64\home\Adi\R\win-library\4.0\data.table\libs\x64\datatable.dll (using DLL Export Viewer) but it doesn't show any function called mean. data.table package also doesn't have any mean function (either exported or not exported).
data.table:::mean
Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
object 'mean' not found
data.table::mean
Error: 'mean' is not an exported object from 'namespace:data.table'
Any suggestion would be greatly appreciated.
PS Here's the performance comparison from microbenchmark for reference:
## Unit: milliseconds
## expr min lq mean median uq max neval
## frollapply 131736.6469 131736.6469 131736.6469 131736.6469 131736.6469 131736.6469 1
## EMA 262.9931 262.9931 262.9931 262.9931 262.9931 262.9931 1
## frollmean 97.0388 97.0388 97.0388 97.0388 97.0388 97.0388 1
PS Since froll isn't a solution to my problem (as #Waldi points out) I'm looking for an alternate solution but I left this as is and asked a separate, new question: fast way to calculate moving average/rolling function which allows custom weights
C source code of froll, shows that this internal function is only designed for mean or sum, the error isn't linked to the environment :
enum {MEAN, SUM} sfun;
if (!strcmp(CHAR(STRING_ELT(fun, 0)), "mean")) {
sfun = MEAN;
} else if (!strcmp(CHAR(STRING_ELT(fun, 0)), "sum")) {
sfun = SUM;
} else {
error(_("Internal error: invalid %s argument in %s function should have been caught earlier. Please report to the data.table issue tracker."), "fun", "rolling"); // # nocov
}
Related
I am working with R. I found this previous post on stackoverflow which shows how to get a "list" of all functions that belong to a given library:
How to find all functions in an R package?
For example:
#load desired library
library(ParBayesianOptimization)
#find out all functions from this library
getNamespaceExports("ParBayesianOptimization")
[1] "addIterations" "getLocalOptimums" "bayesOpt" "getBestPars" "changeSaveFile" "updateGP"
The above code tells me the name of all functions that are used in the "ParBayesianOptimization" library. From here, I could manually inspect each one of these functions - for example:
# manually inspect any one of these functions
getAnywhere(bayesOpt)
A single object matching ‘bayesOpt’ was found
It was found in the following places
package:ParBayesianOptimization
namespace:ParBayesianOptimization
with value
#function stats here
function (FUN, bounds, saveFile = NULL, initGrid, initPoints = 4,
iters.n = 3, iters.k = 1, otherHalting = list(timeLimit = Inf,
minUtility = 0), acq = "ucb", kappa = 2.576, eps = 0,
parallel = FALSE, gsPoints = pmax(100, length(bounds)^3),
convThresh = 1e+08, acqThresh = 1, errorHandling = "stop",
plotProgress = FALSE, verbose = 1, ...)
{
startT <- Sys.time()
optObj <- list()
etc etc etc ...
saveFile = saveFile, verbose = verbose, ...)
return(optObj)
}
#function ends here
<bytecode: 0x000001cbb4145db0>
<environment: namespace:ParBayesianOptimization>
Goal : Is it possible to take each one of these functions and create a notepad file with their full definitions?
Something that would look like this:
My attempt:
I thought I could first make an "object" in R that contained all the functions found in this library:
library(plyr)
a = getNamespaceExports("ParBayesianOptimization")
my_list = do.call("rbind.fill", lapply(a, as.data.frame))
X[[i]]
1 addIterations
2 getLocalOptimums
3 bayesOpt
4 getBestPars
5 changeSaveFile
6 updateGP
Then, I could manually create an "assignment arrow":
header_text <- rep("<-")
Then, "paste" this to each function name:
combined_list <- as.character(paste(my_list, header_text, sep = ""))
But this is not looking correct:
combined_list
[1] "c(\"addIterations\", \"getLocalOptimums\", \"bayesOpt\", \"getBestPars\", \"changeSaveFile\", \"updateGP\")<- "
The goal is to automate the process of manually copying/pasting :
function_1 = getAnywhere("first function ParBayesianOptimization library")
function_2 = getAnywhere("second function ParBayesianOptimization library")
etc
final_list = c(function_1, function_2 ...)
And removing the generic description from each function:
A single object matching ‘bayesOpt’ was found
It was found in the following places
package:ParBayesianOptimization
namespace:ParBayesianOptimization
with value
In the end, if I were to "call" the final_list object, all the functions from this library should get recreated and reassigned.
Can someone please show me how to do this?
Thanks
You can use the dump function for this
pkg <- "ParBayesianOptimization"
dump(getNamespaceExports(pkg), file="funs.R", envir = asNamespace(pkg))
This code will help you write the function definitions of all the functions in a library to a text file.
fn_list <- getNamespaceExports("ParBayesianOptimization")
for(i in seq_along(fn_list)) {
header <- paste('\n\n####Function', i, '\n\n\n')
cat(paste0(header, paste0(getAnywhere(fn_list[i]), collapse = '\n'), '\n\n'),
file = 'function.txt', append = TRUE)
}
I have a specific performance issue, that i wish to extend more generally if possible.
Context:
I've been playing around on google colab with a python code sample for a Q-Learning agent, which associate a state and an action to a value using a defaultdict:
self._qvalues = defaultdict(lambda: defaultdict(lambda: 0))
return self._qvalues[state][action]
Not an expert but my understanding is it returns the value or add and returns 0 if the key is not found.
i'm adapting part of this in R.
the problem is I don't how many state/values combinations I have, and technically i should not know how many states I guess.
At first I went the wrong way, with the rbind of data.frames and that was very slow.
I then replaced my R object with a data.frame(state, action, value = NA_real).
it works but it's still very slow. another problem is my data.frame object has the maximum size which might be problematic in the future.
then I chanded my data.frame to a data.table, which gave me worst performance, then I finally indexed it by (state, action).
qvalues <- data.table(qstate = rep(seq(nbstates), each = nbactions),
qaction = rep(seq(nbactions), times = nbstates),
qvalue = NA_real_,
stringsAsFactors = FALSE)
setkey(qvalues, "qstate", "qaction")
Problem:
Comparing googlecolab/python vs my local R implementation, google performs 1000x10e4 access to the object in let's say 15s, while my code performs 100x100 access in 28s. I got 2s improvements by byte compiling but that's still too bad.
Using profvis, I see most of the time is spent accessing the data.table on these two calls:
qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
self$qvalues[J(state, action)]$qvalue <- value
I don't really know what google has, but my desktop is a beast. Also I saw some benchmarks stating data.table was faster than pandas, so I suppose the problem lies in my choice of container.
Questions:
is my use of a data.table wrong and can be fixed to improve and match the python implementation?
is another design possible to avoid declaring all the state/actions combinations which could be a problem if the dimensions become too large?
i've seen about the hash package, is it the way to go?
Thanks a lot for any pointer!
UPDATE:
thanks for all the input.
So what I did was to replace 3 access to my data.table using your suggestions:
#self$qvalues[J(state, action)]$qvalue <- value
self$qvalues[J(state, action), qvalue := value]
#self$qvalues[J(state, action),]$qvalue <- 0
self$qvalues[J(state, action), qvalue := 0]
#qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
qval <- self$qvalues[J(state, action), nomatch = NA_real_, qvalue]
this dropped the runtime from 33s to 21s
that's a massive improvement, but that's still extremely slow compared to the python defaultdict implementation.
I noted the following:
working in batch: I don't think I can do as the call to the function depends on the previous call.
peudospin> I see you are surprised the get is time consuming. so am I but that's what profvis states:
and here the code of the function as a reference:
QAgent$set("public", "get_qvalue", function( state, action) {
#qval <- self$qvalues[J(state, action), nomatch = NA_real_]$qvalue
qval <- self$qvalues[J(state, action), nomatch = NA_real_, qvalue]
if (is.na(qval)) {
#self$qvalues[self$qvalues$qstate == state & self$qvalues$qaction == action,]$qvalue <- 0
#self$qvalues[J(state, action),]$qvalue <- 0
self$qvalues[J(state, action), qvalue := 0]
return(0)
}
return(qval)
})
At this point, if no more suggestion, I will conclude the data.table is just too slow for this kind of task, and I should look into using an env or a collections. (as suggested there: R fast single item lookup from list vs data.table vs hash )
CONCLUSION:
I replaced the data.table for a collections::dict and the bottleneck completely disappeared.
data.table is fast for doing lookups and manipulations in very large tables of data, but it's not going to be fast at adding rows one by one like python dictionaries. I'd expect it would be copying the whole table each time you add a row which is clearly not what you want.
You can either try to use environments (which are something like a hashmap), or if you really want to do this in R you may need a specialist package, here's a link to an answer with a few options.
Benchmark
library(data.table)
Sys.setenv('R_MAX_VSIZE'=32000000000) # add to the ram limit
setDTthreads(threads=0) # use maximum threads possible
nbstates <- 1e3
nbactions <- 1e5
cartesian <- function(nbstates,nbactions){
x= data.table(qstate=1:nbactions)
y= data.table(qaction=1:nbstates)
k = NULL
x = x[, c(k=1, .SD)]
setkey(x, k)
y = y[, c(k=1, .SD)]
setkey(y, NULL)
x[y, allow.cartesian=TRUE][, c("k", "qvalue") := list(NULL, NA_real_)][]
}
#comparing seq with `:`
(bench = microbenchmark::microbenchmark(
1:1e9,
seq(1e9),
times=1000L
))
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> 1:1e+09 120 143 176.264 156.0 201 5097 1000
#> seq(1e+09) 3039 3165 3333.339 3242.5 3371 21648 1000
ggplot2::autoplot(bench)
(bench = microbenchmark::microbenchmark(
"Cartesian product" = cartesian(nbstates,nbactions),
"data.table assignement"=qvalues <- data.table(qstate = rep(seq(nbstates), each = nbactions),
qaction = rep(seq(nbactions), times = nbstates),
qvalue = NA_real_,
stringsAsFactors = FALSE),
times=100L))
#> Unit: seconds
#> expr min lq mean median uq max neval
#> Cartesian product 3.181805 3.690535 4.093756 3.992223 4.306766 7.662306 100
#> data.table assignement 5.207858 5.554164 5.965930 5.895183 6.279175 7.670521 100
#> data.table (1:nb) 5.006773 5.609738 5.828659 5.80034 5.979303 6.727074 100
#>
#>
ggplot2::autoplot(bench)
it is clear the using seq consumes more time than calling the 1:nb. plus using a cartesian product makes the code faster even when 1:nb is used
I have a R code that does some distributed data preprocessing in sparklyr, and then collects the data to R local dataframe to finally save the result in the CSV. Everything works as expected and now I plan to re-use the spark context across multiple input files processing.
My code looks similar to this reproducible example:
library(dplyr)
library(sparklyr)
sc <- spark_connect(master = "local")
# Generate random input
matrix(rbinom(1000, 1, .5), ncol=1) %>% write.csv('/tmp/input/df0.csv')
matrix(rbinom(1000, 1, .5), ncol=1) %>% write.csv('/tmp/input/df1.csv')
# Multi-job input
input = list(
list(name="df0", path="/tmp/input/df0.csv"),
list(name="df1", path="/tmp/input/df1.csv")
)
global_parallelism = 2
results_dir = "/tmp/results2"
# Function executed on each file
f <- function (job) {
spark_df <- spark_read_csv(sc, "df_tbl", job$path)
local_df <- spark_df %>%
group_by(V1) %>%
summarise(n=n()) %>%
sdf_collect
output_path <- paste(results_dir, "/", job$name, ".csv", sep="")
local_df %>% write.csv(output_path)
return (output_path)
}
If I execute the function of a job inputs in sequential way with lapply everything works as expected:
> lapply(input, f)
[[1]]
[1] "/tmp/results2/df0.csv"
[[2]]
[1] "/tmp/results2/df1.csv"
However, if I plan to run it in parallel to maximize usage of spark context (if df0 spark processing is done and the local R is working on it, df1 can be already processed by spark):
> library(parallel)
> library(MASS)
> mclapply(input, f, mc.cores = global_parallelism)
*** caught segfault ***
address 0x560b2c134003, cause 'memory not mapped'
[[1]]
[1] "Error in as.vector(x, \"list\") : \n cannot coerce type 'environment' to vector of type 'list'\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in as.vector(x, "list"): cannot coerce type 'environment' to vector of type 'list'>
[[2]]
NULL
Warning messages:
1: In mclapply(input, f, mc.cores = global_parallelism) :
scheduled core 2 did not deliver a result, all values of the job will be affected
2: In mclapply(input, f, mc.cores = global_parallelism) :
scheduled core 1 encountered error in user code, all values of the job will be affected
When I'm doing similar with Python and ThreadPoolExcutor, the spark context is shared across threads, same for Scala and Java.
Is this possible to reuse sparklyr context in parallel execution in R?
Yeah, unfortunately, the sc object, which is of class spark_connection, cannot be exported to another R process (even if forked processing is used). If you use the future.apply package, part of the future ecosystem, you can see this if you use:
library(future.apply)
plan(multicore)
## Look for non-exportable objects and given an error if found
options(future.globals.onReference = "error")
y <- future_lapply(input, f)
That will throw:
Error: Detected a non-exportable reference (‘externalptr’) in one of the
globals (‘sc’ of class ‘spark_connection’) used in the future expression
I am trying to use the below code to make API calls in a parallel process to speed up the API calls. (I know this isn't the best way to speed up API calls but it works)
It only fails when I try to use parallel, otherwise it works. In the ldply function I am getting the below error:
Error in do.ply(i) :
task 1 failed - "object of type 'closure' is not subsettable"
In addition:
Warning messages:
1: : ... may be used in an incorrect context: ‘.fun(piece, ...)’
2: : ... may be used in an incorrect context: ‘.fun(piece, ...)’
any help would be appreciated!
One <- 26
cl<-makeCluster(4)
registerDoSNOW(cl)
func.time <- Sys.time()
## API CALL ONE FOR "kline"
url <- "https://api.binance.com"
path <- paste("/api/v1/klines?symbol=",pairs[1],"&interval=1m&limit=1", sep = "")
raw.results <- GET(url = url, path = path)
text_content <- content(raw.results, as = "text", encoding = "UTF-8")
kline <- data.frame(text_content %>% fromJSON())
kline$symbol <- pairs[1]
## API FUNCTION TO BE APPLIED FOR REST
loopfunction <- function(i){
url <- "https://api.binance.com"
path <- paste("/api/v1/klines?symbol=",pairs[i],"&interval=1m&limit=1", sep = "")
raw.results <- GET(url = url, path = path)
text_content <- content(raw.results, as = "text", encoding = "UTF-8")
kline_temp <- data.frame(text_content %>% fromJSON())
kline_temp$symbol <- pairs[i]
kline <- rbind(kline,kline_temp)
return(kline)
}
## DPLY PARALLEL FUNCTION
kline2 <- data.frame(ldply(2:(One - 1), .fun = loopfunction, .parallel = T, .paropts = c("httr", "jsonlite", "dplyr"))) ##"ONE" is a list varriable created earlier
stopCluster(cl)
func.end.time <- Sys.time()
func.tot.time <- func.end.time - func.time
Your question isn't fully reproducible, so the following is an educated guess.
Your loopfunction() references an object called pairs. It seems from your script that a variable called pairs is defined somewhere in your local environment. However, when loopfunction() is passed to ldply(), it no longer has access to that variable (ordinarily, it would, but parallelization requires fresh R environments to be created). Having failed to find an object called pairs in the environment, R continues searching, and finds a match in stats::pairs(). This is a plotting function, not a subsettable object like a vector or data frame. Hence the error message, "object of type 'closure' is not subsettable".
I'm not especially familiar with how ldply implements parallel processing, but you could probably modify your function definition like this:
loopfunction <- function(i, pairs) {
...[body of function]...
}
And pass pairs as an extra parameter in your ldply call:
kline2 <- data.frame(ldply(2:(One - 1), .fun = loopfunction, pairs = pairs, .parallel = T, .paropts = list(.packages = c("httr", "jsonlite", "dplyr"))))
I am new to R, so forgive me if the question is a little silly.
I am trying to write a simple while loop for a value function iteration. My function (optim.routine) uses the solver ipoptr. Here is my code:
d<-1
old1<-0
old2<-0
num.iter<-0
i.esp<-1e-05
i.T<-100
lb<-0
ub<-10
while (d>i.eps & num.iter<i.T){
new1 <- optim.routine(old1, old2, eval_f=eval_f, eval_grad_f=eval_grad_f, lb=lb, ub=ub, update=FALSE)
d<-dist(c(old1, new1), method="euclidean")
num.iter<-num.iter+1
old1<-new1
}
where optim.routine is the following function:
optim.routine<-function(old1, old2, eval_f=obj, eval_grad_f=obj.deriv, lb=lb, ub=ub, update){
if (isTRUE(update)){
var2<-old2
var1<-old1
var1.deriv<-deriv(var1)
optimize <- ipoptr(x0 = old2, eval_f = eval_f, eval_grad_f = eval_grad_f, lb = lb,
ub = ub)
new1<- optimize$objective
new2<- optimize$solution
old2<-new2
old1<-new1
}else{
var2<-old2
var1<-old1
var1.deriv<-vf.deriv(var1)
optimize <- ipoptr(x0 = old2, eval_f = eval_f, eval_grad_f = eval_grad_f, lb = lb,
ub = ub)
new1<- optimize$objective
new2<- optimize$solution
old1<-new1
}
}
and deriv is a function that computes derivatives.
I get the following error if i try to run the code:
source('/mnt/ide0/home/myname/Documents/optim.R')
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'fn' of mode 'function' was not found
and if I debug the function:
Browse[2]> n
Error in isTRUE(update) : argument "update" is missing, with no default
If I only source the function without the while loop no error is displayed. Honestly, I have no clue. Any help is greatly appreciated. Thanks!
Claudia
I had exactly the same error message when I named a variable with the
same name of an existing function in R. I've found this tip
here: http://notepad.patheticcockroach.com/2565/a-bad-idea-in-r-using-variables-with-the-same-name-as-existing-functions/ Hope it helps you too.
– FraNut Oct 12 at 11:26
He's right refrain from using variables that might be function names too.
e.g
z1<-aggregate(steps ~ interval, data_df, mean)
mean<-mean(z[,2],na.rm = TRUE)
mean is a variable and a function name passed as an argument to the aggregate function causing a conflict
Many times that error will appear when you previously created an object called "mean" in the R environment. This creates a conflict when calling the function "mean". To stop this error use:
rm(mean)
This removes the object "mean" from the environment and allows R to call the function "mean".