Adding a new method to data.table

Adding a new method to data.table - r

I work a lot with time series. Most of my manipulations are done via data.table, but often I have to check data called by specific time, and for that I use xts method:
> timedata['2014-01-02/2014-01-03']
My data.table data is basically the exact copy of xts, with first colums index, containing time:
> dt_timedata <- data.table(index=index(timedata), coredata(timedata))
At some point data became way too large, so keeping both or converting them all the time is not really a good option (which it never was really), so I am thinking about making the same method for data.table. However, I only couldn't find any reasonably easy examples of modifying a generic method.
Is what I want even possible, and if so, where could I read about it?
PS Even though I can abviosly use something like
> from <- '2014-01-02'
> to <- '2014-01-03'
> period <- as.POSIXct(c(from, to))
> dt_timedata[index %between% period]
it is far less intuitive and beautiful, so I would rather write a new method.
Edit1 (example by request)
require(xts)
require(data.table)
days <- as.POSIXct(c('2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04'), origin='1970-01-01')
timedata <- xts(1:4, days)
dt_timedata <- data.table(index=index(timedata), coredata(timedata))
What I can do in xts:
> timedata['2014-01-01/2014-01-02']
[,1]
2014-01-01 1
2014-01-02 2
I want the exact same for [.data.table.
Edit2 (to illustrate what I do)
'[.data.table' = function(x, i, ...) {
if (!missing('i')) {
if (all(class(i) == "character")) {
# do some weird stuff
return(x[ *some indices subset I just created* ])
}
}
data.table:::'[.data.table'(x, i, ...)
}
So basically if i is character and suits my format (checks happen in weird stuff section) I return something and function never goes to the last command. Otherwise nothing happens and I just call
data.table:::'[.data.table'(x, i, ...)
And the thing is, this breaks expressions like
ind <- as.POSIXct('2014-01-01', origin='1970-01-01')
dt_timedata[index==ind]
Here's a trivial example for you to try:
require(data.table)
days <- as.POSIXct(c('2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04'), origin='1970-01-01')
dt_timedata <- data.table(index=days, value=1:4)
ind <- as.POSIXct('2014-01-01', origin='1970-01-01')
# now it works
dt_timedata[index==ind]
# new trivial [.data.table
'[.data.table' = function(x, I, ...) {
data.table:::`[.data.table`(x, I, ...)
}
# and now it doesn't work
dt_timedata[index==ind]

Modifying the method to add your own smth smth is very simple:
`[.data.table` = function(...) {
print("I'm doing smth custom")
data.table:::`[.data.table`(...)
}
dt = data.table(a = 1:5)
dt[, sum(a)]
#[1] "I'm doing smth custom"
#[1] 15
So just process the first argument however you like and feed it to the standard function.
Here's an example to handle your edit:
`[.data.table` = function(...) {
if (try(class(..2), silent = TRUE) == 'character')
print("boo")
else
data.table:::`[.data.table`(...)
}
dt = data.table(a = 1:10)
dt[a == 4]
# a
#1: 4
dt['sdf']
#[1] "boo"
#[1] "boo"

Related

How to update data.table variable within a function?

Sorry if this is a duplicate. I am very new to data.table. Basically, I am able to get my code to work outside of functions, but when I pack the operations inside of a function, they breakdown. Ultimately, I had hoped to make the functions age.inds and m.inds internal functions in a package.
# required functions ------------------------------------------------------
# create object
create.obj <- function(n = 100){
obj = list()
obj$inds <- data.table(age = rep(0.1, n), m = NA)
obj$m$model <- function(age, a){return(age^a)}
obj$m$params <- list(a = 2)
return(obj)
}
# calculate new 'age' of inds
age.inds <- function(obj){
obj$inds[, age := age + 1]
return(obj)
}
# calculate new 'm' of inds
m.inds <- function(obj){
ARGS <- list()
args.incl <- which(names(obj$m$params) %in% names(formals(obj$m$model)))
ARGS <- c(ARGS, obj$m$params[args.incl])
args.incl <- names(obj$inds)[names(obj$inds) %in% names(formals(obj$m$model))]
ARGS <- c(ARGS, obj$inds[, ..args.incl]) # double dot '..' version
# ARGS <- c(ARGS, inds[, args.incl, with = FALSE]) # 'with' version
obj$inds[, m := do.call(obj$m$model, ARGS)]
return(obj)
}
# advance object
adv.obj <- function(obj, times = 1){
for(i in seq(times)){
obj <- age.inds(obj)
obj <- m.inds(obj)
}
return(obj)
}
# Example ----------------------------------------------------------------
# this doesn't work
obj <- create.obj(n = 10)
obj # so far so good
obj <- age.inds(obj)
obj # 'inds' gone
# would ultimately like to call adv.obj
obj <- adv.obj(obj, times = 5)
Also (as a side note), most of what I would like to do in my code would be vectorized calculations (i.e. updating variables in obj$inds), so I don't even know if going to data.tables makes too much sense for me (i.e. no by grouping operations as of yet). I am dealing with large objects and wondered if switching from data.frame objects would speed things up (I can get my code to work using data.frames).
Thanks
Update
OK, the issue with the printing has been solved thanks to #eddi. I am however unable to use these "inds" functions when they are located internally within a package (i.e not exported). I made a small package (DTtester), that has this example in the help file for adv.obj:
obj <- create.obj(n=10)
obj <- adv.obj(obj, times = 5)
# Error in `:=`(age, new.age) :
# Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are
# defined for use in j, once only and in particular ways. See help(":=").
Any idea why the functions would fail in this way?

Set atomic vector names by reference

I am wondering if it is possible to set vector names by reference in R.
I often use data.table::fread to read text files, and then I clean up the variable names by wrapping setnames (which also works on a plain data.frame) and a string cleanup function similar to:
clean_var_name <- function(s) {
gsub("^_+|_+$","",gsub("(\\s|\\-|[[:punct:]])+", "_", tolower(s) ) )
}
so my function looks like:
clean_names <- function(x){
require(data.table)
if(is.data.frame(x)){setnames(x, names(x), clean_var_name(names(x)))} # this part works
else if(is.vector(x)){ do_something_here } # this is the question
}
I'm wondering if there is a way to include the case of vectors in the same function in a way that performs names(x) <- clean_var_name(names(x)) by reference.
v <- c(`thIs.Is.A.Terrible-Name`=1, `this One is TOO`=2)
dt <- data.table(t(v))
clean_names(dt)
dt
# this_is_a_terrible_name this_one_is_too
# 1: 1 4
# would like to be able to do same for clean_names(v)
I'm also open to explanations of why this is a bad idea (side effects, functional programming, etc.)

Use setattr function:
library(data.table)
x <- 1:10
address(x)
# [1] "0x713cfd0"
setattr(x,"names",letters[1:10])
address(x)
# [1] "0x713cfd0"

R: on.exit - use returned value without knowing its name

I have below function. I cannot alter the function in any way except the first block of code in the function.
In this simple example I want to display apply some function on returning object.
The point is the name of variable returned by function may vary and I'm not able to guess it.
Obviously I also cannot wrap the f function into { x <- f(); myfun(x); x }.
The below .Last.value in my on.exit call represents the value to be returned by f function.
f <- function(param){
# the only code I know - start
on.exit(if("character" %in% class(.Last.value)) message(print(.Last.value)) else message(class(.Last.value)))
# the only code I know - end
# real processing of f()
a <- "aaa"
"somethiiiing"
if(param==1L) return(a)
b <- 5L
"somethiiiing"
if(param==2L) return(b)
"somethiiiing"
return(32)
}
f(1L)
# function
# [1] "aaa"
f(2L)
# aaa
# [1] 5
f(3L)
# integer
# [1] 32
Above code with .Last.value seems to be working with lag (so in fact not working) and also the .Last.value is probably not the way to go as I want to use the value few times like if(fun0(x)) fun1(x) else fun2(x), and because returned value might be a big object, copy it on the side is also bad approach.
Any way to use on.exit or any other function which can help me to run my function on the f function results without knowing result variable name?

In a similar way to how you are modifying the function, you could easily wrap it as well. Here's a reproducible example.
library(data.table)
append.log<-function(x) {
cat(paste("value:",x,"\n"))
}
idx.dt <- data.table:::`[.data.table`
environment(idx.dt)<-asNamespace("data.table")
idx.wrap <- function(...) {
x<-do.call(idx.dt, as.list(substitute(...())), envir=parent.frame())
append.log(if(is(x, "data.table")) {
nrow(x)
} else { NA })
x
}
environment(idx.wrap)<-asNamespace("data.table")
(unlockBinding)("[.data.table",asNamespace("data.table"))
assign("[.data.table",idx.wrap,envir=asNamespace("data.table"),inherits=FALSE)
dt<-data.table(a=1:10, b=seq(2, 20, by=2), c=letters[1:10])
dt[a%%2==0]

Since R 3.2.0 it is fully possible, thanks to new function returnValue.
Working example below.
f <- function(x, err = FALSE){
pt <- proc.time()[[3L]]
on.exit(message(paste("proc.time:",round(proc.time()[[3L]]-pt,4),"\nnrow:",as.integer(nrow(returnValue()))[1L])))
Sys.sleep(0.001)
if(err) stop("some error")
return(x)
}
dt <- data.frame(a = 1:5, b = letters[1:5])
f(dt)
f(dt, err=T)
f(dt)
f(dt[dt$a %in% 2:3 & dt$b %in% c("c","d"),])

Can I make a function in R return a named object?

I want to write a function that creates a time series, but I'd like it to generate the name of the time series as part of the call.
Sort of
makeTS(my.data.frame, string(dateName), string(varName)){
-create time series tsAux from my.data.frame, dateName and varName
-create string tsName
(-the creation of tsAux is not a problem)
assign(tsName, tsAux)
return(tsName)
}
This, perhaps not surprisingly, returns the string tsName, but is there any way that I can make it return a named object?
I've tried with
do.call('<-', list(tsName, tsAux))
and I've also tried using
as.name(tsName) <- tsAux
but nothing seems to work.
I know that
tsName <- makeTS2(my.data.frame, dateName, varName)
would do the trick (where makeTS2() just generates the time series tsAux and returns it), but is there any way to make it work with one function call?
Thanks!

Can you? Sure:
makeTS <- function(dat, varName) {
result <- NA
assign( varName, result, envir = .GlobalEnv )
result
}
> makeTS(NA, "test")
[1] NA
> test
[1] NA
Should you? Almost surely not.

Ari B.' answer is good. You could also use assign() with a variable.
> makeTS <- function(dat) {
+ return(666)
+ }
> varName <- "tmp"
> tmp
Error: object 'tmp' not found
> assign(varName, makeTS(1))
> tmp
[1] 666

Direct update (replace) of sparse data frame is slow and inefficient

I'm attempting to read in a few hundred-thousand JSON files and eventually get them into a dplyr object. But the JSON files are not simple key-value parse and they require a lot of pre-processing. The preprocessing is coded and does fairly good for efficiency. But the challenge I am having is loading each record into a single object (data.table or dplyr object) efficiently.
This is very sparse data, I'll have over 2000 variables that will mostly be missing. Each record will have maybe a hundred variables set. The variables will be a mix of character, logical and numeric, I do know the mode of each variable.
I thought the best way to avoid R copying the object for every update (or adding one row at a time) would be to create an empty data frame and then update the specific fields after they are pulled from the JSON file. But doing this in a data frame is extremely slow, moving to data table or dplyr object is much better but still hoping to reduce it to minutes instead of hours. See my example below:
timeMe <- function() {
set.seed(1)
names = paste0("A", seq(1:1200))
# try with a data frame
# outdf <- data.frame(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
# try with data table
outdf <- data.table(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
for(i in seq(100)) {
# generate 100 columns (real data is in json)
sparse.cols <- sample(1200, 100)
# Each record is coming in as a list
# Each column is either a character, logical, or numeric
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401) { # logical
sample(c(TRUE, FALSE), 1)
} else if (i < 801) { # numeric
sample(seq(10), 1)
} else { # character
sample(LETTERS, 1)
}
}) # now we have a list with values to populate
names(sparse.val) <- paste0("A", sparse.cols)
# and here is the challenge and what takes a long time.
# want to assign the ith row and the named column with each value
for(x in names(sparse.val)) {
val=sparse.val[[x]]
# this is where the bottleneck is.
# for data frame
# outdf[i, x] <- val
# for data table
outdf[i, x:=val]
}
}
outdf
}
I thought the mode of each column might have been set and reset with each update, but I have also tried this by pre-setting each column type and this didn't help.
For me, running this example with a data.frame (commented out above) takes around 22 seconds, converting to a data.table is 5 seconds. I was hoping someone knew what was going on under the covers and could provide a faster way to populate the data table here.

I follow your code except the part where you construct sparse.val. There are minor errors in the way you assign columns. Don't forget to check that the answer is right in trying to optimise :).
First, the creation of data.table:
Since you say that you already know the type of the columns, it's important to generate the correct type up front. Else, when you do: DT[, LHS := RHS] and RHS type is not equal to LHS, RHS will be coerced to the type of LHS. In your case, all your numeric and character values will be converted to logical, as all columns are logical type. This is not what you want.
Creating a matrix won't help therefore (all columns will be of the same type) + it's also slow. Instead, I'd do it like this:
rows = 100L
cols = 1200L
outdf <- setDT(lapply(seq_along(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else rep(NA_character_, rows)
}))
Now we've the right type set. Next, I think it should be i >= 402L & i < 801L. Otherwise, you're assigning the first 401 columns as logical and then the first 801 columns as numeric, which, given that you know the type of the columns upfront, doesn't make much sense, right?
Second, doing names(.) <-:
The line:
names(sparse.val) <- paste0("A", sparse.cols)
will create a copy and is not really necessary. Therefore we'll delete this line.
Third, the time consuming for-loop:
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, x:=val]
}
is not actually doing what you think it's doing. It's not assigning the values from val to the name assigned to x. Instead it's (over)writing (each time) to a column named x. Check your output.
This is not a part of optimisation. This is just to let you know what you're actually wanting to do here.
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, (x) := val]
}
Note the ( around x. Now, it'll be evaluated and the value contained in x will be the column to which val's value will be assigned to. It's a bit subtle, I understand. But, this is necessary because it allows for the possibility to create column x as DT[, x := val] where you actually want val to be assigned to x.
Coming back to the optimisation, the good news is, your time consuming for-loop is simply:
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
This is where data.table's sub-assign by reference feature comes in handy!
Putting it all together:
Your final function looks like this:
timeMe2 <- function() {
set.seed(1L)
rows = 100L
cols = 1200L
outdf <- as.data.table(lapply(seq_len(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else sample(rep(NA_character_, rows))
}))
setnames(outdf, paste0("A", seq(1:1200)))
for(i in seq(100)) {
sparse.cols <- sample(1200L, 100L)
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401L) sample(c(TRUE, FALSE), 1)
else if (i >= 402 & i < 801L) sample(seq(10), 1)
else sample(LETTERS, 1)
})
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
}
outdf
}
By doing this, your solution takes 9.84 seconds on my system whereas the function above takes 0.34 seconds, which is ~29x improvement. I think this is the result you're looking for. Please verify it.
HTH

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Adding a new method to data.table - r

Related

How to update data.table variable within a function?

Set atomic vector names by reference

R: on.exit - use returned value without knowing its name

Can I make a function in R return a named object?

Direct update (replace) of sparse data frame is slow and inefficient

Categories

Resources