R Dynamically build "list" in data.table (or ddply) - r

My aggregation needs vary among columns / data.frames. I would like to pass the "list" argument to the data.table dynamically.
As a minimal example:
require(data.table)
type <- c(rep("hello", 3), rep("bye", 3), rep("ok",3))
a <- (rep(1:3, 3))
b <- runif(9)
c <- runif(9)
df <- data.frame(cbind(type, a, b, c), stringsAsFactors=F)
DT <-data.table(df)
This call:
DT[, list(suma = sum(as.numeric(a)), meanb = mean(as.numeric(b)), minc = min(as.numeric(c))), by= type]
will have result similar to this:
type suma meanb minc
1: hello 6 0.1332210 0.4265579
2: bye 6 0.5680839 0.2993667
3: ok 6 0.5694532 0.2069026
Future data.frames will have more columns that I will want to summarize differently. But for the sake of working with this small example: Is there a way to pass the list programatically?
I naïvely tried:
# create a different list
mylist <- "list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))"
# new call
DT[, mylist, by=type]
With the following error:
1: hello
2: bye
3: ok
mylist
1: list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))
2: list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))
3: list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))
Any hints appreciated! Best regards!
PS sorry about these as.numeric(), I could not quite figure out why, but I needed them for the example to run.
Minor edit inserted columns / before data.frame in initial sentence to clarify my needs.

This is explained FAQ 1.6 what you are looking for is quote and eval
something like
mycall <- quote(list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c))))
DT[, eval(mycall)]
After a bit of head-banging, here is a very ugly way of constructing the call for ddply using .()
myplyrcall <- .(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))
do.call(ddply,c(.data = quote(DF), .variables = 'type',.fun = quote(summarise),myplyrcall))
You could also use as.quoted which has an as.quoted.character method to construct using paste0
myplc <-as.quoted(c("lengtha" = "length(as.numeric(a))", "maxb" = "max(as.numeric(b))", "meanc" = "mean(as.numeric(c))"))
This can be used with data.table as well!
dtcall <- as.quoted(mylist)[[1]]
DT[,eval(dtcall), by = type]
data.table all the way.

Another way is to use .SDcols to group the columns for which you'd like to perform the same operations together. Let's say that you require columns a,d,e to be summed by type where as, b,g should have mean taken and c,f its median, then,
# constructing an example data.table:
set.seed(45)
dt <- data.table(type=rep(c("hello","bye","ok"), each=3), a=sample(9),
b = rnorm(9), c=runif(9), d=sample(9), e=sample(9),
f = runif(9), g=rnorm(9))
# type a b c d e f g
# 1: hello 6 -2.5566166 0.7485015 9 6 0.5661358 -2.2066521
# 2: hello 3 1.1773119 0.6559926 3 3 0.4586280 -0.8376586
# 3: hello 2 -0.1015588 0.2164430 1 7 0.9299597 1.7216593
# 4: bye 8 -0.2260640 0.3924327 8 2 0.1271187 0.4360063
# 5: bye 7 -1.0720503 0.3256450 7 8 0.5774691 0.7571990
# 6: bye 5 -0.7131021 0.4855804 6 9 0.2687791 1.5398858
# 7: ok 1 -0.4680549 0.8476840 2 4 0.5633317 1.5393945
# 8: ok 4 0.4183264 0.4402595 4 1 0.7592801 2.1829996
# 9: ok 9 -1.4817436 0.5080116 5 5 0.2357030 -0.9953758
# 1) set key
setkey(dt, "type")
# 2) group col-ids by similar operations
id1 <- which(names(dt) %in% c("a", "d", "e"))
id2 <- which(names(dt) %in% c("b","g"))
id3 <- which(names(dt) %in% c("c","f"))
# 3) now use these ids in with .SDcols parameter
dt1 <- dt[, lapply(.SD, sum), by="type", .SDcols=id1]
dt2 <- dt[, lapply(.SD, mean), by="type", .SDcols=id2]
dt3 <- dt[, lapply(.SD, median), by="type", .SDcols=id3]
# 4) merge them.
dt1[dt2[dt3]]
# type a d e b g c f
# 1: bye 20 21 19 -0.6704055 0.9110304 0.3924327 0.2687791
# 2: hello 11 13 16 -0.4936211 -0.4408838 0.6559926 0.5661358
# 3: ok 14 11 10 -0.5104907 0.9090061 0.5080116 0.5633317
If/when you have many many column, making a list like the one you've might be cumbersome.

Another method (supporting the use of paste or paste0 to build the expression):
expr <- parse(text=mylist)
DT[, eval( expr ), by=type]
#-------
type lengtha maxb meanc
1: hello 3 0.8265407 0.5244094
2: bye 3 0.4955301 0.6289475
3: ok 3 0.9527455 0.5600915

I find it worrysome that apparently eval is part of the answer. From your question it is not clear to me, if and why you really want to do what you claim to want. Thus I demonstrate here that you can also use a function:
fun <- function(a,b,c) {
list(lengtha = length(as.numeric(a)),
maxb = max(as.numeric(b)),
meanc = mean(as.numeric(c)))
}
DT[, fun(a,b,c), by=type]
type lengtha maxb meanc
1: hello 3 0.8792184 0.3745643
2: bye 3 0.8718397 0.4519999
3: ok 3 0.8900764 0.4511536

Related

Lapply function to list of data.tables by reference silently

I have a list of data.tables and I want to apply a function to each data.table. I things set up to use := inside an lapply function. Everything works fine and my outputs are updated by reference, but my function also prints to the console. This is part of a much larger project and printing this step to the console is not ideal.
How do I run this 'silently' without printing? Is there a better way to structure the workflow / code?
dt1 <- data.table(a = rnorm(1:10),
b = rnorm(1:10))
dt2 <- data.table(a = rnorm(1:10),
b = rnorm(1:10))
dts <- list(dt1, dt2)
lapply(dts, function(dt) {
dt[, ':=' (c = a + b)]
})
dts
dts now has a C column, but the outputs were displayed in the console. This code chunk is called from another function.
You can use a for loop
for(dt in dts) dt[, ':='(c = a + b)]
You can assign the lapply call which will suppress the output
dts <- lapply(dts, function(dt) {
dt[, ':=' (c = a + b)]
})
We. can use walk which will not print anything into the console
library(purrr)
walk(dts, ~ .x[, `:=`(c = a + b)])
dts
#[[1]]
# a b c
# 1: -0.1069952 0.1115983 0.004603111
# 2: 0.3228771 -0.8400846 -0.517207530
# 3: -1.6072728 -0.2727947 -1.880067477
# 4: 0.1715614 -0.3864995 -0.214938065
# 5: 1.8233350 -1.0786569 0.744678084
# 6: 0.2366026 -0.6166318 -0.380029253
# 7: 0.2373992 0.2251559 0.462555116
# 8: -0.1075611 -1.0418174 -1.149378504
# 9: 1.6742520 -0.5635583 1.110693774
#10: 2.4733842 2.1091365 4.582520731
#[[2]]
# a b c
# 1: -0.8332617 1.67201117 0.83874947
# 2: 1.3688393 1.12168046 2.49051974
# 3: 1.0208642 -1.18482073 -0.16395650
# 4: 0.6784662 2.15979872 2.83826493
# 5: -0.4351644 -0.04629453 -0.48145894
# 6: 1.3133550 -1.03423308 0.27912197
# 7: 1.0143396 -0.84787780 0.16646185
# 8: -0.9622108 0.92338456 -0.03882627
# 9: -0.3106202 1.08886031 0.77824008
#10: 0.7602507 -0.08996701 0.67028370
Or wrap with invisible along with lapply
invisible(lapply(dts, function(dt) {
dt[, ':=' (c = a + b)]
}))
Using set:
for (i in seq_along(dts)) set(dts[[i]], j = "c", value = dts[[i]]$a + dts[[i]]$b)

How to make a fuzzy join in R using more than one variable on each side

I would like to join the two data frames :
a <- data.frame(x=c(1,3,5))
b <- data.frame(start=c(0,4),end=c(2,6),y=c("a","b"))
with a condition like (x>start)&(x<end) in order to get such a result:
# x y
#1 1 a
#2 2 <NA>
#3 3 b
I don't want to make a potentially large cartesian product and then select only the few rows matching the condition and I'd like a solution using the tidyverse (I am not interested in a solution using SQL which would be a confession of failure). I thought of the 'fuzzyjoin' package but I cannot find examples fitting my need : the function to apply for the condition has only two arguments. I also tried to put 'start' and 'end' into a single argument with data.frame(z=I(purrr::map2(b$start,b$end,list)),y=b$y)
# z y
#1 0, 2 a
#2 4, 6 b
but although the data looks fine fuzzy_left_join doesn't accept it.
I search for solutions working in more general cases (n variables on the LHS, m on the RHS, not necessarily numeric with arbitrary conditions).
UPDATE
I also want to be able to express conditions like (x=start+1)|(x=end+1) giving here:
# x y
#1 1 a
#2 3 a
#3 5 b
For this case you don't need multi_by or multy_match_fun, this works :
library(fuzzyjoin)
fuzzy_left_join(a, b, by = c(x = "start", x = "end"), match_fun = list(`>`, `<`))
# x start end y
# 1 1 0 2 a
# 2 3 NA NA <NA>
# 3 5 4 6 b
I eventually went to the code of fuzzy_join and found a way to make what I want even without proper documentation. fuzzy_let_join doesn't work but there is the following way (not really pretty and it actually does a cartesian product):
g <- function(x,y) (x>y[,"start"])&(x<y[,"end"])
fuzzy_join(a,b, multi_by = list(x="x",y=c("start","end"))
, multi_match_fun = g, mode = "left") %>% select(x,y)
data.table approach could be
library(data.table)
name1 <- setdiff(names(setDT(b)), names(setDT(a)))
#perform left outer join and then select required columns
a[b, (name1) := mget(name1), on = .(x > start, x < end)][, .(x, y)]
which gives
x y
1: 1 a
2: 3 <NA>
3: 5 b
Sample data:
a <- data.frame(x = c(1, 3, 5))
b <- data.frame(start = c(0, 4), end = c(2, 6), y = c("a", "b"))
Update: In case you want to join both dataframes on (x=start+1)|(x=end+1) condition then you can try
library(data.table)
DT1 <- as.data.table(a)
DT2 <- as.data.table(b)
#Perform 1st join on "x = start+1" and then another on "x = end+1". Finally row-bind both results.
DT <- rbindlist(list(DT1[DT2[, start_temp := start+1], on = c(x = "start_temp"), .(x, y), nomatch = 0],
DT1[DT2[, end_temp := end+1], on = c(x = "end_temp"), .(x, y), nomatch = 0]))
DT
# x y
#1: 1 a
#2: 5 b
#3: 3 a
A possible answer to explain what I am trying to do : extending dplyr in some way. And I will be happy to know if there are ways to improve this solution or some problems I didn't see.
The solution avoids the cartesian product, but duplicates into lists of data frames both one of the input data frame and the result. I didn't include the final column selection of x and y that is easy to code.
my_left_join <- function(.DATA1,.DATA2,.WHERE)
{
call = as.list(match.call())
df1 <- .DATA1
df1$._row_ <- 1:nrow(df1)
dfl1 <- replyr::replyr_split(df1,"._row_")
eval(substitute(
dfl2 <- mapply(function(.x)
{filter(.DATA2,with(.x,WHERE)) %>%
mutate(._row_=.x$._row_)}
, dfl1, SIMPLIFY=FALSE)
,list(WHERE=call$.WHERE)))
df2 <- replyr::replyr_bind_rows(dfl2)
left_join(df1,df2,by="._row_") %>% select(-._row_)
}
my_left_join(a,b,(x>start)&(x<end))
# x start end y
#1 1 0 2 a
#2 3 NA NA <NA>
#3 5 4 6 b
my_left_join(a,b,(x==(start+1))|(x==(end+1)))
# x start end y
#1 1 0 2 a
#2 3 0 2 a
#3 5 4 6 b
You can try a GenomicRanges solution
library(GenomicRanges)
# setup GRanges objects
a_gr <- GRanges(1, IRanges(a$x,a$x))
b_gr <- GRanges(1, IRanges(b$start, b$end))
# find overlaps between the two data sets
res <- as.data.frame(findOverlaps(a_gr,b_gr))
# create the expected output
a$y <- NA
a$y[res$queryHits] <- as.character(b$y)[res$subjectHits]
a
x y
1 1 a
2 3 <NA>
3 5 b

R data.table dot product with matching column names (for each group)

I have a datatable of data and a datatable of fitted coefficients. I want to calculate the fitted value for each row.
dt = data.table(a = rep(c("x","y"), each = 5), b = rnorm(10), c = rnorm(10), d = rnorm(10))
coefs = data.table(a = c("x","y"), b = c(0, 1), d = c(2,3))
dt
# a b c d
# 1: x -0.25174915 -0.2130797 -0.67909764
# 2: x -0.35569766 0.6014930 0.35201386
# 3: x -0.31600957 0.4398968 -1.15475814
# 4: x -0.54113762 -2.3497952 0.64503654
# 5: x 0.11227873 0.0233775 -0.96891456
# 6: y 1.24077566 -1.2843439 1.98883516
# 7: y -0.23819626 0.9950835 -0.17279980
# 8: y 1.49353589 0.3067897 -0.02592004
# 9: y 0.01033722 -0.5967766 -0.28536224
#10: y 0.69882444 0.8702424 1.24131062
coefs # NB no "c" column
# a b d
#1: x 0 2
#2: y 1 3
For each a=="x" row in dt, I want 0*b+2*d; and for each a=="y" row in dt, I want 1*b+3*d.
Is there a datatable way to do this without hardcode the column name? I'm happy to put the column names in a variable cols = colnames(coefs)[-1].
It's easy to loop over groups and rbind together, so if the grouping is causing trouble, please ignore that part.
Join the data.tables:
dt[coefs, res := b * i.b + d * i.d, on = "a"]
# a b c d res
#1: x 0.09901786 -0.362080111 -0.5108862 -1.0217723
#2: x -0.16128422 0.169655945 0.3199648 0.6399295
#3: x -0.79648896 -0.502279345 1.3828633 2.7657266
#4: x -0.26121421 0.480548972 -1.1559392 -2.3118783
#5: x 0.54085591 -0.601323442 1.3833795 2.7667590
#6: y 0.83662761 0.607666970 0.6320762 2.7328562
#7: y -1.92510391 -0.050515610 -0.3176544 -2.8780671
#8: y 1.65639926 -0.167090105 0.6830158 3.7054466
#9: y 1.48772354 -0.349713539 -1.2736467 -2.3332166
#10: y 1.49065993 0.008198885 -0.1923361 0.9136516
Usually you would use the matrix product here, but that would mean that you had to coerce the respective subset to a matrix. That would result in a copy being made and since data.tables are mainly used for larger data, you want to avoid copies.
If you need dynamic column names, the most simple solution that comes to mind is actually an eval/parse construct:
cols = colnames(coefs)[-1]
expr <- parse(text = paste(paste(cols, paste0("i.", cols), sep = "*"), collapse = "+"))
#expression(b*i.b+d*i.d)
dt[coefs, res := eval(expr), on = "a"]
Maybe someone else can suggest a better solution.
Here is a solution using matrix multiplication:
dt[, res := as.matrix(.SD) %*% unlist(coefs[a == .BY, .SD, .SDcols = cols]),
by = "a", .SDcols = cols]
Of course this makes copies, which is potentially less efficient then the eval solution.
I found out that data.table of all numerical type columns can do arithmetic operations (+,-,*,/), but no name matching -- just order matching.
> coefs
a b d
1: x 0 2
2: y 1 3
> coefs[, .(b,d)] * coefs[, .(b,d)]
b d
1: 0 4
2: 1 9
> coefs[, .(b,d)] * coefs[, .(d,b)]
b d
1: 0 0
2: 3 3
so a solution based on this
> cols = colnames(coefs)[-1]
> zz = rowSums(coefs[dt[,.(a)], .SD, on = 'a', .SDcols = cols] * dt[, .SD, .SDcols = cols])
> dt[, newcol := zz]
Another alternative (but slower) approach is:
dt$res <- unsplit(Map(function(x,y){x$b*y$b + x$d*y$d}, split(dt, dt$a=="x"),
split(coefs,coefs$a=="x")),dt$a=="x")
dt
a b c d res
1: x 0.47859729 1.3479271 0.5691897 1.1383794
2: x 0.28491505 -0.3291934 1.8621365 3.7242730
3: x -1.43894695 1.5555413 0.3685772 0.7371544
4: x 0.04360066 0.1358920 0.5240700 1.0481400
5: x -1.39897890 -0.0175886 -0.6876451 -1.3752901
6: y -0.60952146 1.2331907 -0.3582176 -1.6841742
7: y 0.31777772 1.4090295 -0.4053615 -0.8983067
8: y 0.42758431 -0.3746061 2.1208417 6.7901094
9: y -0.60701063 -0.9232092 1.9386482 5.2089341
10: y -1.52042316 -0.8871454 -0.9314232 -4.3146927
This same code would work in base R as well if your data was already data.frames.

R - How to run average & max on different data.table columns based on multiple factors & return original colnames

I am changing my R code from data.frame + plyr to data.tables as I need a faster and more memory-efficient way to handle a big data set. Unfortunately, my R skills are woefully limited and I've hit a wall for the whole day. Would appreciate if SO experts here can enlighten.
My Goals
Aggregate rows in my data.table based on 2 functions - average and max - run on selected columns (with column names passed via vector) while grouping by columns also passed via vector.
The resulting DT should contain the original column names.
There should not be unnecessary copying of the DT in order to conserve memory
My Test Code
DT = data.table( a=LETTERS[c(1,1,1:4)],b=4:9, c=3:8, d = rnorm(6),
e=LETTERS[c(rep(25,3),rep(26,3))], key="a" )
GrpVar1 <- "a"
GrpVar2 <- "e"
VarToMax <- "b"
VarToAve <- c( "c", "d")
What I tried but didn't work for me
DT[, list( b=max( b ), c=mean(c), d=mean(d) ), by=c( GrpVar1, GrpVar2 ) ]
# Hard-code col name - not what I want
DT[, list( max( get(VarToMax) ), mean( get(VarToAve) )), by=c( GrpVar1, GrpVar2 ) ]
# Col names become 'V1', 'V2', worse, 1 column goes missing - Not what I want either
DT[, list( get(VarToMax)=max( get(VarToMax) ),
get(VarToAve)=mean( get(VarToAve) ) ), by=c( GrpVar1, GrpVar2 ) ]
# Above code gave Error!
Additional Question
Based on my very limited understanding of DTs, the with = F argument should instruct R to parse the values of VarToMax and VarToAve, but running the code below leads to error.
DT[, list( max(VarToMax), mean(VarToAve) ), by=c( GrpVar1, GrpVar2 ), with=F ]
# Error in `[.data.table`(DT, , list(max(VarToMax), mean(VarToAve)), by = c(GrpVar1, :
# object 'ansvals' not found
# In addition: Warning message:
# In mean.default(VarToAve) :
# argument is not numeric or logical: returning NA
Existing SO solutions can't help
Arun's solution was how I got to this point, but I am very stuck. His other solution using lapply and .SDcols involves creating 2 extra DT, which does not meet my memory-conserving requirement.
dt1 <- dt[, lapply(.SD, sum), by=ID, .SDcols=c(3,4)]
dt2 <- dt[, lapply(.SD, head, 1), by=ID, .SDcols=c(2)]
I am SO confused over data.table! Any help would be most appreciated!
In a similar fashion as #David Arenburg, but using .SDcols in order to simplify the notation. Also I show the code until the merge.
DTaves <- DT[, lapply(.SD, mean), .SDcols = VarToAve, by = c(GrpVar1, GrpVar2)]
DTmaxs <- DT[, lapply(.SD, max), .SDcols = VarToMax, by = c(GrpVar1, GrpVar2)]
merge(DTmaxs, DTaves)
## a e b c d
## 1: A Y 6 4 0.2230091
## 2: B Z 7 6 0.5909434
## 3: C Z 8 7 -0.4828223
## 4: D Z 9 8 -1.3591240
Alternatively, you can do this in one go by subsetting the .SD using the .. notation to look for VarToAve in the parent frame of .SD (as opposed to a column named VarToAve)
DT[, c(lapply(.SD[, ..VarToAve], mean),
lapply(.SD[, ..VarToMax], max)),
by = c(GrpVar1, GrpVar2)]
## a e c d b
## 1: A Y 4 0.2230091 6
## 2: B Z 6 0.5909434 7
## 3: C Z 7 -0.4828223 8
## 4: D Z 8 -1.3591240 9
Here's my humble attempt
DT[, as.list(c(setNames(max(get(VarToMax)), VarToMax),
lapply(.SD[, ..VarToAve], mean))),
c(GrpVar1, GrpVar2)]
# a e b c d
# 1: A Y 6 4 -0.8000173
# 2: B Z 7 6 0.2508633
# 3: C Z 8 7 1.1966517
# 4: D Z 9 8 1.7291615
Or, for maximum efficiency you could use colMeans and eval(as.name()) combination instead of lapply and get
DT[, as.list(c(setNames(max(eval(as.name(VarToMax))), VarToMax),
colMeans(.SD[, ..VarToAve]))),
c(GrpVar1, GrpVar2)]
# a e b c d
# 1: A Y 6 4 -0.8000173
# 2: B Z 7 6 0.2508633
# 3: C Z 8 7 1.1966517
# 4: D Z 9 8 1.7291615

Apply different functions to different sets of columns by group

I have a data.table with the following features:
bycols: columns that divide the data into groups
nonvaryingcols: columns that are constant within each group (so that taking the first item from within each group and carrying that through would be sufficient)
datacols: columns to be aggregated / summarized (e.g. sum them within group)
I'm curious what the most efficient way to do what you might call a mixed collapse, taking all three of the above inputs as character vectors. It doesn't have to be the absolute fastest, but fast enough with reasonable syntax would be ideal.
Example data, where the different sets of columns are stored in character vectors.
require(data.table)
set.seed(1)
bycols <- c("g1","g2")
datacols <- c("dat1","dat2")
nonvaryingcols <- c("nv1","nv2")
test <- data.table(
g1 = rep( letters, 10 ),
g2 = rep( c(LETTERS,LETTERS), each = 5 ),
dat1 = runif( 260 ),
dat2 = runif( 260 ),
nv1 = rep( seq(130), 2),
nv2 = rep( seq(130), 2)
)
Final data should look like:
g1 g2 dat1 dat2 nv1 nv2
1: a A 0.8403809 0.6713090 1 1
2: b A 0.4491883 0.4607716 2 2
3: c A 0.6083939 1.2031960 3 3
4: d A 1.5510033 1.2945761 4 4
5: e A 1.1302971 0.8573135 5 5
6: f B 1.4964821 0.5133297 6 6
I have worked out two different ways of doing it, but one is horridly inflexible and unwieldy, and one is horridly slow. Will post tomorrow if no one has come up with something better by then.
As always with this sort of programmatic use of [.data.table, the general strategy is to construct an expression e that that can be evaluated in the j argument. Once you understand that (as I'm sure you do), it just becomes a game of computing on the language to get a j-slot expression that looks like what you'd write at the command line.
Here, for instance, and given the particular values in your example, you'd like a call that looks like:
test[, list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1]),
by=c("g1", "g2")]
so the expression you'd like evaluated in the j-slot is
list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1])
Most of the following function is taken up with constructing just that expression:
f <- function(dt, bycols, datacols, nvcols) {
e <- c(sapply(datacols, function(x) call("sum", as.symbol(x))),
sapply(nvcols, function(x) call("[", as.symbol(x), 1)))
e<- as.call(c(as.symbol("list"), e))
dt[,eval(e), by=bycols]
}
f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
## g1 g2 dat1 dat2 nv1 nv2
## 1: a A 0.8403809 0.6713090 1 1
## 2: b A 0.4491883 0.4607716 2 2
## 3: c A 0.6083939 1.2031960 3 3
## 4: d A 1.5510033 1.2945761 4 4
## 5: e A 1.1302971 0.8573135 5 5
## ---
## 126: v Z 0.5627018 0.4282380 126 126
## 127: w Z 0.7588966 1.4429034 127 127
## 128: x Z 0.7060596 1.3736510 128 128
## 129: y Z 0.6015249 0.4488285 129 129
## 130: z Z 1.5304034 1.6012207 130 130
Here's what I had come up with. It works, but very slowly.
test[, {
cbind(
as.data.frame( t( sapply( .SD[, ..datacols], sum ) ) ),
.SD[, ..nonvaryingcols][1]
)
}, by = bycols ]
Benchmarks
FunJosh <- function() {
f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
}
FunAri <- function() {
test[, {
cbind(
as.data.frame( t( sapply( .SD[, ..datacols], sum ) ) ),
.SD[, ..nonvaryingcols][1]
)
}, by = bycols ]
}
FunEddi <- function() {
cbind(
test[, lapply(.SD, sum), by = bycols, .SDcols = datacols],
test[, lapply(.SD, "[", 1), by = bycols, .SDcols = nonvaryingcols][, ..nonvaryingcols]
)
}
library(microbenchmark)
identical(FunJosh(), FunAri())
# [1] TRUE
microbenchmark(FunJosh(), FunAri(), FunEddi())
#Unit: milliseconds
# expr min lq median uq max neval
# FunJosh() 2.749164 2.958478 3.098998 3.470937 6.863933 100
# FunAri() 246.082760 255.273839 284.485654 360.471469 509.740240 100
# FunEddi() 5.877494 6.229739 6.528205 7.375939 112.895573 100
At least two orders of magnitude slower than #joshobrien's solution. Edit #Eddi's solution is much faster as well, and shows that cbind wasn't optimal but could be fairly fast in the right hands. Might be all the transforming and sapplying I was doing rather than just directly using lapply.
Just for a bit of variety, here is a variant of #Josh O'brien's solution that uses the bquote operator instead of call. I did try to replace the final as.call with a bquote, but because bquote doesn't support list splicing (e.g., see this question), I couldn't get that to work.
f <- function(dt, bycols, datacols, nvcols) {
datacols = sapply(datacols, as.symbol)
nvcols = sapply(nvcols, as.symbol)
e = c(lapply(datacols, function(x) bquote(sum(.(x)))),
lapply(nvcols, function(x) bquote(.(x)[1])))
e = as.call(c(as.symbol("list"), e))
dt[,eval(e), by=bycols]
}
> f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
g1 g2 dat1 dat2 nv1 nv2
1: a A 0.8404 0.6713 1 1
2: b A 0.4492 0.4608 2 2
3: c A 0.6084 1.2032 3 3
4: d A 1.5510 1.2946 4 4
5: e A 1.1303 0.8573 5 5
---
126: v Z 0.5627 0.4282 126 126
127: w Z 0.7589 1.4429 127 127
128: x Z 0.7061 1.3737 128 128
129: y Z 0.6015 0.4488 129 129
130: z Z 1.5304 1.6012 130 130
>

Resources