I'm working with a fairly large graph in igraph R. (~5 million vertices, 40 million edges).
I want to create a new attribute for each vertex which is the average value of an attribute of each of their connections.
For example:
Person A has an X value of 10, they're connected to persons B, C and D who have x values of 20, 50 and 65 respectively. I want to assign a new value of 45 to Person A (average of 20, 50 and 65).
I'm currently using the following method (from another stackoverflow answer) (I'm using 10 cores)
adjcency_list <- get.adjlist(g)
avg_contact_val <- ldply(adjcency_list, function(neis){ mean(V(g)[neis]$X, na.rm = T)},
.parallel = TRUE
)
V(g)$avg_contact_val <- avg_contact_val
This works exactly as I need it to, but it doesn't scale very well and would take a (very!) long time to do on the full graph.
Is there a more efficient method of doing this?
Could this fall under a page rank type algorithm using the x value instead of degree
Would it be possible to use a GPU somehow?
Would this be quicker in igraph Python?
EDIT:
Here's some sample data and an attempt at the approaches suggested:
set.seed(12345)
g <- erdos.renyi.game(10000, .0005)
V(g)$NAME <- c(1:10000)
V(g)$X <- round(runif(10000,0,30))
adjcency_list <- get.adjlist(g)
sub_ages <- data.frame(NAME = V(g)$NAME, X = V(g)$X)
dta.table <- data.table(sub_ages, key = "NAME")
DATA TABLE APPROACH
system.time(
avg_contact_ages <- ldply(adjcency_list,
function(neis){
mean(dta.table[neis,mean(X)], na.rm = T)
}, .progress = "tk"
)
)
user system elapsed
38.87 1.50 40.37
DATA FRAME APPROACH
sub_ages2 <- data.frame(row.names = V(g)$NAME, X = V(g)$X)
system.time(
avg_contact_ages <- ldply(adjcency_list,
function(neis){
mean(sub_ages2[neis, "X"], na.rm = T)
}, .progress = "tk"
)
)
user system elapsed
8.69 1.28 9.99
ORIGINAL APPROACH
system.time(
avg_contact_ages <- ldply(adjcency_list,
function(neis){
mean(V(g)[neis]$X, na.rm = T)
} , .progress = "tk"
)
)
user system elapsed
16.74 2.35 19.14
Shadow's approach
system.time(
avg_nei <- ldply(V(g), function(vert){
mean(get.vertex.attribute(g, "X", index=neighbors(g,vert)), na.rm=TRUE)
}, .progress = "tk")
)
user system elapsed
8.80 1.42 10.23
Is there a more efficient method of doing this?
I think so. Do not call V(g) all the time, but put the attribute in a vector, and index it. If you include some example data, then I'll also include some code.
Could this fall under a page rank type algorithm using the x value instead of degree
No, PageRank is recursive, your rank depends on the whole network, not only on the score of your neighbors.
Would it be possible to use a GPU somehow?
Not with igraph. Put you can certainly make this fast enough without the GPU, so I would not go that way.
Would this be quicker in igraph Python?
Depends how you write it. If you write it the correct way in R, then it will not be faster in Python, either, imo.
EDIT:
I left out the progress bar, because that's the slowest, actually.
Fastest solution above with data frame
system.time({
sub_ages2 <- data.frame(row.names = V(g)$NAME, X = V(g)$X);
avg_contact_ages <- ldply(adjcency_list, function(neis) {
mean(sub_ages2[neis, "X"], na.rm = T)
})
})
# user system elapsed
# 0.368 0.020 0.386
Slightly faster with sapply
system.time({
sub_ages2 <- data.frame(row.names = V(g)$NAME, X = V(g)$X);
avg_contact_ages <- sapply(adjcency_list, function(neis) {
mean(sub_ages2[neis, "X"], na.rm = TRUE)
})
})
# user system elapsed
# 0.340 0.017 0.356
Using factors
system.time({
adj_vec <- unlist(adjcency_list)
adj_fac <- factor(rep(seq_along(adjcency_list),
sapply(adjcency_list, length)),levels=seq_len(vcount(g)))
avg_contact_ages <- tapply(V(g)$X[adj_vec], adj_fac, mean, na.rm=TRUE)
})
# user system elapsed
# 0.131 0.008 0.138
If you need more speedup, you'll probably need to go to C/C++, Rcpp would be a relatively easy solution.
The function get.vertex.attribute adds some speed. But for the size of your graph, that probably won't be enough. Anyway, here's my slightly faster version (in my benchmark tests for much smaller graphs, it's about 2.5 times faster than your version):
avg_nei <- ldply(V(g), function(vert){
mean(get.vertex.attribute(g, "X", index=neighbors(g,vert)), na.rm=TRUE)
}, .parallel = TRUE)
V(g)$avg_contact_val <- avg_nei
Related
I'm using the R package data.table to read large amounts of data and analyse them. What I was wondering is why is selecting rows from a data.table so much slower than from a matrix.
require(data.table)
## create some random data
n = 1000
p = 1000
set.seed(1)
data.raw <- matrix(rnorm(n*p), nrow = n, ncol = p)
rownames(data.raw) <- lapply(1:n, FUN = function(x, length)paste(sample(c(letters, LETTERS), length, replace=TRUE), collapse=""), length = 10)
colnames(data.raw) <- paste0("X", 1:n)
#do the same thing as data.table
data.t <- data.table(data.raw)
data.t[, id := rownames(data.raw)]
setkey(data.t, id)
## now select one row after the other in both matrix and data.table
system.time(for(r in rownames(data.raw)) y <- data.raw[r, ])
# user system elapsed
# 0.016 0.000 0.017
system.time(for(r in data.t$id) y <- data.t[r])
# user system elapsed
# 30.580 0.000 30.608
Even for this relatively small example, data.table is extremely slow even though using the setkey. Is there any way to improve the performance of this?
I am trying to develop a function to "synchronise" NAs among layers of a raster stack, i.e. to make sure that for any given pixel of the stack, if one layer has a NA, then all layers should be set to NA for that pixel.
This is particularly useful when combining rasters coming from varying sources for species distribution modelling, because some models do not handle properly NAs.
I have found two ways to do this, but I find neither of them satisfactory. One of them requires to use the function getValues and thus is not usable for very large stacks or computers with low RAM. The other one is more memory-safe but is much slower. I am therefore here to ask if anyone has an idea to improve my attempts?
Here are the two possibilities:
Using getValues()
syncNA1 <- function (x)
{
val <- getValues(x)
NA.pos <- unique(which(is.na(val), arr.ind = T)[, 1])
val[NA.pos, ] <- NA
x <- setValues(x, val)
return(x)
}
Using calc()
syncNA2 <- function(y)
{
calc(y, na.rm = T, fun = function(x, na.rm = na.rm)
{
if(any(is.na(x)))
{
rep(NA, length(x))
} else
{
x
}
})
}
Now a demonstration of their respective computing times for the same stack:
> system.time(
+ b1 <- syncNA1(a1)
+ )
user system elapsed
3.04 0.15 3.20
> system.time(
+ b2 <- syncNA2(a1)
+ )
user system elapsed
5.89 0.19 6.08
Many thanks for your help,
Boris
With a stack named "s", I would first use calc(s, fun = sum) to compute a mask layer that records the location of all cells with an NA value in at least one of the stack's layers. mask() will then allow you to apply that mask to every layer in the stack.
Here's an example:
library(raster)
## Construct reproducible data! (Here a raster stack with NA values in each layer)
m <- raster(ncol=10, nrow=10)
n <- raster(ncol=10, nrow=10)
m[] <- runif(ncell(m))
n[] <- runif(ncell(n)) * 10
m[m < 0.5] <- NA
n[n < 5] <- NA
s <- stack(m,n)
## Synchronize the NA values
s2 <- mask(s, calc(s,fun = sum))
## Check that it worked
plot(s2)
I don't know about speed, but you might try converting to an array, loading up the NA, and converting back. pseudocode:
xarray<-as.array(xstack)
ind.na<-which(is.na(xarray),array.ind=TRUE)
for(j in nrow(ind.na) ) {
xarray[ind.na[j,1],ind.na[j,2],]<-NA
}
nastack<-raster(xarray)
I haven't verified the correct choice of indices there, nor have I verified I converted back to raster stack correctly, but I hope you get the idea.
EDIT: I ran a time test, with rasters 1000x1000 but otherwise as Josh created.
microbenchmark(josh(s),syncNA1(s),syncNA2(s),times=5)
Unit: milliseconds
expr min lq median uq max
josh(s) 774.2363 789.1653 800.2511 806.5364 809.9087
syncNA1(s) 652.3928 659.8327 692.3578 695.8057 743.9123
syncNA2(s) 7951.3918 8291.7917 8604.2226 8606.3432 10254.4739
neval
5
5
5
I ended up building an hybrid function between syncNA1 and Josh's solution.
This function is memory-safe if the computer does not have enough RAM, but can process faster if the computer has enough RAM:
synchroniseNA <- function(x)
{
if(canProcessInMemory(x, n = 2))
{
val <- getValues(x)
NA.pos <- unique(which(is.na(val), arr.ind = T)[, 1])
val[NA.pos, ] <- NA
x <- setValues(x, val)
return(x)
} else
{
x <- mask(x, calc(x, fun = sum))
return(x)
}
}
However, I empirically determined that the amount of ram used by a data.frame is twice the size of a raster file (for the n argument of canProcessInMemory()), but I am not exactly sure I am right here.
I have looked around StackOverflow, but I cannot find a solution specific to my problem, which involves appending rows to an R data frame.
I am initializing an empty 2-column data frame, as follows.
df = data.frame(x = numeric(), y = character())
Then, my goal is to iterate through a list of values and, in each iteration, append a value to the end of the list. I started with the following code.
for (i in 1:10) {
df$x = rbind(df$x, i)
df$y = rbind(df$y, toString(i))
}
I also attempted the functions c, append, and merge without success. Please let me know if you have any suggestions.
Update from comment:
I don't presume to know how R was meant to be used, but I wanted to ignore the additional line of code that would be required to update the indices on every iteration and I cannot easily preallocate the size of the data frame because I don't know how many rows it will ultimately take. Remember that the above is merely a toy example meant to be reproducible. Either way, thanks for your suggestion!
Update
Not knowing what you are trying to do, I'll share one more suggestion: Preallocate vectors of the type you want for each column, insert values into those vectors, and then, at the end, create your data.frame.
Continuing with Julian's f3 (a preallocated data.frame) as the fastest option so far, defined as:
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(n), y = character(n), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
Here's a similar approach, but one where the data.frame is created as the last step.
# Use preallocated vectors
f4 <- function(n) {
x <- numeric(n)
y <- character(n)
for (i in 1:n) {
x[i] <- i
y[i] <- i
}
data.frame(x, y, stringsAsFactors=FALSE)
}
microbenchmark from the "microbenchmark" package will give us more comprehensive insight than system.time:
library(microbenchmark)
microbenchmark(f1(1000), f3(1000), f4(1000), times = 5)
# Unit: milliseconds
# expr min lq median uq max neval
# f1(1000) 1024.539618 1029.693877 1045.972666 1055.25931 1112.769176 5
# f3(1000) 149.417636 150.529011 150.827393 151.02230 160.637845 5
# f4(1000) 7.872647 7.892395 7.901151 7.95077 8.049581 5
f1() (the approach below) is incredibly inefficient because of how often it calls data.frame and because growing objects that way is generally slow in R. f3() is much improved due to preallocation, but the data.frame structure itself might be part of the bottleneck here. f4() tries to bypass that bottleneck without compromising the approach you want to take.
Original answer
This is really not a good idea, but if you wanted to do it this way, I guess you can try:
for (i in 1:10) {
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
Note that in your code, there is one other problem:
You should use stringsAsFactors if you want the characters to not get converted to factors. Use: df = data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
Let's benchmark the three solutions proposed:
# use rbind
f1 <- function(n){
df <- data.frame(x = numeric(), y = character())
for(i in 1:n){
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
df
}
# use list
f2 <- function(n){
df <- data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
for(i in 1:n){
df[i,] <- list(i, toString(i))
}
df
}
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(1000), y = character(1000), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
system.time(f1(1000))
# user system elapsed
# 1.33 0.00 1.32
system.time(f2(1000))
# user system elapsed
# 0.19 0.00 0.19
system.time(f3(1000))
# user system elapsed
# 0.14 0.00 0.14
The best solution is to pre-allocate space (as intended in R). The next-best solution is to use list, and the worst solution (at least based on these timing results) appears to be rbind.
Suppose you simply don't know the size of the data.frame in advance. It can well be a few rows, or a few millions. You need to have some sort of container, that grows dynamically. Taking in consideration my experience and all related answers in SO I come with 4 distinct solutions:
rbindlist to the data.frame
Use data.table's fast set operation and couple it with manually doubling the table when needed.
Use RSQLite and append to the table held in memory.
data.frame's own ability to grow and use custom environment (which has reference semantics) to store the data.frame so it will not be copied on return.
Here is a test of all the methods for both small and large number of appended rows. Each method has 3 functions associated with it:
create(first_element) that returns the appropriate backing object with first_element put in.
append(object, element) that appends the element to the end of the table (represented by object).
access(object) gets the data.frame with all the inserted elements.
rbindlist to the data.frame
That is quite easy and straight-forward:
create.1<-function(elems)
{
return(as.data.table(elems))
}
append.1<-function(dt, elems)
{
return(rbindlist(list(dt, elems),use.names = TRUE))
}
access.1<-function(dt)
{
return(dt)
}
data.table::set + manually doubling the table when needed.
I will store the true length of the table in a rowcount attribute.
create.2<-function(elems)
{
return(as.data.table(elems))
}
append.2<-function(dt, elems)
{
n<-attr(dt, 'rowcount')
if (is.null(n))
n<-nrow(dt)
if (n==nrow(dt))
{
tmp<-elems[1]
tmp[[1]]<-rep(NA,n)
dt<-rbindlist(list(dt, tmp), fill=TRUE, use.names=TRUE)
setattr(dt,'rowcount', n)
}
pos<-as.integer(match(names(elems), colnames(dt)))
for (j in seq_along(pos))
{
set(dt, i=as.integer(n+1), pos[[j]], elems[[j]])
}
setattr(dt,'rowcount',n+1)
return(dt)
}
access.2<-function(elems)
{
n<-attr(elems, 'rowcount')
return(as.data.table(elems[1:n,]))
}
SQL should be optimized for fast record insertion, so I initially had high hopes for RSQLite solution
This is basically copy&paste of Karsten W. answer on similar thread.
create.3<-function(elems)
{
con <- RSQLite::dbConnect(RSQLite::SQLite(), ":memory:")
RSQLite::dbWriteTable(con, 't', as.data.frame(elems))
return(con)
}
append.3<-function(con, elems)
{
RSQLite::dbWriteTable(con, 't', as.data.frame(elems), append=TRUE)
return(con)
}
access.3<-function(con)
{
return(RSQLite::dbReadTable(con, "t", row.names=NULL))
}
data.frame's own row-appending + custom environment.
create.4<-function(elems)
{
env<-new.env()
env$dt<-as.data.frame(elems)
return(env)
}
append.4<-function(env, elems)
{
env$dt[nrow(env$dt)+1,]<-elems
return(env)
}
access.4<-function(env)
{
return(env$dt)
}
The test suite:
For convenience I will use one test function to cover them all with indirect calling. (I checked: using do.call instead of calling the functions directly doesn't makes the code run measurable longer).
test<-function(id, n=1000)
{
n<-n-1
el<-list(a=1,b=2,c=3,d=4)
o<-do.call(paste0('create.',id),list(el))
s<-paste0('append.',id)
for (i in 1:n)
{
o<-do.call(s,list(o,el))
}
return(do.call(paste0('access.', id), list(o)))
}
Let's see the performance for n=10 insertions.
I also added a 'placebo' functions (with suffix 0) that don't perform anything - just to measure the overhead of the test setup.
r<-microbenchmark(test(0,n=10), test(1,n=10),test(2,n=10),test(3,n=10), test(4,n=10))
autoplot(r)
For 1E5 rows (measurements done on Intel(R) Core(TM) i7-4710HQ CPU # 2.50GHz):
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
It looks like the SQLite-based sulution, although regains some speed on large data, is nowhere near data.table + manual exponential growth. The difference is almost two orders of magnitude!
Summary
If you know that you will append rather small number of rows (n<=100), go ahead and use the simplest possible solution: just assign the rows to the data.frame using bracket notation and ignore the fact that the data.frame is not pre-populated.
For everything else use data.table::set and grow the data.table exponentially (e.g. using my code).
Update with purrr, tidyr & dplyr
As the question is already dated (6 years), the answers are missing a solution with newer packages tidyr and purrr. So for people working with these packages, I want to add a solution to the previous answers - all quite interesting, especially .
The biggest advantage of purrr and tidyr are better readability IMHO.
purrr replaces lapply with the more flexible map() family,
tidyr offers the super-intuitive method add_row - just does what it says :)
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
This solution is short and intuitive to read, and it's relatively fast:
system.time(
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
0.756 0.006 0.766
It scales almost linearly, so for 1e5 rows, the performance is:
system.time(
map_df(1:100000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
76.035 0.259 76.489
which would make it rank second right after data.table (if your ignore the placebo) in the benchmark by #Adam Ryczkowski:
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
A more generic solution for might be the following.
extendDf <- function (df, n) {
withFactors <- sum(sapply (df, function(X) (is.factor(X)) )) > 0
nr <- nrow (df)
colNames <- names(df)
for (c in 1:length(colNames)) {
if (is.factor(df[,c])) {
col <- vector (mode='character', length = nr+n)
col[1:nr] <- as.character(df[,c])
col[(nr+1):(n+nr)]<- rep(col[1], n) # to avoid extra levels
col <- as.factor(col)
} else {
col <- vector (mode=mode(df[1,c]), length = nr+n)
class(col) <- class (df[1,c])
col[1:nr] <- df[,c]
}
if (c==1) {
newDf <- data.frame (col ,stringsAsFactors=withFactors)
} else {
newDf[,c] <- col
}
}
names(newDf) <- colNames
newDf
}
The function extendDf() extends a data frame with n rows.
As an example:
aDf <- data.frame (l=TRUE, i=1L, n=1, c='a', t=Sys.time(), stringsAsFactors = TRUE)
extendDf (aDf, 2)
# l i n c t
# 1 TRUE 1 1 a 2016-07-06 17:12:30
# 2 FALSE 0 0 a 1970-01-01 01:00:00
# 3 FALSE 0 0 a 1970-01-01 01:00:00
system.time (eDf <- extendDf (aDf, 100000))
# user system elapsed
# 0.009 0.002 0.010
system.time (eDf <- extendDf (eDf, 100000))
# user system elapsed
# 0.068 0.002 0.070
Lets take a vector 'point' which has numbers from 1 to 5
point = c(1,2,3,4,5)
if we want to append a number 6 anywhere inside the vector then below command may come handy
i) Vectors
new_var = append(point, 6 ,after = length(point))
ii) columns of a table
new_var = append(point, 6 ,after = length(mtcars$mpg))
The command append takes three arguments:
the vector/column to be modified.
value to be included in the modified vector.
a subscript, after which the values are to be appended.
simple...!!
Apologies in case of any...!
My solution is almost the same as the original answer but it doesn't worked for me.
So, I gave names for the columns and it works:
painel <- rbind(painel, data.frame("col1" = xtweets$created_at,
"col2" = xtweets$text))
I would expect cbind.xts and do.call(cbind.xts) to perform with similar elapsed time.
That was true for R2.11, R2.14.
For R2.15.2 and xts 0.8-8, the do.call(cbind.xts,...) variant performs drastically slower, which effectively breaks my previous codes.
As Josh Ulrich notes in a comment below, the xts package maintainers are aware of this problem. In the meantime, is there a convenient work around?
Reproducible example:
library(xts)
secs <- function (rows, from = as.character(Sys.time()), cols = 1, by = 1)
{
deltas <- seq(from = 0, by = by, length.out = rows)
nacol <- matrix(data = NA, ncol = cols, nrow = rows)
xts(x = nacol, order.by = strptime(from, format = "%Y-%m-%d %X") +
deltas)
}
n <- 20
d1 <- secs(rows=n*100,cols=n)
d2 <- secs(rows=n*100,cols=n)
system.time(cbind.xts(d1,d2))
versus
system.time(do.call(cbind.xts, list(d1,d2)))
One work-around is to set quote=TRUE in do.call.
R> system.time(cb <- cbind.xts(d1,d2))
user system elapsed
0.004 0.000 0.004
R> system.time(dc <- do.call(cbind.xts, list(d1,d2), quote=TRUE))
user system elapsed
0.000 0.004 0.004
R> identical(cb,dc)
[1] TRUE
The slowness is caused by do.call evaluating the arguments before evaluating the function call by default, which causes the call to be much larger. For example, compare these two calls:
call("cbind", d1, d2) # huge
call("cbind", quote(d1), quote(d2)) # dainty
I am trying to take a very large set of records with multiple indices, calculate an aggregate statistic on groups determined by a subset of the indices, and then insert that into every row in the table. The issue here is that these are very large tables - over 10M rows each.
Code for reproducing the data is below.
The basic idea is that there are a set of indices, say ix1, ix2, ix3, ..., ixK. Generally, I am choosing only a couple of them, say ix1 and ix2. Then, I calculate an aggregation of all the rows with matching ix1 and ix2 values (over all combinations that appear), for a column called val. To keep it simple, I'll focus on a sum.
I have tried the following methods
Via sparse matrices: convert the values to a coordinate list, i.e. (ix1, ix2, val), then create a sparseMatrix - this nicely sums up everything, and then I need only convert back from the sparse matrix representation to the coordinate list. Speed: good, but it is doing more than is necessary and it doesn't generalize to higher dimensions (e.g. ix1, ix2, ix3) or more general functions than a sum.
Use of lapply and split: by creating a new index that is unique for all (ix1, ix2, ...) n-tuples, I can then use split and apply. The bad thing here is that the unique index is converted by split into a factor, and this conversion is terribly time consuming. Try system({zz <- as.factor(1:10^7)}).
I'm now trying data.table, via a command like sumDT <- DT[,sum(val),by = c("ix1","ix2")]. However, I don't yet see how I can merge sumDT with DT, other than via something like DT2 <- merge(DT, sumDT, by = c("ix1","ix2"))
Is there a faster method for this data.table join than via the merge operation I've described?
[I've also tried bigsplit from the bigtabulate package, and some other methods. Anything that converts to a factor is pretty much out - as far as I can tell, that conversion process is very slow.]
Code to generate data. Naturally, it's better to try a smaller N to see that something works, but not all methods scale very well for N >> 1000.
N <- 10^7
set.seed(2011)
ix1 <- 1 + floor(rexp(N, 0.01))
ix2 <- 1 + floor(rexp(N, 0.01))
ix3 <- 1 + floor(rexp(N, 0.01))
val <- runif(N)
DF <- data.frame(ix1 = ix1, ix2 = ix2, ix3 = ix3, val = val)
DF <- DF[order(DF[,1],DF[,2],DF[,3]),]
DT <- as.data.table(DF)
Well, it's possible you'll find that doing the merge isn't so bad as long as your keys are properly set.
Let's setup the problem again:
N <- 10^6 ## not 10^7 because RAM is tight right now
set.seed(2011)
ix1 <- 1 + floor(rexp(N, 0.01))
ix2 <- 1 + floor(rexp(N, 0.01))
ix3 <- 1 + floor(rexp(N, 0.01))
val <- runif(N)
DT <- data.table(ix1=ix1, ix2=ix2, ix3=ix3, val=val, key=c("ix1", "ix2"))
Now you can calculate your summary stats
info <- DT[, list(summary=sum(val)), by=key(DT)]
And merge the columns "the data.table way", or just with merge
m1 <- DT[info] ## the data.table way
m2 <- merge(DT, info) ## if you're just used to merge
identical(m1, m2)
[1] TRUE
If either of those ways of merging is too slow, you can try a tricky way to build info at the cost of memory:
info2 <- DT[, list(summary=rep(sum(val), length(val))), by=key(DT)]
m3 <- transform(DT, summary=info2$summary)
identical(m1, m3)
[1] TRUE
Now let's see the timing:
#######################################################################
## Using data.table[ ... ] or merge
system.time(info <- DT[, list(summary=sum(val)), by=key(DT)])
user system elapsed
0.203 0.024 0.232
system.time(DT[info])
user system elapsed
0.217 0.078 0.296
system.time(merge(DT, info))
user system elapsed
0.981 0.202 1.185
########################################################################
## Now the two parts of the last version done separately:
system.time(info2 <- DT[, list(summary=rep(sum(val), length(val))), by=key(DT)])
user system elapsed
0.574 0.040 0.616
system.time(transform(DT, summary=info2$summary))
user system elapsed
0.173 0.093 0.267
Or you can skip the intermediate info table building if the following doesn't seem too inscrutable for your tastes:
system.time(m5 <- DT[ DT[, list(summary=sum(val)), by=key(DT)] ])
user system elapsed
0.424 0.101 0.525
identical(m5, m1)
# [1] TRUE