loop over subsets of datatable - r

Assume I have a random data table and I want to loop over its subsets.
e.g.
DT <- data.table(date = rep(c(1979,1980,1981,1982),3),
Id = rep(c(1,2,3),each = 4),
x1 = c(10, 40, 80,12,13,19,9,5,22,13,49,110),
x2 = sample(100,12,replace=T),
x3 = sample(100,12,replace=T))
I also have the following function:
test <- function(x){x[,3:5]/100}
Assume I loop over id, apply the function 'test' to the subsets of the datatable and save everything in a list:
resultinglist <- vector("list",3)
for (i in 1:3){resultinglist[[i]] <- test(DT[Id == i])}
This, so far, is straight forward. Now my question is, with very large datasets, this can take a while. Therefore: Can this code be optimized in any way, maybe so that no copies of the datatable-subsets are made?
In particular, I wonder what happens if I pass DT[id == i] to functiontest? Is this the right approach? For example I could also try to loop and just filter at every iteration, then apply some code on the filtered datatable.
Thanks for any hints.

I would go with split(test(DT), DT$Id).
> system.time(resultinglist1<- split(test(DT), DT$Id))
user system elapsed
0.002 0.000 0.002
> resultinglist <- vector("list",3)
> system.time(for (i in 1:3){resultinglist[[i]] <- test(DT[Id == i])})
user system elapsed
0.015 0.000 0.016
Even with that few data points it takes 1/8th of the time (on my machine).

There is a split.data.table method: see ?split.data.table so try:
> split(DT, by=c("Id"), flatten=FALSE)
$`1`
date Id x1 x2 x3
1: 1979 1 10 26 74
2: 1980 1 40 17 5
3: 1981 1 80 43 51
4: 1982 1 12 35 96
$`2`
date Id x1 x2 x3
1: 1979 2 13 8 65
2: 1980 2 19 66 69
3: 1981 2 9 69 27
4: 1982 2 5 4 80
$`3`
date Id x1 x2 x3
1: 1979 3 22 100 29
2: 1980 3 13 28 83
3: 1981 3 49 53 55
4: 1982 3 110 89 7
If you wanted to extract the 3rd to 5th columns it might be:
lapply( split(DT, by=c("Id"), flatten=FALSE), subset, select=3:5)
$`1`
x1 x2 x3
1: 10 26 74
2: 40 17 5
3: 80 43 51
4: 12 35 96
$`2`
x1 x2 x3
1: 13 8 65
2: 19 66 69
3: 9 69 27
4: 5 4 80
$`3`
x1 x2 x3
1: 22 100 29
2: 13 28 83
3: 49 53 55
4: 110 89 7
See also ?subset.data.table

Related

find duplicated rows of a data frame in R [duplicate]

I have the following data:
x1 x2 x3 x4
34 14 45 53
2 8 18 17
34 14 45 20
19 78 21 48
2 8 18 5
In rows 1 and 3; and 2 and 5 the values for columns X1;X2,X3 are equal. How can I output only those 4 rows, with equal numbers? The output should be in the following format:
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
Please, ask me questions if something unclear.
ADDITIONAL QUESTION: in the output
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
find the sum of values in last column:
x1 x2 x3 x4
34 14 45 73
2 8 18 22
You can do this with duplicated, which checks for rows being duplicated when passed a matrix. Since you're only checking the first three columns, you should pass dat[,-4] to the function.
dat[duplicated(dat[,-4]) | duplicated(dat[,-4], fromLast=T),]
# x1 x2 x3 x4
# 1 34 14 45 53
# 2 2 8 18 17
# 3 34 14 45 20
# 5 2 8 18 5
An alternative using ave:
dat[ave(dat[,1], dat[-4], FUN=length) > 1,]
# x1 x2 x3 x4
#1 34 14 45 53
#2 2 8 18 17
#3 34 14 45 20
#5 2 8 18 5
Learned this one the other day. You won't need to re-order the output.
s <- split(dat, do.call(paste, dat[-4]))
Reduce(rbind, Filter(function(x) nrow(x) > 1, s))
# x1 x2 x3 x4
# 2 2 8 18 17
# 5 2 8 18 5
# 1 34 14 45 53
# 3 34 14 45 20
There is another way to solve both questions using two packages.
library(DescTools)
library(dplyr)
dat[AllDuplicated(dat[1:3]), ] %>% # this line is to find duplicates
group_by(x1, x2) %>% # the lines followed are to sum up
mutate(x4 = sum(x4)) %>%
unique()
# Source: local data frame [2 x 4]
# Groups: x1, x2
#
# x1 x2 x3 x4
# 1 34 14 45 73
# 2 2 8 18 22
Can also use table command:
> d1 = ddf[ddf$x1 %in% ddf$x1[which(table(ddf$x1)>1)],]
> d2 = ddf[ddf$x2 %in% ddf$x2[which(table(ddf$x2)>1)],]
> rr = rbind(d1, d2)
> rr[!duplicated(rbind(d1, d2)),]
x1 x2 x3 x4
1 34 14 45 53
3 34 14 45 20
2 2 8 18 17
5 2 8 18 5
For sum in last column:
> rrt = data.table(rr2)
> rrt[,x4:=sum(x4),by=x1]
> rrt[rrt[,!duplicated(x1),]]
x1 x2 x3 x4
1: 34 14 45 73
2: 2 8 18 22
first one similar as above, let z be your data.frame:
library(DescTools)
(zz <- Sort(z[AllDuplicated(z[, -4]), ], decreasing=TRUE) )
# now aggregate
aggregate(zz[, 4], zz[, -4], FUN=sum)
# use Sort again, if needed...

R: Calculate differences between rows in data.table

RProf revealed, that the following operation I perform is rather slow:
stockHistory[.(p), stock:=stockHistory[.(p), stock] - (backorderedDemands[.(p-1),backlog] - backorderedDemands[.(p),backlog])]
I suppose this is because of the subtraction
backorderedDemands[.(p-1),backlog] - backorderedDemands[.(p),backlog]
Is there any way to speed up this operation?
.(p) subsets the data.table for a period p, .(p-1) subsets the previous period (see example data below). Would it maybe be faster to apply some kind diff() here? I do not know how to do this, though.
Example data:
backorderedDemands<-CJ(period=1:1000, articleID=letters[1:10], backlog=0)[,backlog:=round(runif(10000)*42,0)]
setkey(backorderedDemands,period, articleID)
stockHistory<-CJ(period=1:1000, articleID=letters[1:10], stock=0)[,stock:=round(runif(10000)*42+66,0)]
setkey(stockHistory,period, articleID)
You can first calculate a difference column in backorderedDemands.
backorderedDemands[, diff := c(NA, -diff(backlog)), by=articleID]
Also it is not necessary to use stockHistory[.(p), stock]. It's enough to just use stock.
stockHistoryNew[.(p), stock:=stock - backorderedDemands[.(p), diff]]
If you want to compute first differences of your data, you can do it like below. It is fast...I included step by step computation.
library(data.table)
library(dplyr)
Data
set.seed(1)
backorderedDemands <-
CJ(period = 1:1000,
articleID = letters[1:10],
backlog = 0)[,backlog:= round(runif(10000) * 42, 0)]
stockHistory <-
CJ(period = 1:1000,
articleID = letters[1:10],
stock = 0)[, stock:= round(runif(10000) * 42 + 66, 0)]
Solution
merge(stockHistory, backorderedDemands,
by = c("period", "articleID")) %>%
group_by(articleID) %>%
mutate(lag_backlog = lag(backlog, 1),
my_backlog_diff = backlog - lag_backlog,
my_diff = stock + my_backlog_diff) %>%
as.data.frame(.) %>%
head(., 20)
period articleID stock backlog lag_backlog my_backlog_diff my_diff
1 1 a 69 11 NA NA NA
2 1 b 94 16 NA NA NA
3 1 c 97 24 NA NA NA
4 1 d 71 38 NA NA NA
5 1 e 68 8 NA NA NA
6 1 f 71 38 NA NA NA
7 1 g 103 40 NA NA NA
8 1 h 101 28 NA NA NA
9 1 i 102 26 NA NA NA
10 1 j 67 3 NA NA NA
11 2 a 71 9 11 -2 69
12 2 b 89 7 16 -9 80
13 2 c 71 29 24 5 76
14 2 d 96 16 38 -22 74
15 2 e 96 32 8 24 120
16 2 f 99 21 38 -17 82
17 2 g 92 30 40 -10 82
18 2 h 87 42 28 14 101
19 2 i 85 16 26 -10 75
20 2 j 67 33 3 30 97

How to output duplicated rows

I have the following data:
x1 x2 x3 x4
34 14 45 53
2 8 18 17
34 14 45 20
19 78 21 48
2 8 18 5
In rows 1 and 3; and 2 and 5 the values for columns X1;X2,X3 are equal. How can I output only those 4 rows, with equal numbers? The output should be in the following format:
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
Please, ask me questions if something unclear.
ADDITIONAL QUESTION: in the output
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
find the sum of values in last column:
x1 x2 x3 x4
34 14 45 73
2 8 18 22
You can do this with duplicated, which checks for rows being duplicated when passed a matrix. Since you're only checking the first three columns, you should pass dat[,-4] to the function.
dat[duplicated(dat[,-4]) | duplicated(dat[,-4], fromLast=T),]
# x1 x2 x3 x4
# 1 34 14 45 53
# 2 2 8 18 17
# 3 34 14 45 20
# 5 2 8 18 5
An alternative using ave:
dat[ave(dat[,1], dat[-4], FUN=length) > 1,]
# x1 x2 x3 x4
#1 34 14 45 53
#2 2 8 18 17
#3 34 14 45 20
#5 2 8 18 5
Learned this one the other day. You won't need to re-order the output.
s <- split(dat, do.call(paste, dat[-4]))
Reduce(rbind, Filter(function(x) nrow(x) > 1, s))
# x1 x2 x3 x4
# 2 2 8 18 17
# 5 2 8 18 5
# 1 34 14 45 53
# 3 34 14 45 20
There is another way to solve both questions using two packages.
library(DescTools)
library(dplyr)
dat[AllDuplicated(dat[1:3]), ] %>% # this line is to find duplicates
group_by(x1, x2) %>% # the lines followed are to sum up
mutate(x4 = sum(x4)) %>%
unique()
# Source: local data frame [2 x 4]
# Groups: x1, x2
#
# x1 x2 x3 x4
# 1 34 14 45 73
# 2 2 8 18 22
Can also use table command:
> d1 = ddf[ddf$x1 %in% ddf$x1[which(table(ddf$x1)>1)],]
> d2 = ddf[ddf$x2 %in% ddf$x2[which(table(ddf$x2)>1)],]
> rr = rbind(d1, d2)
> rr[!duplicated(rbind(d1, d2)),]
x1 x2 x3 x4
1 34 14 45 53
3 34 14 45 20
2 2 8 18 17
5 2 8 18 5
For sum in last column:
> rrt = data.table(rr2)
> rrt[,x4:=sum(x4),by=x1]
> rrt[rrt[,!duplicated(x1),]]
x1 x2 x3 x4
1: 34 14 45 73
2: 2 8 18 22
first one similar as above, let z be your data.frame:
library(DescTools)
(zz <- Sort(z[AllDuplicated(z[, -4]), ], decreasing=TRUE) )
# now aggregate
aggregate(zz[, 4], zz[, -4], FUN=sum)
# use Sort again, if needed...

Subsetting top 4 observations of each unique ID

I have a dataframe of 4 columns and a few thousands rows. I am ordering the dataframe according to thier 4th column-which is their ID-(descending) then to the second column (ascending). Here's what my data looks like:
X1 X2 X3 X4
24 1 23 25
21 3 19 25
19 6 20 25
11 12 14 25
14 9 21 24
3 12 25 24
24 15 23 24
8 1 4 23
17 4 12 23
16 11 23 23
20 19 21 23
24 19 16 23
19 20 7 23
19 22 22 22
11 2 18 21
15 9 19 21
10 14 9 21
17 15 19 21
16 20 6 21
I am trying to keep the highest 4 values of each ID (if available), my desired output would be
X1 X2 X3 X4
24 1 23 25
21 3 19 25
19 6 20 25
11 12 14 25
14 9 21 24
3 12 25 24
24 15 23 24
8 1 4 23
17 4 12 23
16 11 23 23
20 19 21 23
19 22 22 22
11 2 18 21
15 9 19 21
10 14 9 21
17 15 19 21
# note that 2 of the 23 ID observations and one of the 21 ID observations were removed.
I was wondering if there is there some short command that can do the job for me? I can think of a command that is around 1 page long! which is subsetting the data according to the 4th column, taking the top 5, then rbind them again. But that sounds so unprofessional!
Here's a command to generate similar example:
m0 <- matrix(0, 100, 4)
df <- data.frame(apply(m0, c(1,2), function(x) sample(c(0:25),1)))
##fix(df)
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
Thanks all.
maybe data.table:
require(data.table)
df<-read.table(header=T,text=" X1 X2 X3 X4
24 1 23 25
21 3 19 25
19 6 20 25
11 12 14 25
14 9 21 24
3 12 25 24
24 15 23 24
8 1 4 23
17 4 12 23
16 11 23 23
20 19 21 23
24 19 16 23
19 20 7 23
19 22 22 22
11 2 18 21
15 9 19 21
10 14 9 21
17 15 19 21
16 20 6 21")
data.table(df)[,.SD[order(X2)][1:4,],by="X4"][!is.na(X3)][,list(X1,X2,X3,X4)]
X1 X2 X3 X4
1: 24 1 23 25
2: 21 3 19 25
3: 19 6 20 25
4: 11 12 14 25
5: 14 9 21 24
6: 3 12 25 24
7: 24 15 23 24
8: 8 1 4 23
9: 17 4 12 23
10: 16 11 23 23
11: 20 19 21 23
12: 19 22 22 22
13: 11 2 18 21
14: 15 9 19 21
15: 10 14 9 21
16: 17 15 19 2
here's what's happening in the data.table call:
data.table(df)[ # data.table of df
,.SD[ # for each by=X4, .SD is the sub-table
order(X2)][1:4,], # first four entries ordered by X2
by="X4"][ # X4 is the grouping variable
!is.na(X3)][ # filter out NAs (i.e. less than 4 entries per row)
,list(X1,X2,X3,X4)] # order the columns
I think that Thomas's solution is fine, but can be improved. I would guess that the splitting, recombining, and reordering might be time consuming.
Instead, I would create a vector from which we can subset.
This is easily done with ave and should work since the data are already ordered.
Continuing from:
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
we can do:
out <- odf[ave(odf$X4, odf$X4, FUN = seq_along) <= 4, ]
head(out)
# X1 X2 X3 X4
# 24 3 4 13 25
# 6 23 5 13 25
# 19 9 11 24 25
# 40 10 13 11 25
# 93 16 2 25 24
# 26 10 11 13 24
tail(out)
# X1 X2 X3 X4
# 61 23 7 13 2
# 2 9 9 5 2
# 17 18 18 16 2
# 67 12 1 1 1
# 52 22 14 24 1
# 9 16 24 6 1
Update: New alternatives and benchmarks
The "dplyr" package would be great for this, and the syntax is pretty compact. But first, let's set some things up to see how fast these options are:
Functions to benchmark
fun1 <- function() {
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
out <- do.call(rbind, lapply(split(odf, odf$X4), function(z) head(z[order(z$X2),],4) ))
out[order(out$X4, decreasing=TRUE),]
}
fun2 <- function() {
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
odf[ave(odf$X4, odf$X4, FUN = seq_along) <= 4, ]
}
fun3 <- function() {
DT <- data.table(df)
DT[, X := -X4]
setkey(DT, X, X2)
DT[, .SD[sequence(min(.N, 4))], by = X][, X:=NULL][]
}
fun4 <- function() {
group_by(arrange(df, desc(X4), X2), X4) %.%
mutate(vals = seq_along(X4)) %.%
filter(vals <= 4)
}
A bigger version of your sample data
set.seed(1)
df <- data.frame(matrix(sample(0:1000, 1000000 * 4, replace = TRUE), ncol = 4))
The necessary packages
library(data.table)
library(dplyr)
library(microbenchmark)
The first two approaches (Thomas's and my first approach) take a fair amount of time, so instead of benchmarking, I'll just time them once.
system.time(fun1())
# user system elapsed
# 6.645 0.007 6.670
system.time(fun2())
# user system elapsed
# 4.053 0.004 4.186
Here's the "dplyr" and "data.table" results.
microbenchmark(fun3(), fun4(), times = 20)
# Unit: seconds
# expr min lq median uq max neval
# fun3() 2.157956 2.221746 2.303286 2.343951 2.392391 20
# fun4() 1.169212 1.180780 1.194994 1.206651 1.369922 20
Compare the output of the "dplyr" and "data.table" approaches:
out_DT <- fun3()
out_DP <- fun4()
out_DT
# X1 X2 X3 X4
# 1: 340 0 708 1000
# 2: 144 1 667 1000
# 3: 73 2 142 1000
# 4: 79 2 826 1000
# 5: 169 0 870 999
# ---
# 4000: 46 4 2 1
# 4001: 88 0 809 0
# 4002: 535 0 522 0
# 4003: 75 3 234 0
# 4004: 983 3 492 0
head(out_DP, 5)
# Source: local data frame [5 x 5]
# Groups: X4
#
# X1 X2 X3 X4 vals
# 1 340 0 708 1000 1
# 2 144 1 667 1000 2
# 3 73 2 142 1000 3
# 4 79 2 826 1000 4
# 5 169 0 870 999 1
tail(out_DP, 5)
# Source: local data frame [5 x 5]
# Groups: X4
#
# X1 X2 X3 X4 vals
# 4000 46 4 2 1 4
# 4001 88 0 809 0 1
# 4002 535 0 522 0 2
# 4003 75 3 234 0 3
# 4004 983 3 492 0 4
I include your code again with a set.seed call, so that this is exactly reproducible.
set.seed(1)
m0 <- matrix(0, 100, 4)
df <- data.frame(apply(m0, c(1,2), function(x) sample(c(0:25),1)))
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
Here's the code you need using a split-apply-combine strategy:
out <- do.call(rbind, lapply(split(odf, odf$X4), function(z) head(z[order(z$X2),],4) ))
out <- out[order(out$X4, decreasing=TRUE),]
Result:
> dim(out)
[1] 79 4
> head(out)
X1 X2 X3 X4
25.24 3 4 13 25
25.6 23 5 13 25
25.19 9 11 24 25
25.40 10 13 11 25
24.93 16 2 25 24
24.26 10 11 13 24

Enumerate instances of a factor level

I have a data frame with 150000 lines in long format with multiple occurences of the same id variable. I'm using reshape (from stat, rather than package=reshape(2)) to convert this to wide format. I am generating a variable to count each occurence of a given level of id to use as an index.
I've got this working with a small dataframe using plyr, but it is far too slow for my full df. Can I programme this more efficiently?
I've struggled doing this with the reshape package as I have around 30 other variables. It may be best to reshape only what I'm looking at (rather than the whole df) for each individual analysis.
> # u=id variable with three value variables
> u<-c(rep("a",4), rep("b", 3),rep("c", 6), rep("d", 5))
> u<-factor(u)
> v<-1:18
> w<-20:37
> x<-40:57
> df<-data.frame(u,v,w,x)
> df
u v w x
1 a 1 20 40
2 a 2 21 41
3 a 3 22 42
4 a 4 23 43
5 b 5 24 44
6 b 6 25 45
7 b 7 26 46
8 c 8 27 47
9 c 9 28 48
10 c 10 29 49
11 c 11 30 50
12 c 12 31 51
13 c 13 32 52
14 d 14 33 53
15 d 15 34 54
16 d 16 35 55
17 d 17 36 56
18 d 18 37 57
>
> library(plyr)
> df2<-ddply(df, .(u), transform, count=rank(u, ties.method="first"))
> df2
u v w x count
1 a 1 20 40 1
2 a 2 21 41 2
3 a 3 22 42 3
4 a 4 23 43 4
5 b 5 24 44 1
6 b 6 25 45 2
7 b 7 26 46 3
8 c 8 27 47 1
9 c 9 28 48 2
10 c 10 29 49 3
11 c 11 30 50 4
12 c 12 31 51 5
13 c 13 32 52 6
14 d 14 33 53 1
15 d 15 34 54 2
16 d 16 35 55 3
17 d 17 36 56 4
18 d 18 37 57 5
> reshape(df2, idvar="u", timevar="count", direction="wide")
u v.1 w.1 x.1 v.2 w.2 x.2 v.3 w.3 x.3 v.4 w.4 x.4 v.5 w.5 x.5 v.6 w.6 x.6
1 a 1 20 40 2 21 41 3 22 42 4 23 43 NA NA NA NA NA NA
5 b 5 24 44 6 25 45 7 26 46 NA NA NA NA NA NA NA NA NA
8 c 8 27 47 9 28 48 10 29 49 11 30 50 12 31 51 13 32 52
14 d 14 33 53 15 34 54 16 35 55 17 36 56 18 37 57 NA NA NA
I still can't quite figure out why you would want to ultimately convert your dataset from wide to long, because to me, that seems like it would be an extremely unwieldy dataset to work with.
If you're looking to speed up the enumeration of your factor levels, you can consider using ave() in base R, or .N from the "data.table" package. Considering that you are working with a lot of rows, you might want to consider the latter.
First, let's make up some data:
set.seed(1)
df <- data.frame(u = sample(letters[1:6], 150000, replace = TRUE),
v = runif(150000, 0, 10),
w = runif(150000, 0, 100),
x = runif(150000, 0, 1000))
list(head(df), tail(df))
# [[1]]
# u v w x
# 1 b 6.368412 10.52822 223.6556
# 2 c 6.579344 75.28534 450.7643
# 3 d 6.573822 36.87630 283.3083
# 4 f 9.711164 66.99525 681.0157
# 5 b 5.337487 54.30291 137.0383
# 6 f 9.587560 44.81581 831.4087
#
# [[2]]
# u v w x
# 149995 b 4.614894 52.77121 509.0054
# 149996 f 5.104273 87.43799 391.6819
# 149997 f 2.425936 60.06982 160.2324
# 149998 a 1.592130 66.76113 118.4327
# 149999 b 5.157081 36.90400 511.6446
# 150000 a 3.565323 92.33530 252.4982
table(df$u)
#
# a b c d e f
# 25332 24691 24993 24975 25114 24895
Load our required packages:
library(plyr)
library(data.table)
Create a "data.table" version of our dataset
DT <- data.table(df, key = "u")
DT # Notice that the data are now automatically sorted
# u v w x
# 1: a 6.2378578 96.098294 643.2433
# 2: a 5.0322400 46.806132 544.6883
# 3: a 9.6289786 87.915303 334.6726
# 4: a 4.3393403 1.994383 753.0628
# 5: a 6.2300123 72.810359 579.7548
# ---
# 149996: f 0.6268414 15.608049 669.3838
# 149997: f 2.3588955 40.380824 658.8667
# 149998: f 1.6383619 77.210309 250.7117
# 149999: f 5.1042725 87.437989 391.6819
# 150000: f 2.4259363 60.069820 160.2324
DT[, .N, by = key(DT)] # Like "table"
# u N
# 1: a 25332
# 2: b 24691
# 3: c 24993
# 4: d 24975
# 5: e 25114
# 6: f 24895
Now let's run a few basic tests. The results from ave() aren't sorted, but they are in "data.table" and "plyr", so we should also test the timing for sorting when using ave().
system.time(AVE <- within(df, {
count <- ave(as.numeric(u), u, FUN = seq_along)
}))
# user system elapsed
# 0.024 0.000 0.027
# Now time the sorting
system.time(AVE2 <- AVE[order(AVE$u, AVE$count), ])
# user system elapsed
# 0.264 0.000 0.262
system.time(DDPLY <- ddply(df, .(u), transform,
count=rank(u, ties.method="first")))
# user system elapsed
# 0.944 0.000 0.984
system.time(DT[, count := 1:.N, by = key(DT)])
# user system elapsed
# 0.008 0.000 0.004
all(DDPLY == AVE2)
# [1] TRUE
all(data.frame(DT) == AVE2)
# [1] TRUE
That syntax for "data.table" sure is compact, and it's speed is blazing!
Using base R to create an empty matrix and then fill it in appropriately can often be significantly faster. In the code below I suspect the slow part would be converting the data frame to a matrix and transposing, as in the first two lines; if so, that could perhaps be avoided if it could be stored differently to start with.
g <- df$a
x <- t(as.matrix(df[,-1]))
k <- split(seq_along(g), g)
n <- max(sapply(k, length))
out <- matrix(ncol=n*nrow(x), nrow=length(k))
for(idx in seq_along(k)) {
out[idx, seq_len(length(k[[idx]])*nrow(x))] <- x[,k[[idx]]]
}
rownames(out) <- names(k)
colnames(out) <- paste(rep(rownames(x), n), rep(seq_len(n), each=nrow(x)), sep=".")
out
# b.1 c.1 d.1 b.2 c.2 d.2 b.3 c.3 d.3 b.4 c.4 d.4 b.5 c.5 d.5 b.6 c.6 d.6
# a 1 20 40 2 21 41 3 22 42 4 23 43 NA NA NA NA NA NA
# b 5 24 44 6 25 45 7 26 46 NA NA NA NA NA NA NA NA NA
# c 8 27 47 9 28 48 10 29 49 11 30 50 12 31 51 13 32 52
# d 14 33 53 15 34 54 16 35 55 17 36 56 18 37 57 NA NA NA

Resources