data.table within group id [duplicate] - r

This question already has answers here:
data.table "key indices" or "group counter"
(2 answers)
Closed 7 years ago.
I have a data.table with n grouping variables (in this case 2). I want to add an identifier column for each group as seen in the desired output below. I tried :=:.N` and I get why that doesn't work but don't know how to make it happen:
library(data.table)
dat <- data.table::data.table(
w = 1:16,
x = LETTERS[1:2],
y = 1:4
)[, w := NULL][order(x, y)]
## x y
## 1: A 1
## 2: A 1
## 3: A 1
## 4: A 1
## 5: A 3
## 6: A 3
## 7: A 3
## 8: A 3
## 9: B 2
## 10: B 2
## 11: B 2
## 12: B 2
## 13: B 4
## 14: B 4
## 15: B 4
## 16: B 4
dat[, z := 1:.N, by = list(x, y)]
dat
Desired Output
## x y z
## 1: A 1 1
## 2: A 1 1
## 3: A 1 1
## 4: A 1 1
## 5: A 3 2
## 6: A 3 2
## 7: A 3 2
## 8: A 3 2
## 9: B 2 3
## 10: B 2 3
## 11: B 2 3
## 12: B 2 3
## 13: B 4 4
## 14: B 4 4
## 15: B 4 4
## 16: B 4 4

dat[, z:=.GRP,by=list(x,y)]
dat
# x y z
# 1: A 1 1
# 2: A 1 1
# 3: A 1 1
# 4: A 1 1
# 5: A 3 2
# 6: A 3 2
# 7: A 3 2
# 8: A 3 2
# 9: B 2 3
# 10: B 2 3
# ...

Related

Group variable by "n" consecutive integers in data.table

library(data.table)
DT <- data.table(var = 1:100)
I want to create a second variable, group that groups the values in var by n consecutive integers. So if n is equal to 1, it would return the same column as var. If n=2, it would return me:
var group
1: 1 1
2: 2 1
3: 3 2
4: 4 2
5: 5 3
6: 6 3
If n=3, it would return me:
var group
1: 1 1
2: 2 1
3: 3 1
4: 4 2
5: 5 2
6: 6 2
and so on. I would like to do this as flexibly as possibly.
Note that there could be repeated values:
var group
1: 1 1
2: 1 1
3: 2 1
4: 3 2
5: 3 2
6: 4 2
Here, group corresponds to n=2. Thank you!
I think we can use findInterval for this:
DT <- data.table(var = c(1L, 1:10))
n <- 2
DT[, group := findInterval(var, seq(min(var), max(var) + n, by = n))]
# var group
# <int> <int>
# 1: 1 1
# 2: 1 1
# 3: 2 1
# 4: 3 2
# 5: 4 2
# 6: 5 3
# 7: 6 3
# 8: 7 4
# 9: 8 4
# 10: 9 5
# 11: 10 5
n <- 3
DT[, group := findInterval(var, seq(min(var), max(var) + n, by = n))]
# var group
# <int> <int>
# 1: 1 1
# 2: 1 1
# 3: 2 1
# 4: 3 1
# 5: 4 2
# 6: 5 2
# 7: 6 2
# 8: 7 3
# 9: 8 3
# 10: 9 3
# 11: 10 4
(The +n in the call to seq is so that we always have a little more than we need; if we did just seq(min(.),max(.),by=n), it would be possible the highest values of var would be outside of the sequence. One could also do c(seq(min(.), max(.), by=n), Inf) for the same effect.)

Get the group index in a data.table

I have the following data.table:
library(data.table)
DT <- data.table(a = c(1,2,3,4,5,6,7,8,9,10), b = c('A','A','A','B','B', 'C', 'C', 'C', 'D', 'D'), c = c(1,1,1,1,1,2,2,2,2,2))
> DT
a b c
1: 1 A 1
2: 2 A 1
3: 3 A 1
4: 4 B 1
5: 5 B 1
6: 6 C 2
7: 7 C 2
8: 8 C 2
9: 9 D 2
10: 10 D 2
I want to add a column that shows the index grouped by c (starts from 1 from each group in column c), but that only changes when the value of b is changed. The result wanted is shown below:
Here are two ways to do this :
Using rleid :
library(data.table)
DT[, col := rleid(b), c]
With match + unique :
DT[, col := match(b, unique(b)), c]
# a b c col
# 1: 1 A 1 1
# 2: 2 A 1 1
# 3: 3 A 1 1
3 4: 4 B 1 2
# 5: 5 B 1 2
# 6: 6 C 2 1
# 7: 7 C 2 1
# 8: 8 C 2 1
# 9: 9 D 2 2
#10: 10 D 2 2
We can use factor with levels specified and coerce it to integer
library(data.table)
DT[, col := as.integer(factor(b, levels = unique(b))), c]
-output
DT
# a b c col
# 1: 1 A 1 1
# 2: 2 A 1 1
# 3: 3 A 1 1
# 4: 4 B 1 2
# 5: 5 B 1 2
# 6: 6 C 2 1
# 7: 7 C 2 1
# 8: 8 C 2 1
# 9: 9 D 2 2
#10: 10 D 2 2
Or using base R with rle
with(DT, as.integer(ave(b, c, FUN = function(x)
with(rle(x), rep(seq_along(values), lengths)))))

from two lists to one by binding elements

I have two lists with two elements each,
l1 <- list(data.table(id=1:5, group=1), data.table(id=1:5, group=1))
l2 <- list(data.table(id=1:5, group=2), data.table(id=1:5, group=2))
and I would like to rbind(.) both elements, resulting in a new list with two elements.
> l
[[1]]
id group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 1 2
7: 2 2
8: 3 2
9: 4 2
10: 5 2
[[2]]
id group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 1 2
7: 2 2
8: 3 2
9: 4 2
10: 5 2
However, I only find examples where rbind(.) is applied to bind across elements. I suspect that the solution lies somewhere in lapply(.) but lapply(c(l1,l2),rbind) appears to bind the lists, producing a list of four elements.
You can use mapply or Map. mapply (which stands for multivariate apply) applies the supplied function to the first elements of the arguments and then the second and then the third and so on. Map is quite literally a wrapper to mapply that does not try to simplify the result (try running mapply with and without SIMPLIFY=T). Shorter, arguments are recycled as necessary.
mapply(x=l1, y=l2, function(x,y) rbind(x,y), SIMPLIFY = F)
#[[1]]
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 1 2
# 7: 2 2
# 8: 3 2
# 9: 4 2
#10: 5 2
#
#[[2]]
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 1 2
# 7: 2 2
# 8: 3 2
# 9: 4 2
#10: 5 2
As #Parfait pointed out you can do this Map:
Map(rbind, l1, l2)
#[[1]]
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 1 2
# 7: 2 2
# 8: 3 2
# 9: 4 2
#10: 5 2
#
#[[2]]
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 1 2
# 7: 2 2
# 8: 3 2
# 9: 4 2
#10: 5 2
Using tidyverse
library(tidyverse0
map2(l1, l2, bind_rows)

When grouped, how to sort globally by sum of group's elements?

Consider a data.table with the following structure:
DT = data.table(x=rep(c("b","a","c"),each=3), y=1:9)
DT
# x y
# 1: b 1
# 2: b 2
# 3: b 3
# 4: a 4
# 5: a 5
# 6: a 6
# 7: c 7
# 8: c 8
# 9: c 9
I want to order decreasing with respect to the sum of the column y while grouping by x, i.e., I expect:
# x y
# 1: c 7
# 2: c 8
# 3: c 9
# 4: a 4
# 5: a 5
# 6: a 6
# 7: b 1
# 8: b 2
# 9: b 3
The only way that I've found is to create a new column with the 'intragroup sum' when grouping, and then ordering using that column:
DT[, s:=sum(y), by=x][order(s,decreasing=TRUE), .(x,y)]
# x y
# 1: c 7
# 2: c 8
# 3: c 9
# 4: a 4
# 5: a 5
# 6: a 6
# 7: b 1
# 8: b 2
# 9: b 3
But I guess it has to be a better way. Any idea?

dcast long to wide by transforming unique RHS

Let's say I have a table like the following:
DT <- data.table(ID1= rep(c("a","b","c"),3),ID2=rnorm(9,4),var = 1:9)
## > DT
## ID1 ID2 var
## 1: a 2.630392 1
## 2: b 3.966620 2
## 3: c 4.002776 3
## 4: a 3.188372 4
## 5: b 4.735084 5
## 6: c 4.307198 6
## 7: a 2.830868 7
## 8: b 4.892684 8
## 9: c 3.429826 9
and I would like to perform a dcast with the by taking into consideration only the number of times that each ID1 apprear.
undesired output:
dcast(DT,ID1~ID2)
desired output:
## ID1 1 2 3
## 1: a 1 4 7
## 2: b 2 5 8
## 3: c 3 6 9
Try
dcast.data.table(DT[,N:=1:.N ,ID1], ID1~N, value.var='var')
# ID1 1 2 3
#1: a 1 4 7
#2: b 2 5 8
#3: c 3 6 9

Resources