Enumerate groups within groups in a data.table [duplicate] - r

This question already has answers here:
How to create group indices for nested groups in r
(3 answers)
Closed 3 years ago.
This is related to multiple duplicates (1, 2, 3), but a slightly different problem that I'm stuck with. So far, I've seen pandas solution only.
In this data table:
dt = data.table(gr = rep(letters[1:2], each = 6),
cl = rep(letters[1:4], each = 3))
gr cl
1: a a
2: a a
3: a a
4: a b
5: a b
6: a b
7: b c
8: b c
9: b c
10: b d
11: b d
12: b d
I'd like to enumerate unique classes per group to obtain this:
gr cl id
1: a a 1
2: a a 1
3: a a 1
4: a b 2
5: a b 2
6: a b 2
7: b c 1
8: b c 1
9: b c 1
10: b d 2
11: b d 2
12: b d 2

Try
library(data.table)
dt[, id := rleid(cl), by=gr]
dt
# gr cl id
# 1: a a 1
# 2: a a 1
# 3: a a 1
# 4: a b 2
# 5: a b 2
# 6: a b 2
# 7: b c 1
# 8: b c 1
# 9: b c 1
#10: b d 2
#11: b d 2
#12: b d 2

You can do (maybe it will require to sort the data first):
dt[, id := cumsum(!duplicated(cl)), by = gr]
gr cl id
1: a a 1
2: a a 1
3: a a 1
4: a b 2
5: a b 2
6: a b 2
7: b c 1
8: b c 1
9: b c 1
10: b d 2
11: b d 2
12: b d 2
The same with dplyr:
dt %>%
group_by(gr) %>%
mutate(id = cumsum(!duplicated(cl)))
Or a rleid()-like possibility:
dt %>%
group_by(gr) %>%
mutate(id = with(rle(cl), rep(seq_along(lengths), lengths)))

An alternative solution using factor which will not require ordering first
dt %>%
group_by(gr) %>%
mutate(id = as.numeric(factor(cl))) %>%
ungroup()
# # A tibble: 12 x 3
# gr cl id
# <chr> <chr> <dbl>
# 1 a a 1
# 2 a a 1
# 3 a a 1
# 4 a b 2
# 5 a b 2
# 6 a b 2
# 7 b c 1
# 8 b c 1
# 9 b c 1
#10 b d 2
#11 b d 2
#12 b d 2
Note that this will automatically assign a number / id based on the alphabetical order of the cl values, within each gr group.

Related

R mutate a column by group in ifelse

I'd like to mutate a column in R data.table.
Here's the example of my data.
df <- data.table(id=c(1,1,1,2,2,2,3,3,3),
stopId=c("a","b","c","a","b","c","a","b","c"),
category=c(1,1,1,NA,NA,NA,2,2,2),
result = c('a','a','a','b','b','b','c','c','c'))
My goal is to create a column using if-else command.
The column would be the first values of groupId group by id.
The point is when mutating, the values should be the same by group.
If the category is NA, then the result should be the last value of groupId.
This is the result I'm looking forward to.
id groupId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA b
5: 2 c NA b
6: 2 b NA b
7: 3 c 2 c
8: 3 b 2 c
9: 3 a 2 c
with data.table:
df[,result:=fifelse(is.na(category),last(stopId),first(stopId)),by=id][]
id stopId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA c
5: 2 b NA c
6: 2 c NA c
7: 3 a 2 a
8: 3 b 2 a
9: 3 c 2 a
As it's name, by using first and last,
df %>%
group_by(id) %>%
mutate(resultt = ifelse(is.na(category), last(stopId), first(stopId)))
id stopId category result resultt
<dbl> <chr> <dbl> <chr> <chr>
1 1 a 1 a a
2 1 b 1 a a
3 1 c 1 a a
4 2 a NA b b
5 2 c NA b b
6 2 b NA b b
7 3 c 2 c c
8 3 b 2 c c
9 3 a 2 c c
Data you provided is different above...
We can use .N or 1 to index stopId per group
> df[, result := stopId[ifelse(is.na(category), .N, 1)], id][]
id stopId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA c
5: 2 b NA c
6: 2 c NA c
7: 3 a 2 a
8: 3 b 2 a
9: 3 c 2 a
or shorter
> df[, result := stopId[c(1, .N)[is.na(category) + 1]], id][]
id stopId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA c
5: 2 b NA c
6: 2 c NA c
7: 3 a 2 a
8: 3 b 2 a
9: 3 c 2 a

Get the group index in a data.table

I have the following data.table:
library(data.table)
DT <- data.table(a = c(1,2,3,4,5,6,7,8,9,10), b = c('A','A','A','B','B', 'C', 'C', 'C', 'D', 'D'), c = c(1,1,1,1,1,2,2,2,2,2))
> DT
a b c
1: 1 A 1
2: 2 A 1
3: 3 A 1
4: 4 B 1
5: 5 B 1
6: 6 C 2
7: 7 C 2
8: 8 C 2
9: 9 D 2
10: 10 D 2
I want to add a column that shows the index grouped by c (starts from 1 from each group in column c), but that only changes when the value of b is changed. The result wanted is shown below:
Here are two ways to do this :
Using rleid :
library(data.table)
DT[, col := rleid(b), c]
With match + unique :
DT[, col := match(b, unique(b)), c]
# a b c col
# 1: 1 A 1 1
# 2: 2 A 1 1
# 3: 3 A 1 1
3 4: 4 B 1 2
# 5: 5 B 1 2
# 6: 6 C 2 1
# 7: 7 C 2 1
# 8: 8 C 2 1
# 9: 9 D 2 2
#10: 10 D 2 2
We can use factor with levels specified and coerce it to integer
library(data.table)
DT[, col := as.integer(factor(b, levels = unique(b))), c]
-output
DT
# a b c col
# 1: 1 A 1 1
# 2: 2 A 1 1
# 3: 3 A 1 1
# 4: 4 B 1 2
# 5: 5 B 1 2
# 6: 6 C 2 1
# 7: 7 C 2 1
# 8: 8 C 2 1
# 9: 9 D 2 2
#10: 10 D 2 2
Or using base R with rle
with(DT, as.integer(ave(b, c, FUN = function(x)
with(rle(x), rep(seq_along(values), lengths)))))

R find intervals in data.table

i want to add a new column with intervals or breakpoints by group. As an an example:
This is my data.table:
x <- data.table(a = c(1:8,1:8), b = c(rep("A",8),rep("B",8)))
I have already the breakpoint or rowindices:
pos <- data.table(b = c("A","A","B","B"), bp = c(3,5,2,4))
Here i can find the interval for group "A" with:
findInterval(1:nrow(x[b=="A"]), pos[b=="A"]$bp)
How can i do this for each group. In this case "A" and "B"?
An option is to split the datasets by 'b' column, use Map to loop over the corresponding lists, and apply findInterval
Map(function(u, v) findInterval(seq_len(nrow(u)), v$bp),
split(x, x$b), split(pos, pos$b))
#$A
#[1] 0 0 1 1 2 2 2 2
#$B
#[1] 0 1 1 2 2 2 2 2
or another option is to group by 'b' from 'x', then use findInterval by subsetting the 'bp' from 'pos' by filtering with a logical condition created based on .BY
x[, findInterval(seq_len(.N), pos$bp[pos$b==.BY]), b]
# b V1
# 1: A 0
# 2: A 0
# 3: A 1
# 4: A 1
# 5: A 2
# 6: A 2
# 7: A 2
# 8: A 2
# 9: B 0
#10: B 1
#11: B 1
#12: B 2
#13: B 2
#14: B 2
#15: B 2
#16: B 2
Another option using rolling join in data.table:
pos[, ri := rowid(b)]
x[, intvl := fcoalesce(pos[x, on=.(b, bp=a), roll=Inf, ri], 0L)]
output:
a b intvl
1: 1 A 0
2: 2 A 0
3: 3 A 1
4: 4 A 1
5: 5 A 2
6: 6 A 2
7: 7 A 2
8: 8 A 2
9: 1 B 0
10: 2 B 1
11: 3 B 1
12: 4 B 2
13: 5 B 2
14: 6 B 2
15: 7 B 2
16: 8 B 2
We can nest the pos data into list by b and join with x and use findInterval to get corresponding groups.
library(dplyr)
pos %>%
tidyr::nest(data = bp) %>%
right_join(x, by = 'b') %>%
group_by(b) %>%
mutate(interval = findInterval(a, data[[1]][[1]])) %>%
select(-data)
# b a interval
# <chr> <int> <int>
# 1 A 1 0
# 2 A 2 0
# 3 A 3 1
# 4 A 4 1
# 5 A 5 2
# 6 A 6 2
# 7 A 7 2
# 8 A 8 2
# 9 B 1 0
#10 B 2 1
#11 B 3 1
#12 B 4 2
#13 B 5 2
#14 B 6 2
#15 B 7 2
#16 B 8 2

Keep all the data.table when aggregating a data.table

I would like to aggregate a data.table by a list of column and keep all the columns at the end.
A <- c(1,2,3,4,4,6,4)
B <- c("a","b","c","d","e","f","g")
C <- c(10,11,23,8,8,1,3)
D <- c(2,3,5,9,7,8,4)
dt <- data.table(A,B,C,D)
Now I want to aggregate the column B paste(B,sep=";") by A and C and keep the column D too at the end. Do you know a way to do it please?
EDIT
this is what i obtained using dt[, newCol := toString(B), .(A, C)]
A B C D newCol
1: 1 a 10 2 a
2: 2 b 11 3 b
3: 3 c 23 5 c
4: 4 d 8 9 d, e
5: 4 e 8 7 d, e
6: 6 f 1 8 f
7: 4 g 3 4 g
But i would like to obtain
A B C D newCol
1: 1 a 10 2 a
2: 2 b 11 3 b
3: 3 c 23 5 c
4: 4 d 8 9 d, e
6: 6 f 1 8 f
7: 4 g 3 4 g

Alternative to expand.grid for data.frames

I have a data.frame df and I want that every row in this df is duplicated lengthTime times and that a new column is added that counts from 1 to lengthTime for each row in df.
I know, it sounds pretty complicated, but what I basically want is to apply expand.grid to df. Here is an ugly workaround and I have the feeling that there most be an easier solution (maybe even a base-R function?):
df <- data.frame(ID = rep(letters[1:3], each=3),
CatA = rep(1:3, times = 3),
CatB = letters[1:9])
lengthTime <- 3
nrRow <- nrow(df)
intDF <- df
for (i in 1:(lengthTime - 1)) {
df <- rbind(df, intDF)
}
df$Time <- rep(1:lengthTime, each=nrRow)
I thought that I could just use expand.grid(df, 1:lengthTime), but that does not work. outer did not bring any luck either. So does anyone know a good solution?
It's been a while since this question was posted, but I recently came across it looking for just the thing in the title, namely, an expand.grid that works for data frames. The posted answers address the OP's more specific question, so in case anyone is looking for a more general solution for data frames, here's a slightly more general approach:
expand.grid.df <- function(...) Reduce(function(...) merge(..., by=NULL), list(...))
# For the example in the OP
expand.grid.df(df, data.frame(1:lengthTime))
# More generally
df1 <- data.frame(A=1:3, B=11:13)
df2 <- data.frame(C=51:52, D=c("Y", "N"))
df3 <- data.frame(E=c("+", "-"))
expand.grid.df(df1, df2, df3)
You can also just do a simple merge by NULL (which will cause merge to do simple combinatorial data replication):
merge(data.frame(time=1:lengthTime), iris, by=NULL)
Why not just something like df[rep(1:nrow(df),times = 3),] to extend the data frame, and then add the extra column just as you have above, with df$Time <- rep(1:lengthTime, each=nrRow)?
Quick update
There is now also the crossing() function in package tidyr which can be used instead of merge, is somewhat faster, and returns a tbl_df / tibble.
data.frame(time=1:10) %>% merge(iris, by=NULL)
data.frame(time=1:10) %>% tidyr::crossing(iris)
This works:
REP <- rep(1:nrow(df), 3)
df2 <- data.frame(df[REP, ], Time = rep(1:3, each = 9))
rownames(df2) <- NULL
df2
A data.table solution:
> library(data.table)
> ( df <- data.frame(ID = rep(letters[1:3], each=3),
+ CatA = rep(1:3, times = 3),
+ CatB = letters[1:9]) )
ID CatA CatB
1 a 1 a
2 a 2 b
3 a 3 c
4 b 1 d
5 b 2 e
6 b 3 f
7 c 1 g
8 c 2 h
9 c 3 i
> ( DT <- data.table(df)[, lapply(.SD, function(x) rep(x,3))][, Time:=rep(1:3, each=nrow(df0))] )
ID CatA CatB Time
1: a 1 a 1
2: a 2 b 1
3: a 3 c 1
4: b 1 d 1
5: b 2 e 1
6: b 3 f 1
7: c 1 g 1
8: c 2 h 1
9: c 3 i 1
10: a 1 a 2
11: a 2 b 2
12: a 3 c 2
13: b 1 d 2
14: b 2 e 2
15: b 3 f 2
16: c 1 g 2
17: c 2 h 2
18: c 3 i 2
19: a 1 a 3
20: a 2 b 3
21: a 3 c 3
22: b 1 d 3
23: b 2 e 3
24: b 3 f 3
25: c 1 g 3
26: c 2 h 3
27: c 3 i 3
Another one :
> library(data.table)
> ( df <- data.frame(ID = rep(letters[1:3], each=3),
+ CatA = rep(1:3, times = 3),
+ CatB = letters[1:9]) )
> DT <- data.table(df)
> rbindlist(lapply(1:3, function(i) cbind(DT, Time=i)))
ID CatA CatB Time
1: a 1 a 1
2: a 2 b 1
3: a 3 c 1
4: b 1 d 1
5: b 2 e 1
6: b 3 f 1
7: c 1 g 1
8: c 2 h 1
9: c 3 i 1
10: a 1 a 2
11: a 2 b 2
12: a 3 c 2
13: b 1 d 2
14: b 2 e 2
15: b 3 f 2
16: c 1 g 2
17: c 2 h 2
18: c 3 i 2
19: a 1 a 3
20: a 2 b 3
21: a 3 c 3
22: b 1 d 3
23: b 2 e 3
24: b 3 f 3
25: c 1 g 3
26: c 2 h 3
27: c 3 i 3

Resources