Keep empty groups when grouping with data.table in R - r

I want to keep empty groups (with a default value like NA or 0) when grouping by multiple conditions.
dt = data.table(user = c("A", "A", "B"), date = c("t1", "t2", "t1"), duration = c(1, 2, 1))
dt[, .("total" = sum(duration)), by = .(date, user)]
Result:
date user total
1: t1 A 1
2: t2 A 2
3: t1 B 1
Desired result:
date user total
1: t1 A 1
2: t2 A 2
3: t1 B 1
3: t2 B NA
One solution could be to add rows with 0 values before grouping, but it would require to create the Descartes product of many columns and manually checking if a value already exists for that combination, but I would prefer a built-in / simpler one.

You can try:
dt[CJ(user = user, date = date, unique = TRUE), on = .(user, date)]
user date duration
1: A t1 1
2: A t2 2
3: B t1 1
4: B t2 NA

Here is an option with complete from tidyr
library(tidyr)
library(dplyr)
dt1 <- dt[, .("total" = sum(duration)), by = .(date, user)]
dt1 %>%
complete(user, date)
# user date total
# <chr> <chr> <dbl>
# A t1 1
#2 A t2 2
#3 B t1 1
#4 B t2 NA
Or using dcast/melt
melt(dcast(dt, user ~ date, value.var = 'duration', sum),
id.var = 'user', variable.name = 'date', value.name = 'total')

Related

Grouped recurrence by periods over a data.table

I have a dataset with names, dates, and several categorical columns. Let's say
data <- data.table(name = c('Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Ben', 'Ben', 'Cal'),
period = c(1,1,1,1,1,1,2,2,2,3,3),
category = c("A","A","A","B","B","B","A","B","A","B","A"))
Which looks like this:
name period category
Anne 1 A
Ben 1 A
Cal 1 A
Anne 1 B
Ben 1 B
Cal 1 B
Anne 2 A
Ben 2 B
Ben 2 A
Ben 3 A
Cal 3 B
I want to compute, for each period, how many names were present in the past period, for every group of my categorical variables. The output should be as follows:
period category recurrence_count
2 A 2 # due to Anne and Ben being on A, period 1
2 B 1 # due to Ben being on B, period 1
3 A 1 # due to Ben being on A, period 2
3 B 0 # no match from B, period 2
I am aware of the .I and .GRP operators in data.table, but I have no idea how to write the notion of 'next group' in the j entry of my statement. I imagine something like this might be a reasonable path, but I can't figure out the correct syntax:
data[, .(recurrence_count = length(intersect(name, name[last(.GRP)]))), by = .(category, period)]
You can first summarize your data by category and period.
previous_period_names <- data[, .(names = list(name)), .(category, period)]
previous_period_names[, next_period := period + 1]
Join your summary with your original data.
data[previous_period_names, names := i.names, on = c('period==next_period')]
Now count how many names you see the name in the summarized names
data[, .(recurrence_count = sum(name %in% unlist(names))), by = .(period, category)]
Another data.table alternative. For rows that can have a previous period (period != 1), create such a variable (prev_period := period - 1).
Join original data with a subset that has values for 'prev_period' (data[data[!is.na(prev_period)]). Join on 'category', 'period = prev_period' and 'name'.
In the resulting data set, for each 'period' and 'category' (by = .(period = i.period, category)), count the number of names from original data (x.name) that had a match with previous period (length(na.omit(x.name))).
data[period != 1, prev_period := period - 1]
data[data[!is.na(prev_period)], on = c("category", period = "prev_period", "name"),
.(category, i.period, x.name)][
, .(n = length(na.omit(x.name))), by = .(period = i.period, category)]
# period category n
# 1: 2 A 2
# 2: 2 B 1
# 3: 3 B 1
# 4: 3 A 0
One option in base R is to split the 'data' by 'category', then loop over the list (lapply), use Reduce with intersect on the splitted 'name' by 'period' with accumulate as TRUE, get the lengths of the list, create a data.frame with the unique elements of 'period' and use Map to create the 'category' from the names of the list output, rbind the list of data.frame into a single dataset
library(data.table)
lst1 <- lapply(split(data, data$category), function(x)
data.frame(period = unique(x$period)[-1],
recurrence_count = lengths(Reduce(intersect,
split(x$name, x$period), accumulate = TRUE)[-1])))
rbindlist(Map(cbind, category = names(lst1), lst1))[
order(period), .(period, category, recurrence_count)]
# period category recurrence_count
#1: 2 A 2
#2: 2 B 1
#3: 3 A 1
#4: 3 B 0
Or using the same logic within data.table, grouped by 'category, do the split of 'name' by 'period' and apply the Reduce with intersect
setDT(data)[, .(period = unique(period),
recurrence_count = lengths(Reduce(intersect,
split(name, period), accumulate = TRUE))), .(category)][duplicated(category)]
# category period recurrence_count
#1: A 2 2
#2: A 3 1
#3: B 2 1
#4: B 3 0
Or similar option in tidyverse
library(dplyr)
library(purrr)
data %>%
group_by(category) %>%
summarise(reccurence_count = lengths(accumulate(split(name, period),
intersect)), period = unique(period), .groups = 'drop' ) %>%
filter(duplicated(category))
# A tibble: 4 x 3
# category reccurence_count period
# <chr> <int> <int>
#1 A 2 2
#2 A 1 3
#3 B 1 2
#4 B 0 3
data
data <- structure(list(name = c("Anne", "Ben", "Cal", "Anne", "Ben",
"Cal", "Anne", "Ben", "Ben", "Ben", "Cal"), period = c(1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), category = c("A", "A", "A",
"B", "B", "B", "A", "B", "A", "A", "B")), class = "data.frame",
row.names = c(NA,
-11L))
A data.table option
setDT(df)[
,
{
u <- split(name, period)
data.table(
period = unique(period)[-1],
recurrence_count = lengths(
Map(
intersect,
head(u, -1),
tail(u, -1)
)
)
)
},
category
]
gives
category period recurrence_count
1: A 2 2
2: A 3 1
3: B 2 1
4: B 3 0

Multiple first and last non-NA values by group

I have the following data.table:
require(data.table)
dt = data.table(
id = c(rep('Grp 1', 31), rep('Grp 2', 31)),
date = rep(as.IDate(as.IDate('2020-01-01') : as.IDate('2020-01-31')), 2),
change = c(rep(NA, 5), rep('yes', 5), rep(NA, 10), rep('yes', 3), rep(NA, 8),
rep(NA, 2), rep('yes', 8), rep(NA, 8), rep('yes', 5), rep(NA, 8))
)
For every group id I'd like to filter the first and last dates of a series, which is defined by a second column change being yes (i.e. non-NA). I can do the following, which would provide me with the first and last non-NA row by group. However, the problem is that the series occurs more than once per group.
dt[ !is.na(change),
.(head(date, 1),
tail(date, 1)),
.(id) ]
These are the row indices I'd like to have filtered:
dt[c(6,10,21,23,34,41,50,54)]
One way is to give a unique group id to each streak identified by an id and change combination. We can use rleid to generate such run-length type ids. Consider something like this
dt[,
gid := rleid(id, change)
][!is.na(change),
as.list(range(date)),
by = .(id, gid)
][,
gid := NULL
]
Note that I also assume that you want the range of dates, not really the first and last elements. Your method will fail if the dates are not in chronological order. Output looks like this
id V1 V2
1: Grp 1 2020-01-06 2020-01-10
2: Grp 1 2020-01-21 2020-01-23
3: Grp 2 2020-01-03 2020-01-10
4: Grp 2 2020-01-19 2020-01-23
rleid works like this
> rleid(c(1, 1, 2, 3, 3), c("a", "b", "b", "d", "d"))
[1] 1 2 3 4 4
Here is an option with dplyr
library(dplyr)
library(data.table)
dt %>%
group_by(grp = rleid(id, change), id) %>%
filter(!is.na(change)) %>%
summarise(V1 = min(date, na.rm = TRUE),
V2 = max(date, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 4 x 4
# grp id V1 V2
# <int> <chr> <date> <date>
#1 2 Grp 1 2020-01-06 2020-01-10
#2 4 Grp 1 2020-01-21 2020-01-23
#3 7 Grp 2 2020-01-03 2020-01-10
#4 9 Grp 2 2020-01-19 2020-01-23

convert to wide format and set 0 if value does not exists

I have following dataset:
dataset1 <- data.frame(
bnames = c("T1", "T1", "T2", "T3", "T3"),
events = c("I", "O", "I", "I", "O"),
freq = c(1,2,3,4,5))
I want to convert this dataset to wide format, my approach (using reshape package):
dataset2 <- melt(dataset1, id.vars = c("bnames", "events"))
dataset2 <- dataset2[c("bnames", "events", "value")]
names(dataset2) <- c("bnames", "events", "freq")
content of dataset2:
bnames events freq
1 T1 I 1
2 T1 O 2
3 T2 I 3
4 T3 I 4
5 T3 O 5
But there should always be two rows with the same name under bnames column. One row with I and another with O under events column. If the corresponding value does not exists in original dataset (dataset1) then the value under freq should always be 0. So my desired result in this case should be:
bnames events freq
1 T1 I 1
2 T1 O 2
3 T2 I 3
4 T2 O 0
5 T3 I 4
6 T3 O 5
How to do this? Thanks
Here's one way in base R:
left_hand <- expand.grid(
bnames = unique(dataset1$bnames),
events = c("I", "O"),
stringsAsFactors = FALSE
)
dataset2 <- merge(left_hand, dataset2, all.x = TRUE)
dataset2[is.na(dataset2)] <- 0
Alternatively, there is a one-liner in tidyr package:
tidyr::complete(dataset2, bnames, events, fill = list(freq = 0))
Here is a data.table solution. Generate all possible permutations of bnames and events, left join this set with original dataset, and return the frequency if available else set to 0.
library(data.table)
setDT(dataset1)[CJ(bnames=bnames, events=events, unique=TRUE),
.(freq=ifelse(is.na(freq), 0, freq)),
by=.EACHI,
on=.(bnames, events)]
# bnames events freq
#1: T1 I 1
#2: T1 O 2
#3: T2 I 3
#4: T2 O 0
#5: T3 I 4
#6: T3 O 5

Data.Table rolling join by group

How can I find the last value, prior to test.day, for each (loc.x, loc.y) pair?
dt <- data.table(
loc.x = as.integer(c(1, 1, 3, 1, 3, 1)),
loc.y = as.integer(c(1, 2, 1, 2, 1, 2)),
time = as.IDate(c("2015-03-11", "2015-05-10", "2015-09-27",
"2015-11-25", "2014-09-13", "2015-08-19")),
value = letters[1:6]
)
setkey(dt, loc.x, loc.y, time)
test.day <- as.IDate("2015-10-01")
Required output:
loc.x loc.y value
1: 1 1 a
2: 1 2 f
3: 3 1 c
Another option is to use the last function:
dt[, last(value[time < test.day]), by = .(loc.x, loc.y)]
which gives:
loc.x loc.y V1
1: 1 1 a
2: 1 2 f
3: 3 1 c
You can first subset the rows where time < test.day (which should be quite efficient because it is not done by group) and then select the last value per group. To do that you can either use tail(value, 1L) or, as suggested by Floo0, value[.N], resulting in:
dt[time < test.day, tail(value, 1L), by = .(loc.x, loc.y)]
# loc.x loc.y V1
#1: 1 1 a
#2: 1 2 f
#3: 3 1 c
or
dt[time < test.day, value[.N], by = .(loc.x, loc.y)]
Note that this works because the data is sorted due to setkey(dt, loc.x, loc.y, time).
Here's another option using a rolling join after creating a lookup table
indx <- data.table(unique(dt[ ,.(loc.x, loc.y)]), time = test.day)
dt[indx, roll = TRUE, on = names(indx)]
# loc.x loc.y time value
# 1: 1 1 2015-10-01 a
# 2: 1 2 2015-10-01 f
# 3: 3 1 2015-10-01 c
Or a very similar option suggested by #eddi
dt[dt[, .(time = test.day), by = .(loc.x, loc.y)], roll = T, on = c('loc.x', 'loc.y', 'time')]
Or a one liner which will be less efficient as it will call [.data.table by group
dt[,
.SD[data.table(test.day), value, roll = TRUE, on = c(time = "test.day")],
by = .(loc.x, loc.y)
]
# loc.x loc.y V1
# 1: 1 1 a
# 2: 1 2 f
# 3: 3 1 c

Get number of same individuals for different groups

I have a data set with individuals (ID) that can be part of more than one group.
Example:
library(data.table)
DT <- data.table(
ID = rep(1:5, c(3:1, 2:3)),
Group = c("A", "B", "C", "B",
"C", "A", "A", "C",
"A", "B", "C")
)
DT
# ID Group
# 1: 1 A
# 2: 1 B
# 3: 1 C
# 4: 2 B
# 5: 2 C
# 6: 3 A
# 7: 4 A
# 8: 4 C
# 9: 5 A
# 10: 5 B
# 11: 5 C
I want to know the sum of identical individuals for 2 groups.
The result should look like this:
Group.1 Group.2 Sum
A B 2
A C 3
B C 3
Where Sum indicates the number of individuals the two groups have in common.
Here's my version:
# size-1 IDs can't contribute; skip
DT[ , if (.N > 1)
# simplify = FALSE returns a list;
# transpose turns the 3-length list of 2-length vectors
# into a length-2 list of 3-length vectors (efficiently)
transpose(combn(Group, 2L, simplify = FALSE)), by = ID
][ , .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]
With output:
# Group.1 Group.2 Sum
# 1: A B 2
# 2: A C 3
# 3: B C 3
As of version 1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins. So, a self non-equi join can be used:
library(data.table) # v1.9.8+
setDT(DT)[, Group:= factor(Group)]
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)][
, .N, by = .(x.Group, i.Group)]
x.Group i.Group N
1: A B 2
2: A C 3
3: B C 3
Explanantion
The non-equi join on ID, Group < Group is a data.table version of combn() (but applied group-wise):
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)]
ID x.Group i.Group
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 B C
5: 4 A C
6: 5 A B
7: 5 A C
8: 5 B C
We self-join with the same dataset on 'ID', subset the rows where the 'Group' columns are different, get the nrows (.N), grouped by the 'Group' columns, sort the 'Group.1' and 'Group.2' columns by row using pmin/pmax and get the unique value of 'N'.
library(data.table)#v1.9.6+
DT[DT, on='ID', allow.cartesian=TRUE][Group!=i.Group, .N ,.(Group, i.Group)][,
list(Sum=unique(N)) ,.(Group.1=pmin(Group, i.Group), Group.2=pmax(Group, i.Group))]
# Group.1 Group.2 Sum
#1: A B 2
#2: A C 3
#3: B C 3
Or as mentioned in the comments by #MichaelChirico and #Frank, we can convert 'Group' to factor class, subset the rows based on as.integer(Group) < as.integer(i.Group), group by 'Group', 'i.Group' and get the nrow (.N)
DT[, Group:= factor(Group)]
DT[DT, on='ID', allow.cartesian=TRUE][as.integer(Group) < as.integer(i.Group), .N,
by = .(Group.1= Group, Group.2= i.Group)]
Great answers above.
Just an alternative using dplyr in case you, or someone else, is interested.
library(dplyr)
cmb = combn(unique(dt$Group),2)
data.frame(g1 = cmb[1,],
g2 = cmb[2,]) %>%
group_by(g1,g2) %>%
summarise(l=length(intersect(DT[DT$Group==g1,]$ID,
DT[DT$Group==g2,]$ID)))
# g1 g2 l
# (fctr) (fctr) (int)
# 1 A B 2
# 2 A C 3
# 3 B C 3
yet another solution (base R):
tmp <- split(DT, DT[, 'Group'])
ans <- apply(combn(LETTERS[1 : 3], 2), 2, FUN = function(ind){
out <- length(intersect(tmp[[ind[1]]][, 1], tmp[[ind[2]]][, 1]))
c(group1 = ind[1], group2 = ind[2], sum_ = out)
}
)
data.frame(t(ans))
# group1 group2 sum_
#1 A B 2
#2 A C 3
#3 B C 3
first split data into list of groups, then for each unique pairwise combinations of two groups see how many subjects in common they have, using length(intersect(....

Resources