Group observations into specified number of groups according to id with data.table solution - r

I have the following data.table:
dt <- data.table(id = rep(1:5, 5), obs = rnorm(1, n = 25))[order(id)]
dt
id obs
1: 1 0.1470735
2: 1 1.6954685
3: 1 2.3947260
4: 1 2.1782338
5: 1 0.5168873
6: 2 -0.8879545
7: 2 1.9320034
8: 2 2.6269272
9: 2 1.5212627
10: 2 -0.1581711
Which has a total of 5 distinct ids (numbers 1 through 5) and 5 observations (obs) for each id. I want to group the ids together randomly in groups of X ids according to id and create a new column with the grouping. For this example, let's say I want to end up with a data.table like this:
id obs group
1: 1 0.1470735 A
2: 1 1.6954685 A
3: 1 2.3947260 A
4: 1 2.1782338 A
5: 1 0.5168873 A
6: 2 -0.8879545 A
7: 2 1.9320034 A
8: 2 2.6269272 A
9: 2 1.5212627 A
10: 2 -0.1581711 A
Where ids 1 and 2 are assigned to group A, ids 3 and 4 are assigned to group B, and id 5 is assigned to group C.
My actual dataset is much larger and will not necessarily group evenly, but I do not need the groups to contain the same number of ids. I do need to control the general size of the group (for example I want to be able to say 5 ids per group and if the last group has only 3 ids that's fine).
Could someone please help me with an elegant data.table way to accomplish this?

This is the same as #Shree's answer, just using length.out in rep and no dplyr.
I do need to control the general size of the group (for example I want to be able to say 5 ids per group and if the last group has only 3 ids that's fine).
You can make an id table; assign groups there; and if necessary merge back:
# bigger, reproducible example
library(data.table)
max_per_group = 5
n_ids = 1e5+1
DT = data.table(id = rep(1:nid, each = max_per_group), obs = 1)
# make an id table
idDT = unique(DT[, "id"])
# randomly assign groups
idDT[, g := sample(rep(.I, each = 5, length.out = .N))]
# merge back if needed
DT[idDT, on=.(id), g := i.g]
You refer to "my actual dataset" -- but R allows you to juggle multiple tables. Trying to do everything in one is almost always counterproductive.

EDIT: Didn't notice that you needed this with data.table. I'll leave this out here as an alternative.
I am creating a dataframe with id and randomly assigned group. This will be joined with your data to get groups for each record by id -
library(dplyr)
library(data.table)
dt <- data.table(id = rep(1:5, 5), obs = rnorm(1, n = 25))[order(id)]
max_per_group <- 5
n_ids <- length(unique(dt$id))
data.frame(id = unique(dt$id), grp = sample(rep(LETTERS, max_per_group), n_ids)) %>%
left_join(dt, ., by = "id")
id obs grp
1 1 1.28879713 S
2 1 1.04471197 S
3 1 0.36470847 S
4 1 0.46741567 S
5 1 1.07749891 S
6 2 1.73640785 K
7 2 1.61144042 K
8 2 2.85196859 K
9 2 1.84848117 K
10 2 2.11395863 K
11 3 0.88623462 S
12 3 2.11706351 S
13 3 1.29225433 S
14 3 0.30458037 S
15 3 -1.72070005 S
16 4 2.24593162 U
17 4 2.10346287 U
18 4 2.28724412 U
19 4 0.02978044 U
20 4 0.56234660 U
21 5 2.92050008 F
22 5 1.08048974 F
23 5 0.58885261 F
24 5 1.53299092 F
25 5 1.47271123 F

Related

Unique ID for interconnected cases

I have the following data frame, that shows which cases are interconnected:
DebtorId DupDebtorId
1: 1 2
2: 1 3
3: 1 4
4: 5 1
5: 5 2
6: 5 3
7: 6 7
8: 7 6
My goal is to assign a unique group ID to each group of cases. The desired output is:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 6 2
7: 7 2
My train of thought:
library(data.table)
example <- data.table(
DebtorId = c(1,1,1,5,5,5,6,7),
DupDebtorId = c(2,3,4,1,2,3,7,6)
)
unique_pairs <- example[!duplicated(t(apply(example, 1, sort))),] #get unique pairs of DebtorID and DupDebtorID
unique_pairs[, group := .GRP, by=.(DebtorId)] #assign a group ID for each DebtorId
unique_pairs[, num := rowid(group)]
groups <- dcast(unique_pairs, group + DebtorId ~ num, value.var = 'DupDebtorId') #format data to wide for each group ID
#create new data table with unique cases to assign group ID
newdt <- data.table(DebtorId = sort(unique(c(example$DebtorId, example$DupDebtorId))), group = NA)
newdt$group <- as.numeric(newdt$group)
#loop through the mapped groups, selecting the first instance of group ID for the case
for (i in 1:nrow(newdt)) {
a <- newdt[i]$DebtorId
b <- min(which(groups[,-1] == a, arr.ind=TRUE)[,1])
newdt[i]$group <- b
}
Output:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
6: 6 3
7: 7 3
There are 2 problems in my approach:
From the output, you can see that it fails to recognize that case 5
belongs to group 1;
The final loop is agonizingly slow, which would
render it useless for my use case of 1M rows in my original data, and going the traditional := way does not work with which()
I'm not sure whether my approach could be optimized, or there is a better way of doing this altogether.
This functionality already exists in igraph, so if you don't need to do it yourself, we can build a graph from your data frame and then extract cluster membership. stack() is just an easy way to convert a named vector to data frame.
library(igraph)
g <- graph.data.frame(df)
df_membership <- clusters(g)$membership
stack(df_membership)
#> values ind
#> 1 1 1
#> 2 1 5
#> 3 2 6
#> 4 2 7
#> 5 1 2
#> 6 1 3
#> 7 1 4
Above, values corresponds to group and ind to DebtorId.

Repeat sequence by group

I have the following dataframe:
a <- data.frame(
group1=factor(rep(c("a","b"),each=6,times=1)),
time=rep(1:6,each=1,times=2),
newcolumn = c(1,1,2,2,3,3,1,1,2,2,3,3)
)
I'm looking to replicate the output of newcolumn with a rep by group function (the time variable is there for ordering purposes). In other words, for each group, ordered by time, how can I assign a sequence 1,1,2,2,n,n? I also need a general solution (in the case that groups are of differing number of rows, or I want to repeat values 3,10,n times).
For instance, I can generate that sequence with this:
newcolumn=rep(1:3,each=2,times=2)
But that wouldn't work in a group by statement where group1 has differing rows.
We specify the length.out in the rep after grouping by 'group1'
library(dplyr)
a %>%
group_by(group1) %>%
mutate(new = rep(seq_len(n()/2), each = 2, length.out = n()))
NOTE: each and times are not used in the same call. Either we use each or times
EDIT: Based on comments from #r2evans
A data.table alternative:
library(data.table)
DT <- as.data.table(a[1:2])
DT[order(time),newcolumn := rep(seq_len(.N/2), each=2, length.out=.N),by=c("group1")]
DT
# group1 time newcolumn
# 1: a 1 1
# 2: a 2 1
# 3: a 3 2
# 4: a 4 2
# 5: a 5 3
# 6: a 6 3
# 7: b 1 1
# 8: b 2 1
# 9: b 3 2
# 10: b 4 2
# 11: b 5 3
# 12: b 6 3

Row-wise difference in two list using Data.Table in R

I want to use data.table to incrementally find out new elements i.e. for every row, I'd see whether values in list have been seen before. If they are, we will ignore them. If not, we will select them.
I was able to wrap elements by group in a list, but I am unsure how I can find incremental differences.
Here's my attempt:
df = data.table::data.table(id = c('A','B','C','A','B','A','A','A','D','E','E','E'),
Value = c(1,2,3,4,3,5,2,3,7,2,3,9))
df_wrapped=df[,.(Values=(list(unique(Value)))), by=id]
expected_output = data.table::data.table(id = c("A","B","C","D","E"),
Value = list(c(1,4,5,2,3),c(2,3),c(3),c(7),c(2,3,9)),
Diff=list(c(1,4,5,2,3),c(NA),c(NA),c(7),c(9)),
Count = c(5,0,0,1,1))
Thoughts about expected output:
For the first row, all elements are unique. So, we will include them in Diff column.
In the second row, 2,3 have occurred in row 1. So, we will ignore them. Ditto for row 3.
Similarly, 7 and 9 are seen for the first time in row 4 and 5, so we will include them.
Here's visual representation:
expected_output
id Value Diff Count
A 1,4,5,2,3 1,4,5,2,3 5
B 2,3 NA 0
C 3 NA 0
D 7 7 1
E 2,3,9 9 1
I'd appreciate any thoughts. I am only looking for data.table based solutions because of performance issues in my original dataset.
I am not sure why you specifically need to put them in a list, but otherwise I wrote a small piece that could help you.
df = data.table::data.table(id = c('A','B','C','A','B','A','A','A','D','E','E','E'),
Value = c(1,2,3,4,3,5,2,3,7,2,3,9))
df = df[order(id, Value)]
df = df[duplicated(Value) == FALSE, diff := Value][]
df = df[, count := uniqueN(diff, na.rm = TRUE), by = id]
The outcome would be:
> df
id Value diff count
1: A 1 1 5
2: A 2 2 5
3: A 3 3 5
4: A 4 4 5
5: A 5 5 5
6: B 2 NA 0
7: B 3 NA 0
8: C 3 NA 0
9: D 7 7 1
10: E 2 NA 1
11: E 3 NA 1
12: E 9 9 1
Hope this helps, or at least get you started.
Here is another possible approach:
library(data.table)
df = data.table(
id = c('A','B','C','A','B','A','A','A','D','E','E','E'),
Value = c(1,2,3,4,3,5,2,3,7,2,3,9))
valset <- c()
df[, {
d <- setdiff(Value, valset)
valset <- unique(c(valset, Value))
.(Values=.(Value), Diff=.(d), Count=length(d))
},
by=.(id)]
output:
id Values Diff Count
1: A 1,4,5,2,3 1,4,5,2,3 5
2: B 2,3 0
3: C 3 0
4: D 7 7 1
5: E 2,3,9 9 1

Count number of shared observations between samples using dplyr

I have a list of observations grouped by samples. I want to find the samples that share the most identical observations. An identical observation is where the start and end number are both matching between two samples. I'd like to use R and preferably dplyr to do this if possible.
I've been getting used to using dplyr for simpler data handling but this task is beyond what I am currently able to do. I've been thinking the solution would involve grouping the start and end into a single variable: group_by(start,end) but I also need to keep the information about which sample each observation belongs to and compare between samples.
example:
sample start end
a 2 4
a 3 6
a 4 8
b 2 4
b 3 6
b 10 12
c 10 12
c 0 4
c 2 4
Here samples a, b and c share 1 observation (2, 4)
sample a and b share 2 observations (2 4, 3 6)
sample b and c share 2 observations (2 4, 10 12)
sample a and c share 1 observation (2 4)
I'd like an output like:
abc 1
ab 2
bc 2
ac 1
and also to see what the shared observations are if possible:
abc 2 4
ab 2 4
ab 3 6
etc
Thanks in advance
Here's something that should get you going:
df %>%
group_by(start, end) %>%
summarise(
samples = paste(unique(sample), collapse = ""),
n = length(unique(sample)))
# Source: local data frame [5 x 4]
# Groups: start [?]
#
# start end samples n
# <int> <int> <chr> <int>
# 1 0 4 c 1
# 2 2 4 abc 3
# 3 3 6 ab 2
# 4 4 8 a 1
# 5 10 12 bc 2
Here is an idea via base R,
final_d <- data.frame(count1 = sapply(Filter(nrow, split(df, list(df$start, df$end))), nrow),
pairs1 = sapply(Filter(nrow, split(df, list(df$start, df$end))), function(i) paste(i[[1]], collapse = '')))
# count1 pairs1
#0.4 1 c
#2.4 3 abc
#3.6 2 ab
#4.8 1 a
#10.12 2 bc

Get top k records per group, where k differs by group, in R data.table

I have two data.tables:
Values to extract the top k from, per group.
A mapping from group to the k values to select for that group.
how to find the top N values by group or within category (groupwise) in an R data.frame addresses this question when k does not vary by group. How can I do this? Here's sample data and the desired result:
Values:
(dt <- data.table(id=1:10,
group=c(rep(1, 5), rep(2, 5))))
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 2
# 7: 7 2
# 8: 8 2
# 9: 9 2
# 10: 10 2
Mapping from group to k:
(group.k <- data.table(group=1:2,
k=2:3))
# group k
# 1: 1 2
# 2: 2 3
Desired result, which should include the first two records from group 1 and the first three records from group 2:
(result <- data.table(id=c(1:2, 6:8),
group=c(rep(1, 2), rep(2, 3))))
# id group
# 1: 1 1
# 2: 2 1
# 3: 6 2
# 4: 7 2
# 5: 8 2
Applying the solution to the above-linked question after merging returns this error:
merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, k), by=group])
# Error: length(n) == 1L is not TRUE
I'd rather do it as:
dt[group.k, head(.SD, k), by=.EACHI, on="group"]
because it's quite clear to see what the intended operation is. j can be .SD[1:k] of course. Both these expressions will very likely be (further) optimised (for speed) in the next release.
See this post for a detailed explanation of by=.EACHI until we wrap those vignettes.
After merging in the k by group, a similar approach to https://stackoverflow.com/a/14800271/1840471's solution can be applied, you just need a unique to avoid the length(n) error:
merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, unique(k)), by=group])
# group id k
# 1: 1 1 2
# 2: 1 2 2
# 3: 2 6 3
# 4: 2 7 3
# 5: 2 8 3

Resources