My question is similar to this r count combinations of elements in groups however, firstly, I want to group all potential combinations by group in a column Comb and second, count the occurrences of the combinations depending on year in a column n.
Using the same mock dataset:
> dat = data.table(group = c(1,1,1,2,2,2,3,3), id=c(10,11,12,10,11,13,11,13))
> dat
group id year
1: 1 10 2010
2: 1 11 2010
3: 1 12 2010
4: 2 10 2011
5: 2 11 2011
6: 2 13 2011
7: 3 11 2012
8: 3 13 2012
The desired outcome:
> dat
group Comb year n
1: 1 10 11 2010 1
2: 1 11 12 2010 1
3: 1 12 10 2010 1
4: 2 10 11 2011 2
5: 2 11 13 2011 1
6: 2 13 10 2011 1
7: 3 11 13 2012 2
I would much appreciate a possible solution with dplyr.
thanks
Here's a solution, presented first as data.table then as dplyr. The process is the same: we self-join on group, filter where the id combinations are in a consistent order (any order would work, we pick first id < second id), group by combination to number the rows, and drop the unused columns.
dat = data.table(group = c(1,1,1,2,2,2,3,3), id=c(10,11,12,10,11,13,11,13))
## with data.table
merge(dat, dat, by = "group", allow.cartesian = TRUE)[
id.x < id.y, ][
, Comb := paste(id.x, id.y)][
, n := 1:.N, by = .(Comb)
][, .(group, Comb, n)]
# group Comb n
# 1: 1 10 11 1
# 2: 1 10 12 1
# 3: 1 11 12 1
# 4: 2 10 11 2
# 5: 2 10 13 1
# 6: 2 11 13 1
# 7: 3 11 13 2
## with dplyr
dat %>% full_join(dat, by = "group") %>%
filter(id.x < id.y) %>%
group_by(Comb = paste(id.x, id.y)) %>%
mutate(n = row_number()) %>%
select(group, Comb, n)
# # A tibble: 7 x 3
# # Groups: Comb [5]
# group Comb n
# <dbl> <chr> <int>
# 1 1 10 11 1
# 2 1 10 12 1
# 3 1 11 12 1
# 4 2 10 11 2
# 5 2 10 13 1
# 6 2 11 13 1
# 7 3 11 13 2
Related
I need some help with grouping data by continuous values.
If I have this data.table
dt <- data.table::data.table( a = c(1,1,1,2,2,2,2,1,1,2), b = seq(1:10), c = seq(1:10)+1 )
a b c
1: 1 1 2
2: 1 2 3
3: 1 3 4
4: 2 4 5
5: 2 5 6
6: 2 6 7
7: 2 7 8
8: 1 8 9
9: 1 9 10
10: 2 10 11
I need a group for every following equal values in column a. Of this group i need the first (also min possible) value of column b and the last (also max possible) value of column c.
Like this:
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
Thank you very much for your help. I do not get it solved alone.
Probably we can try
> dt[, .(a = a[1], b = b[1], c = c[.N]), rleid(a)][, -1]
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
An option with dplyr
library(dplyr)
dt %>%
group_by(grp = cumsum(c(TRUE, diff(a) != 0))) %>%
summarise(across(a:b, first), c = last(c)) %>%
select(-grp)
-output
# A tibble: 4 × 3
a b c
<dbl> <int> <dbl>
1 1 1 4
2 2 4 8
3 1 8 10
4 2 10 11
I have a dataset (dt) like this in R:
n id val
1 1&&2 10
2 3 20
3 4&&5 30
And what I want to get is
n id val
1 1 10
2 2 10
3 3 20
4 4 30
5 5 30
I know that to split ids I need to do something like this:
id_split <- strsplit(dt$id,"&&")
But how do I create new rows with the same val for ids which were initially together in a row?
You may cbind the splits to get a column which you cbind again to the val (recycling).
res <- do.call(rbind, Map(data.frame, id=lapply(strsplit(dat$id, "&&"), cbind),
val=dat$val))
res <- cbind(n=1:nrow(res), res)
res
# n id val
# 1 1 1 10
# 2 2 2 10
# 3 3 3 20
# 4 4 4 30
# 5 5 5 30
You can use the lengths from the split of id and expand your rows. Then set n to be the sequece of the length of your data frame, i.e.
l1 <- strsplit(as.character(df$id), '&&')
res_df <- transform(df[rep(seq_len(nrow(df)), lengths(l1)),],
id = unlist(l1),
n = seq_along(unlist(l1)))
which gives,
n id val
1 1 1 10
1.1 2 2 10
2 3 3 20
3 4 4 30
3.1 5 5 30
You can remove the rownames with rownames(res_df) <- NULL
A data.table solution.
library(data.table)
DT <- fread('n id val
1 1&&2 10
2 3 20
3 4&&5 30')
DT[,.(id=unlist(strsplit(id,split ="&&"))),by=.(n,val)][,n:=.I][]
#> n val id
#> 1: 1 10 1
#> 2: 2 10 2
#> 3: 3 20 3
#> 4: 4 30 4
#> 5: 5 30 5
Created on 2020-05-08 by the reprex package (v0.3.0)
Note:
A more rebosut solution is by = 1:nrow(DT). But you need to play around your other columns though.
If anyone looking for tidy solution,
dt %>%
separate(id, into = paste0("id", 1:2),sep = "&&") %>%
pivot_longer(cols = c(id1,id2), names_to = "id_name", values_to = "id") %>%
drop_na(id) %>%
select(n, id, val)
output as
# A tibble: 5 x 3
n id val
<dbl> <chr> <dbl>
1 1 1 10
2 1 2 10
3 2 3 20
4 3 4 30
5 3 5 30
Edit:
As suggested by #sotos, and completely missed by me. one liner solution
d %>% separate_rows(id, ,sep = "&&")
gives same output as
# A tibble: 5 x 3
n id val
<dbl> <chr> <dbl>
1 1 1 10
2 1 2 10
3 2 3 20
4 3 4 30
5 3 5 30
tstrplit by id from data.table can do the job
library(data.table)
df <- setDT(df)[,.('id' = tstrsplit(id, "&&")), by = c('n','val')]
df[,'n' := seq(.N)]
df
n val id
1: 1 10 1
2: 2 10 2
3: 3 20 3
4: 4 30 4
5: 5 30 5
I hope anyone can help with this. I have a data frame similar to this:
test <- data.frame(ID = c(1:24),
group = rep(c(1,1,1,1,1,1,2,2,2,2,2,2),2),
year1 = rep(c(2018,2018,2018,2019,2019,2019),4),
month1 = rep(c(1,2,3),8))
Now I want to do a cumsum per group but when I use the following code the sumsum 'restarts' each year.
test2 <-test %>%
group_by(group,year1,month1) %>%
summarise(a = length(unique(ID))) %>%
mutate(a = cumsum(a))
My desired output is:
group year1 month1 a
1 1 2018 1 2
2 1 2018 2 4
3 1 2018 3 6
4 1 2019 1 8
5 1 2019 2 10
6 1 2019 3 12
7 2 2018 1 2
8 2 2018 2 4
9 2 2018 3 6
10 2 2019 1 8
11 2 2019 2 10
12 2 2019 3 12
You could first count unique ID for each group, month and year and then take cumsum of it for each group.
library(dplyr)
test %>%
group_by(group, year1, month1) %>%
summarise(a = n_distinct(ID)) %>%
group_by(group) %>%
mutate(a = cumsum(a))
# group year1 month1 a
# <dbl> <dbl> <dbl> <int>
# 1 1 2018 1 2
# 2 1 2018 2 4
# 3 1 2018 3 6
# 4 1 2019 1 8
# 5 1 2019 2 10
# 6 1 2019 3 12
# 7 2 2018 1 2
# 8 2 2018 2 4
# 9 2 2018 3 6
#10 2 2019 1 8
#11 2 2019 2 10
#12 2 2019 3 12
With data.table, this can be done with
library(data.table)
setDT(test)[, .(a = uniqueN(ID)), by = .(group, year1, month1)
][, a := cumsum(a), by = group]
Let's say i have this kind of data:
> head(data)
year type
1 1999 A
2 2018 B
3 2002 A
4 2001 B
5 2017 B
6 2017 A
How do i get to group the column 'year' by a interval defined by the user, let's say, 2.
So the returning data would look like this:
> head(data)
Ano Type Freq
1 1999-2000 A 12
2 1999-2000 B 5
3 2001-2002 A 23
4 2001-2002 B 6
5 2003-2004 A 30
6 2003-2004 B 15
I'm using it along a Shiny app, and I've got this far, but it only works for one column:
period <- 1999:2004
n = 2
interval = split(period, ceiling(seq_along(period) / n))
year_interval = unlist(lapply(interval, function(x) {
paste(min(x), max(x), sep = " - ")
}))
Create data
library(tidyverse)
set.seed(10)
df <- tibble(year = sample(1999:2020, 30, T), type = sample(LETTERS[1:3], 30, T))
Group by intervals of N = 2 years
N <- 2
df %>%
mutate(g = year %/% N,
years = paste0(g*N, '-', g*N + N - 1)) %>%
count(years, type)
# # A tibble: 20 x 3
# # Groups: years [10]
# years type n
# <chr> <chr> <int>
# 1 2000-2001 A 2
# 2 2000-2001 B 1
# 3 2002-2003 C 1
# 4 2004-2005 A 1
# 5 2004-2005 B 2
# 6 2004-2005 C 2
# 7 2006-2007 A 3
# 8 2006-2007 B 2
# 9 2008-2009 A 2
# 10 2008-2009 B 1
# 11 2010-2011 A 1
# 12 2010-2011 B 1
# 13 2012-2013 A 1
# 14 2012-2013 C 3
# 15 2014-2015 B 2
# 16 2014-2015 C 1
# 17 2016-2017 A 1
# 18 2016-2017 B 1
# 19 2016-2017 C 1
# 20 2018-2019 B 1
For N = 3
N <- 3
df %>%
mutate(g = year %/% N,
years = paste0(g*N, '-', g*N + N - 1)) %>%
count(years, type)
# # A tibble: 18 x 3
# # Groups: years [7]
# years type n
# <chr> <chr> <int>
# 1 1998-2000 A 1
# 2 1998-2000 B 1
# 3 2001-2003 A 1
# 4 2001-2003 C 1
# 5 2004-2006 A 2
# 6 2004-2006 B 4
# 7 2004-2006 C 2
# 8 2007-2009 A 4
# 9 2007-2009 B 1
# 10 2010-2012 A 1
# 11 2010-2012 B 1
# 12 2010-2012 C 3
# 13 2013-2015 A 1
# 14 2013-2015 B 2
# 15 2013-2015 C 1
# 16 2016-2018 A 1
# 17 2016-2018 B 2
# 18 2016-2018 C 1
I wish to count the number of times each combination of two elements appears in the same group.
For example, with:
> dat = data.table(group = c(1,1,1,2,2,2,3,3), id=c(10,11,12,10,11,13,11,13))
> dat
group id
1: 1 10
2: 1 11
3: 1 12
4: 2 10
5: 2 11
6: 2 13
7: 3 11
8: 3 13
The expected result would be:
id.1 id.2 nb_common_appearances
10 11 2 (in group 1 and 2)
10 12 1 (in group 1)
11 12 1 (in group 1)
10 13 1 (in group 2)
11 13 2 (in group 2 and 3)
Here is a data.table approach (roughly the same as #josilber's from plyr):
pairs <- dat[, c(id=split(combn(id,2),1:2)), by=group ]
pairs[, .N, by=.(id.1,id.2) ]
# id.1 id.2 N
# 1: 10 11 2
# 2: 10 12 1
# 3: 11 12 1
# 4: 10 13 1
# 5: 11 13 2
You might also consider viewing the results in a table:
pairs[, table(id.1,id.2) ]
# id.2
# id.1 11 12 13
# 10 2 1 1
# 11 0 1 2
You can use merges instead of combn:
setkey(dat, group)
dat[ dat, allow.cartesian=TRUE ][ id<i.id, .N, by=.(id,i.id) ]
Benchmarks. For large data, the merges can be a little faster (as hypothesized by #DavidArenburg). #Arun's answer is faster still:
DT <- data.table(g=1,id=1:(1.5e3),key="id")
system.time({a <- combn(DT$id,2)})
# user system elapsed
# 0.81 0.00 0.81
system.time({b <- DT[DT,allow.cartesian=TRUE][id<i.id]})
# user system elapsed
# 0.13 0.00 0.12
system.time({d <- DT[,.(rep(id,(.N-1L):0L),id[indices(.N-1L)])]})
# user system elapsed
# 0.01 0.00 0.02
(I left out the group-by operation as I don't think it will be important to the timings.)
In defense of combn. The combn approach extends nicely to larger combos, while merges and #Arun's answer, while much faster for pairs, do not (as far as I can see):
DT2 <- data.table(g=rep(1:2,each=5),id=1:5)
tuple_size <- 4
tuples <- DT2[, c(id=split(combn(id,tuple_size),1:tuple_size)), by=g ]
tuples[, .N, by=setdiff(names(tuples),"g")]
# id.1 id.2 id.3 id.4 N
# 1: 1 2 3 4 2
# 2: 1 2 3 5 2
# 3: 1 2 4 5 2
# 4: 1 3 4 5 2
# 5: 2 3 4 5 2
Another way using data.table:
require(data.table)
indices <- function(n) sequence(n:1L) + rep(1:n, n:1)
dat[, .(id1 = rep(id, (.N-1L):0L),
id2 = id[indices(.N-1L)]),
by=group
][, .N, by=.(id1, id2)]
# id1 id2 N
# 1: 10 11 2
# 2: 10 12 1
# 3: 11 12 1
# 4: 10 13 1
# 5: 11 13 2
You could reshape your data to have each pair in each group in a separate row (I've used split-apply-combine for that step) and then use count from the plyr package to count the frequency of unique rows:
library(plyr)
count(do.call(rbind, lapply(split(dat, dat$group), function(x) t(combn(x$id, 2)))))
# x.1 x.2 freq
# 1 10 11 2
# 2 10 12 1
# 3 10 13 1
# 4 11 12 1
# 5 11 13 2
Here is a dplyr approach, using combn to make the combinations.
dat %>%
group_by(group) %>%
do(as.data.frame(t(combn(.[["id"]], 2)))) %>%
group_by(V1, V2) %>%
summarise(n( ))
Source: local data frame [5 x 3]
Groups: V1
V1 V2 n()
1 10 11 2
2 10 12 1
3 10 13 1
4 11 12 1
5 11 13 2