count number of combinations by group - r

I am struggling to count the number of unique combinations in my data. I would like to first group them by the id and then count, how many times combination of each values occurs. here, it does not matter if the elements are combined in 'd-f or f-d, they still belongs in teh same category, as they have same element:
combinations:
n
c-f: 2 # aslo f-c
c-d-f: 1 # also cfd or fdc
d-f: 2 # also f-d or d-f. The dash is only for isualization purposes
Dummy example:
# my data
dd <- data.frame(id = c(1,1,2,2,2,3,3,4, 4, 5,5),
cat = c('c','f','c','d','f','c','f', 'd', 'f', 'f', 'd'))
> dd
id cat
1 1 c
2 1 f
3 2 c
4 2 d
5 2 f
6 3 c
7 3 f
8 4 d
9 4 f
10 5 f
11 5 d
Using paste is a great solution provided by #benson23, but it considers as unique category f-d and d-f. I wish, however, that the order will not matter. Thank you!

Create a "combination" column in summarise, we can count this column afterwards.
An easy way to count the category is to order them at the beginning, then in this case they will all be in the same order.
library(dplyr)
dd %>%
group_by(id) %>%
arrange(id, cat) %>%
summarize(combination = paste0(cat, collapse = "-"), .groups = "drop") %>%
count(combination)
# A tibble: 3 x 2
combination n
<chr> <int>
1 c-d-f 1
2 c-f 2
3 d-f 2

Related

How to order grouped rows while keeping duplicates together [duplicate]

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 2 years ago.
I have a dataframe with several "people".
There are repeat instances for "people", however, the measured "value" is different in each instance.
Here is an example of dataframe.
df2 <- data.frame(
value = c(1, 2, 3, 4, 5),
people = c("d", "c", "b", "d", "b")
)
which looks like:
value people
1 d
2 c
3 b
4 d
5 b
I would like to group the data by "people", then sort the groups of rows by "value", and within the groups, I would like to sort descending by the "value".
That is, I want to keep duplicates together while sorting by value.
Here is how I would like the data to look:
value people
1 d
4 d
2 c
3 b
5 b
I have tried multiple attempts with group_by and arrange using {dplyr} but seems I am missing something.
Thanks for the help.
I have made a change - for clarity, I do not want "people" sorted alphabetically - this is a schedule in reality - person D has the first appointment (1), and his second appointment is 4. I want them to appear first and together. Person C has a 2nd appointment. Person B has a 3rd appointment, his other appointment is 5. I hope this makes it more clear. Thanks again
You can use arrange in this form :
library(dplyr)
df2 %>%
arrange(value) %>%
arrange(match(people, unique(people)))
# value people
#1 1 d
#2 4 d
#3 2 c
#4 3 b
#5 5 b
Though a longer code, but this will also work
df2 %>% group_by(people) %>% arrange(value) %>%
mutate(d = first(value)) %>% arrange(d) %>% ungroup() %>% select(-d)
# A tibble: 5 x 2
value people
<dbl> <chr>
1 1 d
2 4 d
3 2 c
4 3 b
5 5 b
I got your result with the following one-liner base-R code:
df2[order(df2$people, decreasing = TRUE),]
# value people
# 1 1 d
# 4 4 d
# 2 2 c
# 3 3 b
# 5 5 b

R dplyr: filter common values by group

I need to find common values between different groups ideally using dplyr and R.
From my dataset here:
group val
<fct> <dbl>
1 a 1
2 a 2
3 a 3
4 b 3
5 b 4
6 b 5
7 c 1
8 c 3
the expected output is
group val
<fct> <dbl>
1 a 3
2 b 3
3 c 3
as only number 3 occurs in all groups.
This code seems not working:
# Filter the data
dd %>%
group_by(group) %>%
filter(all(val)) # does not work
Example here solves similar issue but have a defined vector of shared values. What if I do not know which ones are shared?
Dummy example:
# Reproducible example: filter all id by group
group = c("a", "a", "a",
"b", "b", "b",
"c", "c")
val = c(1,2,3,
3,4,5,
1,3)
dd <- data.frame(group,
val)
group_by isolates each group, so we can't very well group_by(group) and compare between between groups. Instead, we can group_by(val) and see which ones have all the groups:
dd %>%
group_by(val) %>%
filter(n_distinct(group) == n_distinct(dd$group))
# # A tibble: 3 x 2
# # Groups: val [1]
# group val
# <chr> <dbl>
# 1 a 3
# 2 b 3
# 3 c 3
This is one of the rare cases where we want to use data$column in a dplyr verb - n_distinct(dd$group) refers explicitly to the ungrouped original data to get the total number of groups. (It could also be pre-computed.) Whereas n_distinct(group) is using the grouped data piped in to filter, thus it gives the number of distinct groups for each value (because we group_by(val)).
A base R approach can be:
#Code
newd <- dd[dd$val %in% Reduce(intersect, split(dd$val, dd$group)),]
Output:
group val
3 a 3
4 b 3
8 c 3
A similar option in data.table as that of #GregorThomas solution is
library(data.table)
setDT(dd)[dd[, .I[uniqueN(group) == uniqueN(dd$group)], val]$V1]

Rolling sum of one variable in data.frame in number of steps defined by another variable

I'm trying to sum up the values in a data.frame in a cumulative way.
I have this:
df <- data.frame(
a = rep(1:2, each = 5),
b = 1:10,
step_window = c(2,3,1,2,4, 1,2,3,2,1)
)
I'm trying to sum up the values of b, within the groups a. The trick is, I want the sum of b values that corresponds to the number of rows following the current row given by step_window.
This is the output I'm looking for:
data.frame(
a = rep(1:2, each = 5),
step_window = c(2,3,1,2,4,
1,2,3,2,1),
b = 1:10,
sum_b_step_window = c(3, 9, 3, 9, 5,
6, 15, 27, 19, 10)
)
I tried to do this using the RcppRoll but I get an error Expecting a single value:
df %>%
group_by(a) %>%
mutate(sum_b_step_window = RcppRoll::roll_sum(x = b, n = step_window))
I'm not sure if having variable window size is possible in any of the rolling function. Here is one way to do this using map2_dbl :
library(dplyr)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = purrr::map2_dbl(row_number(), step_window,
~sum(b[.x:(.x + .y - 1)], na.rm = TRUE)))
# a b step_window sum_b_step_window
# <int> <int> <dbl> <dbl>
# 1 1 1 2 3
# 2 1 2 3 9
# 3 1 3 1 3
# 4 1 4 2 9
# 5 1 5 4 5
# 6 2 6 1 6
# 7 2 7 2 15
# 8 2 8 3 27
# 9 2 9 2 19
#10 2 10 1 10
1) rollapply
rollapply in zoo supports vector widths. partial=TRUE says that if the width goes past the end then use just the values within the data. (Another possibility would be to use fill=NA instead in which case it would fill with NA's if there were not enough data left) . align="left" specifies that the current value at each step is the left end of the range to sum.
library(dplyr)
library(zoo)
df %>%
group_by(a) %>%
mutate(sum = rollapply(b, step_window, sum, partial = TRUE, align = "left")) %>%
ungroup
2) SQL
This can also be done in SQL by left joining df to itself on the indicated condition and then for each row summing over all rows for which the condition matches.
library(sqldf)
sqldf("select A.*, sum(B.b) as sum
from df A
left join df B on B.rowid between A.rowid and A.rowid + A.step_window - 1
and A.a = B.a
group by A.rowid")
Here is a solution with the package slider.
library(dplyr)
library(slider)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = hop_vec(b, row_number(), step_window+row_number()-1, sum)) %>%
ungroup()
It is flexible on different window sizes.
Output:
# A tibble: 10 x 4
a b step_window sum_b_step_window
<int> <int> <dbl> <int>
1 1 1 2 3
2 1 2 3 9
3 1 3 1 3
4 1 4 2 9
5 1 5 4 5
6 2 6 1 6
7 2 7 2 15
8 2 8 3 27
9 2 9 2 19
10 2 10 1 10
slider is a couple-of-months-old tidyverse package specific for sliding window functions. Have a look here for more info: page, vignette
hop is the engine of slider. With this solution we are triggering different .start and .stop to sum the values of b according to the a groups.
With _vec you're asking hop to return a vector: a double in this case.
row_number() is a dplyr function that allows you to return the row number of each group, thus allowing you to slide along the rows.
data.table solution using cumulative sums
setDT(df)
df[, sum_b_step_window := {
cs <- c(0,cumsum(b))
cs[pmin(.N+1, 1:.N+step_window)]-cs[pmax(1, (1:.N))]
},by = a]

n objects, a value for each combination of two objects, find minimum value for each object in R

I want to find the minimum value associated with an object out of a dataframe. The dataframe contains two columns representing all combinations of the objects and a value-column for each combination. It looks like this:
id_A id_B dist
206 208 2385.5096
207 208 467.8890
207 209 576.4631
...
208 209 1081.539
208 210 8214.439
...
I tried the following recommended dplyr functions:
df %>%
group_by(id_A) %>%
slice(which.min(dist))
But it creates not the desired output:
id_A id_B dist
...
207 208 467.8890
208 209 1081.5393
...
Note that for id 208 the combination with id 207 has the lowest value, but is not associated to id 208 (when it is in the grouped_by column).
I wrote a function doing this right, but since I got many entries it is way to slow. Its a loop subsetting the data by all entries containing a specific id and then finds the minimum within that subset and associates that value with that id.
Do you have an idea, how to make that fast e.g. using dplyr.
The issue boils down to needing a long (rather than wide) data format. First, here are some reproducible data (using the pipe from dplyr):
df <-
LETTERS[1:4] %>%
combn(2) %>%
t %>%
data.frame() %>%
mutate(val = 1:n()) %>%
setNames( c("id_A", "id_B", "dist") )
gives:
id_A id_B dist
1 A B 1
2 A C 2
3 A D 3
4 B C 4
5 B D 5
6 C D 6
What we want is a pair of columns giving matching each category with the distance from its row. For this, I am using gather from tidyr. It creates new columns telling us which column the data came from and what value that held. Here, we are telling it to pull from columns id_A and id_B to give us the category for each ID entry (it then duplicates the dist column as necessary)
df %>%
gather(whichID, Category, id_A, id_B)
Gives
dist whichID Category
1 1 id_A A
2 2 id_A A
3 3 id_A A
4 4 id_A B
5 5 id_A B
6 6 id_A C
7 1 id_B B
8 2 id_B C
9 3 id_B D
10 4 id_B C
11 5 id_B D
12 6 id_B D
We can then pass that data.frame to group_by and then use summarise to give us whatever information we wanted. I know that you didn't ask for the max, but I am including it just to show the general syntax you can use to get whatever type of result you want:
df %>%
gather(whichID, Category, id_A, id_B) %>%
group_by(Category) %>%
summarise(minDist = min(dist)
, maxDist = max(dist))
Returns:
Category minDist maxDist
<chr> <int> <int>
1 A 1 3
2 B 1 5
3 C 2 6
4 D 3 6
I just looked at the question and realized that you wanted to also display which comparison had the minimum value. Here is an approach that does that by tracking an index of the match (so that it is replicated when gathering) and then pulls the correct row from the original df and pastes together the two comparison values:
df %>%
mutate(whichComparison = 1:n()) %>%
gather(whichID, Category, id_A, id_B) %>%
group_by(Category) %>%
summarise(minDist = min(dist)
, whichMin = whichComparison[which.min(dist)]
, maxDist = max(dist)
, whichMax = whichComparison[which.max(dist)]) %>%
mutate(
minComp = sapply(whichMin, function(x){
paste(df[x, "id_A"], df[x, "id_B"], sep = " vs " )})
, maxComp = sapply(whichMax, function(x){
paste(df[x, "id_A"], df[x, "id_B"], sep = " vs " )})
)
returns
Category minDist whichMin maxDist whichMax minComp maxComp
<chr> <int> <int> <int> <int> <chr> <chr>
1 A 1 1 3 3 A vs B A vs D
2 B 1 1 5 5 A vs B B vs D
3 C 2 2 6 6 A vs C C vs D
4 D 3 3 6 6 A vs D C vs D
If you really want a single column giving which comparison gave the min value (and the max, in my output), you can instead use the index to pull both the id_A and id_B from the original df, knock out the one that matches the Category of interest, then use use_first_valid_of from the package janitor to grab just the one you are interested in. Because this generated a large number of intermediate columns, I am using select to clean things back up:
df %>%
mutate(whichComparison = 1:n()) %>%
gather(whichID, Category, id_A, id_B) %>%
group_by(Category) %>%
summarise(minDist = min(dist)
, maxDist = max(dist)
, whichMin = whichComparison[which.min(dist)]
, whichMax = whichComparison[which.max(dist)]) %>%
mutate(
minA = df$id_A[whichMin]
, minB = df$id_B[whichMin]
, maxA = df$id_A[whichMax]
, maxB = df$id_B[whichMax]
) %>%
mutate_each(funs(ifelse(. == Category, NA, as.character(.)) )
, minA:maxB) %>%
mutate(minComp = use_first_valid_of(minA, minB)
, maxComp = use_first_valid_of(maxA, maxB)) %>%
select(-(whichMin:maxB))
returns:
Category minDist maxDist minComp maxComp
<chr> <int> <int> <chr> <chr>
1 A 1 3 B D
2 B 1 5 A D
3 C 2 6 A D
4 D 3 6 A C
An alternative approach is to first convert the distance pairs to a matrix. Here, I first duplicate the comparisons in the reverse order to ensure that the matrix is complete (using tidyr to spread):
bind_rows(
df
, rename(df, id_A = id_B, id_B = id_A)
) %>%
spread(id_B, dist)
returns:
id_A A B C D
1 A NA 1 2 3
2 B 1 NA 4 5
3 C 2 4 NA 6
4 D 3 5 6 NA
Then, we just apply across rows much like we would if we working from a distance matrix (which may be where your data actually started):
bind_rows(
df
, rename(df, id_A = id_B, id_B = id_A)
) %>%
spread(id_B, dist) %>%
mutate(
minDist = apply(as.matrix(.[, -1]), 1, min, na.rm = TRUE)
, minComp = names(.)[apply(as.matrix(.[, -1]), 1, which.min) + 1]
, maxDist = apply(as.matrix(.[, -1]), 1, max, na.rm = TRUE)
, maxComp = names(.)[apply(as.matrix(.[, -1]), 1, which.max) + 1]
) %>%
select(Category = `id_A`
, minDist:maxComp)
returns:
Category minDist minComp maxDist maxComp
1 A 1 B 3 D
2 B 1 A 5 D
3 C 2 A 6 D
4 D 3 A 6 C

Summarise multiple variables to strings in dplyr

I wish to summarize two variables in string. Let's say this is my id
#visit
id source1 source2
1 a t
2 c l
3 c z
1 b x
second dataset:
#transaction
id transactions
1 1
3 2
1 2
I'd like to join these data together but convert them to string at the same time:
I can do for one variable ( let's say source 1):
library(dplyr)
%>% left_join(visit, transaction, by="id")
%>% group_by( id)
%>% summarise( Source = toString(unique(source1)), transactions = toString(unique(transactions)) )
This gives me the following output:
id source transactions
1 a,b 1,2
2 c NA
3 c 2
But I wish to summarize for two variables: So my desire output would be something like that:
id source transactions
1 a,t > b,x 1,2
2 c,l NA
3 c,z 2
You can paste the two variables together, using both sep and collapse to combine:
visit %>% left_join(transaction) %>%
group_by(id) %>%
summarise(source = paste(unique(source1), unique(source2), sep = ', ', collapse = ' > '),
transaction = na_if(toString(unique(na.omit(transactions))), ''))
## # A tibble: 3 × 3
## id source transaction
## <int> <chr> <chr>
## 1 1 a, t > b, x 1, 2
## 2 2 c, l <NA>
## 3 3 c, z 2
Beware, though; paste and toString stupidly coerce NAs to strings. You may want to wrap in na.omit or use na_if.

Resources