Computing minimum distance between observations within groups - r

In the dataset below, how could I create a new column min.diff that reports, for a given observation x, the minimum distance between x and any other observation y within its group (identified by the group column)? I would like to measure the distance between x and y by abs(x-y).
set.seed(1)
df <- data.frame(
group = c('A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'),
value = sample(1:10, 8, replace = T)
)
Expected output:
group value min.diff
1 A 9 2
2 A 4 3
3 A 7 2
4 B 1 1
5 B 2 1
6 C 7 4
7 C 2 1
8 C 3 1
I prefer a solution using dplyr.
The only way that I have in my mind is to extend the dataframe by adding more rows to get each possible pair within groups, calculating distances and then filtering out the smallest value in each group. Is there a more compact way?

We can use map_dbl to subtract current value with all other values and select the minimum from it for each group.
library(dplyr)
library(purrr)
df %>%
group_by(group) %>%
mutate(min.diff = map_dbl(row_number(), ~min(abs(value[-.x] - value[.x]))))
# group value min.diff
# <chr> <int> <dbl>
#1 A 9 2
#2 A 4 3
#3 A 7 2
#4 B 1 1
#5 B 2 1
#6 C 7 4
#7 C 2 1
#8 C 3 1

We can use combn to do the pairwise difference between 'value', get the min of the absolute values
library(dplyr)
df1 <- df %>%
mutate(new = min(abs(combn(value, 2, FUN = function(x) x[1] - x[2]))))
If we want to get the minimum between a given element i.e. first from the rest
df1 <- df %>%
mutate(new = min(abs(value[-1] - first(value))))

If the order doesn't matter...
library(dplyr)
df %>%
arrange(group, value) %>% #Order ascending by value, within each group
group_by(group) %>%
mutate(min.diff = case_when(lag(group) == group & lead(group) == group ~ min(c(abs(value - lag(value)), abs(value - lead(value))), na.rm = T), #If the "group" for the previous and next entry are the same as the current group, take the smallest of the two differences
lag(group) == group ~ abs(value - lag(value)), #Otherwise, if only the previous entry's group is the same as the current one, take the difference from the previous
lead(group) == group ~ abs(value - lead(value)) #Otherwise, if only the next entry's group is the same as the current one, take the difference from the next
)
) %>%
ungroup()
# group value min.diff
# <chr> <int> <int>
# 1 A 4 3
# 2 A 7 2
# 3 A 9 2
# 4 B 1 1
# 5 B 2 1
# 6 C 2 1
# 7 C 3 1
# 8 C 7 4
If the order is important, you could add in an index and rearrange it after, like so:
library(dplyr)
df %>%
group_by(group) %>%
mutate(index = row_number()) %>% #create the index
arrange(group, value) %>%
mutate(min.diff = case_when(lag(group) == group & lead(group) == group ~ min(c(abs(value - lag(value)), abs(value - lead(value))), na.rm = T),
lag(group) == group ~ abs(value - lag(value)),
lead(group) == group ~ abs(value - lead(value))
)
) %>%
ungroup() %>%
arrange(group, index) %>% #rearrange by the index
select(-index) #remove the index
# group value min.diff
# <chr> <int> <int>
# 1 A 9 2
# 2 A 4 3
# 3 A 7 2
# 4 B 1 1
# 5 B 2 1
# 6 C 7 4
# 7 C 2 1
# 8 C 3 1

Related

R: Slicing a grouped data frame conditional on a column

I have a data frame with a group, a condition that differs by group, and an index within each group:
df <- data.frame(group = c(rep(c("A", "B", "C"), each = 3)),
condition = rep(c(0,1,1), each = 3),
index = c(1:3,1:3,2:4))
> df
group condition index
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 B 1 2
6 B 1 3
7 C 1 2
8 C 1 3
9 C 1 4
I would like to slice the data within each group, filtering out all but the row with the lowest index. However, this filter should only be applied when the condition applies, i.e., condition == 1. My solution was to compute a ranking on the index within each group and filter on the combination of condition and rank:
df %>%
group_by(group) %>%
mutate(rank = order(index)) %>%
filter(case_when(condition == 0 ~ TRUE,
condition == 1 & rank == 1 ~ TRUE))
# A tibble: 5 x 4
# Groups: group [3]
group condition index rank
<chr> <dbl> <int> <int>
1 A 0 1 1
2 A 0 2 2
3 A 0 3 3
4 B 1 1 1
5 C 1 2 1
This left me wondering whether there is a faster solution that does not require a separate ranking variable, and potentially uses slice_min() instead.
You can use filter() to keep all cases where the condition is zero or the index equals the minimum index.
library(dplyr)
df %>%
group_by(group) %>%
filter(condition == 0 | index == min(index))
# A tibble: 5 x 3
# Groups: group [3]
group condition index
<chr> <dbl> <int>
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 C 1 2
An option with slice
library(dplyr)
df %>%
group_by(group) %>%
slice(unique(c(which(condition == 0), which.min(index))))

Cumulative sum for each row of data for the same ID

I have this data frame:
df=data.frame(id=c(1,1,2,2,2,5,NA),var=c("a","a","b","b","b","e","f"),value=c(1,1,0,1,0,0,1),cs=c(2,2,3,3,3,3,NA))
I want to calculate the sum of value for each group (id, var) and then the cumulative sum but I would like to have the cumulative sum to be displayed for each row of data, i.e., I don't want to summarized view of the data. I have included what my output should look like. This is what I have tried so far:
df%>%arrange(id,var)%>%group_by(id,var)%>%mutate(cs=cumsum(value))
Any suggestions?
Here is an approach that I think meets your expectations.
Would group by id and calculate the sum of value for each id via summarise.
You can then add your cumulative sum column with mutate. Based on your comments, I included an ifelse so that if id was NA, it would not provide a cumulative sum, but instead be given NA.
Finally, to combine your cumulative sum data with your original dataset, you would need to join the two tables.
library(tidyverse)
df %>%
arrange(id) %>%
group_by(id) %>%
summarise(sum = sum(value)) %>%
mutate(cs=ifelse(is.na(id), NA, cumsum(sum))) %>%
left_join(df)
Output
# A tibble: 7 x 5
id sum cs var value
<dbl> <dbl> <dbl> <fct> <dbl>
1 1 2 2 a 1
2 1 2 2 a 1
3 2 1 3 b 0
4 2 1 3 b 1
5 2 1 3 b 0
6 5 0 3 e 0
7 NA 1 NA f 1
Calculate cumulative sum over all values, even if id is NA, then alter final cs to NA if id is NA
df %>%
arrange(id, var) %>%
mutate(cs = cumsum(value)) %>%
group_by(id, var) %>%
mutate(cs = max(ifelse(!is.na(id), cs, NA))) %>%
ungroup()
OR, Exclude rows where id is NA when calculating cumulative sum
df %>%
arrange(id, var) %>%
mutate(cs = cumsum(ifelse(!is.na(id), value, 0))) %>%
group_by(id, var) %>%
mutate(cs = max(ifelse(!is.na(id), cs, NA))) %>%
ungroup()
For your data, both return similar result
# A tibble: 7 x 4
# id var value cs
# <dbl> <fct> <dbl> <dbl>
# 1 1 a 1 2
# 2 1 a 1 2
# 3 2 b 0 3
# 4 2 b 1 3
# 5 2 b 0 3
# 6 5 e 0 3
# 7 NA f 1 4

Getting observations until and including first different value (groups with "no switch" are allowed)

I have a slightly convoluted way to slice a data frame by group from the first row (it always starts with the same value) till (and including) the first different value.
I though about using slice(1:min(which == new.value)), but there are groups where this switch does not happen - and this is what causes me headache. I could split the data into those groups where there is a switch and not and do the calculation on only those with a switch - but I would love to know if there are somewhat more elegant options out there. I am open for any package out there.
library(dplyr)
mydf <- data.frame(group = rep(letters[1:3], each = 4), value = c(1,2,2,2, 1, 1,1,1,1,1,2,2))
The following does not work, because there are groups without "switch"
mydf %>% group_by(group) %>% slice(1: min(which(value == 2)))
#> Warning in min(which(value == 2)): no non-missing arguments to min; returning
#> Inf
#> Error in 1:min(which(value == 2)): result would be too long a vector
Doing the slice operation on only the groups with a switch and binding with the "no-switchers" works:
mydf_grouped <- mydf %>% group_by(group)
mydf_grouped %>%
filter(any(value == 2)) %>%
slice(1: min(which(value == 2))) %>%
bind_rows(filter(mydf_grouped, !any(value ==2)))
#> # A tibble: 9 x 2
#> # Groups: group [3]
#> group value
#> <fct> <dbl>
#> 1 a 1
#> 2 a 2
#> 3 c 1
#> 4 c 1
#> 5 c 2
#> 6 b 1
#> 7 b 1
#> 8 b 1
#> 9 b 1
Created on 2019-12-22 by the reprex package (v0.3.0)
Here, one option is to pass the if/else condition
library(dplyr)
mydf %>%
group_by(group) %>%
slice(if(!2 %in% value) row_number() else seq_len(match(2, value)) )
Or more compactly, change the nomatch in match to n()
mydf %>%
group_by(group) %>%
slice(seq_len(match(2, value, nomatch = n())))
# A tibble: 9 x 2
# Groups: group [3]
# group value
# <fct> <dbl>
#1 a 1
#2 a 2
#3 b 1
#4 b 1
#5 b 1
#6 b 1
#7 c 1
#8 c 1
#9 c 2
We want all rows having a value of 1 as well as the row with the first 2 in each group:
mydf %>%
group_by(group) %>%
filter(value == 1 | cumsum(value == 2) == 1) %>%
ungroup
We can use rleid to create an index of change in value, shift it by 1 position and select all the rows till 1st change.
library(data.table)
setDT(mydf)
mydf[, .SD[shift(rleid(value), fill = 1) == 1], group]
# group value
#1: a 1
#2: a 2
#3: b 1
#4: b 1
#5: b 1
#6: b 1
#7: c 1
#8: c 1
#9: c 2
The same logic in dplyr can be implemented by
library(dplyr)
mydf %>%
group_by(group) %>%
filter(lag(cumsum(value != lag(value, default = 1)), default = 0) == 0)

count distinct levels of a data frame for groups based on a condition

I have the following DF
x = data.frame('grp' = c(1,1,1,2,2,2),'a' = c(1,2,1,1,2,1), 'b'= c(6,5,6,6,2,6), 'c' = c(0.1,0.2,0.4,-1, 0.9,0.7))
grp a b c
1 1 1 6 0.1
2 1 2 5 0.2
3 1 1 6 0.4
4 2 1 6 -1.0
5 2 2 2 0.9
6 2 1 6 0.7
I want to count distinct levels of (a,b) for each group where c >= 0.1
I have tried using dplyr for this using group_by & summarise but not getting the desired result
x %>% group_by(grp) %>% summarise(count = n_distinct(c(a,b)[c >= 0.1]))
For the above case I would expect the following result
grp count
<dbl> <int>
1 1 2
2 2 2
However using the above query I am getting the following result
grp count
<dbl> <int>
1 1 4
2 2 3
Logically the above output seems to be solving for all unique values of a concat list of (a,b) but not what I require
Any pointers, really appreciate any help
Here's another way using dplyr. It sounds like you want to filter based on c, so we do that. Instead of using c(a, b) in n_distinct, we can write it as n_distinct(a, b).
x %>%
filter(c >= 0.1) %>%
group_by(grp) %>%
summarise(cnt_d = n_distinct(a, b))
# grp cnt_d
# <dbl> <int>
# 1 1 2
# 2 2 2
We can paste a and b columns and count distinct values in each group.
library(dplyr)
x %>%
mutate(col = paste(a, b, sep = "_")) %>%
group_by(grp) %>%
summarise(count = n_distinct(col[c >= 0.1]))
# grp count
# <dbl> <int>
#1 1 2
#2 2 2
An option using data.table
library(data.table)
setDT(x)[c >= 0.1, .(cnt_d = uniqueN(paste(a, b))), .(grp)]
# grp cnt_d
#1: 1 2
#2: 2 2

Operations between groups with dplyr

I have a data frame as follow where I would like to group the data by grp and index and use group a as a reference to perform some simple calculations. I would like to subtract the variable value from other group from the values of group a.
df <- data.frame(grp = rep(letters[1:3], each = 2),
index = rep(1:2, times = 3),
value = seq(10, 60, length.out = 6))
df
## grp index value
## 1 a 1 10
## 2 a 2 20
## 3 b 1 30
## 4 b 2 40
## 5 c 1 50
## 6 c 2 60
The desired outpout would be like:
## grp index value
## 1 b 1 20
## 2 b 2 20
## 3 c 1 40
## 4 c 2 40
My guess is it will be something close to:
group_by(df, grp, index) %>%
mutate(diff = value - value[grp == "a"])
Ideally I would like to do it using dplyr.
Regards, Philippe
We can filter for 'grp' that are not 'a' and then do the difference within mutate.
df %>%
filter(grp!="a") %>%
mutate(value = value- df$value[df$grp=="a"])
Or another option would be join
df %>%
filter(grp!="a") %>%
left_join(., subset(df, grp=="a", select=-1), by = "index") %>%
mutate(value = value.x- value.y) %>%
select(1, 2, 5)
# grp index value
#1 b 1 20
#2 b 2 20
#3 c 1 40
#4 c 2 40

Resources