R - (Tidyverse) Compress multiple observations into one - r

I have a dataset that has multiple variables, two of which are dates (start date, end date). Sometimes a date interval has been split into sequences so for example you would have:
Start: 1990-12-12, Stop: 1990-12-13
Start: 1990-12-13, Stop: 1990-12-14
Rather than
Start: 1990-12-12, Stop: 1990-12-14
What I want to do is isolate these chains of sequences and basically collapse them into one observation such that all observations from the end of the sequence are saved with the rest being overwritten (except the first start date). Below is a basic example:
library(tidyverse)
library(lubridate)
tib_ex <- tibble(
id = rep(1,5),
date1 = ymd(c('1990-11-05', '1990-12-01',
'1990-12-05', '1990-12-08',
'1990-12-15')),
date2 = ymd(c('1990-11-28', '1990-12-05',
'1990-12-08', '1990-12-12',
'1990-12-31')),
var1 = 2:6,
var2 = 7:11,
var3 = 12:16,
var4 = c(0, 1, 0 ,0, 1)
)
This yields the following tibble:
# A tibble: 5 x 7
id date1 date2 var1 var2 var3 var4
<dbl> <date> <date> <int> <int> <int> <dbl>
1 1 1990-11-05 1990-11-28 2 7 12 0
2 1 1990-12-01 1990-12-05 3 8 13 1
3 1 1990-12-05 1990-12-08 4 9 14 0
4 1 1990-12-08 1990-12-12 5 10 15 0
5 1 1990-12-15 1990-12-31 6 11 16 1
Which I want to transform into the following tibble:
# A tibble: 3 x 7
id date1 date2 var1 var2 var3 var4
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 1990-11-05 1990-11-28 2 7 12 0
2 1 1990-12-01 1990-12-12 5 10 15 0
3 1 1990-12-15 1990-12-31 6 11 16 1
I thought about nesting by id, date1 and date2 which packs the rest of the variables into a tibble per row making it easy to overwrite I just don't know how to efficiently collapse the dates from row 2 to row 4.
I've tried creating a binary variable which tracks if the end date of one observation matches the start date of the following observation but I'm running into difficulties there as well.

Find the rows with start and end dates by comparing with the next / previous row and combine the result in a suitable way:
date_info <-
tib_ex %>%
## find indices of start and end dates by comparing with date in next / previous row
mutate(is_startdate = date1 != lag(date2),
is_enddate = date2 != lead(date1)) %>%
## NA's appear at the beginning (start_date) and end (end_date) and should thus be interpreted as TRUE
replace_na(list(is_startdate = T, is_enddate = T))
## combine the start- and end-dates
date_info %>%
filter(is_enddate) %>%
mutate(date1 = date_info$date1[date_info$is_startdate]) %>%
select(-starts_with("is_"))
-------
# A tibble: 3 x 7
id date1 date2 var1 var2 var3 var4
<dbl> <date> <date> <int> <int> <int> <dbl>
1 1.00 1990-11-05 1990-11-28 2 7 12 0
2 1.00 1990-12-01 1990-12-12 5 10 15 0
3 1.00 1990-12-15 1990-12-31 6 11 16 1.00

Here is a different approach which also will work if the dataset contains more than one individual id. According to OP's expected result the additional variables var1 to var4 are aggregated/summarized by picking the value at the end of each collapsed period.
The approach below
uses cumsum() and lag() to identify rows which belong to one period,
uses summarize() to collapse the start and end dates,
and joins with the original dataset to pick the values at the end of each collapsed period.
The last step avoids to include all additional variables in the call to summarize().
tib_ex %>%
arrange(id, date1, date2) %>% # this is important!
group_by(id) %>%
mutate(period = cumsum(lag(date2, default = date1[1]) < date1)) %>%
right_join(
(.) %>% group_by(id, period) %>%
summarize(date1 = first(date1), date2 = last(date2)),
by = c("id", "period", "date2"), suffix = c("", ".y")) %>%
select(-period, -date1.y)
# A tibble: 3 x 7
# Groups: id [1]
id date1 date2 var1 var2 var3 var4
<dbl> <date> <date> <int> <int> <int> <dbl>
1 1 1990-11-05 1990-11-28 2 7 12 0
2 1 1990-12-08 1990-12-12 5 10 15 0
3 1 1990-12-15 1990-12-31 6 11 16 1
Here is a test that the approach is working for multiple id:
tib_ex %>%
bind_rows(
(.) %>% mutate(id = 2))
duplicates OPs dataset for id = 2:
# A tibble: 10 x 7
id date1 date2 var1 var2 var3 var4
<dbl> <date> <date> <int> <int> <int> <dbl>
1 1 1990-11-05 1990-11-28 2 7 12 0
2 1 1990-12-01 1990-12-05 3 8 13 1
3 1 1990-12-05 1990-12-08 4 9 14 0
4 1 1990-12-08 1990-12-12 5 10 15 0
5 1 1990-12-15 1990-12-31 6 11 16 1
6 2 1990-11-05 1990-11-28 2 7 12 0
7 2 1990-12-01 1990-12-05 3 8 13 1
8 2 1990-12-05 1990-12-08 4 9 14 0
9 2 1990-12-08 1990-12-12 5 10 15 0
10 2 1990-12-15 1990-12-31 6 11 16 1
tib_ex %>%
bind_rows(
(.) %>% mutate(id = 2)) %>%
arrange(id, date1, date2) %>% # this is important!
group_by(id) %>%
mutate(period = cumsum(lag(date2, default = date1[1]) < date1)) %>%
right_join(
(.) %>% group_by(id, period) %>%
summarize(date1 = first(date1), date2 = last(date2)),
by = c("id", "period", "date2"), suffix = c("", ".y")) %>%
select(-period, -date1.y)
# A tibble: 6 x 7
# Groups: id [2]
id date1 date2 var1 var2 var3 var4
<dbl> <date> <date> <int> <int> <int> <dbl>
1 1 1990-11-05 1990-11-28 2 7 12 0
2 1 1990-12-08 1990-12-12 5 10 15 0
3 1 1990-12-15 1990-12-31 6 11 16 1
4 2 1990-11-05 1990-11-28 2 7 12 0
5 2 1990-12-08 1990-12-12 5 10 15 0
6 2 1990-12-15 1990-12-31 6 11 16 1

Related

grouping to aggregate values, but tripping up on NA's

I have long data, and I am trying to make a new variable (consistent) that is the value for a given column (VALUE), for each person (ID), at TIME = 2. I used the code below to do this, but I am getting tripped up on NA's. If the VALUE for TIME = 2 is NA, then I want it to grab the VALUE at TIME = 1 instead. That part I'm not sure how to do. So, in the example below, I want the new variable (consistent) should be 10 instead of NA.
ID = c("A", "A", "B", "B", "C", "C", "D", "D")
TIME = c(1, 2, 1, 2, 1, 2, 1, 2)
VALUE = c(8, 9, 10, NA, 12, 13, 14, 9)
df = data.frame(ID, TIME, VALUE)
df <- df %>%
group_by(ID) %>%
mutate(consistent = VALUE[TIME == 2]) %>% ungroup
df
If we want to use the same code, then coalesce with the 'VALUE' where 'TIME' is 1 (assuming there is a single observation of 'TIME' for each 'ID')
library(dplyr)
df %>%
group_by(ID) %>%
mutate(consistent = coalesce(VALUE[TIME == 2], VALUE[TIME == 1])) %>%
ungroup
-output
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 1 8 9
2 A 2 9 9
3 B 1 10 10
4 B 2 NA 10
5 C 1 12 13
6 C 2 13 13
7 D 1 14 9
8 D 2 9 9
Or another option is to arrange before doing the group_by and get the first element of 'VALUE' (assuming no replicating for 'TIME')
df %>%
arrange(ID, is.na(VALUE), desc(TIME)) %>%
group_by(ID) %>%
mutate(consistent = first(VALUE)) %>%
ungroup
-output
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 2 9 9
2 A 1 8 9
3 B 1 10 10
4 B 2 NA 10
5 C 2 13 13
6 C 1 12 13
7 D 2 9 9
8 D 1 14 9
Another possible solution, using tidyr::fill:
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(consistent = VALUE) %>% fill(consistent) %>% ungroup
#> # A tibble: 8 × 4
#> ID TIME VALUE consistent
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 1 8 8
#> 2 A 2 9 9
#> 3 B 1 10 10
#> 4 B 2 NA 10
#> 5 C 1 12 12
#> 6 C 2 13 13
#> 7 D 1 14 14
#> 8 D 2 9 9
You can also use ifelse with your condition. TIME is guaranteed to be 1 in this scenario if there are only 2 group member each with TIME 1 and 2.
df %>%
group_by(ID) %>%
arrange(TIME, .by_group=T) %>%
mutate(consistent=ifelse(is.na(VALUE)&TIME==2, lag(VALUE), VALUE)) %>%
ungroup()
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 1 8 8
2 A 2 9 9
3 B 1 10 10
4 B 2 NA 10
5 C 1 12 12
6 C 2 13 13
7 D 1 14 14
8 D 2 9 9

How to flag the last row of a data frame group?

Suppose we start with the below dataframe df:
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
ID Period Value
1 1 1 10
2 1 2 12
3 1 3 11
4 5 1 4
5 5 2 6
Now using dplyr I add a "Calculate" column that multiplies Period and Value of each row, giving me the following:
> df %>% mutate(Calculate = Period * Value)
ID Period Value Calculate
1 1 1 10 10
2 1 2 12 24
3 1 3 11 33
4 5 1 4 4
5 5 2 6 12
I'd like to modify the above "Calculate" to give me a value of 0, when reaching the last row for a given ID, so that the data frame output looks like:
ID Period Value Calculate
1 1 1 10 10
2 1 2 12 24
3 1 3 11 0
4 5 1 4 4
5 5 2 6 0
I was going to use the lead() function to peer at the next row to see if the ID changes but wasn't sure that happens when reaching the end of the data frame.
How could this be accomplished using dplyr?
You can group_by ID and replace the last row for each ID with 0.
library(dplyr)
df %>%
mutate(Calculate = Period * Value) %>%
group_by(ID) %>%
mutate(Calculate = replace(Calculate, n(), 0)) %>%
ungroup
# ID Period Value Calculate
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 10
#2 1 2 12 24
#3 1 3 11 0
#4 5 1 4 4
#5 5 2 6 0
Yet another possibility:
library(tidyverse)
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
df %>%
mutate(Calculate = Period * Value) %>%
group_by(ID) %>%
mutate(Calculate = if_else(row_number() == n(), 0, Calculate)) %>%
ungroup
#> # A tibble: 5 × 4
#> ID Period Value Calculate
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 10 10
#> 2 1 2 12 24
#> 3 1 3 11 0
#> 4 5 1 4 4
#> 5 5 2 6 0
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
library(tidyverse)
df %>%
mutate(Calculate = Period * Value * duplicated(ID, fromLast = TRUE))
#> ID Period Value Calculate
#> 1 1 1 10 10
#> 2 1 2 12 24
#> 3 1 3 11 0
#> 4 5 1 4 4
#> 5 5 2 6 0
Created on 2022-01-09 by the reprex package (v2.0.1)
This should work. You can also replace rownum with Period (most likely)
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
df = df %>% mutate(Calculate = Period * Value)
df$rownum = rownames(df)
df = df %>%
group_by(ID) %>%
mutate(Calculate = ifelse(rownum == max(rownum), 0, Calculate)) %>%
ungroup()
A tibble: 5 × 5
ID Period Value Calculate rownum
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 1 10 10 1
2 1 2 12 24 2
3 1 3 11 0 3
4 5 1 4 4 4
5 5 2 6 0 5

Finding the differences of paired-columns using dplyr

set.seed(3)
library(dplyr)
dat <- tibble(Measure = c("Height","Weight","Width","Length"),
AD1_1= rpois(4,10),
AD1_2= rpois(4,9),
AD2_1= rpois(4,10),
AD2_2= rpois(4,9),
AD3_1= rpois(4,10),
AD3_2= rpois(4,9),
AD4_1= rpois(4,10),
AD4_2= rpois(4,9),
AD5_1= rpois(4,10),
AD5_2= rpois(4,9),
AD6_1= rpois(4,10),
AD6_2= rpois(4,9))
Suppose I have data that looks like this. I wish to calculate the difference for each AD, paired with underscored number, i.e., AD1diff, AD2diff,AD3diff.
Instead of writing
dat %>%
mutate(AD1diff = AD1_1 - AD1_2,
AD2diff = AD2_1 - AD2_2,
...)
what would be an efficient way to write this?
One dplyr option could be:
dat %>%
mutate(across(ends_with("_1"), .names = "{col}_diff") - across(ends_with("_2"))) %>%
rename_with(~ sub("_\\d+", "", .), ends_with("_diff"))
Measure AD1_1 AD1_2 AD2_1 AD2_2 AD3_1 AD3_2 AD4_1 AD4_2 AD5_1 AD5_2 AD6_1 AD6_2 AD1_diff AD2_diff AD3_diff AD4_diff AD5_diff AD6_diff
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 Height 6 10 10 3 12 8 7 5 7 5 8 9 -4 7 4 2 2 -1
2 Weight 8 9 13 6 14 7 8 7 13 11 10 9 -1 7 7 1 2 1
3 Width 10 9 11 5 12 8 7 11 9 5 5 6 1 6 4 -4 4 -1
4 Length 8 9 8 7 8 13 8 7 6 11 14 6 -1 1 -5 1 -5 8
The "tidy" way to do this would be to convert your data from wide to long, do a grouped subtraction, and then go back to wide format:
library(tidyr)
dat_long = dat %>% pivot_longer(
cols = starts_with("AD"),
names_sep = "_",
names_to = c("group", "obs")
)
dat_long %>% head
# # A tibble: 48 x 4
# Measure group obs value
# <chr> <chr> <chr> <int>
# 1 Height AD1 1 6
# 2 Height AD1 2 10
# 3 Height AD2 1 10
# 4 Height AD2 2 3
# 5 Height AD3 1 12
# 6 Height AD3 2 8
dat_long %>%
group_by(Measure, group) %>%
summarize(diff = value[obs == 1] - value[obs == 2]) %>%
pivot_wider(names_from = "group", values_from = "diff") %>%
rename_with(.fn = ~ paste0(., "diff"), .cols = starts_with("AD"))
# # A tibble: 4 x 7
# # Groups: Measure [4]
# Measure AD1diff AD2diff AD3diff AD4diff AD5diff AD6diff
# <chr> <int> <int> <int> <int> <int> <int>
# 1 Height -4 7 4 2 2 -1
# 2 Length -1 1 -5 1 -5 8
# 3 Weight -1 7 7 1 2 1
# 4 Width 1 6 4 -4 4 -1
Here is a data.table option
setDT(dat)[
,
paste0(
unique(gsub("_\\d+", "", names(dat)[-1])),
"diff"
) := lapply(
split.default(.SD[, -1], gsub("_\\d+", "", names(dat)[-1])),
function(x) do.call("-", x)
)
]
which gives
> dat
Measure AD1_1 AD1_2 AD2_1 AD2_2 AD3_1 AD3_2 AD4_1 AD4_2 AD5_1 AD5_2 AD6_1
1: Height 6 10 10 3 12 8 7 5 7 5 8
2: Weight 8 9 13 6 14 7 8 7 13 11 10
3: Width 10 9 11 5 12 8 7 11 9 5 5
4: Length 8 9 8 7 8 13 8 7 6 11 14
AD6_2 AD1diff AD2diff AD3diff AD4diff AD5diff AD6diff
1: 9 -4 7 4 2 2 -1
2: 9 -1 7 7 1 2 1
3: 6 1 6 4 -4 4 -1
4: 6 -1 1 -5 1 -5 8
or
setDT(dat)[
,
c(.(Measure = Measure), setNames(lapply(
split.default(.SD[, -1], gsub("_\\d+", "", names(dat)[-1])),
function(x) do.call("-", x)
), paste0(
unique(gsub("_\\d+", "", names(dat)[-1])),
"diff"
)))
]
gives
Measure AD1diff AD2diff AD3diff AD4diff AD5diff AD6diff
1: Height -4 7 4 2 2 -1
2: Weight -1 7 7 1 2 1
3: Width 1 6 4 -4 4 -1
4: Length -1 1 -5 1 -5 8
Use tidyverse package tidyr to rearrange your data before mutating
require(dplyr)
require(tidyr)
#> Loading required package: tidyr
First, tidyr::pivot_longer the data frame so that there's a separate row for every column:
new_dat <-
pivot_longer(dat, cols = starts_with("AD"), # For columns whose names start with 'AD'...
names_sep = "_", # separate columns using '_' in colname
names_to = c("AD_number", "observation")) %>%
arrange(AD_number, Measure, observation)
head(new_dat, 9)
#> # A tibble: 9 x 4
#> Measure AD_number observation value
#> <chr> <chr> <chr> <int>
#> 1 Height AD1 1 6
#> 2 Height AD1 2 10
#> 3 Length AD1 1 8
#> 4 Length AD1 2 9
#> 5 Weight AD1 1 8
#> 6 Weight AD1 2 9
#> 7 Width AD1 1 10
#> 8 Width AD1 2 9
#> 9 Height AD2 1 10
Then, use tidyr::pivot_wider (the functional opposite of pivot_longer) to make a separate column for each value in observation. This will be very compatible with the upcoming mutate operation.
new_dat <-
pivot_wider(new_dat,
names_from = observation,
values_from = value,
names_prefix = "value_")
head(new_dat, 5)
#> # A tibble: 5 x 4
#> Measure AD_number value_1 value_2
#> <chr> <chr> <int> <int>
#> 1 Height AD1 6 10
#> 2 Length AD1 8 9
#> 3 Weight AD1 8 9
#> 4 Width AD1 10 9
#> 5 Height AD2 10 3
Finally, mutate the data:
new_dat <-
mutate(new_dat, diff = value_1 - value_2)
head(new_dat, 4)
#> # A tibble: 4 x 5
#> Measure AD_number value_1 value_2 diff
#> <chr> <chr> <int> <int> <int>
#> 1 Height AD1 6 10 -4
#> 2 Length AD1 8 9 -1
#> 3 Weight AD1 8 9 -1
#> 4 Width AD1 10 9 1
Created on 2021-01-22 by the reprex package (v0.3.0)
Getting back to your original data format is possible, but it might not make the data any easier to work with:
rename(new_dat,
c(`1` = "value_1", `2` = "value_2")) %>%
pivot_wider(names_from = AD_number,
values_from = c(`1`, `2`, diff),
names_glue = "{AD_number}_{.value}") %>%
{.[,order(names(.))]} %>%
relocate(Measure)

How to call columns implicitly in certain R functions

There are some functions in R where you have to call columns explicitly by name, such as pmin. My question is how to get around this, preferably using tidyverse.
Here's some sample data.
library(tidyverse)
df <- tibble(a = c(1:5),
b = c(6:10),
d = c(11:15),
e = c(16:20))
# A tibble: 5 x 4
a b d e
<int> <int> <int> <int>
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
Now I'd like to find the minimum of all the columns except for "e". I can do this:
df %>%
mutate(min = pmin(a, b, d))
# A tibble: 5 x 5
a b d e min
<int> <int> <int> <int> <int>
1 1 6 11 16 1
2 2 7 12 17 2
3 3 8 13 18 3
4 4 9 14 19 4
5 5 10 15 20 5
But what if I have many columns and would like to call every column except "e" without having to type out each column's name? I've made several attempts but none successful. I used the column index in my examples but I'd prefer excluding "e" by name. See below.
df %>%
mutate(min = pmin(-e))
df %>%
mutate(min = pmin(names(. %>% select(.))[-4]))
df %>%
mutate(min = pmin(names(.)[-4]))
df %>%
mutate(min = pmin(noquote(paste0(names(.)[-4], collapse = ","))))
df %>%
mutate(min = pmin(!!ensyms(names(.)[-4])))
None of these worked and I'm a bit at a loss.
One option using dplyr and purrr could be:
df %>%
mutate(min = exec(pmin, !!!select(., -e)))
a b d e min
<int> <int> <int> <int> <int>
1 1 6 11 16 1
2 2 7 12 17 2
3 3 8 13 18 3
4 4 9 14 19 4
5 5 10 15 20 5
For those not reading the comments, a nice option proposed by #IceCreamToucan and involving only dplyr could be:
df %>%
mutate(min = do.call(pmin, select(., -e)))
We can also use reduce with pmin
library(dplyr)
library(purrr)
df %>%
mutate(min = select(., -e) %>%
reduce(pmin))
# A tibble: 5 x 5
# a b d e min
# <int> <int> <int> <int> <int>
#1 1 6 11 16 1
#2 2 7 12 17 2
#3 3 8 13 18 3
#4 4 9 14 19 4
#5 5 10 15 20 5
Or with syms and !!!. Note that en- prefix is used while using from inside a function
df %>%
mutate(min = pmin(!!! syms(names(.)[-4])))
# A tibble: 5 x 5
# a b d e min
# <int> <int> <int> <int> <int>
#1 1 6 11 16 1
#2 2 7 12 17 2
#3 3 8 13 18 3
#4 4 9 14 19 4
#5 5 10 15 20 5
I would do this by reshaping long, calculating the min by group, and then reshaping back to wide:
df %>%
rowid_to_column() %>%
pivot_longer(cols = -c(e, rowid)) %>%
group_by(rowid) %>%
mutate(min = min(value)) %>%
ungroup() %>%
pivot_wider() %>%
select(-rowid, -min, min)

Collapsing a data.frame by group and interval coordinates

I have a data.frame which specifies linear intervals (along chromosomes), where each interval is assigned to a group:
df <- data.frame(chr = c(rep("1",5),rep("2",4),rep("3",5)),
start = c(seq(1,50,10),seq(1,40,10),seq(1,50,10)),
end = c(seq(10,50,10),seq(10,40,10),seq(10,50,10)),
group = c(c("g1.1","g1.1","g1.2","g1.3","g1.1"),c("g2.1","g2.2","g2.3","g2.2"),c("g3.1","g3.2","g3.2","g3.2","g3.3")),
stringsAsFactors = F)
I'm looking for a fast way to collapse df by chr and by group such that consecutive intervals along a chr that are assigned to the same group are combined and their start and end coordinates are modified accordingly.
Here's the desired outcome for this example:
res.df <- data.frame(chr = c(rep("1",4),rep("2",4),rep("3",3)),
start = c(c(1,21,31,41),c(1,11,21,31),c(1,11,41)),
end = c(c(20,30,40,50),c(10,20,30,40),c(10,40,50)),
group = c("g1.1","g1.2","g1.3","g1.1","g2.1","g2.2","g2.3","g2.2","g3.1","g3.2","g3.3"),
stringsAsFactors = F)
Edit: To account for the consecutive requirement you can use the same approach as earlier but add an extra grouping variable based on consecutive values.
library(dplyr)
df %>%
group_by(chr, group, temp.grp = with(rle(group), rep(seq_along(lengths), lengths))) %>%
summarise(start = min(start),
end = max(end)) %>%
arrange(chr, start) %>%
select(chr, start, end, group)
# A tibble: 11 x 4
# Groups: chr, group [9]
chr start end group
<chr> <dbl> <dbl> <chr>
1 1 1 20 g1.1
2 1 21 30 g1.2
3 1 31 40 g1.3
4 1 41 50 g1.1
5 2 1 10 g2.1
6 2 11 20 g2.2
7 2 21 30 g2.3
8 2 31 40 g2.2
9 3 1 10 g3.1
10 3 11 40 g3.2
11 3 41 50 g3.3
A different tidyverse approach could be:
df %>%
gather(var, val, -c(chr, group)) %>%
group_by(chr, group) %>%
filter(val == min(val) | val == max(val)) %>%
spread(var, val)
chr group end start
<chr> <chr> <dbl> <dbl>
1 1 g1.1 20 1
2 1 g1.2 30 21
3 1 g1.3 50 31
4 2 g2.1 10 1
5 2 g2.2 20 11
6 2 g2.3 40 21
7 3 g3.1 10 1
8 3 g3.2 40 11
9 3 g3.3 50 41
Or:
df %>%
group_by(chr, group) %>%
summarise_all(funs(min, max)) %>%
select(-end_min, -start_max)
chr group start_min end_max
<chr> <chr> <dbl> <dbl>
1 1 g1.1 1 20
2 1 g1.2 21 30
3 1 g1.3 31 50
4 2 g2.1 1 10
5 2 g2.2 11 20
6 2 g2.3 21 40
7 3 g3.1 1 10
8 3 g3.2 11 40
9 3 g3.3 41 50
A solution, using also rleid() from data.table, to the updated post could be:
df %>%
group_by(chr, group, group2 = rleid(group)) %>%
summarise_all(funs(min, max)) %>%
select(-end_min, -start_max)
chr group group2 start_min end_max
<chr> <chr> <int> <dbl> <dbl>
1 1 g1.1 1 1 20
2 1 g1.1 4 41 50
3 1 g1.2 2 21 30
4 1 g1.3 3 31 40
5 2 g2.1 5 1 10
6 2 g2.2 6 11 20
7 2 g2.2 8 31 40
8 2 g2.3 7 21 30
9 3 g3.1 9 1 10
10 3 g3.2 10 11 40
11 3 g3.3 11 41 50

Resources