Group data frame row by consecutive value in R [duplicate] - r

This question already has answers here:
Create grouping variable for consecutive sequences and split vector
(5 answers)
Closed 1 year ago.
I need to group a data frame by consecutive value in a row.
So for example, given this data frame:
tibble( time = c(1,2,3,4,5,10,11,20,30,31,32,40) )
I want to have a grouping column like:
tibble( time = c(1,2,3,4,5,10,11,20,30,31,32,40), group=c(1,1,1,1,1,2,2,3,4,4,4,5) )
What's a tidyverse (or base R) way to get the column group as explained?

We could it this way:
df %>%
arrange(time) %>%
group_by(grp = (time %/% 10)+1)
time group
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 10 2
7 11 2
8 20 3
9 30 4
10 31 4
11 32 4
12 40 5

We could use diff on the adjacent values of 'time', check if the difference is not equal to 1, then change the logical vector to numeric index by taking the cumulative sum (cumsum) so that there is an increment of 1 at each TRUE value
library(dplyr)
df1 %>%
mutate(grp = cumsum(c(TRUE, diff(time) != 1)))
-output
# A tibble: 12 x 2
time grp
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 10 2
7 11 2
8 20 3
9 30 4
10 31 4
11 32 4
12 40 5

You can use the following solution:
library(dplyr)
library(purrr)
df %>%
mutate(grp = accumulate(2:nrow(df), .init = 1,
~ if(time[.y] - time[.y - 1] == 1) {
.x
} else {
.x + 1
}))
# A tibble: 12 x 2
time grp
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 10 2
7 11 2
8 20 3
9 30 4
10 31 4
11 32 4
12 40 5

Related

Converting time-dependent variable to long format using one variable indicating day of update

I am trying to convert my data to a long format using one variable that indicates a day of the update.
I have the following variables:
baseline temperature variable "temp_b";
time-varying temperature variable "temp_v" and
the number of days "n_days" when the varying variable is updated.
I want to create a long format using the carried forward approach and a max follow-up time of 5 days.
Example of data
df <- structure(list(id=1:3, temp_b=c(20L, 7L, 7L), temp_v=c(30L, 10L, NA), n_days=c(2L, 4L, NA)), class="data.frame", row.names=c(NA, -3L))
# id temp_b temp_v n_days
# 1 1 20 30 2
# 2 2 7 10 4
# 3 3 7 NA NA
df_long <- structure(list(id=c(1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3),
days_cont=c(1,2,3,4,5, 1,2,3,4,5, 1,2,3,4,5),
long_format=c(20,30,30,30,30,7,7,7,10,10,7,7,7,7,7)),
class="data.frame", row.names=c(NA, -15L))
# id days_cont long_format
# 1 1 1 20
# 2 1 2 30
# 3 1 3 30
# 4 1 4 30
# 5 1 5 30
# 6 2 1 7
# 7 2 2 7
# 8 2 3 7
# 9 2 4 10
# 10 2 5 10
# 11 3 1 7
# 12 3 2 7
# 13 3 3 7
# 14 3 4 7
# 15 3 5 7
You could repeat each row 5 times with tidyr::uncount():
library(dplyr)
df %>%
tidyr::uncount(5) %>%
group_by(id) %>%
transmute(days_cont = 1:n(),
temp = ifelse(row_number() < n_days | is.na(n_days), temp_b, temp_v)) %>%
ungroup()
# # A tibble: 15 × 3
# id days_cont temp
# <int> <int> <int>
# 1 1 1 20
# 2 1 2 30
# 3 1 3 30
# 4 1 4 30
# 5 1 5 30
# 6 2 1 7
# 7 2 2 7
# 8 2 3 7
# 9 2 4 10
# 10 2 5 10
# 11 3 1 7
# 12 3 2 7
# 13 3 3 7
# 14 3 4 7
# 15 3 5 7
Here's a possibility using tidyverse functions. First, pivot_longer and get rid of unwanted values (that will not appear in the final df, i.e. values with temp_v == NA), then group_by id, and mutate the n_days variable to match the number of rows it will have in the final df. Finally, uncount the dataframe.
library(tidyverse)
df %>%
replace_na(list(n_days = 6)) %>%
pivot_longer(-c(id, n_days)) %>%
filter(!is.na(value)) %>%
group_by(id) %>%
mutate(n_days = case_when(name == "temp_b" ~ n_days - 1,
name == "temp_v" ~ 5 - (n_days - 1))) %>%
uncount(n_days) %>%
mutate(days_cont = row_number()) %>%
select(id, days_cont, long_format = value)
id days_cont long_format
<int> <int> <int>
1 1 1 20
2 1 2 30
3 1 3 30
4 1 4 30
5 1 5 30
6 2 1 7
7 2 2 7
8 2 3 7
9 2 4 10
10 2 5 10
11 3 1 7
12 3 2 7
13 3 3 7
14 3 4 7
15 3 5 7

How do I find the largest range in a dataset, and filter out the other data?

Competitor Laps
1 1 1
2 1 2
3 1 3
4 1 4
5 1 1
6 1 2
7 1 3
8 1 4
9 1 5
10 1 6
11 1 7
12 1 8
I need to identify the longest range in laps. Here, that range is from row 5 to row 12. The range is 7. As opposed to row 1 to row 4 which has a range of 3. After identifying the largest range, I should only keep the values values that contribute to said range. So, my final dataset should look like:
Competitor Laps
5 1 1
6 1 2
7 1 3
8 1 4
9 1 5
10 1 6
11 1 7
12 1 8
How should I go about this?
Potential solution with dplyr:
dat <- tibble(
Competitor = 1,
Laps = c(seq(1,4), seq(1,8))
)
dat |>
mutate(StintId = cumsum(if_else(Laps == 1, 1, 0))) |>
group_by(StintId) |>
mutate(range = max(Laps) - min(Laps)) |>
ungroup() |>
filter(range == max(range)) |>
select(-StintId, -range)
Output:
# A tibble: 8 x 2
Competitor Laps
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
Returns the largest range for each competitor. Assumes first laps always starts with 1, and laps are sequential.
df<-data.frame(Competitor=c(rep(1,12), rep(2,16)),
Laps=c(1:4, 1:8, 1:9, 1:7))
df %>%
group_by(Competitor) %>%
mutate(LapGroup=cumsum(if_else(Laps==1,1,0))) %>%
group_by(Competitor, LapGroup) %>%
mutate(MaxLaps=max(Laps)) %>%
group_by(Competitor) %>%
filter(MaxLaps==max(Laps))

If 1 appears, all subsequent elements of the variable must be 1, grouped by subject

I want make from:
test <- data.frame(subject=c(rep(1,10),rep(2,10)),x=1:10,y=0:1)
Something like that:
As I wrote in the title, when the first 1 appears all subsequent values of "y" for a given "subject" must change to 1, then the same for the next "subject"
I tried something like that:
test <- test%>%
group_nest(subject) %>%
mutate(XD = map(data,function(x){
ifelse(x$y[which(grepl(1, x$y))[1]:nrow(x)]==TRUE , 1,0)})) %>% unnest(cols = c(data,XD))
It didn't work :(
Try this:
library(dplyr)
#Code
new <- test %>%
group_by(subject) %>%
mutate(y=ifelse(row_number()<min(which(y==1)),y,1))
Output:
# A tibble: 20 x 3
# Groups: subject [2]
subject x y
<dbl> <int> <dbl>
1 1 1 0
2 1 2 1
3 1 3 1
4 1 4 1
5 1 5 1
6 1 6 1
7 1 7 1
8 1 8 1
9 1 9 1
10 1 10 1
11 2 1 0
12 2 2 1
13 2 3 1
14 2 4 1
15 2 5 1
16 2 6 1
17 2 7 1
18 2 8 1
19 2 9 1
20 2 10 1
Since you appear to just have 0's and 1's, a straightforward approach would be to take a cumulative maximum via the cummax function:
library(dplyr)
test %>%
group_by(subject) %>%
mutate(y = cummax(y))
#Duck's answer is considerably more robust if you have a range of values that may appear before or after the first 1.

R Tidy : Dynamic Sequential Threshold

I'm trying to find a tidy way to dynamically adjust a threshold as I "move" through a tibble using library(tidyverse). For example, imagine a tibble containing sequential observations:
example <-
tibble(observed = c(2,1,1,2,2,4,10,4,2,2,3))
example
# A tibble: 11 x 1
observed
<dbl>
1 2
2 1
3 1
4 2
5 2
6 4
7 10
8 4
9 2
10 2
11 3
I'm trying to calculate a threshold that starts with the initial value (2) and increments by a prespecified amount (in this case, 1) unless the current observation is greater than that threshold in which case the current observation becomes the reference threshold and further thresholds increment from it. Here is what the final tibble would look like:
answer <-
example %>%
mutate(threshold = c(2,3,4,5,6,7,10,11,12,13,14))
answer
# A tibble: 11 x 2
observed threshold
<dbl> <dbl>
1 2 2
2 1 3
3 1 4
4 2 5
5 2 6
6 4 7
7 10 10
8 4 11
9 2 12
10 2 13
11 3 14
I'm looking for the best way to do this using dplyr/tidy. All help is appreciated!
EDIT:
The answers so far are very close, but miss in the case that the observed values drop and increase again. For example consider the same tibble as example above, but with a 4 instead of a 3 for the final observation:
example <-
tibble(observed = c(2,1,1,2,2,4,10,4,2,2,4))
example
# A tibble: 11 x 1
observed
<dbl>
1 2
2 1
3 1
4 2
5 2
6 4
7 10
8 4
9 2
10 2
11 4
The diff & cumsum method then gives:
example %>%
group_by(gr = cumsum(c(TRUE, diff(observed) > thresh))) %>%
mutate(thresold = first(observed) + row_number() - 1) %>%
ungroup %>%
select(-gr)
A tibble: 11 x 2
observed thresold
<dbl> <dbl>
1 2 2
2 1 3
3 1 4
4 2 5
5 2 6
6 4 4
7 10 10
8 4 11
9 2 12
10 2 13
11 4 4
Where the final threshold value is incorrect.
You could use diff to create groups and add row number in the group to the first value.
library(dplyr)
thresh <- 1
example %>%
group_by(gr = cumsum(c(TRUE, diff(observed) > thresh))) %>%
mutate(thresold = first(observed) + row_number() - 1) %>%
ungroup %>%
select(-gr)
# A tibble: 11 x 2
# observed thresold
# <dbl> <dbl>
# 1 2 2
# 2 1 3
# 3 1 4
# 4 2 5
# 5 2 6
# 6 4 4
# 7 10 10
# 8 4 11
# 9 2 12
#10 2 13
#11 3 14
To understand how the groups are created here are the detailed steps :
We first calculate the difference between consecutive values
diff(example$observed)
#[1] -1 0 1 0 2 6 -6 -2 0 1
Note that diff gives output of length 1 less than the actual length.
We compare it with thresh which gives TRUE for every time we have value greater than the threshold
diff(example$observed) > thresh
#[1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
Now since output of diff has one value less we add one value as TRUE
c(TRUE, diff(example$observed) > thresh)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
and then finally take cumsum to create groups which is used in group_by.
cumsum(c(TRUE, diff(example$observed) > thresh))
# [1] 1 1 1 1 1 2 3 3 3 3 3
EDIT
For the updated question we can add another condition to check of the previous value is greater than the current count and update the values accordingly.
example %>%
group_by(gr = cumsum(c(TRUE, diff(observed) > thresh) &
observed > first(observed) + row_number())) %>%
mutate(thresold = first(observed) + row_number() - 1) %>%
ungroup() %>%
select(-gr)
# A tibble: 11 x 2
# observed thresold
# <dbl> <dbl>
# 1 2 2
# 2 1 3
# 3 1 4
# 4 2 5
# 5 2 6
# 6 4 7
# 7 10 10
# 8 4 11
# 9 2 12
#10 2 13
#11 4 14
We can create the grouping variable with lag of the column difference
library(dplyr)
thresh <- 1
example %>%
group_by(grp = cumsum((observed - lag(observed, default = first(observed)) >
thresh))) %>%
mutate(threshold = observed[1] + row_number() - 1) %>%
ungroup %>%
mutate(new = row_number() + 1,
threshold = pmax(threshold, new)) %>%
select(-grp, -new)
# A tibble: 11 x 2
# observed threshold
# <dbl> <dbl>
# 1 2 2
# 2 1 3
# 3 1 4
# 4 2 5
# 5 2 6
# 6 4 7
# 7 10 10
# 8 4 11
# 9 2 12
#10 2 13
#11 3 14
I think I've figured out a way to do this, by utilizing zoo::locf (although I'm not sure this part is really necessary).
First create the harder of the two examples I've listed in my description:
example2 <-
tibble(observed = c(2,1,1,2,2,4,10,4,2,2,4))
example2 %>%
mutate(def = first(observed) + row_number() - 1) %>%
mutate(t1 = pmax(observed,def)) %>%
mutate(local_maxima = ifelse(observed == t1,t1,NA)) %>%
mutate(groupings = zoo::na.locf(local_maxima)) %>%
group_by(groupings) %>%
mutate(threshold = groupings + row_number() - 1) %>%
ungroup() %>%
select(-def,-t1,-local_maxima,-groupings)
Result:
# A tibble: 11 x 2
observed threshold
<dbl> <dbl>
1 2 2
2 1 3
3 1 4
4 2 5
5 2 6
6 4 7
7 10 10
8 4 11
9 2 12
10 2 13
11 4 14
I'd definitely prefer a more elegant solution if anyone finds one.

Running totals calculation by factor [duplicate]

I have the same question as this post, but I want to use dplyr:
With an R dataframe, eg:
df <- data.frame(id = rep(1:3, each = 5)
, hour = rep(1:5, 3)
, value = sample(1:15))
how do I add a cumulative sum column that matches the id?
Without dplyr the accepted solution of the previous post is:
df$csum <- ave(df$value, df$id, FUN=cumsum)
Like this?
df <- data.frame(id = rep(1:3, each = 5),
hour = rep(1:5, 3),
value = sample(1:15))
mutate(group_by(df,id), csum=cumsum(value))
Or if you use the dplyr's piping operator:
df %>% group_by(id) %>% mutate(csum = cumsum(value))
Result in both cases:
Source: local data frame [15 x 4]
Groups: id
id hour value csum
1 1 1 4 4
2 1 2 14 18
3 1 3 8 26
4 1 4 2 28
5 1 5 3 31
6 2 1 10 10
7 2 2 7 17
8 2 3 5 22
9 2 4 12 34
10 2 5 9 43
11 3 1 6 6
12 3 2 15 21
13 3 3 1 22
14 3 4 13 35
15 3 5 11 46

Resources