Number of switches in a vector by group - r

Given the vector
vec <- c(1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0)
the number of switches can simply be calculated with
cumsum(diff(vec) == 1)
However given the df
group <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
df_test <- data.frame(vec,group)
and using
df_test <- df_test %>%
group_by(group) %>%
mutate(n_switches = cumsum(diff(vec) == 1))%>%
ungroup()
the calculation fails because n_switches has 2 rows (1 per group) less than vec.
How can I overcome this problem and generate n_switches per group? Any help would be much appreciated!

With diff, the output is always an output with length 1 difference from the original output length. We can append an element to make the length same as mutate wants to return the column with the same length
library(dplyr)
df_test %>%
group_by(group) %>%
mutate(n_switches = cumsum(c(FALSE, diff(vec) == 1))) %>%
ungroup

Here is another option besides diff, which applies gregexpr for positioning the shifts
df_test %>%
group_by(group) %>%
mutate(n_switches = cumsum(replace(0 * vec, gregexpr("(?<=0)1", paste0(vec, collapse = ""), perl = TRUE)[[1]], 1))) %>%
ungroup()
# A tibble: 20 x 3
vec group n_switches
<dbl> <dbl> <dbl>
1 1 1 0
2 0 1 0
3 1 1 1
4 1 1 1
5 1 1 1
6 0 1 1
7 0 1 1
8 1 1 2
9 1 1 2
10 0 1 2
11 1 2 0
12 0 2 0
13 1 2 1
14 0 2 1
15 1 2 2
16 1 2 2
17 1 2 2
18 0 2 2
19 1 2 3
20 0 2 3
A similar realization but with data.table
setDT(df_test)[
,
n_switches := cumsum(replace(0 * vec, gregexpr("(?<=0)1", paste0(vec, collapse = ""), perl = TRUE)[[1]], 1)),
group
]
> df_test
vec group n_switches
1: 1 1 0
2: 0 1 0
3: 1 1 1
4: 1 1 1
5: 1 1 1
6: 0 1 1
7: 0 1 1
8: 1 1 2
9: 1 1 2
10: 0 1 2
11: 1 2 0
12: 0 2 0
13: 1 2 1
14: 0 2 1
15: 1 2 2
16: 1 2 2
17: 1 2 2
18: 0 2 2
19: 1 2 3
20: 0 2 3

You can define a switch as a position where the current value is 1 and previous value was 0 and then using cumsum create n_switches variable.
Using dplyr you can do this as :
library(dplyr)
df_test %>%
group_by(group) %>%
mutate(n_switches = cumsum(vec == 1 & lag(vec, default = 1) == 0))
# vec group n_switches
# <dbl> <dbl> <int>
# 1 1 1 0
# 2 0 1 0
# 3 1 1 1
# 4 1 1 1
# 5 1 1 1
# 6 0 1 1
# 7 0 1 1
# 8 1 1 2
# 9 1 1 2
#10 0 1 2
#11 1 2 0
#12 0 2 0
#13 1 2 1
#14 0 2 1
#15 1 2 2
#16 1 2 2
#17 1 2 2
#18 0 2 2
#19 1 2 3
#20 0 2 3
and same logic with data.table :
library(data.table)
setDT(df_test)[, n_switches:=cumsum(vec == 1 & shift(vec, fill = 1) == 0), group]

Related

How to loop ifelse function through a grouped variable with dplyr

I'm trying to apply a rule for a group of IDs that, upon the first instance where the value for a variable in one row equals 1, all values for another variable in all subsequent rows in that group equal 1.
Essentially, here is what I am trying to do:
I have:
ID D
1 1
1 0
1 0
2 0
2 0
3 1
3 0
3 0
4 1
4 0
4 1
4 1
4 1
4 0
I want:
ID D PREV
1 1 0
1 0 1
1 0 1
2 0 0
2 0 0
3 1 0
3 0 1
3 0 1
4 1 0
4 0 1
4 1 1
4 1 1
4 0 1
I'm trying to use dplyr to iterate through a series of grouped rows, in each one applying an ifelse function. My code looks like this:
data$prev = 0
data <-
data %>%
group_by(id)%>%
mutate(prev = if_else(lag(prev) == 1 | lag(d) == 1, 1, 0))
But for some reason, this is not applying the ifelse function over the whole group, resulting in data that looks something like this:
ID D PREV
1 1 0
1 0 1
1 0 0
2 0 0
2 0 0
3 1 0
3 0 1
3 0 0
4 1 0
4 0 1
4 1 0
4 1 1
4 0 1
Can anyone help me with this?
What about this:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(prev = +(cumsum(c(0, D[-length(D)])) > 0)) %>%
ungroup()
#> # A tibble: 14 x 3
#> ID D prev
#> <int> <int> <int>
#> 1 1 1 0
#> 2 1 0 1
#> 3 1 0 1
#> 4 2 0 0
#> 5 2 0 0
#> 6 3 1 0
#> 7 3 0 1
#> 8 3 0 1
#> 9 4 1 0
#> 10 4 0 1
#> 11 4 1 1
#> 12 4 1 1
#> 13 4 1 1
#> 14 4 0 1
To explain what it does, let's just take a simple vector.
The calc will be the same for each group.
Be x our vector
x <- c(0,0,0,1,1,0,0,2,3,4)
Do the cumulative sum over x
cumsum(x)
#> [1] 0 0 0 1 2 2 2 4 7 11
You are interested only on value above zeros, therefore:
cumsum(x)>0
#> [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
You don't want logical, but numeric. Just a + makes the trick
+(cumsum(x)>0)
#> [1] 0 0 0 1 1 1 1 1 1 1
However, you want the 1s delayed by 1. Thus, we had a zero on top of x
+(cumsum(c(0,x))>0)
#> [1] 0 0 0 0 1 1 1 1 1 1 1
We need to keep the same length, so we remove the last value of x.
+(cumsum(c(0, x[-length(x)])) > 0)
#> [1] 0 0 0 0 1 1 1 1 1 1
And that makes the trick.
We can use lag
library(dplyr)
df %>%
group_by(ID) %>%
mutate(prev = lag(cumsum(D) > 0, default = 0))
-output
# A tibble: 14 x 3
# Groups: ID [4]
# ID D prev
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 0 1
# 3 1 0 1
# 4 2 0 0
# 5 2 0 0
# 6 3 1 0
# 7 3 0 1
# 8 3 0 1
# 9 4 1 0
#10 4 0 1
#11 4 1 1
#12 4 1 1
#13 4 1 1
#14 4 0 1
data
df <- data.frame(
ID = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4),
D = c(1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0)
)
You can use a new function from dplyr dplyr::group_modify to apply function over groups
df <- data.frame(
ID = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4),
D = c(1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0)
)
df %>% group_by(ID) %>% group_modify(
function(x, y){
boo <- x[1, ]$D == 1
ifelse(boo,
{x$prev = 1
x$prev[1] = 0
},
{x$prev = 0})
x
}
)
# A tibble: 14 x 3
# Groups: ID [4]
ID D prev
<dbl> <dbl> <dbl>
1 1 1 0
2 1 0 1
3 1 0 1
4 2 0 0
5 2 0 0
6 3 1 0
7 3 0 1
8 3 0 1
9 4 1 0
10 4 0 1
11 4 1 1
12 4 1 1
13 4 1 1
14 4 0 1

Set value to 0 if any of the remaining values is 0

I have a data.frame like this:
dat <- data.frame("ID"=c(rep(1,13),rep(2,5)), "time"=c(seq(1,13),c(seq(1,5))), "value"=c(rep(0,5), rep(1,3), 2, 0, 1, 5, 20, rep(0,2), seq(1:3)))
ID time value
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 1
7 1 7 1
8 1 8 1
9 1 9 2
10 1 10 0
11 1 11 1
12 1 12 5
13 1 13 20
14 2 1 0
15 2 2 0
16 2 3 1
17 2 4 2
18 2 5 3
My goal is to set all values to 0, if among the remaining values there is any other 0 (for each unique ID and sorted by time). That means in the example data, I would like to have 0 in the rows 6:9.
I tried dat %>% group_by(ID) %>% mutate(value2 = ifelse(lead(value, order_by=time)==0, 0, value)) but I would have to run this several times, since it only changes one row at a time (i.e. row 9 first, then row 8, etc.).
dplyr solution would be prefered but I'd take everything that works :)
Short explanation: value is the size of a tumor. If the tumor does not grow large, but actually vanishes completely at a later time, it was most likely an irrelevant encapsulation, hence should be coded as "zero tumor".
I am not sure wether this is your desired output, but maybe it can be usefull to you
dat %>%
group_by(ID) %>%
arrange(-time) %>%
mutate(value = if_else(cumsum(value == 0) > 0, 0, value)) %>%
arrange(ID, time)
ID time value
<dbl> <int> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
7 1 7 0
8 1 8 0
9 1 9 0
10 1 10 0
11 1 11 1
12 1 12 5
13 1 13 20
14 2 1 0
15 2 2 0
16 2 3 1
17 2 4 2
18 2 5 3
Basicalyl, I first put the observations in descending order. Then I check whether there has been a zero in value (cumsum(value == 0) > 0)). If yes, I set all remaining values to zero.
Finally, I put the observations in correct order again.
If you do not want to order and reorder the data you can use the following code, which relies on the same logic but is a bit more difficult to read:
dat %>%
group_by(ID) %>%
arrange(ID, time) %>%
mutate(value = if_else(cumsum(value == 0) < sum(value == 0), 0, value))
Or a bit more efficient without if_else:
dat %>%
group_by(ID) %>%
arrange(ID, time) %>%
mutate(value = value * (cumsum(value == 0) >= sum(value == 0)))
One way could be to find the indices of the first and last occurrences of 0 and replace everything in between.
library(dplyr)
dat %>%
group_by(ID) %>%
mutate(value = replace(value, between(row_number(), which.max(value == 0), tail(which(value == 0), 1)), 0))
# A tibble: 18 x 3
# Groups: ID [2]
ID time value
<dbl> <int> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
7 1 7 0
8 1 8 0
9 1 9 0
10 1 10 0
11 1 11 1
12 1 12 5
13 1 13 20
14 2 1 0
15 2 2 0
16 2 3 1
17 2 4 2
18 2 5 3
With data.table you can caluculate fields with the data in a certain order, without actually reordering the data frame. Useful here
library(data.table)
setDT(dat)
dat[order(-time), value := fifelse(cumsum(value == 0) > 0, 0, value), ID]
dat
# ID time value
# 1: 1 1 0
# 2: 1 2 0
# 3: 1 3 0
# 4: 1 4 0
# 5: 1 5 0
# 6: 1 6 0
# 7: 1 7 0
# 8: 1 8 0
# 9: 1 9 0
# 10: 1 10 0
# 11: 1 11 1
# 12: 1 12 5
# 13: 1 13 20
# 14: 2 1 0
# 15: 2 2 0
# 16: 2 3 1
# 17: 2 4 2
# 18: 2 5 3
You can use accumulate(..., .dir = "backward") in purrr
library(dplyr)
library(purrr)
dat %>%
group_by(ID) %>%
arrange(time, .by_group = T) %>%
mutate(value2 = accumulate(value, ~ if(.y == 0) 0 else .x, .dir = "backward")) %>%
ungroup()
# A tibble: 18 x 4
ID time value value2
<dbl> <int> <dbl> <dbl>
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 1 4 0 0
5 1 5 0 0
6 1 6 1 0
7 1 7 1 0
8 1 8 1 0
9 1 9 2 0
10 1 10 0 0
11 1 11 1 1
12 1 12 5 5
13 1 13 20 20
14 2 1 0 0
15 2 2 0 0
16 2 3 1 1
17 2 4 2 2
18 2 5 3 3

sum certain values between when the first and last observation equals a certain value

I have something like:
df<-data.frame(group=c(1, 1, 1, 1,1, 2, 2, 2, 2, 2, 3, 3, 3),
have=c(1, 0, 1, 0, 1, 1, 0,0,0,1, 1,0,0),
wantsum=c(2,2,2,2,2,3,3,3,3,3,0,0,0))
I want to sum the number of 0's per group, but only when the first and last observation is equal to 1. Something like:
# group have wantsum
#1 1 1 2
#2 1 0 2
#3 1 1 2
#4 1 0 2
#5 1 1 2
#6 2 1 3
#7 2 0 3
#8 2 0 3
#9 2 0 3
#10 2 1 3
#11 3 1 0
#12 3 0 0
#13 3 0 0
Thanks
We can create a condition after grouping by 'group', by checking if all the first and last observation in 'have' are 1, then get the sum of '0' values or else return 0
library(dplyr)
df %>%
group_by(group) %>%
mutate(wantsum2 = if(all(c(first(have), last(have)) == 1)) sum(have == 0) else 0)
# A tibble: 13 x 4
# Groups: group [3]
# group have wantsum wantsum2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2 2
# 2 1 0 2 2
# 3 1 1 2 2
# 4 1 0 2 2
# 5 1 1 2 2
# 6 2 1 3 3
# 7 2 0 3 3
# 8 2 0 3 3
# 9 2 0 3 3
#10 2 1 3 3
#11 3 1 0 0
#12 3 0 0 0
#13 3 0 0 0
There are many ways to make this happen, so the if/else condition can be modified by multiplying the logical vector generated from all and as TRUE/FALSE -> 1/0, any number multiplied by 0 -> 0 and multiplied by 1 -> number
df %>%
group_by(group) %>%
mutate(wantsum2 = sum(have == 0) * all(c(first(have), last(have)) == 1) )

Identify and label repeated data in a series

I'm trying to identify cases in a dataset where a value occurs multiple times in a row, and once this is picked up, a row to the side of the nth occurrence confirms this with '1'.
df<-data.frame(user=c(1,1,1,1,2,3,3,3,4,4,4,4,4,4,4,4),
week=c(1,2,3,4,1,1,2,3,1,2,3,4,5,6,7,8),
updated=c(1,0,1,1,1,1,1,1,1,1,0,0,0,0,1,1))
In this case, users are performing a task. If the task is performed, '1' appears for that week, if not '0' appears.
Is it possible, in the event that four or more 0s are encountered in a row, that an indicator is mutated into a new column identifying that this sequence has occurred? Something like this:
user week updated warning
1 1 1 1 0
2 1 2 0 0
3 1 3 1 0
4 1 4 1 0
5 2 1 1 0
6 3 1 1 0
7 3 2 1 0
8 3 3 1 0
9 4 1 1 0
10 4 2 1 0
11 4 3 0 0
12 4 4 0 0
13 4 5 0 0
14 4 6 0 1
15 4 7 1 0
16 4 8 1 0
Thanks!
Edit:
Apologies and thanks to #akrun for helping with this.
Additional example below, where on the 4th occurring missed entry equalling to '1', the warning column is updated to show the event, where a trigger will run off of that data.
df<-data.frame(user=c(1,1,1,1,2,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,7,7,7,7,7,7,7,7),
week=c(1,2,3,4,1,1,2,3,1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,1,2,3,4,5,6,7,8),
missed=c(0,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,1,0,1,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,1,1,0,1))
user week missed warning
1 1 1 0 0
2 1 2 1 0
3 1 3 0 0
4 1 4 0 0
5 2 1 0 0
6 3 1 0 0
7 3 2 0 0
8 3 3 0 0
9 4 1 0 0
10 4 2 0 0
11 4 3 1 0
12 4 4 1 0
13 4 5 1 0
14 4 6 1 1
15 4 7 0 0
16 4 8 0 0
17 5 1 0 0
18 5 2 1 0
19 5 3 0 0
20 5 4 1 0
21 5 5 0 0
22 5 6 0 0
23 5 7 0 0
24 5 8 0 0
25 6 1 0 0
26 6 2 1 0
27 6 3 1 0
28 6 4 1 0
29 6 5 1 1
30 6 6 1 0
31 6 7 0 0
32 7 1 0 0
33 7 2 0 0
34 7 3 0 0
35 7 4 0 0
36 7 5 1 0
37 7 6 1 0
38 7 7 0 0
39 7 8 1 0
An option would be to use rle to create the warning. Grouped by 'user', create the 'warning based by checking therun-length-id (rle) of 'updated', it would give the adjacent similar 'values' and 'lengths' as a list, create a logical condition where values is 0 and lengthsis greater than or equal to 4.
library(dplyr)
library(data.table)
df %>%
group_by(user) %>%
mutate(warning = with(rle(updated), rep(!values & lengths >= 4, lengths))) %>%
group_by(grp = rleid(warning), add = TRUE) %>%
mutate(warning = if(all(warning)) rep(c(0, 1), c(n()-1, 1)) else 0) %>%
ungroup %>%
select(-grp)
# A tibble: 16 x 4
# user week updated warning
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 0
# 2 1 2 0 0
# 3 1 3 1 0
# 4 1 4 1 0
# 5 2 1 1 0
# 6 3 1 1 0
# 7 3 2 1 0
# 8 3 3 1 0
# 9 4 1 1 0
#10 4 2 1 0
#11 4 3 0 0
#12 4 4 0 0
#13 4 5 0 0
#14 4 6 0 1
#15 4 7 1 0
#16 4 8 1 0
If we need to flag the group where any have greater than 4 0's then
df %>%
group_by(user) %>%
mutate(warning = with(rle(updated), rep(!values & lengths >= 4, lengths)),
warning = as.integer(any(warning)))
# A tibble: 16 x 4
# Groups: user [4]
# user week updated warning
# <dbl> <dbl> <dbl> <int>
# 1 1 1 1 0
# 2 1 2 0 0
# 3 1 3 1 0
# 4 1 4 1 0
# 5 2 1 1 0
# 6 3 1 1 0
# 7 3 2 1 0
# 8 3 3 1 0
# 9 4 1 1 1
#10 4 2 1 1
#11 4 3 0 1
#12 4 4 0 1
#13 4 5 0 1
#14 4 6 0 1
#15 4 7 1 1
#16 4 8 1 1
I followed a different approach. I numbered sequentially the cases where updated was 0 for each user and releid(updated). If there's a 4, that means that there are 4 consecutive homeworks not done. The warning is thus created where the new vector is equal to 4.
library(data.table)
df[,
warning := {id <- 1:.N;
warning <- as.numeric(id == 4)},
by = .(user,
rleid(updated))][,
warning := ifelse(warning == 1 & updated == 0, 1, 0)][is.na(warning),
warning := 0]
What has been done there
warning := assigns the result of the sequence that is between the {} to warning.
Now, inside the sequence:
id <- 1:.N creates a temporary variable id variable with consecutive numbers for each user and run-length group of updated values.
warning <- as.numeric(id == 4) creates a temporary variable with 1 in case id2 is equal to 4 and zero otherwise.
The by = .(user, rleid(updated)) grouped by both user and run-length values of updated. Of course there were run-length values for updated == 1, so we get rid of them by the ifelse clause. The final [is.na(warning), warning := 0] (notice the chaining) just gets rid of the NA values in the resulting variable.
Data used
> dput(df2)
structure(list(user = c(1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4, 4, 4,
4, 4, 4, 5, 5, 5, 5, 5), week = c(1, 2, 3, 4, 1, 1, 2, 3, 1,
2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5), updated = c(1, 0, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0)), row.names = c(NA,
-21L), class = c("data.table", "data.frame"))
Speed comparisson
I just compared with #akrun's answer:
set.seed(1)
df <- data.table(user = sample(1:10, 100, TRUE), updated = sample(c(1, 0), 100, TRUE), key = "user")
df[, week := 1:.N, by = user]
akrun <- function(df4){
df4 %>%
group_by(user) %>%
mutate(warning = with(rle(updated), rep(!values & lengths >= 4, lengths))) %>%
group_by(grp = rleid(warning), add = TRUE) %>%
mutate(warning = if(all(warning)) rep(c(0, 1), c(n()-1, 1)) else 0) %>%
ungroup %>%
select(-grp)
}
pavo <- function(df4){
df4[, warning := {id <- 1:.N; warning <- as.numeric(id == 4)}, by = .(user, rleid(updated))][, warning := ifelse(warning == 1 & updated == 0, 1, 0)][is.na(warning), warning := 0]
}
microbenchmark(akrun(df), pavo(df), times = 100)
Unit: microseconds
expr min lq mean median uq max neval
akrun(df) 1920.278 2144.049 2405.0332 2245.1735 2308.0145 6901.939 100
pavo(df) 823.193 877.061 978.7166 928.0695 991.5365 4905.450 100

R select sequence of certain length

I am trying to figure out how to select sequences of length 3.
Consider the following binary sequence.
sq
1 0
2 0
3 0
4 1
5 1
6 0
7 0
8 1
9 1
10 1
11 1
12 0
13 0
14 0
15 1
16 1
17 0
18 1
19 1
20 1
21 1
What I would like first is to identify the sequence of length 3.
I tried to use:
new = sqd %>% group_by(sq) %>% mutate(sq_cum = cumsum(sq)) %>% as.data.frame()
But it sum all the number 1 in the sequence, not the consecutive 1.
What I want is this vector seq_of_three.
sq sq_cum seq_of_three
1 0 0 0
2 0 0 0
3 0 0 0
4 1 1 0
5 1 2 0
6 0 0 0
7 0 0 0
8 1 3 1
9 1 4 1
10 1 5 1
11 1 6 1
12 0 0 0
13 0 0 0
14 0 0 0
15 1 7 0
16 1 8 0
17 0 0 0
18 1 9 1
19 1 10 1
20 1 11 1
21 1 12 1
Once I get that, I would like to subset the 3 first sequences.
sq sq_cum seq_of_three
8 1 3 1
9 1 4 1
10 1 5 1
18 1 9 1
19 1 10 1
20 1 11 1
data
structure(list(sq = c(0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0,
0, 1, 1, 0, 1, 1, 1, 1), sq_cum = c(0, 0, 0, 1, 2, 0, 0, 3, 4,
5, 6, 0, 0, 0, 7, 8, 0, 9, 10, 11, 12), seq_of_three = c(0, 0,
0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1)), row.names = c(NA,
-21L), class = "data.frame")
We can use rleid to create a grouping variable and then create the sequence of three by checking the number of rows and the values of 'sq' to create the binary column, filter the rows having 'seq_of_three' as 1 and then slice the first 3 rows. If needed, remove the 'grp' column
library(dplyr)
library(data.table)
sqd %>%
group_by(grp = rleid(sq)) %>%
mutate(seq_of_three = +(n() > 3 & all(sq == 1))) %>%
filter(seq_of_three == 1) %>%
slice(1:3) %>%
ungroup %>%
select(-grp)
# A tibble: 6 x 3
# sq sq_cum seq_of_three
# <dbl> <dbl> <int>
#1 1 3 1
#2 1 4 1
#3 1 5 1
#4 1 9 1
#5 1 10 1
#6 1 11 1
NOTE: It is not clear whether we need seq_of_three column created or not. If not, then the steps can be further made compact
Another option with slice
sqd %>%
group_by(grp = rleid(sq)) %>%
mutate(seq_of_three = +(n() > 3 & all(sq == 1))) %>%
slice(head(row_number()[seq_of_three == 1], 3)) %>%
ungroup %>%
select(-grp)
A different dplyr possibility could be:
df %>%
rowid_to_column() %>%
group_by(grp = with(rle(sq), rep(seq_along(lengths), lengths))) %>%
mutate(grp_seq = seq_along(grp)) %>%
filter(sq == 1 & grp_seq %in% 1:3 & length(grp) >= 3)
rowid sq grp grp_seq
<int> <int> <int> <int>
1 8 1 4 1
2 9 1 4 2
3 10 1 4 3
4 18 1 8 1
5 19 1 8 2
6 20 1 8 3
Here it, first, uses a rleid()-like function to create a grouping variable. Second, it creates a sequence along this grouping variable. Finally, it keeps the cases where "sq" == 1, the length of grouping variable is three or more and the sequence around the grouping variables has values from one to three.
replace(ave(df1$sq, df1$sq, FUN = seq_along), df1$sq == 0, 0)
# [1] 0 0 0 1 2 0 0 3 4 5 6 0 0 0 7 8 0 9 10 11 12
with(rle(df1$sq), {
rep(replace(rep(0, length(values)), lengths >= 3 & values == 1, 1), lengths)
})
# [1] 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1
df1[with(rle(df1$sq), {
temp = rep(replace(rep(0, length(values)),
lengths >= 3 & values == 1,
seq(sum(lengths >= 3 & values == 1))),
lengths)
ave(temp, temp, FUN = seq_along) <= 3 & temp > 0
}),]
# sq sq_cum seq_of_three
#8 1 3 1
#9 1 4 1
#10 1 5 1
#18 1 9 1
#19 1 10 1
#20 1 11 1

Resources