R select sequence of certain length - r

I am trying to figure out how to select sequences of length 3.
Consider the following binary sequence.
sq
1 0
2 0
3 0
4 1
5 1
6 0
7 0
8 1
9 1
10 1
11 1
12 0
13 0
14 0
15 1
16 1
17 0
18 1
19 1
20 1
21 1
What I would like first is to identify the sequence of length 3.
I tried to use:
new = sqd %>% group_by(sq) %>% mutate(sq_cum = cumsum(sq)) %>% as.data.frame()
But it sum all the number 1 in the sequence, not the consecutive 1.
What I want is this vector seq_of_three.
sq sq_cum seq_of_three
1 0 0 0
2 0 0 0
3 0 0 0
4 1 1 0
5 1 2 0
6 0 0 0
7 0 0 0
8 1 3 1
9 1 4 1
10 1 5 1
11 1 6 1
12 0 0 0
13 0 0 0
14 0 0 0
15 1 7 0
16 1 8 0
17 0 0 0
18 1 9 1
19 1 10 1
20 1 11 1
21 1 12 1
Once I get that, I would like to subset the 3 first sequences.
sq sq_cum seq_of_three
8 1 3 1
9 1 4 1
10 1 5 1
18 1 9 1
19 1 10 1
20 1 11 1
data
structure(list(sq = c(0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0,
0, 1, 1, 0, 1, 1, 1, 1), sq_cum = c(0, 0, 0, 1, 2, 0, 0, 3, 4,
5, 6, 0, 0, 0, 7, 8, 0, 9, 10, 11, 12), seq_of_three = c(0, 0,
0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1)), row.names = c(NA,
-21L), class = "data.frame")

We can use rleid to create a grouping variable and then create the sequence of three by checking the number of rows and the values of 'sq' to create the binary column, filter the rows having 'seq_of_three' as 1 and then slice the first 3 rows. If needed, remove the 'grp' column
library(dplyr)
library(data.table)
sqd %>%
group_by(grp = rleid(sq)) %>%
mutate(seq_of_three = +(n() > 3 & all(sq == 1))) %>%
filter(seq_of_three == 1) %>%
slice(1:3) %>%
ungroup %>%
select(-grp)
# A tibble: 6 x 3
# sq sq_cum seq_of_three
# <dbl> <dbl> <int>
#1 1 3 1
#2 1 4 1
#3 1 5 1
#4 1 9 1
#5 1 10 1
#6 1 11 1
NOTE: It is not clear whether we need seq_of_three column created or not. If not, then the steps can be further made compact
Another option with slice
sqd %>%
group_by(grp = rleid(sq)) %>%
mutate(seq_of_three = +(n() > 3 & all(sq == 1))) %>%
slice(head(row_number()[seq_of_three == 1], 3)) %>%
ungroup %>%
select(-grp)

A different dplyr possibility could be:
df %>%
rowid_to_column() %>%
group_by(grp = with(rle(sq), rep(seq_along(lengths), lengths))) %>%
mutate(grp_seq = seq_along(grp)) %>%
filter(sq == 1 & grp_seq %in% 1:3 & length(grp) >= 3)
rowid sq grp grp_seq
<int> <int> <int> <int>
1 8 1 4 1
2 9 1 4 2
3 10 1 4 3
4 18 1 8 1
5 19 1 8 2
6 20 1 8 3
Here it, first, uses a rleid()-like function to create a grouping variable. Second, it creates a sequence along this grouping variable. Finally, it keeps the cases where "sq" == 1, the length of grouping variable is three or more and the sequence around the grouping variables has values from one to three.

replace(ave(df1$sq, df1$sq, FUN = seq_along), df1$sq == 0, 0)
# [1] 0 0 0 1 2 0 0 3 4 5 6 0 0 0 7 8 0 9 10 11 12
with(rle(df1$sq), {
rep(replace(rep(0, length(values)), lengths >= 3 & values == 1, 1), lengths)
})
# [1] 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1
df1[with(rle(df1$sq), {
temp = rep(replace(rep(0, length(values)),
lengths >= 3 & values == 1,
seq(sum(lengths >= 3 & values == 1))),
lengths)
ave(temp, temp, FUN = seq_along) <= 3 & temp > 0
}),]
# sq sq_cum seq_of_three
#8 1 3 1
#9 1 4 1
#10 1 5 1
#18 1 9 1
#19 1 10 1
#20 1 11 1

Related

Number of switches in a vector by group

Given the vector
vec <- c(1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0)
the number of switches can simply be calculated with
cumsum(diff(vec) == 1)
However given the df
group <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
df_test <- data.frame(vec,group)
and using
df_test <- df_test %>%
group_by(group) %>%
mutate(n_switches = cumsum(diff(vec) == 1))%>%
ungroup()
the calculation fails because n_switches has 2 rows (1 per group) less than vec.
How can I overcome this problem and generate n_switches per group? Any help would be much appreciated!
With diff, the output is always an output with length 1 difference from the original output length. We can append an element to make the length same as mutate wants to return the column with the same length
library(dplyr)
df_test %>%
group_by(group) %>%
mutate(n_switches = cumsum(c(FALSE, diff(vec) == 1))) %>%
ungroup
Here is another option besides diff, which applies gregexpr for positioning the shifts
df_test %>%
group_by(group) %>%
mutate(n_switches = cumsum(replace(0 * vec, gregexpr("(?<=0)1", paste0(vec, collapse = ""), perl = TRUE)[[1]], 1))) %>%
ungroup()
# A tibble: 20 x 3
vec group n_switches
<dbl> <dbl> <dbl>
1 1 1 0
2 0 1 0
3 1 1 1
4 1 1 1
5 1 1 1
6 0 1 1
7 0 1 1
8 1 1 2
9 1 1 2
10 0 1 2
11 1 2 0
12 0 2 0
13 1 2 1
14 0 2 1
15 1 2 2
16 1 2 2
17 1 2 2
18 0 2 2
19 1 2 3
20 0 2 3
A similar realization but with data.table
setDT(df_test)[
,
n_switches := cumsum(replace(0 * vec, gregexpr("(?<=0)1", paste0(vec, collapse = ""), perl = TRUE)[[1]], 1)),
group
]
> df_test
vec group n_switches
1: 1 1 0
2: 0 1 0
3: 1 1 1
4: 1 1 1
5: 1 1 1
6: 0 1 1
7: 0 1 1
8: 1 1 2
9: 1 1 2
10: 0 1 2
11: 1 2 0
12: 0 2 0
13: 1 2 1
14: 0 2 1
15: 1 2 2
16: 1 2 2
17: 1 2 2
18: 0 2 2
19: 1 2 3
20: 0 2 3
You can define a switch as a position where the current value is 1 and previous value was 0 and then using cumsum create n_switches variable.
Using dplyr you can do this as :
library(dplyr)
df_test %>%
group_by(group) %>%
mutate(n_switches = cumsum(vec == 1 & lag(vec, default = 1) == 0))
# vec group n_switches
# <dbl> <dbl> <int>
# 1 1 1 0
# 2 0 1 0
# 3 1 1 1
# 4 1 1 1
# 5 1 1 1
# 6 0 1 1
# 7 0 1 1
# 8 1 1 2
# 9 1 1 2
#10 0 1 2
#11 1 2 0
#12 0 2 0
#13 1 2 1
#14 0 2 1
#15 1 2 2
#16 1 2 2
#17 1 2 2
#18 0 2 2
#19 1 2 3
#20 0 2 3
and same logic with data.table :
library(data.table)
setDT(df_test)[, n_switches:=cumsum(vec == 1 & shift(vec, fill = 1) == 0), group]

How to loop ifelse function through a grouped variable with dplyr

I'm trying to apply a rule for a group of IDs that, upon the first instance where the value for a variable in one row equals 1, all values for another variable in all subsequent rows in that group equal 1.
Essentially, here is what I am trying to do:
I have:
ID D
1 1
1 0
1 0
2 0
2 0
3 1
3 0
3 0
4 1
4 0
4 1
4 1
4 1
4 0
I want:
ID D PREV
1 1 0
1 0 1
1 0 1
2 0 0
2 0 0
3 1 0
3 0 1
3 0 1
4 1 0
4 0 1
4 1 1
4 1 1
4 0 1
I'm trying to use dplyr to iterate through a series of grouped rows, in each one applying an ifelse function. My code looks like this:
data$prev = 0
data <-
data %>%
group_by(id)%>%
mutate(prev = if_else(lag(prev) == 1 | lag(d) == 1, 1, 0))
But for some reason, this is not applying the ifelse function over the whole group, resulting in data that looks something like this:
ID D PREV
1 1 0
1 0 1
1 0 0
2 0 0
2 0 0
3 1 0
3 0 1
3 0 0
4 1 0
4 0 1
4 1 0
4 1 1
4 0 1
Can anyone help me with this?
What about this:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(prev = +(cumsum(c(0, D[-length(D)])) > 0)) %>%
ungroup()
#> # A tibble: 14 x 3
#> ID D prev
#> <int> <int> <int>
#> 1 1 1 0
#> 2 1 0 1
#> 3 1 0 1
#> 4 2 0 0
#> 5 2 0 0
#> 6 3 1 0
#> 7 3 0 1
#> 8 3 0 1
#> 9 4 1 0
#> 10 4 0 1
#> 11 4 1 1
#> 12 4 1 1
#> 13 4 1 1
#> 14 4 0 1
To explain what it does, let's just take a simple vector.
The calc will be the same for each group.
Be x our vector
x <- c(0,0,0,1,1,0,0,2,3,4)
Do the cumulative sum over x
cumsum(x)
#> [1] 0 0 0 1 2 2 2 4 7 11
You are interested only on value above zeros, therefore:
cumsum(x)>0
#> [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
You don't want logical, but numeric. Just a + makes the trick
+(cumsum(x)>0)
#> [1] 0 0 0 1 1 1 1 1 1 1
However, you want the 1s delayed by 1. Thus, we had a zero on top of x
+(cumsum(c(0,x))>0)
#> [1] 0 0 0 0 1 1 1 1 1 1 1
We need to keep the same length, so we remove the last value of x.
+(cumsum(c(0, x[-length(x)])) > 0)
#> [1] 0 0 0 0 1 1 1 1 1 1
And that makes the trick.
We can use lag
library(dplyr)
df %>%
group_by(ID) %>%
mutate(prev = lag(cumsum(D) > 0, default = 0))
-output
# A tibble: 14 x 3
# Groups: ID [4]
# ID D prev
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 0 1
# 3 1 0 1
# 4 2 0 0
# 5 2 0 0
# 6 3 1 0
# 7 3 0 1
# 8 3 0 1
# 9 4 1 0
#10 4 0 1
#11 4 1 1
#12 4 1 1
#13 4 1 1
#14 4 0 1
data
df <- data.frame(
ID = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4),
D = c(1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0)
)
You can use a new function from dplyr dplyr::group_modify to apply function over groups
df <- data.frame(
ID = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4),
D = c(1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0)
)
df %>% group_by(ID) %>% group_modify(
function(x, y){
boo <- x[1, ]$D == 1
ifelse(boo,
{x$prev = 1
x$prev[1] = 0
},
{x$prev = 0})
x
}
)
# A tibble: 14 x 3
# Groups: ID [4]
ID D prev
<dbl> <dbl> <dbl>
1 1 1 0
2 1 0 1
3 1 0 1
4 2 0 0
5 2 0 0
6 3 1 0
7 3 0 1
8 3 0 1
9 4 1 0
10 4 0 1
11 4 1 1
12 4 1 1
13 4 1 1
14 4 0 1

How can I count values in one dataframe and transfer the results to a second dataframe under the corresponding column in R?

I'm trying to extract and organize the values from the first data frame into the second. In the first you have cbn which is a factor that lists combinations of variables 1 to 31 (example dataframe shows a portion of all my data). For each of these combinations A, B, and C have values 1 or 2.
cbn A B C
1 1, 2, 3, 4 1 2 1
2 1, 2, 3, 5 1 1 1
3 1, 2, 3, 7 1 1 1
4 1, 2, 3, 8 1 2 1
5 1, 2, 3, 9 1 1 1
6 1, 2, 3, 10 1 1 1
7 1, 2, 3, 12 1 2 1
8 1, 2, 3, 13 1 2 1
9 1, 2, 3, 17 1 2 1
10 1, 2, 3, 18 1 2 1
11 1, 2, 3, 20 2 2 2
12 1, 2, 3, 22 1 2 1
13 1, 2, 3, 23 1 2 1
14 1, 2, 3, 25 1 2 1
15 1, 2, 3, 26 1 2 1
16 1, 2, 3, 28 1 2 1
17 1, 2, 3, 29 1 2 1
18 1, 2, 3, 30 1 2 1
19 1, 2, 3, 31 1 2 1
I'm trying to get all that data into a new dataframe. The rows become the 31 variables, and the columns become separated into 1 and 2 for A,B, and C. For every row in df1, the variables used in the combination are separated and added to the corresponding row in df2 under the column with the letter and value indicated in the df1. Thus, the first line in df1 has variables 1, 2, 3, and 4, and A is 1. In df2 under the A1 column, 1 is added to each corresponding variable row. For each variable present under cbn in df1, 1 is added to the count for that variable in df2 under letter with the same value in df1. I have added the first two rows of df1 to df2.
Variable A1 A2 B1 B2 C1 C2
1 1 2 0 1 1 2 0
2 2 2 0 1 1 2 0
3 3 2 0 1 1 2 0
4 4 1 0 0 1 1 0
5 5 1 0 1 0 1 0
6 6 0 0 0 0 0 0
7 7 0 0 0 0 0 0
8 8 0 0 0 0 0 0
9 9 0 0 0 0 0 0
10 10 0 0 0 0 0 0
11 11 0 0 0 0 0 0
12 12 0 0 0 0 0 0
13 13 0 0 0 0 0 0
14 14 0 0 0 0 0 0
15 15 0 0 0 0 0 0
16 16 0 0 0 0 0 0
... ... ... ... ... ... ... ...
31 31 0 0 0 0 0 0
How can I transfer this data into df2?
Using you first two rows of data:
df1 <- data.frame(cbn = c("1, 2, 3, 4", "1, 2, 3, 5" ),
A = c(1,1),
B = c(2,1),
C = c(1,1))
First add the letters to the entries of the column:
for(x in c("A","B","C")){
df1[,x] <- paste0(x, df1[,x])
}
Then using sperate to split the cbn column in to multiple columns and using gather, summarize and then spread:
library(tidyverse)
df2 <- df1 %>%
separate(cbn , paste("V",1:4), sep = ",") %>%
gather("dummy", "Variable", starts_with("V")) %>%
mutate(Variable = as.numeric(Variable))%>%
select(-dummy) %>%
gather("dummy", "value", -Variable) %>%
select(-dummy) %>%
mutate(value = factor(value, levels = c("A1","A2","B1","B2","C1","C2"))) %>%
group_by(Variable, value) %>%
summarize(n = n()) %>%
spread("value", "n", fill = 0, drop = F) %>%
as.data.frame()
results in:
> df2
Variable A1 A2 B1 B2 C1 C2
1 1 2 0 1 1 2 0
2 2 2 0 1 1 2 0
3 3 2 0 1 1 2 0
4 4 1 0 0 1 1 0
5 5 1 0 1 0 1 0

Identify and label repeated data in a series

I'm trying to identify cases in a dataset where a value occurs multiple times in a row, and once this is picked up, a row to the side of the nth occurrence confirms this with '1'.
df<-data.frame(user=c(1,1,1,1,2,3,3,3,4,4,4,4,4,4,4,4),
week=c(1,2,3,4,1,1,2,3,1,2,3,4,5,6,7,8),
updated=c(1,0,1,1,1,1,1,1,1,1,0,0,0,0,1,1))
In this case, users are performing a task. If the task is performed, '1' appears for that week, if not '0' appears.
Is it possible, in the event that four or more 0s are encountered in a row, that an indicator is mutated into a new column identifying that this sequence has occurred? Something like this:
user week updated warning
1 1 1 1 0
2 1 2 0 0
3 1 3 1 0
4 1 4 1 0
5 2 1 1 0
6 3 1 1 0
7 3 2 1 0
8 3 3 1 0
9 4 1 1 0
10 4 2 1 0
11 4 3 0 0
12 4 4 0 0
13 4 5 0 0
14 4 6 0 1
15 4 7 1 0
16 4 8 1 0
Thanks!
Edit:
Apologies and thanks to #akrun for helping with this.
Additional example below, where on the 4th occurring missed entry equalling to '1', the warning column is updated to show the event, where a trigger will run off of that data.
df<-data.frame(user=c(1,1,1,1,2,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,7,7,7,7,7,7,7,7),
week=c(1,2,3,4,1,1,2,3,1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,1,2,3,4,5,6,7,8),
missed=c(0,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,1,0,1,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,1,1,0,1))
user week missed warning
1 1 1 0 0
2 1 2 1 0
3 1 3 0 0
4 1 4 0 0
5 2 1 0 0
6 3 1 0 0
7 3 2 0 0
8 3 3 0 0
9 4 1 0 0
10 4 2 0 0
11 4 3 1 0
12 4 4 1 0
13 4 5 1 0
14 4 6 1 1
15 4 7 0 0
16 4 8 0 0
17 5 1 0 0
18 5 2 1 0
19 5 3 0 0
20 5 4 1 0
21 5 5 0 0
22 5 6 0 0
23 5 7 0 0
24 5 8 0 0
25 6 1 0 0
26 6 2 1 0
27 6 3 1 0
28 6 4 1 0
29 6 5 1 1
30 6 6 1 0
31 6 7 0 0
32 7 1 0 0
33 7 2 0 0
34 7 3 0 0
35 7 4 0 0
36 7 5 1 0
37 7 6 1 0
38 7 7 0 0
39 7 8 1 0
An option would be to use rle to create the warning. Grouped by 'user', create the 'warning based by checking therun-length-id (rle) of 'updated', it would give the adjacent similar 'values' and 'lengths' as a list, create a logical condition where values is 0 and lengthsis greater than or equal to 4.
library(dplyr)
library(data.table)
df %>%
group_by(user) %>%
mutate(warning = with(rle(updated), rep(!values & lengths >= 4, lengths))) %>%
group_by(grp = rleid(warning), add = TRUE) %>%
mutate(warning = if(all(warning)) rep(c(0, 1), c(n()-1, 1)) else 0) %>%
ungroup %>%
select(-grp)
# A tibble: 16 x 4
# user week updated warning
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 0
# 2 1 2 0 0
# 3 1 3 1 0
# 4 1 4 1 0
# 5 2 1 1 0
# 6 3 1 1 0
# 7 3 2 1 0
# 8 3 3 1 0
# 9 4 1 1 0
#10 4 2 1 0
#11 4 3 0 0
#12 4 4 0 0
#13 4 5 0 0
#14 4 6 0 1
#15 4 7 1 0
#16 4 8 1 0
If we need to flag the group where any have greater than 4 0's then
df %>%
group_by(user) %>%
mutate(warning = with(rle(updated), rep(!values & lengths >= 4, lengths)),
warning = as.integer(any(warning)))
# A tibble: 16 x 4
# Groups: user [4]
# user week updated warning
# <dbl> <dbl> <dbl> <int>
# 1 1 1 1 0
# 2 1 2 0 0
# 3 1 3 1 0
# 4 1 4 1 0
# 5 2 1 1 0
# 6 3 1 1 0
# 7 3 2 1 0
# 8 3 3 1 0
# 9 4 1 1 1
#10 4 2 1 1
#11 4 3 0 1
#12 4 4 0 1
#13 4 5 0 1
#14 4 6 0 1
#15 4 7 1 1
#16 4 8 1 1
I followed a different approach. I numbered sequentially the cases where updated was 0 for each user and releid(updated). If there's a 4, that means that there are 4 consecutive homeworks not done. The warning is thus created where the new vector is equal to 4.
library(data.table)
df[,
warning := {id <- 1:.N;
warning <- as.numeric(id == 4)},
by = .(user,
rleid(updated))][,
warning := ifelse(warning == 1 & updated == 0, 1, 0)][is.na(warning),
warning := 0]
What has been done there
warning := assigns the result of the sequence that is between the {} to warning.
Now, inside the sequence:
id <- 1:.N creates a temporary variable id variable with consecutive numbers for each user and run-length group of updated values.
warning <- as.numeric(id == 4) creates a temporary variable with 1 in case id2 is equal to 4 and zero otherwise.
The by = .(user, rleid(updated)) grouped by both user and run-length values of updated. Of course there were run-length values for updated == 1, so we get rid of them by the ifelse clause. The final [is.na(warning), warning := 0] (notice the chaining) just gets rid of the NA values in the resulting variable.
Data used
> dput(df2)
structure(list(user = c(1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4, 4, 4,
4, 4, 4, 5, 5, 5, 5, 5), week = c(1, 2, 3, 4, 1, 1, 2, 3, 1,
2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5), updated = c(1, 0, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0)), row.names = c(NA,
-21L), class = c("data.table", "data.frame"))
Speed comparisson
I just compared with #akrun's answer:
set.seed(1)
df <- data.table(user = sample(1:10, 100, TRUE), updated = sample(c(1, 0), 100, TRUE), key = "user")
df[, week := 1:.N, by = user]
akrun <- function(df4){
df4 %>%
group_by(user) %>%
mutate(warning = with(rle(updated), rep(!values & lengths >= 4, lengths))) %>%
group_by(grp = rleid(warning), add = TRUE) %>%
mutate(warning = if(all(warning)) rep(c(0, 1), c(n()-1, 1)) else 0) %>%
ungroup %>%
select(-grp)
}
pavo <- function(df4){
df4[, warning := {id <- 1:.N; warning <- as.numeric(id == 4)}, by = .(user, rleid(updated))][, warning := ifelse(warning == 1 & updated == 0, 1, 0)][is.na(warning), warning := 0]
}
microbenchmark(akrun(df), pavo(df), times = 100)
Unit: microseconds
expr min lq mean median uq max neval
akrun(df) 1920.278 2144.049 2405.0332 2245.1735 2308.0145 6901.939 100
pavo(df) 823.193 877.061 978.7166 928.0695 991.5365 4905.450 100

Cumulative sum of a subset of data based on condition

I have what i think is a simple R task but i'm having trouble. Basically I need to do a cumulative sum of values based on the criteria of another column.
Here's the catch, it should do the cumulative sum for the previous rows until it hits another condition. In the example i'm providing, it accumulates all values from the duration column, 1 and 2 in the condition column. Example is shown below.
duration <- c(2,3,2,4,5,10,2,9,7,5,8,9,10,12,4,5,6)
condition <- c(0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,2)
accum_sum <- c(0,5,0,0,0,0,0,32,0,0,0,0,39,0,0,0,27)
df <- data.frame(duration,condition,accum_sum)
df
row duration condition accum_sum
1 2 0 0
2 3 1 5
3 2 0 0
4 4 0 0
5 5 0 0
6 10 0 0
7 2 0 0
8 9 2 32
9 7 0 0
10 5 0 0
11 8 0 0
12 9 0 0
13 10 1 39
14 12 0 0
15 4 0 0
16 5 0 0
17 6 2 27
Using data.table:
setDT(df)
df[, accum_sum := cumsum(duration), by = rev(cumsum(rev(condition)))]
df[condition == 0, accum_sum := 0]
# duration condition accum_sum
# 1: 2 0 0
# 2: 3 1 5
# 3: 2 0 0
# 4: 4 0 0
# 5: 5 0 0
# 6: 10 0 0
# 7: 2 0 0
# 8: 9 2 32
# 9: 7 0 0
#10: 5 0 0
#11: 8 0 0
#12: 9 0 0
#13: 10 1 39
#14: 12 0 0
#15: 4 0 0
#16: 5 0 0
#17: 6 2 27
We create runs by filling the zeros backwards with rev(cumsum(rev(condition))) and then group by this "filled" condition.
#cumulative sum
df$cum_sum <- ave(df$duration, c(0, cumsum(df$condition[-nrow(df)])), FUN = cumsum)
#replace all zero condition row with zero value in cumulative sum column
df$cum_sum <- ifelse(df$condition == 0, 0, df$cum_sum)
which gives
duration condition cum_sum
1 2 0 0
2 3 1 5
3 2 0 0
4 4 0 0
5 5 0 0
6 10 0 0
7 2 0 0
8 9 2 32
9 7 0 0
10 5 0 0
11 8 0 0
12 9 0 0
13 10 1 39
14 12 0 0
15 4 0 0
16 5 0 0
17 6 2 27
Sample data:
df <- structure(list(duration = c(2, 3, 2, 4, 5, 10, 2, 9, 7, 5, 8,
9, 10, 12, 4, 5, 6), condition = c(0, 1, 0, 0, 0, 0, 0, 2, 0,
0, 0, 0, 1, 0, 0, 0, 2), cum_sum = c(0, 5, 0, 0, 0, 0, 0, 32,
0, 0, 0, 0, 39, 0, 0, 0, 27)), .Names = c("duration", "condition",
"cum_sum"), row.names = c(NA, -17L), class = "data.frame")
Using dplyr, we can use cumsum() on condition to keep track of how many conditions have been seen. Then add within those subsets:
library(dplyr)
df %>%
mutate(condition_group = cumsum(lag(condition, default = 0) != 0) + 1) %>%
group_by(condition_group) %>%
mutate(accum_sum = ifelse(condition != 0,
sum(duration),
0))
Output:
# A tibble: 17 x 4
# Groups: condition_group [4]
duration condition accum_sum condition_group
<dbl> <dbl> <dbl> <dbl>
1 2 0 0 1
2 3 1 5 1
3 2 0 0 2
4 4 0 0 2
5 5 0 0 2
6 10 0 0 2
7 2 0 0 2
8 9 2 32 2
9 7 0 0 3
10 5 0 0 3
11 8 0 0 3
12 9 0 0 3
13 10 1 39 3
14 12 0 0 4
15 4 0 0 4
16 5 0 0 4
17 6 2 27 4
If you shift condition by 1, you can simply use tapply.
duration <- c(2,3,2,4,5,10,2,9,7,5,8,9,10,12,4,5,6)
condition <- c(0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,2)
accum_sum <- c(0,5,0,0,0,0,0,32,0,0,0,0,39,0,0,0,27)
df <- data.frame(duration,condition,accum_sum)
df$want <- unlist(tapply(df$duration,
INDEX = cumsum(c(df$condition[1], head(df$condition, -1))),
cumsum)) * ifelse(df$condition == 0, 0, 1)
df

Resources