I need to compute letter frequencies of a large list of words. For each of the locations in the word (first, second,..), I need to find how many times each letter (a-z) appeared in the list and then table the data according to the word positon.
For example, if my word list is: words <- c("swims", "seems", "gills", "draws", "which", "water")
then the result table should like that:
letter
first position
second position
third position
fourth position
fifth position
a
0
1
1
0
0
b
0
0
0
0
0
c
0
0
0
1
0
d
1
0
0
0
0
e
0
1
1
1
0
f
0
0
0
0
0
...continued until z
...
...
...
...
...
All words are of same length (5).
What I have so far is:
alphabet <- letters[1:26]
words.df <- data.frame("Words" = words)
words.df <- words.df %>% mutate("First_place" = substr(words.df$words,1,1))
words.df <- words.df %>% mutate("Second_place" = substr(words.df$words,2,2))
words.df <- words.df %>% mutate("Third_place" = substr(words.df$words,3,3))
words.df <- words.df %>% mutate("Fourth_place" = substr(words.df$words,4,4))
words.df <- words.df %>% mutate("Fifth_place" = substr(words.df$words,5,5))
x1 <- words.df$First_place
x1 <- table(factor(x1,alphabet))
x2 <- words.df$Second_place
x2 <- table(factor(x2,alphabet))
x3 <- words.df$Third_place
x3 <- table(factor(x3,alphabet))
x4 <- words.df$Fourth_place
x4 <- table(factor(x4,alphabet))
x5 <- words.df$Fifth_place
x5 <- table(factor(x5,alphabet))
My code is not effective and gives tables to each letter position sepretely. All help will be appreicated.
in base R use table:
table(let = unlist(strsplit(words,'')),pos = sequence(nchar(words)))
pos
let 1 2 3 4 5
a 0 1 1 0 0
c 0 0 0 1 0
d 1 0 0 0 0
e 0 1 1 1 0
g 1 0 0 0 0
h 0 1 0 0 1
i 0 1 2 0 0
l 0 0 1 1 0
m 0 0 0 2 0
r 0 1 0 0 1
s 2 0 0 0 4
t 0 0 1 0 0
w 2 1 0 1 0
Note that if you need all the values from a-z then use
table(factor(unlist(strsplit(words,'')), letters), sequence(nchar(words)))
Also to get a dataframe you could do:
d <- table(factor(unlist(strsplit(words,'')), letters), sequence(nchar(words)))
cbind(letters = rownames(d), as.data.frame.matrix(d))
Here is a tidyverse solution using dplyr, purrr, and tidyr:
strsplit(words.df$Words, "") %>%
map_dfr(~setNames(.x, seq_along(.x))) %>%
pivot_longer(everything(),
values_drop_na = T,
names_to = "pos",
values_to = "letter") %>%
count(pos, letter) %>%
pivot_wider(names_from = pos,
names_glue = "pos{pos}",
id_cols = letter,
values_from = n,
values_fill = 0L)
Output
letter pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 pos10 pos11
1 a 65 127 88 38 28 17 14 5 3 0 0
2 b 58 4 7 9 2 4 2 0 1 0 0
3 c 83 14 45 37 20 19 8 3 2 0 0
4 C 2 0 0 0 0 0 0 0 0 0 0
5 d 43 8 33 47 21 22 9 3 1 1 0
6 e 45 156 81 132 114 69 48 23 14 2 2
7 f 54 11 18 10 5 2 1 0 0 0 0
8 g 23 7 27 21 15 8 7 1 0 0 0
9 h 38 56 6 28 21 10 3 3 1 1 0
10 i 25 106 51 58 38 28 8 4 1 0 0
11 j 6 0 2 2 0 0 0 0 0 0 0
12 k 9 1 6 22 12 0 0 0 0 0 0
13 l 45 41 54 54 36 9 7 6 0 2 0
14 m 45 8 31 19 8 8 4 2 0 0 0
15 n 23 42 75 53 34 41 16 16 4 2 0
16 o 28 167 76 41 38 9 11 2 1 0 0
17 p 72 20 34 30 8 3 1 1 1 0 0
18 q 7 2 1 0 0 0 0 0 0 0 0
19 r 46 74 92 59 56 45 12 9 1 1 0
20 s 119 8 67 35 31 22 18 4 1 0 0
21 t 65 30 73 83 57 42 31 9 6 3 1
22 u 12 66 39 36 20 7 7 2 0 0 0
23 v 8 7 20 12 5 5 1 0 0 0 0
24 w 53 8 13 10 2 3 0 1 0 0 0
25 y 6 4 16 15 17 15 10 5 6 1 1
26 x 0 12 5 0 0 0 0 0 0 0 0
27 z 0 0 1 0 0 0 1 1 0 0 0
I'm trying to perform calculations on different elements in a matrix in R. My Matrix is 18x18 and I would like to get e.g. the mean of each 6x6 array (which makes 9 arrays in total). My desired arrays would be:
A1 <- df[1:6,1:6]
A2 <- df[1:6,7:12]
A3 <- df[1:6,13:18]
B1 <- df[7:12,1:6]
B2 <- df[7:12,7:12]
B3 <- df[7:12,13:18]
C1 <- df[13:18,1:6]
C2 <- df[13:18,7:12]
C3 <- df[13:18,13:18]
The matrix looks like this:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
5 14 17 9 10 8 4 10 12 18 9 13 14 NA NA 19 15 10 10
10 30 32 23 27 17 28 25 12 28 29 28 26 19 25 34 24 11 17
15 16 16 16 9 17 27 17 16 30 13 18 13 15 13 19 8 7 9
20 15 12 18 18 18 6 4 6 9 11 10 10 13 11 8 10 15 15
25 7 13 21 7 3 5 2 5 5 4 3 2 3 5 2 1 5 6
30 5 9 1 7 7 4 4 12 8 9 2 0 5 2 1 0 2 6
35 3 0 2 0 0 4 4 7 4 4 5 2 0 0 1 0 0 0
40 0 4 0 0 0 1 3 9 10 10 1 0 0 0 1 0 1 0
45 0 0 0 0 0 3 10 9 17 9 1 0 0 0 0 0 0 0
50 0 0 2 0 0 0 2 8 20 0 0 0 0 0 1 0 0 0
55 0 0 0 0 0 0 7 3 21 0 0 0 0 0 0 0 0 0
60 0 0 0 0 3 4 10 2 2 0 0 1 0 0 0 0 0 0
65 0 0 0 0 0 4 8 4 8 11 0 0 0 0 0 0 0 0
70 0 0 0 0 0 6 2 5 14 0 0 0 0 0 0 0 0 0
75 0 0 0 0 0 4 0 5 9 0 0 0 0 0 0 0 0 0
80 0 0 0 0 0 4 4 0 4 2 0 0 0 0 0 0 0 0
85 0 0 0 0 0 0 0 4 1 1 0 0 0 0 0 0 0 0
90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Is there a clean way to solve this issue with a loop?
Thanks a lot in advance,
Paul
Given your matrix, e.g.
x <- matrix(1:(18*18), ncol=18)
Try, for example for sub matrices of 6
step <- 6
nx <- nrow(x)
if((nx %% step) != 0) stop("nx %% step should be 0")
indI <- seq(1, nx, by=step)
nbStep <- length(indI)
for(Col in 1:nbStep){
for(Row in 1:nbStep){
name <- paste0(LETTERS[Col],Row)
theCol <- indI[Col]:(indI[Col]+step-1)
theRow <- indI[Row]:(indI[Row]+step-1)
assign(name, sum(x[theCol, theRow]))
}
}
You'll get your results in A1, A2, A3...
This is the idea. Twist the code for non square matrices, different size of sub matrices, ...
Here's one way:
# generate fake data
set.seed(47)
n = 18
m = matrix(rpois(n * n, lambda = 5), nrow = n)
# generate starting indices
n_array = 6
start_i = seq(1, n, by = n_array)
arr_starts = expand.grid(row = start_i, col = start_i)
# calculate sums
with(arr_starts, mapply(function(x, y) sum(m[(x + 1:n_array) - 1, (y + 1:n_array) - 1]), row, col))
# [1] 158 188 176 201 188 201 197 206 204
Hi stack overflow community,
I'm relatively new to R (9 months) and this is my first stack overflow question with reprex and would really appreciate any help. I mainly use tidyverse although I am open to base R solutions.
Problem:
I have ~21,000 rows of symptom data with >10 variables per day. I would like to be able to classify "exacerbations" of a disease (in this case chest infections in lung disease) by using rules to define the start and end of the episode so that I can later calculate duration of episodes, type of episode (this depends on the combination of symptoms) and treatment received. As with any data set involving patients there are missing values. I have imputed from the most recent day if less than 2 days of data is missing.
The below code is a simplified, made up example involving only 1 symptom.
Exacerbation Rule:
Start of exacerbation = 2 days of worse symptoms (>= 3)
Resolution of exacerbation = 5 days with normal breathing (<=2)
I would ideally want to be able to identify all days when an exacerbation is happening too.
Here is the data:
#load packages
library(tidyverse)
#load data
id <- "A"
day <- c(1:50)
symptom <- c(2,2,2,2,2,2,2,2,2,2,2,3,2,2,2,2,NA,NA,NA,2,2,2,3,3,3,4,4,3,3,2,3,2,2,3,3,2,2,2,2,2,2,3,2,2,2,2,2,3,2,2)
df <- data.frame(id,day,symptom)
#Data Dictionary
#Symptom: 1 = Better than usual, 2 = Normal/usual, 3 = Worse than usual, 4 = Much worse than usual
What I have tried:
I have tried to approach this by using a combination of lag() and lead() with conditional statements case_when() and ifelse().
df %>%
mutate_at(vars("symptom"), #used for more variables within vars() argument
.funs = list(lead1 = ~ lead(., n = 1),
lead2 = ~ lead(., n = 2),
lead3 = ~ lead(., n = 3),
lead4 = ~ lead(., n = 4),
lead5 = ~ lead(., n = 5),
lag1 = ~ lag(., n = 1),
lag2 = ~ lag(., n = 2),
lag3 = ~ lag(., n = 3))) %>%
mutate(start = case_when(symptom <= 2 ~ 0,
symptom >= 3 ~
ifelse(symptom >= lag2 & symptom <= lag1,1,0)),
end = case_when(symptom >=3 ~
ifelse(lead1 <=2 &
lead2 <=2 &
lead3 <=2 &
lead4 <=2 &
lead5 <=2,1,0)))
My main issue is that of complexity. As I build in more symptoms and rules I have to refer to different variables that have ifelse()/case_when() statements within it. I am sure there is a more elegant solution to my problem.
The other issue is that during an "exacerbation" the exacerbation_start variable should only be used at the start and not during the episode. Similarly for exacerbation_end it would only be applicable when an exacerbation is already happening. I have tried using ifelse() statements to refer to when an exacerbation is happening but not been able to get this to work and obey the rule I desire.
The output I would like is:
id day symptom start end exacerbation
1 A 1 2 0 0 0
2 A 2 2 0 0 0
3 A 3 2 0 0 0
4 A 4 2 0 0 0
5 A 5 2 0 0 0
6 A 6 2 0 0 0
7 A 7 2 0 0 0
8 A 8 2 0 0 0
9 A 9 2 0 0 0
10 A 10 2 0 0 0
11 A 11 2 0 0 0
12 A 12 3 0 0 0
13 A 13 2 0 0 0
14 A 14 2 0 0 0
15 A 15 2 0 0 0
16 A 16 2 0 0 0
17 A 17 NA 0 0 0
18 A 18 NA 0 0 0
19 A 19 NA 0 0 0
20 A 20 2 0 0 0
21 A 21 2 0 0 0
22 A 22 2 0 0 0
23 A 23 3 0 0 0
24 A 24 3 1 0 1
25 A 25 3 0 0 1
26 A 26 4 0 0 1
27 A 27 4 0 0 1
28 A 28 3 0 0 1
29 A 29 3 0 0 1
30 A 30 2 0 0 1
31 A 31 3 0 0 1
32 A 32 2 0 0 1
33 A 33 2 0 0 1
34 A 34 3 0 0 1
35 A 35 3 0 1 1
36 A 36 2 0 0 0
37 A 37 2 0 0 0
38 A 38 2 0 0 0
39 A 39 2 0 0 0
40 A 40 2 0 0 0
41 A 41 2 0 0 0
42 A 42 3 0 0 0
43 A 43 2 0 0 0
44 A 44 2 0 0 0
45 A 45 2 0 0 0
46 A 46 2 0 0 0
47 A 47 2 0 0 0
48 A 48 3 0 0 0
49 A 49 2 0 0 0
50 A 50 2 0 0 0
I look forward to your replies!
EDIT
I have added 50 more rows of data to simulate multiple exacerbations and the issue with right censoring and NAs. I have also included a second participant "B" to see if this is a reason for issues.
id <- c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B")
day <- c(1:50,1:50)
symptom <- c(2,3,3,3,3,2,2,2,2,2,2,3,2,2,2,2,NA,NA,NA,2,2,2,3,3,3,4,4,3,3,2,3,2,2,3,3,2,2,2,2,2,2,3,2,2,2,2,2,3,2,2, 2,2,2,2,2,2,3,2,3,3,2,3,2,3,2,2,2,2,2,2,3,3,3,3,NA,NA,NA,2,2,2,3,2,2,2,2,2,3,2,2,3,NA,NA,NA,3,3,3,3,3,3,2)
df <- data.frame(id,day,symptom)
id day symptom start end exacerbation censor
1 A 1 2 0 0 0 0
2 A 2 3 1 0 1 0
3 A 3 3 0 0 1 0
4 A 4 3 0 0 1 0
5 A 5 3 0 1 1 0
6 A 6 2 0 0 0 0
7 A 7 2 0 0 0 0
8 A 8 2 0 0 0 0
9 A 9 2 0 0 0 0
10 A 10 2 0 0 0 0
11 A 11 2 0 0 0 0
12 A 12 3 0 0 0 0
13 A 13 2 0 0 0 0
14 A 14 2 0 0 0 0
15 A 15 2 0 0 0 0
16 A 16 2 0 0 0 0
17 A 17 NA 0 0 0 0
18 A 18 NA 0 0 0 0
19 A 19 NA 0 0 0 0
20 A 20 2 0 0 0 0
21 A 21 2 0 0 0 0
22 A 22 2 0 0 0 0
23 A 23 3 1 0 1 0
24 A 24 3 0 0 1 0
25 A 25 3 0 0 1 0
26 A 26 4 0 0 1 0
27 A 27 4 0 0 1 0
28 A 28 3 0 0 1 0
29 A 29 3 0 0 1 0
30 A 30 2 0 0 1 0
31 A 31 3 0 0 1 0
32 A 32 2 0 0 1 0
33 A 33 2 0 0 1 0
34 A 34 3 0 0 1 0
35 A 35 3 0 0 1 0
36 A 36 2 0 0 1 0
37 A 37 2 0 0 1 0
38 A 38 2 0 0 1 0
39 A 39 2 0 0 1 0
40 A 40 2 0 0 1 0
41 A 41 2 0 1 1 0
42 A 42 3 0 0 0 0
43 A 43 2 0 0 0 0
44 A 44 2 0 0 0 0
45 A 45 2 0 0 0 0
46 A 46 2 0 0 0 0
47 A 47 2 0 0 0 0
48 A 48 3 0 0 0 0
49 A 49 2 0 0 0 0
50 A 50 2 0 0 0 0
51 B 1 2 0 0 0 0
52 B 2 2 0 0 0 0
53 B 3 2 0 0 0 0
54 B 4 2 0 0 0 0
55 B 5 2 0 0 0 0
56 B 6 2 0 0 0 0
57 B 7 3 0 0 0 0
58 B 8 2 0 0 0 0
59 B 9 3 0 0 0 0
60 B 10 3 1 0 1 0
61 B 11 2 0 0 1 0
62 B 12 3 0 0 1 0
63 B 13 2 0 0 1 0
64 B 14 3 0 0 1 0
65 B 15 2 0 0 1 0
66 B 16 2 0 0 1 0
67 B 17 2 0 0 1 0
68 B 18 2 0 0 1 0
69 B 19 2 0 1 1 0
70 B 20 2 0 0 0 0
71 B 21 3 1 0 1 0
72 B 22 3 0 0 1 0
73 B 23 3 0 0 1 0
74 B 24 3 0 0 1 0
75 B 25 NA 0 0 0 1
76 B 26 NA 0 0 0 1
77 B 27 NA 0 0 0 1
78 B 28 2 0 0 0 1
79 B 29 2 0 0 0 1
80 B 30 2 0 0 0 1
81 B 31 3 0 0 0 1
82 B 32 2 0 0 0 1
83 B 33 2 0 0 0 1
84 B 34 2 0 0 0 1
85 B 35 2 0 0 0 1
86 B 36 2 0 0 0 1
87 B 37 3 0 0 0 0
88 B 38 2 0 0 0 0
89 B 39 2 0 0 0 0
90 B 40 3 0 0 0 0
91 B 41 NA 0 0 0 0
92 B 42 NA 0 0 0 0
93 B 43 NA 0 0 0 0
94 B 44 3 1 0 1 0
95 B 45 3 0 0 1 0
96 B 46 3 0 0 1 0
97 B 47 3 0 0 1 0
98 B 48 3 0 0 1 0
99 B 49 3 0 0 1 0
100 B 50 2 0 0 1 0
>
Here is a try for a more elegant and scalable way to write your algorithm:
First, you do not have to compute the lead and lag calls before you can use case_when. Of note, I find it good practice to explicitly write the TRUE option of case_when. Here is some code.
df2=df %>%
mutate(
exacerbation_start = case_when(
is.na(symptom) ~ NA_real_,
symptom <= 2 ~ 0,
symptom >= 3 & symptom >= lag(symptom, n=2) & symptom <= lag(symptom, n=1) ~ 1,
TRUE ~ 0
),
exacerbation_end = case_when(
symptom >=3 ~ ifelse(lead(symptom, n=1) <=2 &
lead(symptom, n=2) <=2 & lead(symptom, n=3) <=2 &
lead(symptom, n=4) <=2 & lead(symptom, n=5) <=2,
1,0),
TRUE ~ NA_real_
)
)
all.equal(df1,df2) #TRUE
Alternatively, if your algorithm is the same for all symptoms, you might want to use custom functions:
get_exacerbation_start = function(x){
case_when(
is.na(x) ~ NA_real_,
x <= 2 ~ 0,
x >= 3 & x >= lag(x, n=2) & x <= lag(x, n=1) ~ 1,
TRUE ~ 0
)
}
get_exacerbation_end = function(x){
case_when(
x >=3 ~ ifelse(x >=3 & lead(x, n=1) <=2 &
lead(x, n=2) <=2 & lead(x, n=3) <=2 &
lead(x, n=4) <=2 & lead(x, n=5) <=2,
1,0),
TRUE ~ NA_real_
)
}
df3=df %>%
mutate(
exacerbation_start = get_exacerbation_start(symptom),
exacerbation_end = get_exacerbation_end(symptom)
)
all.equal(df1,df3) #also TRUE
This latter way might be even more powerful with some mutate_at calls.
EDIT: after seeing your edit, here is a try to get the exacerbation period. The code is quite ugly in my opinion, I'm not sure that row_number was supposed to be used this way.
df_final=df %>%
transmute(
id,day,symptom,
start = get_exacerbation_start(symptom),
end = get_exacerbation_end(symptom),
exacerbation = row_number()>=which(start==1)[1] & row_number()<=which(end==1)[1]
)
I may come back with a less convoluted approach, but try this:
library(dplyr)
library(tidyr)
df %>%
group_by(id,
idx = with(
rle(
case_when(symptom <= 2 ~ 'normal',
symptom >= 3 ~ 'worse',
TRUE ~ symptom %>% as.character)),
rep(seq_along(lengths), lengths)
)
) %>%
mutate(
trajectory = case_when(cumsum(symptom <= 2) == 5 ~ 2, cumsum(symptom >= 3) == 2 ~ 1)
) %>%
group_by(id) %>% fill(trajectory) %>%
mutate(
trajectory = replace_na(trajectory, 0),
start = +(trajectory == 1 & lag(trajectory) == 2),
end = +(trajectory == 2 & lag(trajectory) == 1),
exacerbation = +(trajectory == 1 | start == 1 | end == 1)
) %>%
select(-idx, -trajectory) %>% as.data.frame
Output:
id day symptom start end exacerbation
1 A 1 2 0 0 0
2 A 2 2 0 0 0
3 A 3 2 0 0 0
4 A 4 2 0 0 0
5 A 5 2 0 0 0
6 A 6 2 0 0 0
7 A 7 2 0 0 0
8 A 8 2 0 0 0
9 A 9 2 0 0 0
10 A 10 2 0 0 0
11 A 11 2 0 0 0
12 A 12 3 0 0 0
13 A 13 2 0 0 0
14 A 14 2 0 0 0
15 A 15 2 0 0 0
16 A 16 2 0 0 0
17 A 17 NA 0 0 0
18 A 18 NA 0 0 0
19 A 19 NA 0 0 0
20 A 20 2 0 0 0
21 A 21 2 0 0 0
22 A 22 2 0 0 0
23 A 23 3 0 0 0
24 A 24 3 1 0 1
25 A 25 3 0 0 1
26 A 26 4 0 0 1
27 A 27 4 0 0 1
28 A 28 3 0 0 1
29 A 29 3 0 0 1
30 A 30 2 0 0 1
31 A 31 3 0 0 1
32 A 32 2 0 0 1
33 A 33 2 0 0 1
34 A 34 3 0 0 1
35 A 35 3 0 0 1
36 A 36 2 0 0 1
37 A 37 2 0 0 1
38 A 38 2 0 0 1
39 A 39 2 0 0 1
40 A 40 2 0 1 1
41 A 41 2 0 0 0
42 A 42 3 0 0 0
43 A 43 2 0 0 0
44 A 44 2 0 0 0
45 A 45 2 0 0 0
46 A 46 2 0 0 0
47 A 47 2 0 0 0
48 A 48 3 0 0 0
49 A 49 2 0 0 0
50 A 50 2 0 0 0
I have what i think is a simple R task but i'm having trouble. Basically I need to do a cumulative sum of values based on the criteria of another column.
Here's the catch, it should do the cumulative sum for the previous rows until it hits another condition. In the example i'm providing, it accumulates all values from the duration column, 1 and 2 in the condition column. Example is shown below.
duration <- c(2,3,2,4,5,10,2,9,7,5,8,9,10,12,4,5,6)
condition <- c(0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,2)
accum_sum <- c(0,5,0,0,0,0,0,32,0,0,0,0,39,0,0,0,27)
df <- data.frame(duration,condition,accum_sum)
df
row duration condition accum_sum
1 2 0 0
2 3 1 5
3 2 0 0
4 4 0 0
5 5 0 0
6 10 0 0
7 2 0 0
8 9 2 32
9 7 0 0
10 5 0 0
11 8 0 0
12 9 0 0
13 10 1 39
14 12 0 0
15 4 0 0
16 5 0 0
17 6 2 27
Using data.table:
setDT(df)
df[, accum_sum := cumsum(duration), by = rev(cumsum(rev(condition)))]
df[condition == 0, accum_sum := 0]
# duration condition accum_sum
# 1: 2 0 0
# 2: 3 1 5
# 3: 2 0 0
# 4: 4 0 0
# 5: 5 0 0
# 6: 10 0 0
# 7: 2 0 0
# 8: 9 2 32
# 9: 7 0 0
#10: 5 0 0
#11: 8 0 0
#12: 9 0 0
#13: 10 1 39
#14: 12 0 0
#15: 4 0 0
#16: 5 0 0
#17: 6 2 27
We create runs by filling the zeros backwards with rev(cumsum(rev(condition))) and then group by this "filled" condition.
#cumulative sum
df$cum_sum <- ave(df$duration, c(0, cumsum(df$condition[-nrow(df)])), FUN = cumsum)
#replace all zero condition row with zero value in cumulative sum column
df$cum_sum <- ifelse(df$condition == 0, 0, df$cum_sum)
which gives
duration condition cum_sum
1 2 0 0
2 3 1 5
3 2 0 0
4 4 0 0
5 5 0 0
6 10 0 0
7 2 0 0
8 9 2 32
9 7 0 0
10 5 0 0
11 8 0 0
12 9 0 0
13 10 1 39
14 12 0 0
15 4 0 0
16 5 0 0
17 6 2 27
Sample data:
df <- structure(list(duration = c(2, 3, 2, 4, 5, 10, 2, 9, 7, 5, 8,
9, 10, 12, 4, 5, 6), condition = c(0, 1, 0, 0, 0, 0, 0, 2, 0,
0, 0, 0, 1, 0, 0, 0, 2), cum_sum = c(0, 5, 0, 0, 0, 0, 0, 32,
0, 0, 0, 0, 39, 0, 0, 0, 27)), .Names = c("duration", "condition",
"cum_sum"), row.names = c(NA, -17L), class = "data.frame")
Using dplyr, we can use cumsum() on condition to keep track of how many conditions have been seen. Then add within those subsets:
library(dplyr)
df %>%
mutate(condition_group = cumsum(lag(condition, default = 0) != 0) + 1) %>%
group_by(condition_group) %>%
mutate(accum_sum = ifelse(condition != 0,
sum(duration),
0))
Output:
# A tibble: 17 x 4
# Groups: condition_group [4]
duration condition accum_sum condition_group
<dbl> <dbl> <dbl> <dbl>
1 2 0 0 1
2 3 1 5 1
3 2 0 0 2
4 4 0 0 2
5 5 0 0 2
6 10 0 0 2
7 2 0 0 2
8 9 2 32 2
9 7 0 0 3
10 5 0 0 3
11 8 0 0 3
12 9 0 0 3
13 10 1 39 3
14 12 0 0 4
15 4 0 0 4
16 5 0 0 4
17 6 2 27 4
If you shift condition by 1, you can simply use tapply.
duration <- c(2,3,2,4,5,10,2,9,7,5,8,9,10,12,4,5,6)
condition <- c(0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,2)
accum_sum <- c(0,5,0,0,0,0,0,32,0,0,0,0,39,0,0,0,27)
df <- data.frame(duration,condition,accum_sum)
df$want <- unlist(tapply(df$duration,
INDEX = cumsum(c(df$condition[1], head(df$condition, -1))),
cumsum)) * ifelse(df$condition == 0, 0, 1)
df