Get Max Value per Run or Series in Sequence - r

I am trying to get a a max value per stretch of an indicator, or repeating value.
Here is an example:
A = c(28, 20, 23, 30, 26, 23, 25, 26, 27, 25, 30, 26, 25, 22, 24, 25, 24, 27, 29)
B = c(0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1)
df <- as.data.frame(cbind(A, B))
df
A B
28 0
20 1
23 1
30 0
26 0
23 1
25 1
26 1
27 0
25 0
30 1
26 1
25 1
22 0
24 1
25 0
24 0
27 0
29 1
For each group or stretch of 1's in column B I want to find the max in column A. The max column could be an indicator that A it is a max or the actual value in A, and be NA or 0 for other values of B.
The output I am hoping for looks something like this:
A B max
28 0 0
20 1 0
23 1 1
30 0 0
26 0 0
23 1 0
25 1 0
26 1 1
27 0 0
25 0 0
30 1 1
26 1 0
25 1 0
22 0 0
24 1 1
25 0 0
24 0 0
27 0 0
29 1 1
I've tried to generate groups per section of column B that = 1 but I did not get very far because most grouping functions require unique values between groups.
Also, please let me know if there are any improvements to the title for this problem.

One option would be data.table
library(data.table)
setDT(df)[, Max := +((A== max(A)) & B), rleid(B) ]
df
# A B Max
# 1: 28 0 0
# 2: 20 1 0
# 3: 23 1 1
# 4: 30 0 0
# 5: 26 0 0
# 6: 23 1 0
# 7: 25 1 0
# 8: 26 1 1
# 9: 27 0 0
#10: 25 0 0
#11: 30 1 1
#12: 26 1 0
#13: 25 1 0
#14: 22 0 0
#15: 24 1 1
#16: 25 0 0
#17: 24 0 0
#18: 27 0 0
#19: 29 1 1
Or as #Frank mentioned, for better efficiency, we can make use gmax by first assigning column and then replace
DT[, MA := max(A), by=rleid(B)][A == MA & B, Max := 1L][]

Solution using dplyr
library(dplyr)
df %>%
group_by(with(rle(B), rep(seq_along(lengths), lengths))) %>%
mutate(MAX = ifelse(B == 0, 0, as.numeric(A == max(A)))) %>%
.[, c(1, 2, 4)]
A B MAX
<dbl> <dbl> <dbl>
1 28 0 0
2 20 1 0
3 23 1 1
4 30 0 0
5 26 0 0
6 23 1 0
7 25 1 0
8 26 1 1
9 27 0 0
10 25 0 0
11 30 1 1
12 26 1 0
13 25 1 0
14 22 0 0
15 24 1 1
16 25 0 0
17 24 0 0
18 27 0 0
19 29 1 1

Related

R: table frequencies of letters in string based on Alphabet

I need to compute letter frequencies of a large list of words. For each of the locations in the word (first, second,..), I need to find how many times each letter (a-z) appeared in the list and then table the data according to the word positon.
For example, if my word list is: words <- c("swims", "seems", "gills", "draws", "which", "water")
then the result table should like that:
letter
first position
second position
third position
fourth position
fifth position
a
0
1
1
0
0
b
0
0
0
0
0
c
0
0
0
1
0
d
1
0
0
0
0
e
0
1
1
1
0
f
0
0
0
0
0
...continued until z
...
...
...
...
...
All words are of same length (5).
What I have so far is:
alphabet <- letters[1:26]
words.df <- data.frame("Words" = words)
words.df <- words.df %>% mutate("First_place" = substr(words.df$words,1,1))
words.df <- words.df %>% mutate("Second_place" = substr(words.df$words,2,2))
words.df <- words.df %>% mutate("Third_place" = substr(words.df$words,3,3))
words.df <- words.df %>% mutate("Fourth_place" = substr(words.df$words,4,4))
words.df <- words.df %>% mutate("Fifth_place" = substr(words.df$words,5,5))
x1 <- words.df$First_place
x1 <- table(factor(x1,alphabet))
x2 <- words.df$Second_place
x2 <- table(factor(x2,alphabet))
x3 <- words.df$Third_place
x3 <- table(factor(x3,alphabet))
x4 <- words.df$Fourth_place
x4 <- table(factor(x4,alphabet))
x5 <- words.df$Fifth_place
x5 <- table(factor(x5,alphabet))
My code is not effective and gives tables to each letter position sepretely. All help will be appreicated.
in base R use table:
table(let = unlist(strsplit(words,'')),pos = sequence(nchar(words)))
pos
let 1 2 3 4 5
a 0 1 1 0 0
c 0 0 0 1 0
d 1 0 0 0 0
e 0 1 1 1 0
g 1 0 0 0 0
h 0 1 0 0 1
i 0 1 2 0 0
l 0 0 1 1 0
m 0 0 0 2 0
r 0 1 0 0 1
s 2 0 0 0 4
t 0 0 1 0 0
w 2 1 0 1 0
Note that if you need all the values from a-z then use
table(factor(unlist(strsplit(words,'')), letters), sequence(nchar(words)))
Also to get a dataframe you could do:
d <- table(factor(unlist(strsplit(words,'')), letters), sequence(nchar(words)))
cbind(letters = rownames(d), as.data.frame.matrix(d))
Here is a tidyverse solution using dplyr, purrr, and tidyr:
strsplit(words.df$Words, "") %>%
map_dfr(~setNames(.x, seq_along(.x))) %>%
pivot_longer(everything(),
values_drop_na = T,
names_to = "pos",
values_to = "letter") %>%
count(pos, letter) %>%
pivot_wider(names_from = pos,
names_glue = "pos{pos}",
id_cols = letter,
values_from = n,
values_fill = 0L)
Output
letter pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 pos10 pos11
1 a 65 127 88 38 28 17 14 5 3 0 0
2 b 58 4 7 9 2 4 2 0 1 0 0
3 c 83 14 45 37 20 19 8 3 2 0 0
4 C 2 0 0 0 0 0 0 0 0 0 0
5 d 43 8 33 47 21 22 9 3 1 1 0
6 e 45 156 81 132 114 69 48 23 14 2 2
7 f 54 11 18 10 5 2 1 0 0 0 0
8 g 23 7 27 21 15 8 7 1 0 0 0
9 h 38 56 6 28 21 10 3 3 1 1 0
10 i 25 106 51 58 38 28 8 4 1 0 0
11 j 6 0 2 2 0 0 0 0 0 0 0
12 k 9 1 6 22 12 0 0 0 0 0 0
13 l 45 41 54 54 36 9 7 6 0 2 0
14 m 45 8 31 19 8 8 4 2 0 0 0
15 n 23 42 75 53 34 41 16 16 4 2 0
16 o 28 167 76 41 38 9 11 2 1 0 0
17 p 72 20 34 30 8 3 1 1 1 0 0
18 q 7 2 1 0 0 0 0 0 0 0 0
19 r 46 74 92 59 56 45 12 9 1 1 0
20 s 119 8 67 35 31 22 18 4 1 0 0
21 t 65 30 73 83 57 42 31 9 6 3 1
22 u 12 66 39 36 20 7 7 2 0 0 0
23 v 8 7 20 12 5 5 1 0 0 0 0
24 w 53 8 13 10 2 3 0 1 0 0 0
25 y 6 4 16 15 17 15 10 5 6 1 1
26 x 0 12 5 0 0 0 0 0 0 0 0
27 z 0 0 1 0 0 0 1 1 0 0 0

Summing up different elements in a matrix in R

I'm trying to perform calculations on different elements in a matrix in R. My Matrix is 18x18 and I would like to get e.g. the mean of each 6x6 array (which makes 9 arrays in total). My desired arrays would be:
A1 <- df[1:6,1:6]
A2 <- df[1:6,7:12]
A3 <- df[1:6,13:18]
B1 <- df[7:12,1:6]
B2 <- df[7:12,7:12]
B3 <- df[7:12,13:18]
C1 <- df[13:18,1:6]
C2 <- df[13:18,7:12]
C3 <- df[13:18,13:18]
The matrix looks like this:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
5 14 17 9 10 8 4 10 12 18 9 13 14 NA NA 19 15 10 10
10 30 32 23 27 17 28 25 12 28 29 28 26 19 25 34 24 11 17
15 16 16 16 9 17 27 17 16 30 13 18 13 15 13 19 8 7 9
20 15 12 18 18 18 6 4 6 9 11 10 10 13 11 8 10 15 15
25 7 13 21 7 3 5 2 5 5 4 3 2 3 5 2 1 5 6
30 5 9 1 7 7 4 4 12 8 9 2 0 5 2 1 0 2 6
35 3 0 2 0 0 4 4 7 4 4 5 2 0 0 1 0 0 0
40 0 4 0 0 0 1 3 9 10 10 1 0 0 0 1 0 1 0
45 0 0 0 0 0 3 10 9 17 9 1 0 0 0 0 0 0 0
50 0 0 2 0 0 0 2 8 20 0 0 0 0 0 1 0 0 0
55 0 0 0 0 0 0 7 3 21 0 0 0 0 0 0 0 0 0
60 0 0 0 0 3 4 10 2 2 0 0 1 0 0 0 0 0 0
65 0 0 0 0 0 4 8 4 8 11 0 0 0 0 0 0 0 0
70 0 0 0 0 0 6 2 5 14 0 0 0 0 0 0 0 0 0
75 0 0 0 0 0 4 0 5 9 0 0 0 0 0 0 0 0 0
80 0 0 0 0 0 4 4 0 4 2 0 0 0 0 0 0 0 0
85 0 0 0 0 0 0 0 4 1 1 0 0 0 0 0 0 0 0
90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Is there a clean way to solve this issue with a loop?
Thanks a lot in advance,
Paul
Given your matrix, e.g.
x <- matrix(1:(18*18), ncol=18)
Try, for example for sub matrices of 6
step <- 6
nx <- nrow(x)
if((nx %% step) != 0) stop("nx %% step should be 0")
indI <- seq(1, nx, by=step)
nbStep <- length(indI)
for(Col in 1:nbStep){
for(Row in 1:nbStep){
name <- paste0(LETTERS[Col],Row)
theCol <- indI[Col]:(indI[Col]+step-1)
theRow <- indI[Row]:(indI[Row]+step-1)
assign(name, sum(x[theCol, theRow]))
}
}
You'll get your results in A1, A2, A3...
This is the idea. Twist the code for non square matrices, different size of sub matrices, ...
Here's one way:
# generate fake data
set.seed(47)
n = 18
m = matrix(rpois(n * n, lambda = 5), nrow = n)
# generate starting indices
n_array = 6
start_i = seq(1, n, by = n_array)
arr_starts = expand.grid(row = start_i, col = start_i)
# calculate sums
with(arr_starts, mapply(function(x, y) sum(m[(x + 1:n_array) - 1, (y + 1:n_array) - 1]), row, col))
# [1] 158 188 176 201 188 201 197 206 204

How to deal with longitudinal symptom data in R using lag/lead and ifelse/case_when (or other solution)?

Hi stack overflow community,
I'm relatively new to R (9 months) and this is my first stack overflow question with reprex and would really appreciate any help. I mainly use tidyverse although I am open to base R solutions.
Problem:
I have ~21,000 rows of symptom data with >10 variables per day. I would like to be able to classify "exacerbations" of a disease (in this case chest infections in lung disease) by using rules to define the start and end of the episode so that I can later calculate duration of episodes, type of episode (this depends on the combination of symptoms) and treatment received. As with any data set involving patients there are missing values. I have imputed from the most recent day if less than 2 days of data is missing.
The below code is a simplified, made up example involving only 1 symptom.
Exacerbation Rule:
Start of exacerbation = 2 days of worse symptoms (>= 3)
Resolution of exacerbation = 5 days with normal breathing (<=2)
I would ideally want to be able to identify all days when an exacerbation is happening too.
Here is the data:
#load packages
library(tidyverse)
#load data
id <- "A"
day <- c(1:50)
symptom <- c(2,2,2,2,2,2,2,2,2,2,2,3,2,2,2,2,NA,NA,NA,2,2,2,3,3,3,4,4,3,3,2,3,2,2,3,3,2,2,2,2,2,2,3,2,2,2,2,2,3,2,2)
df <- data.frame(id,day,symptom)
#Data Dictionary
#Symptom: 1 = Better than usual, 2 = Normal/usual, 3 = Worse than usual, 4 = Much worse than usual
What I have tried:
I have tried to approach this by using a combination of lag() and lead() with conditional statements case_when() and ifelse().
df %>%
mutate_at(vars("symptom"), #used for more variables within vars() argument
.funs = list(lead1 = ~ lead(., n = 1),
lead2 = ~ lead(., n = 2),
lead3 = ~ lead(., n = 3),
lead4 = ~ lead(., n = 4),
lead5 = ~ lead(., n = 5),
lag1 = ~ lag(., n = 1),
lag2 = ~ lag(., n = 2),
lag3 = ~ lag(., n = 3))) %>%
mutate(start = case_when(symptom <= 2 ~ 0,
symptom >= 3 ~
ifelse(symptom >= lag2 & symptom <= lag1,1,0)),
end = case_when(symptom >=3 ~
ifelse(lead1 <=2 &
lead2 <=2 &
lead3 <=2 &
lead4 <=2 &
lead5 <=2,1,0)))
My main issue is that of complexity. As I build in more symptoms and rules I have to refer to different variables that have ifelse()/case_when() statements within it. I am sure there is a more elegant solution to my problem.
The other issue is that during an "exacerbation" the exacerbation_start variable should only be used at the start and not during the episode. Similarly for exacerbation_end it would only be applicable when an exacerbation is already happening. I have tried using ifelse() statements to refer to when an exacerbation is happening but not been able to get this to work and obey the rule I desire.
The output I would like is:
id day symptom start end exacerbation
1 A 1 2 0 0 0
2 A 2 2 0 0 0
3 A 3 2 0 0 0
4 A 4 2 0 0 0
5 A 5 2 0 0 0
6 A 6 2 0 0 0
7 A 7 2 0 0 0
8 A 8 2 0 0 0
9 A 9 2 0 0 0
10 A 10 2 0 0 0
11 A 11 2 0 0 0
12 A 12 3 0 0 0
13 A 13 2 0 0 0
14 A 14 2 0 0 0
15 A 15 2 0 0 0
16 A 16 2 0 0 0
17 A 17 NA 0 0 0
18 A 18 NA 0 0 0
19 A 19 NA 0 0 0
20 A 20 2 0 0 0
21 A 21 2 0 0 0
22 A 22 2 0 0 0
23 A 23 3 0 0 0
24 A 24 3 1 0 1
25 A 25 3 0 0 1
26 A 26 4 0 0 1
27 A 27 4 0 0 1
28 A 28 3 0 0 1
29 A 29 3 0 0 1
30 A 30 2 0 0 1
31 A 31 3 0 0 1
32 A 32 2 0 0 1
33 A 33 2 0 0 1
34 A 34 3 0 0 1
35 A 35 3 0 1 1
36 A 36 2 0 0 0
37 A 37 2 0 0 0
38 A 38 2 0 0 0
39 A 39 2 0 0 0
40 A 40 2 0 0 0
41 A 41 2 0 0 0
42 A 42 3 0 0 0
43 A 43 2 0 0 0
44 A 44 2 0 0 0
45 A 45 2 0 0 0
46 A 46 2 0 0 0
47 A 47 2 0 0 0
48 A 48 3 0 0 0
49 A 49 2 0 0 0
50 A 50 2 0 0 0
I look forward to your replies!
EDIT
I have added 50 more rows of data to simulate multiple exacerbations and the issue with right censoring and NAs. I have also included a second participant "B" to see if this is a reason for issues.
id <- c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B")
day <- c(1:50,1:50)
symptom <- c(2,3,3,3,3,2,2,2,2,2,2,3,2,2,2,2,NA,NA,NA,2,2,2,3,3,3,4,4,3,3,2,3,2,2,3,3,2,2,2,2,2,2,3,2,2,2,2,2,3,2,2, 2,2,2,2,2,2,3,2,3,3,2,3,2,3,2,2,2,2,2,2,3,3,3,3,NA,NA,NA,2,2,2,3,2,2,2,2,2,3,2,2,3,NA,NA,NA,3,3,3,3,3,3,2)
df <- data.frame(id,day,symptom)
id day symptom start end exacerbation censor
1 A 1 2 0 0 0 0
2 A 2 3 1 0 1 0
3 A 3 3 0 0 1 0
4 A 4 3 0 0 1 0
5 A 5 3 0 1 1 0
6 A 6 2 0 0 0 0
7 A 7 2 0 0 0 0
8 A 8 2 0 0 0 0
9 A 9 2 0 0 0 0
10 A 10 2 0 0 0 0
11 A 11 2 0 0 0 0
12 A 12 3 0 0 0 0
13 A 13 2 0 0 0 0
14 A 14 2 0 0 0 0
15 A 15 2 0 0 0 0
16 A 16 2 0 0 0 0
17 A 17 NA 0 0 0 0
18 A 18 NA 0 0 0 0
19 A 19 NA 0 0 0 0
20 A 20 2 0 0 0 0
21 A 21 2 0 0 0 0
22 A 22 2 0 0 0 0
23 A 23 3 1 0 1 0
24 A 24 3 0 0 1 0
25 A 25 3 0 0 1 0
26 A 26 4 0 0 1 0
27 A 27 4 0 0 1 0
28 A 28 3 0 0 1 0
29 A 29 3 0 0 1 0
30 A 30 2 0 0 1 0
31 A 31 3 0 0 1 0
32 A 32 2 0 0 1 0
33 A 33 2 0 0 1 0
34 A 34 3 0 0 1 0
35 A 35 3 0 0 1 0
36 A 36 2 0 0 1 0
37 A 37 2 0 0 1 0
38 A 38 2 0 0 1 0
39 A 39 2 0 0 1 0
40 A 40 2 0 0 1 0
41 A 41 2 0 1 1 0
42 A 42 3 0 0 0 0
43 A 43 2 0 0 0 0
44 A 44 2 0 0 0 0
45 A 45 2 0 0 0 0
46 A 46 2 0 0 0 0
47 A 47 2 0 0 0 0
48 A 48 3 0 0 0 0
49 A 49 2 0 0 0 0
50 A 50 2 0 0 0 0
51 B 1 2 0 0 0 0
52 B 2 2 0 0 0 0
53 B 3 2 0 0 0 0
54 B 4 2 0 0 0 0
55 B 5 2 0 0 0 0
56 B 6 2 0 0 0 0
57 B 7 3 0 0 0 0
58 B 8 2 0 0 0 0
59 B 9 3 0 0 0 0
60 B 10 3 1 0 1 0
61 B 11 2 0 0 1 0
62 B 12 3 0 0 1 0
63 B 13 2 0 0 1 0
64 B 14 3 0 0 1 0
65 B 15 2 0 0 1 0
66 B 16 2 0 0 1 0
67 B 17 2 0 0 1 0
68 B 18 2 0 0 1 0
69 B 19 2 0 1 1 0
70 B 20 2 0 0 0 0
71 B 21 3 1 0 1 0
72 B 22 3 0 0 1 0
73 B 23 3 0 0 1 0
74 B 24 3 0 0 1 0
75 B 25 NA 0 0 0 1
76 B 26 NA 0 0 0 1
77 B 27 NA 0 0 0 1
78 B 28 2 0 0 0 1
79 B 29 2 0 0 0 1
80 B 30 2 0 0 0 1
81 B 31 3 0 0 0 1
82 B 32 2 0 0 0 1
83 B 33 2 0 0 0 1
84 B 34 2 0 0 0 1
85 B 35 2 0 0 0 1
86 B 36 2 0 0 0 1
87 B 37 3 0 0 0 0
88 B 38 2 0 0 0 0
89 B 39 2 0 0 0 0
90 B 40 3 0 0 0 0
91 B 41 NA 0 0 0 0
92 B 42 NA 0 0 0 0
93 B 43 NA 0 0 0 0
94 B 44 3 1 0 1 0
95 B 45 3 0 0 1 0
96 B 46 3 0 0 1 0
97 B 47 3 0 0 1 0
98 B 48 3 0 0 1 0
99 B 49 3 0 0 1 0
100 B 50 2 0 0 1 0
>
Here is a try for a more elegant and scalable way to write your algorithm:
First, you do not have to compute the lead and lag calls before you can use case_when. Of note, I find it good practice to explicitly write the TRUE option of case_when. Here is some code.
df2=df %>%
mutate(
exacerbation_start = case_when(
is.na(symptom) ~ NA_real_,
symptom <= 2 ~ 0,
symptom >= 3 & symptom >= lag(symptom, n=2) & symptom <= lag(symptom, n=1) ~ 1,
TRUE ~ 0
),
exacerbation_end = case_when(
symptom >=3 ~ ifelse(lead(symptom, n=1) <=2 &
lead(symptom, n=2) <=2 & lead(symptom, n=3) <=2 &
lead(symptom, n=4) <=2 & lead(symptom, n=5) <=2,
1,0),
TRUE ~ NA_real_
)
)
all.equal(df1,df2) #TRUE
Alternatively, if your algorithm is the same for all symptoms, you might want to use custom functions:
get_exacerbation_start = function(x){
case_when(
is.na(x) ~ NA_real_,
x <= 2 ~ 0,
x >= 3 & x >= lag(x, n=2) & x <= lag(x, n=1) ~ 1,
TRUE ~ 0
)
}
get_exacerbation_end = function(x){
case_when(
x >=3 ~ ifelse(x >=3 & lead(x, n=1) <=2 &
lead(x, n=2) <=2 & lead(x, n=3) <=2 &
lead(x, n=4) <=2 & lead(x, n=5) <=2,
1,0),
TRUE ~ NA_real_
)
}
df3=df %>%
mutate(
exacerbation_start = get_exacerbation_start(symptom),
exacerbation_end = get_exacerbation_end(symptom)
)
all.equal(df1,df3) #also TRUE
This latter way might be even more powerful with some mutate_at calls.
EDIT: after seeing your edit, here is a try to get the exacerbation period. The code is quite ugly in my opinion, I'm not sure that row_number was supposed to be used this way.
df_final=df %>%
transmute(
id,day,symptom,
start = get_exacerbation_start(symptom),
end = get_exacerbation_end(symptom),
exacerbation = row_number()>=which(start==1)[1] & row_number()<=which(end==1)[1]
)
I may come back with a less convoluted approach, but try this:
library(dplyr)
library(tidyr)
df %>%
group_by(id,
idx = with(
rle(
case_when(symptom <= 2 ~ 'normal',
symptom >= 3 ~ 'worse',
TRUE ~ symptom %>% as.character)),
rep(seq_along(lengths), lengths)
)
) %>%
mutate(
trajectory = case_when(cumsum(symptom <= 2) == 5 ~ 2, cumsum(symptom >= 3) == 2 ~ 1)
) %>%
group_by(id) %>% fill(trajectory) %>%
mutate(
trajectory = replace_na(trajectory, 0),
start = +(trajectory == 1 & lag(trajectory) == 2),
end = +(trajectory == 2 & lag(trajectory) == 1),
exacerbation = +(trajectory == 1 | start == 1 | end == 1)
) %>%
select(-idx, -trajectory) %>% as.data.frame
Output:
id day symptom start end exacerbation
1 A 1 2 0 0 0
2 A 2 2 0 0 0
3 A 3 2 0 0 0
4 A 4 2 0 0 0
5 A 5 2 0 0 0
6 A 6 2 0 0 0
7 A 7 2 0 0 0
8 A 8 2 0 0 0
9 A 9 2 0 0 0
10 A 10 2 0 0 0
11 A 11 2 0 0 0
12 A 12 3 0 0 0
13 A 13 2 0 0 0
14 A 14 2 0 0 0
15 A 15 2 0 0 0
16 A 16 2 0 0 0
17 A 17 NA 0 0 0
18 A 18 NA 0 0 0
19 A 19 NA 0 0 0
20 A 20 2 0 0 0
21 A 21 2 0 0 0
22 A 22 2 0 0 0
23 A 23 3 0 0 0
24 A 24 3 1 0 1
25 A 25 3 0 0 1
26 A 26 4 0 0 1
27 A 27 4 0 0 1
28 A 28 3 0 0 1
29 A 29 3 0 0 1
30 A 30 2 0 0 1
31 A 31 3 0 0 1
32 A 32 2 0 0 1
33 A 33 2 0 0 1
34 A 34 3 0 0 1
35 A 35 3 0 0 1
36 A 36 2 0 0 1
37 A 37 2 0 0 1
38 A 38 2 0 0 1
39 A 39 2 0 0 1
40 A 40 2 0 1 1
41 A 41 2 0 0 0
42 A 42 3 0 0 0
43 A 43 2 0 0 0
44 A 44 2 0 0 0
45 A 45 2 0 0 0
46 A 46 2 0 0 0
47 A 47 2 0 0 0
48 A 48 3 0 0 0
49 A 49 2 0 0 0
50 A 50 2 0 0 0

Cumulative sum of a subset of data based on condition

I have what i think is a simple R task but i'm having trouble. Basically I need to do a cumulative sum of values based on the criteria of another column.
Here's the catch, it should do the cumulative sum for the previous rows until it hits another condition. In the example i'm providing, it accumulates all values from the duration column, 1 and 2 in the condition column. Example is shown below.
duration <- c(2,3,2,4,5,10,2,9,7,5,8,9,10,12,4,5,6)
condition <- c(0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,2)
accum_sum <- c(0,5,0,0,0,0,0,32,0,0,0,0,39,0,0,0,27)
df <- data.frame(duration,condition,accum_sum)
df
row duration condition accum_sum
1 2 0 0
2 3 1 5
3 2 0 0
4 4 0 0
5 5 0 0
6 10 0 0
7 2 0 0
8 9 2 32
9 7 0 0
10 5 0 0
11 8 0 0
12 9 0 0
13 10 1 39
14 12 0 0
15 4 0 0
16 5 0 0
17 6 2 27
Using data.table:
setDT(df)
df[, accum_sum := cumsum(duration), by = rev(cumsum(rev(condition)))]
df[condition == 0, accum_sum := 0]
# duration condition accum_sum
# 1: 2 0 0
# 2: 3 1 5
# 3: 2 0 0
# 4: 4 0 0
# 5: 5 0 0
# 6: 10 0 0
# 7: 2 0 0
# 8: 9 2 32
# 9: 7 0 0
#10: 5 0 0
#11: 8 0 0
#12: 9 0 0
#13: 10 1 39
#14: 12 0 0
#15: 4 0 0
#16: 5 0 0
#17: 6 2 27
We create runs by filling the zeros backwards with rev(cumsum(rev(condition))) and then group by this "filled" condition.
#cumulative sum
df$cum_sum <- ave(df$duration, c(0, cumsum(df$condition[-nrow(df)])), FUN = cumsum)
#replace all zero condition row with zero value in cumulative sum column
df$cum_sum <- ifelse(df$condition == 0, 0, df$cum_sum)
which gives
duration condition cum_sum
1 2 0 0
2 3 1 5
3 2 0 0
4 4 0 0
5 5 0 0
6 10 0 0
7 2 0 0
8 9 2 32
9 7 0 0
10 5 0 0
11 8 0 0
12 9 0 0
13 10 1 39
14 12 0 0
15 4 0 0
16 5 0 0
17 6 2 27
Sample data:
df <- structure(list(duration = c(2, 3, 2, 4, 5, 10, 2, 9, 7, 5, 8,
9, 10, 12, 4, 5, 6), condition = c(0, 1, 0, 0, 0, 0, 0, 2, 0,
0, 0, 0, 1, 0, 0, 0, 2), cum_sum = c(0, 5, 0, 0, 0, 0, 0, 32,
0, 0, 0, 0, 39, 0, 0, 0, 27)), .Names = c("duration", "condition",
"cum_sum"), row.names = c(NA, -17L), class = "data.frame")
Using dplyr, we can use cumsum() on condition to keep track of how many conditions have been seen. Then add within those subsets:
library(dplyr)
df %>%
mutate(condition_group = cumsum(lag(condition, default = 0) != 0) + 1) %>%
group_by(condition_group) %>%
mutate(accum_sum = ifelse(condition != 0,
sum(duration),
0))
Output:
# A tibble: 17 x 4
# Groups: condition_group [4]
duration condition accum_sum condition_group
<dbl> <dbl> <dbl> <dbl>
1 2 0 0 1
2 3 1 5 1
3 2 0 0 2
4 4 0 0 2
5 5 0 0 2
6 10 0 0 2
7 2 0 0 2
8 9 2 32 2
9 7 0 0 3
10 5 0 0 3
11 8 0 0 3
12 9 0 0 3
13 10 1 39 3
14 12 0 0 4
15 4 0 0 4
16 5 0 0 4
17 6 2 27 4
If you shift condition by 1, you can simply use tapply.
duration <- c(2,3,2,4,5,10,2,9,7,5,8,9,10,12,4,5,6)
condition <- c(0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,2)
accum_sum <- c(0,5,0,0,0,0,0,32,0,0,0,0,39,0,0,0,27)
df <- data.frame(duration,condition,accum_sum)
df$want <- unlist(tapply(df$duration,
INDEX = cumsum(c(df$condition[1], head(df$condition, -1))),
cumsum)) * ifelse(df$condition == 0, 0, 1)
df

Print 1 below specific value until meets a higher value

I have data where I wish to print 1s when below a certain value until we meet a higher value.
Take this data for example:
data long_entry long_exit
1 80.000000 0 1
2 7.692308 1 0
3 7.692308 1 0
4 8.333333 1 0
5 9.090909 1 0
6 20.000000 1 0
7 27.272727 0 0
8 50.000000 0 0
9 50.000000 0 0
10 21.428571 1 0
11 58.333333 0 0
12 46.666667 0 0
13 78.064516 0 1
14 86.153846 0 1
15 42.857143 0 0
16 44.186047 0 0
17 20.000000 1 0
18 25.000000 0 0
19 40.000000 0 0
20 45.000000 0 0
21 78.000000 0 1
22 55.000000 0 0
My goal is to print 1,s when data column is below 25 and continue to print 1 until we meet a data number over 70 (first instance).
Code used to make long / exit signals:
df$long_entry = ifelse(df$data < 25,1,0 )
df$long_exit = ifelse(df$data >= 70,1,0)
I have tried writing a few for loops using base and dplyr:
df$final.signal[[1]] = ifelse(df$long_entry[[1]] == 1, 1, 0)
for (i in 2:nrow(df)){
df$final.signal[i] = ifelse(df$long_entry[i] ==1, 1, 0,
ifelse(df$long_exit[i] == 1, 0,
df$long_exit[i-1]))
}
df <- df %>%
dplyr::mutate(final.signal = ifelse(long_entry == 1, 1,
ifelse(long_exit ==1, 0, 0)))
This however does not do as intended. The desired output is to be like this:
data desired.output
1 80.000000 0
2 7.692308 1
3 7.692308 1
4 8.333333 1
5 9.090909 1
6 20.000000 1
7 27.272727 1
8 50.000000 1
9 50.000000 1
10 21.428571 1
11 58.333333 1
12 46.666667 1
13 78.064516 1 (1 on first instance over 70)
14 86.153846 0
15 42.857143 0
16 44.186047 0
17 20.000000 1 (back to 1 when under 25)
18 25.000000 1
19 40.000000 1
20 45.000000 1
21 78.000000 1 ( stay 1 until first instance over 70)
22 85.000000 0
We see we print 1 < 25 until we meet the first instance of >70.
Which is the best method to approach this task?
May this could help you :
dataa <- data.frame(abs(rnorm(mean = 30, sd = 40, n= 100)))
names(dataa) <- c("v1")
dataa %>% mutate(v2 = as.numeric( (cumsum(as.numeric(dataa$v1>70)) <= 0) & (cumsum(as.numeric(dataa$v1<25)) >= 1)))

Resources