I would like to make a some new variables in R based on multiple (>100) other variables.
My dataset looks like this
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br
1 1 1 1 2 2 1 6 0 1 6 1
2 2 2 3 2 5 1 3
3 3 0 0 <NA> 4 1 0 0 <NA> 2 2
4 4 NA 1 2 2 NA 1 1 4
5 5 NA 4 2 3 5 NA 4 3 4 3
The variables diag_x_ais can take integers from 0-6, and diag_x_br can take integers between 1-6.
I would like to make 6 new variables corresponding to the 6 possible diag_x_br values, i.e. the new variables would be called br_1, br_2 ... br_6. These new variables shall then be filled with the maximum value of the corresponding diag_x_ais variables, i.e.
if diag_1_br, diag_2_br, and diag_4_br are all 3, then br_3 should take the maximum value of diag_1_ais, diag_2_ais, and diag_4_ais.
Please also see the example dataset below:
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6
1 1 1 1 2 2 1 6 0 1 6 1 2 NA NA NA NA 2
2 2 2 1 4 3 5 5 2 2 1 3 3 4 5 NA 2 NA
3 3 0 0 NA 4 1 0 0 NA 2 2 NA 4 NA NA NA NA
4 4 NA 1 2 2 NA 1 1 4 2 NA NA 2 NA NA
5 5 NA 4 2 3 5 NA 4 3 4 3 NA NA 5 4 NA NA
Hereafter, I would like a final variable which calculates the sum of the up to three largest br_x variables, example displayed below:
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6 sum3
1 1 1 1 2 2 1 6 0 1 6 1 2 NA NA NA NA 2 4
2 2 2 1 4 3 5 5 2 2 1 3 3 4 5 NA 2 NA 12
3 3 0 0 NA 4 1 0 0 NA 2 2 NA 4 NA NA NA NA 4
4 4 NA 1 2 2 NA 1 1 4 2 NA NA 2 NA NA 4
5 5 NA 4 2 3 5 NA 4 3 4 3 NA NA 5 4 NA NA 9
My actual dataset has 60 diag_x_ais variables and 60 diag_x_br variables and 4000 rows.
I hope that someone can help me do this in R. Thank you!
I think you could use the following solution. I made a slight modification so that we only sum the first 3 max values:
library(dplyr)
library(purrr)
df %>%
bind_cols(as.data.frame(t(map_dfr(1:6, function(a) pmap_dfc(df, ~ {x <- c(...)[grepl("br", names(df))]
inds <- which(x == a)
if(length(inds) != 0) {
y <- c(...)[grepl("ais", names(df))]
max(y[inds])
} else {
NA
}})))) %>%
setNames(paste0("br", 1:6))) %>%
rowwise() %>%
mutate(sum = sum(sort(as.numeric(c_across(starts_with("br"))), decreasing = TRUE)[1:3], na.rm = TRUE)) %>%
select(starts_with("br"), sum)
Resulting output
# A tibble: 5 x 7
# Rowwise:
br1 br2 br3 br4 br5 br6 sum
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2 NA NA NA NA 2 4
2 3 4 5 NA 2 NA 12
3 NA 4 NA NA NA NA 4
4 2 NA NA 2 NA NA 4
5 NA NA 5 4 NA NA 9
You could use some heavy data.transforming most likely not very efficient on large datasets. There are some empty values, NA and 0 in your dataset. I didn't handle them (and replaced the empty values by NA to make importing easier).
library(tidyr)
library(dplyr)
data %>%
pivot_longer(-sub_id,
names_to = c("name", "cat"),
names_pattern = ".*_(\\d+)_(.*)") %>%
pivot_wider(names_from = "cat") %>%
group_by(sub_id, br) %>%
summarise(value = max(ais), .groups = "drop") %>%
filter(br %in% 1:6) %>%
group_by(sub_id) %>%
mutate(sum = sum(tail(sort(value), 3))) %>%
pivot_wider(names_from = br,
names_glue = "br_{br}") %>%
select(sub_id, paste0("br_", 1:6), sum)
This returns
# A tibble: 5 x 8
sub_id br_1 br_2 br_3 br_4 br_5 br_6 sum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 NA NA NA NA 2 4
2 2 3 4 5 NA 2 NA 12
3 3 NA 4 NA NA NA NA 4
4 4 2 NA NA 2 NA NA 4
5 5 NA NA 5 4 NA NA 9
Piping an addtional right_join(data, by = "sub_id") gives you your example output (minus the order of your columns).
I took an idea from this answer.
Data
data <- structure(list(sub_id = c(1, 2, 3, 4, 5), diag_1_ais = c(1, 2,
0, NA, NA), diag_2_ais = c(1, 1, 0, NA, 4), diag_3_ais = c(2,
4, NA, 1, 2), diag_4_ais = c(2, 3, 4, 2, 3), diag_5_ais = c(1,
5, 1, 2, 5), diag_1_br = c(6, 5, 0, NA, NA), diag_2_br = c(0,
2, 0, NA, 4), diag_3_br = c(1, 2, NA, 1, 3), diag_4_br = c(6,
1, 2, 1, 4), diag_5_br = c(1, 3, 2, 4, 3)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
For the first part:
data <- data.frame(sub_id = c(1,2,3,4,5),
diag_1_ais = c(1,2,0,NA,NA),
diag_2_ais = c(1,1,0,NA,4),
diag_3_ais = c(2,4,NA,1,2),
diag_4_ais = c(2,3,4,2,3),
diag_5_ais = c(1,5,1,2,5),
diag_1_br = c(6,5,0,NA,NA),
diag_2_br = c(0,2,0,NA,4),
diag_3_br = c(1,2,NA,1,3),
diag_4_br = c(6,1,2,1,4),
diag_5_br = c(1,3,2,4,3))
calc_br <- function(data, value, firstBr, lastBr) {
br <- c()
for (i in 1:nrow(data)){
if (length(which(data[i,c(firstBr:lastBr)] %in% value))!=0){
br <- c(br, c(max(data[i,which(data[i,c(firstBr:lastBr)] %in% value)+1])))
}
else {
br <- c(br, c(NA))
}
}
result <- br
}
firstBr = 7
lastBr = 11
data$br_1 <- calc_br(data,1,firstBr,lastBr)
data$br_2 <- calc_br(data,2,firstBr,lastBr)
data$br_3 <- calc_br(data,3,firstBr,lastBr)
data$br_4 <- calc_br(data,4,firstBr,lastBr)
data$br_5 <- calc_br(data,5,firstBr,lastBr)
data$br_6 <- calc_br(data,6,firstBr,lastBr)
This should yield the same results as in your example. You should only have to exchange lastBr and firstBr (to 62 and 122 i would guess).
For the second part this should do the trick:
br_sum <- c()
for (i in 1:nrow(data)){
br_sum <- c(br_sum, sum(data[i,lastBr+tail(order(data[i,c((lastBr+1):(lastBr+6))], na.last = NA), 3)]))
}
data$br_sum <- br_sum
For completness here my results:
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br
1 1 1 1 2 2 1 6
2 2 2 1 4 3 5 5
3 3 0 0 NA 4 1 0
4 4 NA NA 1 2 2 NA
5 5 NA 4 2 3 5 NA
diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6 br_sum
1 0 1 6 1 2 NA NA NA NA 2 4
2 2 2 1 3 3 4 5 NA 2 NA 12
3 0 NA 2 2 NA 4 NA NA NA NA 4
4 NA 1 1 4 2 NA NA 2 NA NA 4
5 4 3 4 3 NA NA 5 4 NA NA 9
Related
I want to count how many times a specific value occurs across multiple columns and put the number of occurrences in a new column. My dataset has a lot of missing values but only if the entire row consists solely of NA's, it should return NA. If possible, I would prefer something that works with dplyr pipelines.
Example dataset:
df <- data.frame(c1 = sample(1:4, 20, replace = TRUE),
c2 = sample(1:4, 20, replace = TRUE),
c3 = sample(1:4, 20, replace = TRUE),
c4 = sample(1:4, 20, replace = TRUE),
c5 = sample(1:4, 20, replace = TRUE))
for (i in 1:5) {
df[sample(1:20, 1), sample(1:5, 1)] <- NA
df[sample(1:20, 1), ] <- NA
}
c1 c2 c3 c4 c5
1 1 2 4 4 1
2 2 2 1 3 4
3 2 4 4 3 3
4 4 2 3 2 1
5 4 2 4 1 3
6 NA 1 2 4 4
7 3 NA 4 NA 4
8 NA NA NA NA NA
9 1 3 3 2 2
10 NA NA NA NA NA
I have tried with rowwise() and rowSums. Some non-working examples here:
# First attempt
df <- df %>%
rowwise() %>%
mutate(count2 = sum(c_across(c1:c5, ~.x %in% 2)))
# Second attempt
df <- df %>%
rowwise() %>%
mutate(count2 = sum(c_across(select(where(c1:c5 %in% 2)))))
# With rowSums
df <- df %>%
rowwise() %>%
mutate(count4 = rowSums(select(c1:c5 %in% 4), na.rm = TRUE))
How about this:
library(dplyr)
df <- data.frame(c1 = sample(1:4, 20, replace = TRUE),
c2 = sample(1:4, 20, replace = TRUE),
c3 = sample(1:4, 20, replace = TRUE),
c4 = sample(1:4, 20, replace = TRUE),
c5 = sample(1:4, 20, replace = TRUE))
for (i in 1:5) {
df[sample(1:20, 1), sample(1:5, 1)] <- NA
df[sample(1:20, 1), ] <- NA
}
df %>%
rowwise() %>%
mutate(count2 = sum(na.omit(c_across(c1:c5)) == 2),
count2 = ifelse(all(is.na(c_across(c1:c5))), NA, count2))
#> # A tibble: 20 × 6
#> # Rowwise:
#> c1 c2 c3 c4 c5 count2
#> <int> <int> <int> <int> <int> <int>
#> 1 NA NA NA NA NA NA
#> 2 2 2 3 4 2 3
#> 3 1 1 1 4 4 0
#> 4 2 3 3 2 4 2
#> 5 NA NA NA NA NA NA
#> 6 1 1 1 2 1 1
#> 7 3 3 2 3 4 1
#> 8 1 1 4 3 4 0
#> 9 NA NA NA NA NA NA
#> 10 NA NA NA NA NA NA
#> 11 2 3 3 4 1 1
#> 12 2 1 4 2 NA 2
#> 13 4 4 2 NA 2 2
#> 14 4 2 3 3 2 2
#> 15 1 3 4 2 2 2
#> 16 1 1 3 3 2 1
#> 17 1 1 1 4 4 0
#> 18 2 4 4 NA 1 1
#> 19 NA NA NA NA NA NA
#> 20 4 1 1 NA 4 0
Created on 2022-12-08 by the reprex package (v2.0.1)
I have a Datafaame like this:
dt <- tibble(
TRIAL = c("A", "A", "A", "B", "B", "B", "C", "C", "C","D","D","D"),
RL = c(1, NA, 3, 1, 6, 3, 2, 3, 1, 0, 1.5, NA),
SL = c(6, 1.5, 1, 0, 0, 1, 1, 2, 0, 1, 1.5, NA),
HC = c(0, 1, 5, 6,7, 8, 9, 3, 4, 5, 4, 2)
)
# A tibble: 12 x 4
TRIAL RL SL HC
<chr> <dbl> <dbl> <dbl>
1 A 1 6 0
2 A NA 1.5 1
3 A 3 1 5
4 B 1 0 6
5 B 6 0 7
6 B 3 1 8
7 C 2 1 9
8 C 3 2 3
9 C 1 0 4
10 D 0 1 5
11 D 1.5 1.5 4
12 D NA NA 2
I want to group the data frame by TRIAL and have the values in RL and SL checked by group, if the value in either of the column is greater than 5 then move all values for RL and SL for that particular group to RLCT and SLCT respectively.
# A tibble: 12 x 6
TRIAL HC RLCT SLCT SL RL
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0 1 6 NA NA
2 A 1 NA 1.5 NA NA
3 A 5 3 1 NA NA
4 B 6 1 0 NA NA
5 B 7 6 0 NA NA
6 B 8 3 1 NA NA
7 C 9 NA NA 1 3
8 C 3 NA NA 3 5
9 C 4 NA NA 1 1
10 D 5 NA NA 1 0
11 D 4 NA NA 1.5 1.5
12 D 2 NA NA NA NA
When I run the below code, I did not get the expected output
dt0 <- dt %>%
mutate(RLCT = NA,
SLCT = NA) %>%
group_by(TRIAL) %>%
filter(!any(RL > 5.0 | SL > 5.0))
dt1 <- dt %>%
group_by(TRIAL) %>%
filter(any(RL > 5.0 | SL > 5.0)) %>%
mutate(RLCT = RL,
SLCT = SL) %>%
rbind(dt0, .) %>%
mutate(RL = ifelse(!is.na(RLCT), NA, RL),
SL = ifelse(!is.na(SLCT), NA, SL)) %>% arrange(TRIAL)
This is what I get
# A tibble: 9 x 6
# Groups: TRIAL [3]
TRIAL RL SL HC RLCT SLCT
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A NA NA 0 1 6
2 A NA NA 1 NA 1.5
3 A NA NA 5 3 1
4 B NA NA 6 1 0
5 B NA NA 7 6 0
6 B NA NA 8 3 1
7 C 2 1 9 NA NA
8 C 3 2 3 NA NA
9 C 1 0 4 NA NA
You can define a column to storage the condition, and change RL and SL with ifelse inside across.
dt %>%
group_by(TRIAL) %>%
mutate(cond = any(RL > 5.0 | SL > 5.0, na.rm = TRUE),
across(c(RL, SL), ~ ifelse(cond, ., NA), .names = "{.col}CT"),
across(c(RL, SL), ~ ifelse(!cond, ., NA)),
cond = NULL)
Result:
# A tibble: 12 x 6
# Groups: TRIAL [4]
TRIAL RL SL HC RLCT SLCT
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A NA NA 0 1 6
2 A NA NA 1 NA 1.5
3 A NA NA 5 3 1
4 B NA NA 6 1 0
5 B NA NA 7 6 0
6 B NA NA 8 3 1
7 C 2 1 9 NA NA
8 C 3 2 3 NA NA
9 C 1 0 4 NA NA
10 D 0 1 5 NA NA
11 D 1.5 1.5 4 NA NA
12 D NA NA 2 NA NA
With dplyr, you could use group_modify():
library(dplyr)
dt %>%
group_by(TRIAL) %>%
group_modify(~ {
if(any(select(.x, c(RL, SL)) > 5, na.rm = TRUE)) {
rename_with(.x, ~ paste0(.x, 'CT'), c(RL, SL))
} else {
.x
}
})
Output
# A tibble: 12 × 6
# Groups: TRIAL [4]
TRIAL RLCT SLCT HC RL SL
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 6 0 NA NA
2 A NA 1.5 1 NA NA
3 A 3 1 5 NA NA
4 B 1 0 6 NA NA
5 B 6 0 7 NA NA
6 B 3 1 8 NA NA
7 C NA NA 9 2 1
8 C NA NA 3 3 2
9 C NA NA 4 1 0
10 D NA NA 5 0 1
11 D NA NA 4 1.5 1.5
12 D NA NA 2 NA NA
I am trying to sum the row of values if any column have values but not working for me like below
df=data.frame(
x3=c(2,NA,3,5,4,6,NA,NA,3,3),
x4=c(0,NA,NA,6,5,6,NA,0,4,2))
df$summ <- ifelse(is.na(c(df[,"x3"] & df[,"x4"])),NA,rowSums(df[,c("x3","x4")], na.rm=TRUE))
the output should be like
An alternative solution:
library(data.table)
setDT(df)[!( is.na(x3) & is.na(x4)),summ:=rowSums(.SD, na.rm = T)]
You can do :
df <- transform(df, summ = ifelse(is.na(x3) & is.na(x4), NA,
rowSums(df, na.rm = TRUE)))
df
# x3 x4 summ
#1 2 0 2
#2 NA NA NA
#3 3 NA 3
#4 5 6 11
#5 4 5 9
#6 6 6 12
#7 NA NA NA
#8 NA 0 0
#9 3 4 7
#10 3 2 5
In general for any number of columns :
cols <- c('x3', 'x4')
df <- transform(df, summ = ifelse(rowSums(is.na(df[cols])) == length(cols),
NA, rowSums(df, na.rm = TRUE)))
Try the code below with rowSums + replace
df$summ <- replace(rowSums(df, na.rm = TRUE), rowSums(is.na(df)) == 2, NA)
which gives
> df
x3 x4 summ
1 2 0 2
2 NA NA NA
3 3 NA 3
4 5 6 11
5 4 5 9
6 6 6 12
7 NA NA NA
8 NA 0 0
9 3 4 7
10 3 2 5
This is not much different from already posted answers, however, it contains some useful functions:
library(dplyr)
df %>%
rowwise() %>%
mutate(Count = ifelse(all(is.na(cur_data())), NA,
sum(c_across(everything()), na.rm = TRUE)))
# A tibble: 10 x 3
# Rowwise:
x3 x4 Count
<dbl> <dbl> <dbl>
1 2 0 2
2 NA NA NA
3 3 NA 3
4 5 6 11
5 4 5 9
6 6 6 12
7 NA NA NA
8 NA 0 0
9 3 4 7
10 3 2 5
Say I have the following dataframe:
ABC1_old <- c(1, 5, 3, 4, 3, NA, NA, NA, NA, NA)
ABC2_old <- c(4, 2, 1, 1, 5, NA, NA, NA, NA, NA)
ABC1_adj <- c(NA, NA, NA, NA, NA, 5, 5, 1, 2, 4)
ABC2_adj <- c(NA, NA, NA, NA, NA, 3, 2, 1, 4, 2)
df <- data.frame(ABC1_old, ABC2_old, ABC1_adj, ABC2_adj)
I want to create a column that compares each pair of ABCn_old with its corresponding ABCn_adj. (So ABC1_old would be compared against ABCn_adj, etc.) The resulting column would be called ABCn_new. The evaluation would be that if ABCn_old is NA, fill in the blank with the corresponding value in ABCn_adj, otherwise use ABCn_old's value. The new columns would look like this:
df$ABC1_new <- c(1, 5, 3, 4, 3, 5, 5, 1, 2, 4)
df$ABC2_new <- c(4, 2, 1, 1, 5, 3, 2, 1, 4, 2)
I know a simple mutate could work here, but I would like to use some kind of tidyverse looping via purrr if possible since the dataset is much larger in reality. Any ideas for the best way to achieve this?
map_dfc(split.default(df, str_remove(names(df), "_.*")), ~coalesce(!!!.x))
# A tibble: 10 x 2
ABC1 ABC2
<dbl> <dbl>
1 1 4
2 5 2
3 3 1
4 4 1
5 3 5
6 5 3
7 5 2
8 1 1
9 2 4
10 4 2
Putting it together:
df %>%
split.default(str_replace(names(.), "_.*", "_new")) %>%
map_dfc(~coalesce(!!!.x))%>%
cbind(df, .)
ABC1_old ABC2_old ABC1_adj ABC2_adj ABC1_new ABC2_new
1 1 4 NA NA 1 4
2 5 2 NA NA 5 2
3 3 1 NA NA 3 1
4 4 1 NA NA 4 1
5 3 5 NA NA 3 5
6 NA NA 5 3 5 3
7 NA NA 5 2 5 2
8 NA NA 1 1 1 1
9 NA NA 2 4 2 4
10 NA NA 4 2 4 2
Using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_to = c(".value", 'grp'),
names_sep = '_', values_drop_na = TRUE) %>%
select(-grp, -rn) %>%
rename_all(~ str_c(., '_new')) %>% bind_cols(df, .)
# ABC1_old ABC2_old ABC1_adj ABC2_adj ABC1_new ABC2_new
#1 1 4 NA NA 1 4
#2 5 2 NA NA 5 2
#3 3 1 NA NA 3 1
#4 4 1 NA NA 4 1
#5 3 5 NA NA 3 5
#6 NA NA 5 3 5 3
#7 NA NA 5 2 5 2
#8 NA NA 1 1 1 1
#9 NA NA 2 4 2 4
#10 NA NA 4 2 4 2
Or using dplyr
df %>%
mutate(across(ends_with('old'),
~ coalesce(., get(str_replace(cur_column(),
'old', 'adj'))), .names = '{.col}_new'))
I have a package on github to solve this and similar problems. In this case we could use dplyover::across2 to apply one (or more) functions to two set of columns, which can be selected with tidyselect. In the .names argument we can specify "{pre}" to refer to the common prefix of both sets of columns.
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
df %>%
mutate(across2(ends_with("_old"),
ends_with("_adj"),
~ coalesce(.x, .y),
.names = "{pre}_new"))
#> ABC1_old ABC2_old ABC1_adj ABC2_adj ABC1_new ABC2_new
#> 1 1 4 NA NA 1 4
#> 2 5 2 NA NA 5 2
#> 3 3 1 NA NA 3 1
#> 4 4 1 NA NA 4 1
#> 5 3 5 NA NA 3 5
#> 6 NA NA 5 3 5 3
#> 7 NA NA 5 2 5 2
#> 8 NA NA 1 1 1 1
#> 9 NA NA 2 4 2 4
#> 10 NA NA 4 2 4 2
Created on 2021-05-16 by the reprex package (v0.3.0)
I have a dataframe of a service. Now I need to add a column "order" and group them with the following rule:
Group the service to orders: If within the next 5 values after one service Value "A" is another service "A" present, fill all values to an order ID - also the ones that don't have a service value. If there is no service value within the next 5 values the next order group is defined.
dput(data)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
14, 15, 16), time = structure(1:15, .Label = c("13:20:01", "13:20:02",
"13:20:03", "13:20:04", "13:20:05", "13:20:06", "13:20:07", "13:20:08",
"13:20:09", "13:20:10", "13:20:11", "13:20:12", "13:20:13", "13:20:14",
"13:20:15"), class = "factor"), apples = c(2, 2, 2, 3, 3, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2), service = structure(c(NA, 1L, 1L,
NA, 1L, NA, 1L, NA, NA, NA, NA, NA, 1L, NA, 1L), .Label = "A", class = "factor")), class = "data.frame", row.names = c(NA,
-15L))
overview
id time apples service
1 13:20:01 2
2 13:20:02 2 A
3 13:20:03 2 A
4 13:20:04 3
5 13:20:05 3 A
6 13:20:06 2
7 13:20:07 2 A
8 13:20:08 2
9 13:20:09 2
10 13:20:10 2
11 13:20:11 2
12 13:20:12 2
14 13:20:13 2 A
15 13:20:14 2
16 13:20:15 2 A
This is the format I'm looking for. ID 2 to ID 8 is an order and ID 14 to ID 16.
id time apples service Order
1 13:20:01 2
2 13:20:02 2 A 1
3 13:20:03 2 A 1
4 13:20:04 3 1
5 13:20:05 3 A 1
6 13:20:06 2 1
7 13:20:07 2 A 1
8 13:20:08 2
9 13:20:09 2
10 13:20:10 2
11 13:20:11 2
12 13:20:12 2
14 13:20:13 2 A 2
15 13:20:14 2 2
16 13:20:15 2 A 2
I tried it with a for loop. I suggest there is a way to use the mutate method and add the "range" conditon.
Thx for your help!
This is my output that is produced by the code of tspano
# A tibble: 15 x 11
id time apples service start end g0 g1 g2 g3 order
<dbl> <fct> <dbl> <fct> <dbl> <dbl> <chr> <int> <chr> <int> <int>
1 1 13:20:01 2 NA 0 3 NA 0 NA 0 NA
2 2 13:20:02 2 A 1 3 start 1 NA 0 NA
3 3 13:20:03 2 A 2 3 NA 1 NA 0 NA
4 4 13:20:04 3 NA 2 2 NA 1 NA 0 NA
5 5 13:20:05 3 A 3 2 NA 1 NA 0 NA
6 6 13:20:06 2 NA 3 1 NA 1 NA 0 NA
7 7 13:20:07 2 A 3 1 NA 1 NA 0 NA
8 8 13:20:08 2 NA 2 0 end 2 NA 0 NA
9 9 13:20:09 2 NA 2 1 NA 2 NA 0 NA
10 10 13:20:10 2 NA 1 1 NA 2 NA 0 NA
11 11 13:20:11 2 NA 1 2 NA 2 NA 0 NA
12 12 13:20:12 2 NA 0 2 NA 2 NA 0 NA
13 14 13:20:13 2 A 1 2 start 3 NA 0 NA
14 15 13:20:14 2 NA 1 1 NA 3 NA 0 NA
15 16 13:20:15 2 A 2 1 NA 3 NA 0 NA
Here is a solution using RcppRoll, which should be faster than a R for loop:
data %>%
mutate(start = RcppRoll::roll_sum(c(rep(F,4),(service=="A") %in% T), n = 5, align = "right"),
end = RcppRoll::roll_sum(c((service=="A") %in% T, rep(F,4)), n = 5, align = "left"),
g0 = case_when(start>0 & (lag(start)==0) %in% c(T,NA) ~ "start",
end ==0 ~ "end",
T ~ NA_character_)
) %>%
group_by(g1 = cumsum(!is.na(g0))) %>%
mutate(g2 = if_else(first(g0)=="end", NA_character_, "order")) %>%
ungroup() %>%
group_by(g3 = cumsum(!is.na(g2) & is.na(lag(g2))) ) %>%
mutate(order = if_else(is.na(g2), NA_integer_, g3)) %>%
ungroup() %>%
select(id, time, apples, service, order)
If you remove the last select you can see I have several intermediate results that should make the logic clear.