incremental counter within dataframe only when a condition is met in r - r

I would like to create an accumulative incremental counter that increases only when a condition is met.
DT <- data.table(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2),
b = c(10L, 5L, 3L, 4L, 2L, 6L, 1L, 3L, 5L, 7L))
I don't get the desired result with rleid because when two conditions are met in consecutive rows, the increment is not performed
> DT[,count := rleid(b>=5),id]
> DT
id b count
1: 1 10 1
2: 1 5 1
3: 1 3 2
4: 1 4 2
5: 1 2 2
6: 1 6 3
7: 1 1 4
8: 2 3 1
9: 2 5 2
10: 2 7 2
The expected result is
> DT
id b count
1: 1 10 1
2: 1 5 2
3: 1 3 2
4: 1 4 2
5: 1 2 2
6: 1 6 3
7: 1 1 3
8: 2 3 1
9: 2 5 2
10: 2 7 3

Here is an option with cumsum. Grouped by 'id', get the cumulative sum of logical expression (b >= 5). For 'id' 2, the first element that is greater than or equal to 5 is at position 2 (in the grouped position), thus the first row will be 0. Inorder to make this 1, an option is to convert it to factor and then coerce to integer so that we get the integer storage values (R indexing starts from 1)
DT[, count := as.integer(factor(cumsum(b >= 5))), id]
-output
DT
id b count
1: 1 10 1
2: 1 5 2
3: 1 3 2
4: 1 4 2
5: 1 2 2
6: 1 6 3
7: 1 1 3
8: 2 3 1
9: 2 5 2
10: 2 7 3

Another data.table option with cumsum
> DT[, count := (v <- cumsum(b >= 5)) - v[1] + 1, id][]
id b count
1: 1 10 1
2: 1 5 2
3: 1 3 2
4: 1 4 2
5: 1 2 2
6: 1 6 3
7: 1 1 3
8: 2 3 1
9: 2 5 2
10: 2 7 3

We can also use accumulate function for this purpose. Here are some notes on this solution:
accumulate takes a two argument function as its .f argument where .x is the previous/ accumulated value and .y is the current value in the sequence of values of vector b
I set the initial value of count as 1 thus remove the first value of b cause we don't need it anymore and check the next value by .y and if the condition is met it will be added by one otherwise it remains as is.
library(dplyr)
library(purrr)
DT %>%
group_by(id) %>%
mutate(count = accumulate(b[-1], .init = 1,
~ if(.y >= 5) {
.x + 1
} else {
.x
}))
# A tibble: 10 x 3
# Groups: id [2]
id b count
<dbl> <int> <dbl>
1 1 10 1
2 1 5 2
3 1 3 2
4 1 4 2
5 1 2 2
6 1 6 3
7 1 1 3
8 2 3 1
9 2 5 2
10 2 7 3

Related

R data.table group by continuous values

I need some help with grouping data by continuous values.
If I have this data.table
dt <- data.table::data.table( a = c(1,1,1,2,2,2,2,1,1,2), b = seq(1:10), c = seq(1:10)+1 )
a b c
1: 1 1 2
2: 1 2 3
3: 1 3 4
4: 2 4 5
5: 2 5 6
6: 2 6 7
7: 2 7 8
8: 1 8 9
9: 1 9 10
10: 2 10 11
I need a group for every following equal values in column a. Of this group i need the first (also min possible) value of column b and the last (also max possible) value of column c.
Like this:
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
Thank you very much for your help. I do not get it solved alone.
Probably we can try
> dt[, .(a = a[1], b = b[1], c = c[.N]), rleid(a)][, -1]
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
An option with dplyr
library(dplyr)
dt %>%
group_by(grp = cumsum(c(TRUE, diff(a) != 0))) %>%
summarise(across(a:b, first), c = last(c)) %>%
select(-grp)
-output
# A tibble: 4 × 3
a b c
<dbl> <int> <dbl>
1 1 1 4
2 2 4 8
3 1 8 10
4 2 10 11

Find number of observations until a specific word is found

Say I have the following data.table:
library(data.table)
DT <- data.table(
ID = rep(c(1,2,3),4),
day = c(rep(1,3),rep(2,3),rep(3,3),rep(4,3)),
Status = c(rep('A',3),'A','B','B','A','C','B','A','D','C')
)
What I would like to achieve is that for each ID, find number of observations (in this case if sorted by days, the number of day it takes to hit a specific Status. So if I need to do this for Status C, the result would be:
0 for ID 1 (since doesn't contain status C), 3 for ID 2, and 4 for ID 3.
The only way came to my mind was to write a function and do nested for loops, but I am sure there should be much better/faster/more efficient ways.
Appreciate any help.
A possible data.table approach adding one column for the number of days to reach each status (0 if never reached):
library(data.table)
## status id's
status_ids <- unique(DT$Status)
status_cols <- paste("status", status_ids, sep = "_")
## add one column for each status id
setorder(DT, ID, day)
DT[, (status_cols) := lapply(status_ids, \(s) ifelse(any(Status == s), min(day[Status == s]), 0)), by = "ID"]
DT
#> ID day Status status_A status_B status_C status_D
#> 1: 1 1 A 1 0 0 0
#> 2: 1 2 A 1 0 0 0
#> 3: 1 3 A 1 0 0 0
#> 4: 1 4 A 1 0 0 0
#> 5: 2 1 A 1 2 3 4
#> 6: 2 2 B 1 2 3 4
#> 7: 2 3 C 1 2 3 4
#> 8: 2 4 D 1 2 3 4
#> 9: 3 1 A 1 2 4 0
#> 10: 3 2 B 1 2 4 0
#> 11: 3 3 B 1 2 4 0
#> 12: 3 4 C 1 2 4 0
You can split by ID and return the first match of day.
sapply(split(DT[,2:3], DT$ID), \(x) x$day[match("C", x$Status)])
# 1 2 3
#NA 3 4
Does this work:
library(dplyr)
DT %>% left_join(
DT %>% group_by(ID) %>% summarise(col = row_number()[Status == 'C'])
) %>% replace_na(list(col= 0))
`summarise()` has grouped output by 'ID'. You can override using the
`.groups` argument.
Joining, by = "ID"
ID day Status col
1: 1 1 A 0
2: 2 1 A 3
3: 3 1 A 4
4: 1 2 A 0
5: 2 2 B 3
6: 3 2 B 4
7: 1 3 A 0
8: 2 3 C 3
9: 3 3 B 4
10: 1 4 A 0
11: 2 4 D 3
12: 3 4 C 4

Count length of sequential consequtive values per group in R

I have a dataset with consequtive values and I would like to know the count of how many times each length occurs.
More specifically, I want to find out how many id's have a sequence running from 1:2, from 1:3, from 1:4 etc.
Only sequences starting from 1 are of interest.
In this example, id1 would have a "full" sequence running from 1:3 (as the number 4 is missing), id2 has a sequence running from 1:5, id3 has a sequence running from 1:6, id4 is not counted since it does not start with a value of 1 and id 5 has a sequence running from 1:3.
So we end up with two sequences until 3, one until 5 and one until 6.
Is there a clever way to calculate this, without resorting to inefficient loops?
Example data:
data <- data.table( id = c(1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,5,5,5,5),
value = c(1,2,3,5,1,2,3,4,5,10,11,1,2,3,4,5,6,2,3,4,5,6,7,8,1,2,3,7))
> data
id value
1: 1 1
2: 1 2
3: 1 3
4: 1 5
5: 2 1
6: 2 2
7: 2 3
8: 2 4
9: 2 5
10: 2 10
11: 2 11
12: 3 1
13: 3 2
14: 3 3
15: 3 4
16: 3 5
17: 3 6
18: 4 2
19: 4 3
20: 4 4
21: 4 5
22: 4 6
23: 4 7
24: 4 8
25: 5 1
26: 5 2
27: 5 3
28: 5 7
id value
out <- data[, len0 := rleid(c(TRUE, diff(value) == 1L)), by = .(id) ][
, .(value1 = first(value), len = .N), by = .(id, len0) ]
out
# id len0 value1 len
# <num> <int> <num> <int>
# 1: 1 1 1 3
# 2: 1 2 5 1
# 3: 2 1 1 5
# 4: 2 2 10 1
# 5: 2 3 11 1
# 6: 3 1 1 6
# 7: 4 1 2 7
# 8: 5 1 1 3
# 9: 5 2 7 1
Walk-through:
within each id, the len0 is created to identify the increase-by-1 steps
within id,len0, summarize with the first value (in case you only want those starting at 1, see below) and the length of the run
If you just want to know those whose sequences begin at one, filter on value1:
out[ value1 == 1L, ]
# id len0 value1 len
# <num> <int> <num> <int>
# 1: 1 1 1 3
# 2: 2 1 1 5
# 3: 3 1 1 6
# 4: 5 1 1 3
(I think you only need id and len at this point.)
Here is another option:
data[rowid(id)==value, max(value), id]
output:
id V1
1: 1 3
2: 2 5
3: 3 6
4: 5 3
library(data.table)
dt <- data.table( id = c(1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,5,5,5,5),
value = c(1,2,3,5,1,2,3,4,5,10,11,1,2,3,4,5,6,2,3,4,5,6,7,8,1,2,3,7))
dt[, n := seq_len(.N) - value, by = id]
res <- dt[n == 0, .SD[value == max(value)], by = id][, n := NULL]
head(res)
#> id value
#> 1: 1 3
#> 2: 2 5
#> 3: 3 6
#> 4: 5 3
Created on 2021-02-04 by the reprex package (v1.0.0)
One option utilizing dplyr might be:
data %>%
group_by(id) %>%
mutate(rleid = with(rle(c(0, diff(value)) <= 1), rep(seq_along(values), lengths))) %>%
filter(rleid == 1 & min(value) == 1) %>%
summarise(value = paste(value, collapse = "")) %>%
group_by(value) %>%
summarise(n = n(),
ids = toString(id))
value n ids
<chr> <int> <chr>
1 123 2 1, 5
2 12345 1 2
3 123456 1 3

Lag R data.table with a group condition

I have data like this:
test <- data.frame(id = c(1,2,1,5,5,5,6),
time = c(0,1,4,5,6,7,9),
cond = c("a","a","b","a","b","b","b"),
value = c(5,3,2,4,0,3,1),
stringsAsFactors=F)
setDT(test)[,order := order(time),id][order(id,order)]
id time cond value order
1 0 a 5 1
2 1 a 3 1
1 4 b 2 2
5 5 a 4 1
5 6 b 0 2
5 7 b 3 3
6 9 b 1 1
The data.table function creates a column "order" which is the order of time based on the group id.
I would like to create a column which returns the previous value but only where the condition is "b". If the condition is "a" return the current value and if the condition is "b" and the previous is "b" then skip to the next non "b". If the first condition of a group is "b" Then return NA.
Desired output would be like this:
id time cond value order prev
1 0 a 5 1 5
2 1 a 3 1 3
1 4 b 2 2 5
5 5 a 4 1 4
5 6 b 0 2 4
5 7 b 3 3 4
6 9 b 1 1 NA
I've tried some functions like this but only returned NAs.
test[, prev := shift(value[cond == 'b']), .(id,order)]
If I understood the problem correctly, one option could be:
library(data.table)
setDT(test)[, order := order(time), id][order(id, order)]
test[, prev := {
frst <- ifelse(cond[1] == "a", value[1],
ifelse(cond[1] == "b", NA, cond[1]))
prev <- as.integer(ifelse(cond == "b" & shift(cond) == "b",
NA,
c(frst, shift(value)[-1])))
}, by = id][cond == "b", prev := zoo::na.locf(prev), by = id]
Output:
id time cond value order prev
1: 1 0 a 5 1 5
2: 1 4 b 2 2 5
3: 2 1 a 3 1 3
4: 5 5 a 4 1 4
5: 5 6 b 0 2 4
6: 5 7 b 3 3 4
7: 6 9 b 1 1 NA
If you assign the non-b values first, zoo:na.locf can do the rest (fill the b (NA) values downwards).
library(zoo)
test[cond != 'b', prev := value]
test[, prev := na.locf(prev), id]
test
# id time cond value order prev
# 1: 1 0 a 5 1 5
# 2: 2 1 a 3 1 3
# 3: 1 4 b 2 2 5
# 4: 5 5 a 4 1 4
# 5: 5 6 b 0 2 4
# 6: 5 7 b 3 3 4
# 7: 6 9 b 1 1 NA

Inserting a count field for each row by a grouping variable

I have a data set with observations that are both grouped and ordered (by rank). I'd like to add a third variable that is a count of the number of observations for each grouping variable. I'm aware of ways to group and count variables but I can't find a way to re-insert these counts back into the original data set, which has more rows. I'd like to get the variable C in the example table below.
A B C
1 1 3
1 2 3
1 3 3
2 1 4
2 2 4
2 3 4
2 4 4
Here's one way using ave:
DF <- within(DF, {C <- ave(A, A, FUN=length)})
# A B C
# 1 1 1 3
# 2 1 2 3
# 3 1 3 3
# 4 2 1 4
# 5 2 2 4
# 6 2 3 4
# 7 2 4 4
Here is one approach using data.table that makes use of .N, which is described in the help file to "data.table" as .N is an integer, length 1, containing the number of rows in the group.
> library(data.table)
> DT <- data.table(A = rep(c(1, 2), times = c(3, 4)), B = c(1:3, 1:4))
> DT
A B
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
7: 2 4
> DT[, C := .N, by = "A"]
> DT
A B C
1: 1 1 3
2: 1 2 3
3: 1 3 3
4: 2 1 4
5: 2 2 4
6: 2 3 4
7: 2 4 4

Resources