I have a dataset with consequtive values and I would like to know the count of how many times each length occurs.
More specifically, I want to find out how many id's have a sequence running from 1:2, from 1:3, from 1:4 etc.
Only sequences starting from 1 are of interest.
In this example, id1 would have a "full" sequence running from 1:3 (as the number 4 is missing), id2 has a sequence running from 1:5, id3 has a sequence running from 1:6, id4 is not counted since it does not start with a value of 1 and id 5 has a sequence running from 1:3.
So we end up with two sequences until 3, one until 5 and one until 6.
Is there a clever way to calculate this, without resorting to inefficient loops?
Example data:
data <- data.table( id = c(1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,5,5,5,5),
value = c(1,2,3,5,1,2,3,4,5,10,11,1,2,3,4,5,6,2,3,4,5,6,7,8,1,2,3,7))
> data
id value
1: 1 1
2: 1 2
3: 1 3
4: 1 5
5: 2 1
6: 2 2
7: 2 3
8: 2 4
9: 2 5
10: 2 10
11: 2 11
12: 3 1
13: 3 2
14: 3 3
15: 3 4
16: 3 5
17: 3 6
18: 4 2
19: 4 3
20: 4 4
21: 4 5
22: 4 6
23: 4 7
24: 4 8
25: 5 1
26: 5 2
27: 5 3
28: 5 7
id value
out <- data[, len0 := rleid(c(TRUE, diff(value) == 1L)), by = .(id) ][
, .(value1 = first(value), len = .N), by = .(id, len0) ]
out
# id len0 value1 len
# <num> <int> <num> <int>
# 1: 1 1 1 3
# 2: 1 2 5 1
# 3: 2 1 1 5
# 4: 2 2 10 1
# 5: 2 3 11 1
# 6: 3 1 1 6
# 7: 4 1 2 7
# 8: 5 1 1 3
# 9: 5 2 7 1
Walk-through:
within each id, the len0 is created to identify the increase-by-1 steps
within id,len0, summarize with the first value (in case you only want those starting at 1, see below) and the length of the run
If you just want to know those whose sequences begin at one, filter on value1:
out[ value1 == 1L, ]
# id len0 value1 len
# <num> <int> <num> <int>
# 1: 1 1 1 3
# 2: 2 1 1 5
# 3: 3 1 1 6
# 4: 5 1 1 3
(I think you only need id and len at this point.)
Here is another option:
data[rowid(id)==value, max(value), id]
output:
id V1
1: 1 3
2: 2 5
3: 3 6
4: 5 3
library(data.table)
dt <- data.table( id = c(1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,5,5,5,5),
value = c(1,2,3,5,1,2,3,4,5,10,11,1,2,3,4,5,6,2,3,4,5,6,7,8,1,2,3,7))
dt[, n := seq_len(.N) - value, by = id]
res <- dt[n == 0, .SD[value == max(value)], by = id][, n := NULL]
head(res)
#> id value
#> 1: 1 3
#> 2: 2 5
#> 3: 3 6
#> 4: 5 3
Created on 2021-02-04 by the reprex package (v1.0.0)
One option utilizing dplyr might be:
data %>%
group_by(id) %>%
mutate(rleid = with(rle(c(0, diff(value)) <= 1), rep(seq_along(values), lengths))) %>%
filter(rleid == 1 & min(value) == 1) %>%
summarise(value = paste(value, collapse = "")) %>%
group_by(value) %>%
summarise(n = n(),
ids = toString(id))
value n ids
<chr> <int> <chr>
1 123 2 1, 5
2 12345 1 2
3 123456 1 3
Related
I need some help with grouping data by continuous values.
If I have this data.table
dt <- data.table::data.table( a = c(1,1,1,2,2,2,2,1,1,2), b = seq(1:10), c = seq(1:10)+1 )
a b c
1: 1 1 2
2: 1 2 3
3: 1 3 4
4: 2 4 5
5: 2 5 6
6: 2 6 7
7: 2 7 8
8: 1 8 9
9: 1 9 10
10: 2 10 11
I need a group for every following equal values in column a. Of this group i need the first (also min possible) value of column b and the last (also max possible) value of column c.
Like this:
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
Thank you very much for your help. I do not get it solved alone.
Probably we can try
> dt[, .(a = a[1], b = b[1], c = c[.N]), rleid(a)][, -1]
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
An option with dplyr
library(dplyr)
dt %>%
group_by(grp = cumsum(c(TRUE, diff(a) != 0))) %>%
summarise(across(a:b, first), c = last(c)) %>%
select(-grp)
-output
# A tibble: 4 × 3
a b c
<dbl> <int> <dbl>
1 1 1 4
2 2 4 8
3 1 8 10
4 2 10 11
I have a data.table that is supposed to remove all rows per group until a negative number is met in value (including the row with the negative number itself). However, if there is no negative number in value I would like to keep all rows from that group.
# Example data
group = rep(1:4,each=3)
value = c(1,2,3,1,-2,3,1,2,-3,-1,2,3)
DT = data.table(group,value)
> DT
group value row_idx
1: 1 1 1
2: 1 2 2
3: 1 3 3
4: 2 1 1
5: 2 -2 2
6: 2 3 3
7: 3 1 1
8: 3 2 2
9: 3 -3 3
10: 4 -1 1
11: 4 2 2
12: 4 3 3
My attempt so far:
DT[,row_idx := seq_len(.N), by = "group"] #append row index per group
DT[,.SD[row_idx > (which(sign(value) == -1))], by = "group"]
group value row_idx
1: 2 3 3
2: 4 2 2
3: 4 3 3
In this example group 1 is being deleted although I would like to keep it as no negative number is present in this group. I can check for the presence/absence of negative signs in value by DT[,(-1) %in% sign(value), by = "group"] but I do not know how to use this to achieve what I want.
We may use a if/else condition
library(data.table)
DT[DT[, if(any(sign(value) < 0))
.I[row_idx > (which(sign(value) == -1))] else .I, by = group]$V1]
-output
group value row_idx
<int> <num> <int>
1: 1 1 1
2: 1 2 2
3: 1 3 3
4: 2 3 3
5: 4 2 2
6: 4 3 3
Or slightly more compact option
DT[DT[, .I[seq_len(.N) > match(-1, sign(value), nomatch = 0)], group]$V1]
group value
<int> <num>
1: 1 1
2: 1 2
3: 1 3
4: 2 3
5: 4 2
6: 4 3
DT[, .SD[if (min(value) > 0) TRUE else -(1:which.max(value < 0))], by = group]
# group value
# <int> <num>
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 2 3
# 5: 4 2
# 6: 4 3
library(data.table)
DT <- data.table(var = 1:100)
I want to create a second variable, group that groups the values in var by n consecutive integers. So if n is equal to 1, it would return the same column as var. If n=2, it would return me:
var group
1: 1 1
2: 2 1
3: 3 2
4: 4 2
5: 5 3
6: 6 3
If n=3, it would return me:
var group
1: 1 1
2: 2 1
3: 3 1
4: 4 2
5: 5 2
6: 6 2
and so on. I would like to do this as flexibly as possibly.
Note that there could be repeated values:
var group
1: 1 1
2: 1 1
3: 2 1
4: 3 2
5: 3 2
6: 4 2
Here, group corresponds to n=2. Thank you!
I think we can use findInterval for this:
DT <- data.table(var = c(1L, 1:10))
n <- 2
DT[, group := findInterval(var, seq(min(var), max(var) + n, by = n))]
# var group
# <int> <int>
# 1: 1 1
# 2: 1 1
# 3: 2 1
# 4: 3 2
# 5: 4 2
# 6: 5 3
# 7: 6 3
# 8: 7 4
# 9: 8 4
# 10: 9 5
# 11: 10 5
n <- 3
DT[, group := findInterval(var, seq(min(var), max(var) + n, by = n))]
# var group
# <int> <int>
# 1: 1 1
# 2: 1 1
# 3: 2 1
# 4: 3 1
# 5: 4 2
# 6: 5 2
# 7: 6 2
# 8: 7 3
# 9: 8 3
# 10: 9 3
# 11: 10 4
(The +n in the call to seq is so that we always have a little more than we need; if we did just seq(min(.),max(.),by=n), it would be possible the highest values of var would be outside of the sequence. One could also do c(seq(min(.), max(.), by=n), Inf) for the same effect.)
I would like to create an accumulative incremental counter that increases only when a condition is met.
DT <- data.table(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2),
b = c(10L, 5L, 3L, 4L, 2L, 6L, 1L, 3L, 5L, 7L))
I don't get the desired result with rleid because when two conditions are met in consecutive rows, the increment is not performed
> DT[,count := rleid(b>=5),id]
> DT
id b count
1: 1 10 1
2: 1 5 1
3: 1 3 2
4: 1 4 2
5: 1 2 2
6: 1 6 3
7: 1 1 4
8: 2 3 1
9: 2 5 2
10: 2 7 2
The expected result is
> DT
id b count
1: 1 10 1
2: 1 5 2
3: 1 3 2
4: 1 4 2
5: 1 2 2
6: 1 6 3
7: 1 1 3
8: 2 3 1
9: 2 5 2
10: 2 7 3
Here is an option with cumsum. Grouped by 'id', get the cumulative sum of logical expression (b >= 5). For 'id' 2, the first element that is greater than or equal to 5 is at position 2 (in the grouped position), thus the first row will be 0. Inorder to make this 1, an option is to convert it to factor and then coerce to integer so that we get the integer storage values (R indexing starts from 1)
DT[, count := as.integer(factor(cumsum(b >= 5))), id]
-output
DT
id b count
1: 1 10 1
2: 1 5 2
3: 1 3 2
4: 1 4 2
5: 1 2 2
6: 1 6 3
7: 1 1 3
8: 2 3 1
9: 2 5 2
10: 2 7 3
Another data.table option with cumsum
> DT[, count := (v <- cumsum(b >= 5)) - v[1] + 1, id][]
id b count
1: 1 10 1
2: 1 5 2
3: 1 3 2
4: 1 4 2
5: 1 2 2
6: 1 6 3
7: 1 1 3
8: 2 3 1
9: 2 5 2
10: 2 7 3
We can also use accumulate function for this purpose. Here are some notes on this solution:
accumulate takes a two argument function as its .f argument where .x is the previous/ accumulated value and .y is the current value in the sequence of values of vector b
I set the initial value of count as 1 thus remove the first value of b cause we don't need it anymore and check the next value by .y and if the condition is met it will be added by one otherwise it remains as is.
library(dplyr)
library(purrr)
DT %>%
group_by(id) %>%
mutate(count = accumulate(b[-1], .init = 1,
~ if(.y >= 5) {
.x + 1
} else {
.x
}))
# A tibble: 10 x 3
# Groups: id [2]
id b count
<dbl> <int> <dbl>
1 1 10 1
2 1 5 2
3 1 3 2
4 1 4 2
5 1 2 2
6 1 6 3
7 1 1 3
8 2 3 1
9 2 5 2
10 2 7 3
# have
> aDT <- data.table(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,NA,NA,1,4,3,NA,NA,4,NA,2,NA))
> aDT
colA colB
1: 1 4
2: 1 NA
3: 1 NA
4: 1 1
5: 2 4
6: 2 3
7: 2 NA
8: 2 NA
9: 3 4
10: 3 NA
11: 3 2
12: 3 NA
# want
> bDT <- data.table(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,1,1,1,4,3,3,3,4,2,2,2))
> bDT
colA colB
1: 1 4
2: 1 1
3: 1 1
4: 1 1
5: 2 4
6: 2 3
7: 2 3
8: 2 3
9: 3 4
10: 3 2
11: 3 2
12: 3 2
Would like to fill missing values according to the algorithm below:
within each group ('colA'),
use the value from one row below, if it's still NA, keeps going until the last row within that group
if all NAs in rows below, look at rows above (go up 1 row at a time)
if all NAs, then NA
Since the dataset is quite large, algorithmic efficiency is part of consideration. Not sure if there's any package for this type of operation already. How to do it?
With data.table and zoo:
library(data.table)
library(zoo)
# Last observation carried forward from last row of group
dt <- dt[, colB := na.locf0(colB, fromLast = TRUE), by = colA]
# Last observation carried forward for first row of group
dt[, colB := na.locf(colB), by = colA][]
Or in a single chain:
dt[, colB := na.locf0(colB, fromLast = TRUE), by = colA][
, colB := na.locf(colB), by = colA][]
Both return:
colA colB
1: 1 4
2: 1 1
3: 1 1
4: 1 1
5: 2 4
6: 2 3
7: 2 3
8: 2 3
9: 3 4
10: 3 2
11: 3 2
12: 3 2
Data:
text <- "colA colB
1 4
1 NA
1 NA
1 1
2 4
2 3
2 NA
2 NA
3 4
3 NA
3 2
3 NA"
dt <- fread(input = text, stringsAsFactors = FALSE)
Here is one way using tidyverse and zoo::na.locf:
library(tidyverse);
library(zoo);
df %>%
group_by(colA) %>%
arrange(colA) %>%
mutate(colB = na.locf(colB, na.rm = F, fromLast = TRUE)) %>%
mutate(colB = na.locf(colB, na.rm = F));
## A tibble: 12 x 2
## Groups: colA [3]
# colA colB
# <dbl> <dbl>
# 1 1.00 4.00
# 2 1.00 1.00
# 3 1.00 1.00
# 4 1.00 1.00
# 5 2.00 4.00
# 6 2.00 3.00
# 7 2.00 3.00
# 8 2.00 3.00
# 9 3.00 4.00
#10 3.00 2.00
#11 3.00 2.00
#12 3.00 2.00
Or the data.table way:
library(data.table);
dt[, .(na.locf(na.locf(colB, na.rm = F, fromLast = T), na.rm = F)), by = .(colA)];
# colA V1
# 1: 1 4
# 2: 1 1
# 3: 1 1
# 4: 1 1
# 5: 2 4
# 6: 2 3
# 7: 2 3
# 8: 2 3
# 9: 3 4
#10: 3 2
#11: 3 2
#12: 3 2
The key in both cases is to apply na.locf twice: First to replace NAs from the bottom, then replace the remaining NAs from the top.
Sample data
# As data.frame
df <- data.frame(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,NA,NA,1,4,3,NA,NA,4,NA,2,NA));
# As data.table
dt <- data.table(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,NA,NA,1,4,3,NA,NA,4,NA,2,NA));
library(tidyverse)
aDT%>%group_by(colA)%>%fill(colB,.direction="up")%>%fill(colB)
# A tibble: 12 x 2
# Groups: colA [3]
colA colB
<dbl> <dbl>
1 1 4
2 1 1
3 1 1
4 1 1
5 2 4
6 2 3
7 2 3
8 2 3
9 3 4
10 3 2
11 3 2
12 3 2