Group variable by "n" consecutive integers in data.table - r

library(data.table)
DT <- data.table(var = 1:100)
I want to create a second variable, group that groups the values in var by n consecutive integers. So if n is equal to 1, it would return the same column as var. If n=2, it would return me:
var group
1: 1 1
2: 2 1
3: 3 2
4: 4 2
5: 5 3
6: 6 3
If n=3, it would return me:
var group
1: 1 1
2: 2 1
3: 3 1
4: 4 2
5: 5 2
6: 6 2
and so on. I would like to do this as flexibly as possibly.
Note that there could be repeated values:
var group
1: 1 1
2: 1 1
3: 2 1
4: 3 2
5: 3 2
6: 4 2
Here, group corresponds to n=2. Thank you!

I think we can use findInterval for this:
DT <- data.table(var = c(1L, 1:10))
n <- 2
DT[, group := findInterval(var, seq(min(var), max(var) + n, by = n))]
# var group
# <int> <int>
# 1: 1 1
# 2: 1 1
# 3: 2 1
# 4: 3 2
# 5: 4 2
# 6: 5 3
# 7: 6 3
# 8: 7 4
# 9: 8 4
# 10: 9 5
# 11: 10 5
n <- 3
DT[, group := findInterval(var, seq(min(var), max(var) + n, by = n))]
# var group
# <int> <int>
# 1: 1 1
# 2: 1 1
# 3: 2 1
# 4: 3 1
# 5: 4 2
# 6: 5 2
# 7: 6 2
# 8: 7 3
# 9: 8 3
# 10: 9 3
# 11: 10 4
(The +n in the call to seq is so that we always have a little more than we need; if we did just seq(min(.),max(.),by=n), it would be possible the highest values of var would be outside of the sequence. One could also do c(seq(min(.), max(.), by=n), Inf) for the same effect.)

Related

R Data Table add rows to each group if not existing [duplicate]

This question already has answers here:
data.table equivalent of tidyr::complete()
(3 answers)
Closed 29 days ago.
I have a data table with multiple groups. Each group I'd like to fill with rows containing the values in vals if they are not already present. Additional columns should be filled with NAs.
DT = data.table(group = c(1,1,1,2,2,3,3,3,3), val = c(1,2,4,2,3,1,2,3,4), somethingElse = rep(1,9))
vals = data.table(val = c(1,2,3,4))
What I want:
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
The order of val does not necessarily have to be increasing, the values may also be appened at the beginning/end of each group.
I don't know how to approach this problem. I've thought about using rbindlist(...,fill = TRUE), but then the values will be simply appended.
I think some expression with DT[, lapply(...), by = c("group")] might be useful here but I have no idea how to check if a value already exists.
You can use a cross-join:
setDT(DT)[
CJ(group = group, val = val, unique = TRUE),
on = .(group, val)
]
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
Another way to solve your problem:
DT[, .SD[vals, on="val"], by=group]
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
# or
DT[CJ(group, val, unique=TRUE), on=.NATURAL]
I will just add this answer for a slightly more complex case:
#Raw Data
DT = data.table(group = c(1,1,2,2,2,3,3,3,3),
x = c(1,2,1,3,4,1,2,3,4),
y = c(2,4,2,6,8,2,4,6,8),
somethingElse = rep(1,9))
#allowed combinations of x and y
DTxy = data.table(x = c(1,2,3,4), y = c(2,4,6,8))
Here, I want to add all x,y combinations from DTxy to each group from DT, if not already present.
I've wrote a function to work for subsets.
#function to join subsets on two columns (here: x,y)
DTxyJoin = function(.SD, xy){
.SD = .SD[xy, on = .(x,y)]
return(.SD)
}
I then applied the function to each group:
#add x and y to each group if missing
DTres = DT[, DTxyJoin(.SD, DTxy), by = c("group")]
The Result:
group x y somethingElse
1: 1 1 2 1
2: 1 2 4 1
3: 1 3 6 NA
4: 1 4 8 NA
5: 2 1 2 1
6: 2 2 4 NA
7: 2 3 6 1
8: 2 4 8 1
9: 3 1 2 1
10: 3 2 4 1
11: 3 3 6 1
12: 3 4 8 1

R data.table only perform operation on group if condition is met

I have a data.table that is supposed to remove all rows per group until a negative number is met in value (including the row with the negative number itself). However, if there is no negative number in value I would like to keep all rows from that group.
# Example data
group = rep(1:4,each=3)
value = c(1,2,3,1,-2,3,1,2,-3,-1,2,3)
DT = data.table(group,value)
> DT
group value row_idx
1: 1 1 1
2: 1 2 2
3: 1 3 3
4: 2 1 1
5: 2 -2 2
6: 2 3 3
7: 3 1 1
8: 3 2 2
9: 3 -3 3
10: 4 -1 1
11: 4 2 2
12: 4 3 3
My attempt so far:
DT[,row_idx := seq_len(.N), by = "group"] #append row index per group
DT[,.SD[row_idx > (which(sign(value) == -1))], by = "group"]
group value row_idx
1: 2 3 3
2: 4 2 2
3: 4 3 3
In this example group 1 is being deleted although I would like to keep it as no negative number is present in this group. I can check for the presence/absence of negative signs in value by DT[,(-1) %in% sign(value), by = "group"] but I do not know how to use this to achieve what I want.
We may use a if/else condition
library(data.table)
DT[DT[, if(any(sign(value) < 0))
.I[row_idx > (which(sign(value) == -1))] else .I, by = group]$V1]
-output
group value row_idx
<int> <num> <int>
1: 1 1 1
2: 1 2 2
3: 1 3 3
4: 2 3 3
5: 4 2 2
6: 4 3 3
Or slightly more compact option
DT[DT[, .I[seq_len(.N) > match(-1, sign(value), nomatch = 0)], group]$V1]
group value
<int> <num>
1: 1 1
2: 1 2
3: 1 3
4: 2 3
5: 4 2
6: 4 3
DT[, .SD[if (min(value) > 0) TRUE else -(1:which.max(value < 0))], by = group]
# group value
# <int> <num>
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 2 3
# 5: 4 2
# 6: 4 3

Count length of sequential consequtive values per group in R

I have a dataset with consequtive values and I would like to know the count of how many times each length occurs.
More specifically, I want to find out how many id's have a sequence running from 1:2, from 1:3, from 1:4 etc.
Only sequences starting from 1 are of interest.
In this example, id1 would have a "full" sequence running from 1:3 (as the number 4 is missing), id2 has a sequence running from 1:5, id3 has a sequence running from 1:6, id4 is not counted since it does not start with a value of 1 and id 5 has a sequence running from 1:3.
So we end up with two sequences until 3, one until 5 and one until 6.
Is there a clever way to calculate this, without resorting to inefficient loops?
Example data:
data <- data.table( id = c(1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,5,5,5,5),
value = c(1,2,3,5,1,2,3,4,5,10,11,1,2,3,4,5,6,2,3,4,5,6,7,8,1,2,3,7))
> data
id value
1: 1 1
2: 1 2
3: 1 3
4: 1 5
5: 2 1
6: 2 2
7: 2 3
8: 2 4
9: 2 5
10: 2 10
11: 2 11
12: 3 1
13: 3 2
14: 3 3
15: 3 4
16: 3 5
17: 3 6
18: 4 2
19: 4 3
20: 4 4
21: 4 5
22: 4 6
23: 4 7
24: 4 8
25: 5 1
26: 5 2
27: 5 3
28: 5 7
id value
out <- data[, len0 := rleid(c(TRUE, diff(value) == 1L)), by = .(id) ][
, .(value1 = first(value), len = .N), by = .(id, len0) ]
out
# id len0 value1 len
# <num> <int> <num> <int>
# 1: 1 1 1 3
# 2: 1 2 5 1
# 3: 2 1 1 5
# 4: 2 2 10 1
# 5: 2 3 11 1
# 6: 3 1 1 6
# 7: 4 1 2 7
# 8: 5 1 1 3
# 9: 5 2 7 1
Walk-through:
within each id, the len0 is created to identify the increase-by-1 steps
within id,len0, summarize with the first value (in case you only want those starting at 1, see below) and the length of the run
If you just want to know those whose sequences begin at one, filter on value1:
out[ value1 == 1L, ]
# id len0 value1 len
# <num> <int> <num> <int>
# 1: 1 1 1 3
# 2: 2 1 1 5
# 3: 3 1 1 6
# 4: 5 1 1 3
(I think you only need id and len at this point.)
Here is another option:
data[rowid(id)==value, max(value), id]
output:
id V1
1: 1 3
2: 2 5
3: 3 6
4: 5 3
library(data.table)
dt <- data.table( id = c(1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,5,5,5,5),
value = c(1,2,3,5,1,2,3,4,5,10,11,1,2,3,4,5,6,2,3,4,5,6,7,8,1,2,3,7))
dt[, n := seq_len(.N) - value, by = id]
res <- dt[n == 0, .SD[value == max(value)], by = id][, n := NULL]
head(res)
#> id value
#> 1: 1 3
#> 2: 2 5
#> 3: 3 6
#> 4: 5 3
Created on 2021-02-04 by the reprex package (v1.0.0)
One option utilizing dplyr might be:
data %>%
group_by(id) %>%
mutate(rleid = with(rle(c(0, diff(value)) <= 1), rep(seq_along(values), lengths))) %>%
filter(rleid == 1 & min(value) == 1) %>%
summarise(value = paste(value, collapse = "")) %>%
group_by(value) %>%
summarise(n = n(),
ids = toString(id))
value n ids
<chr> <int> <chr>
1 123 2 1, 5
2 12345 1 2
3 123456 1 3

R data.table how to create duplicates [duplicate]

This question already has answers here:
Repeat rows of a data.frame N times
(10 answers)
Closed 3 years ago.
I have:
dataDT <- data.table(A = 1:3, B = 1:3)
dataDT
A B
1: 1 1
2: 2 2
3: 3 3
I want:
dataDT <- data.table(A = c(1:3, 1:3), B = c(1:3, 1:3))
dataDT
A B
1: 1 1
2: 2 2
3: 3 3
4: 1 1
5: 2 2
6: 3 3
i.e. create x copies of duplicate and append after the bottom row.
I've tried (results aren't what I need):
dataDT1 <- splitstackshape::expandRows(dataset = dataDT, count = 2, count.is.col = FALSE) # order not correct
dataDT1
A B
1: 1 1
2: 1 1
3: 2 2
4: 2 2
5: 3 3
6: 3 3
Also (results aren't what I need):
dataDT2 <- rbindlist(list(rep(dataDT, 2))) # it creates columns
dataDT2
A B A B
1: 1 1 1 1
2: 2 2 2 2
3: 3 3 3 3
Can anyone recommend a correct and efficient way of doing it?
You can do it with rep:
> x = 2; dataDT[rep(seq_len(nrow(dataDT)), x), ]
A B
1: 1 1
2: 2 2
3: 3 3
4: 1 1
5: 2 2
6: 3 3
or with rbindlist and replicate:
> x = 2; rbindlist(replicate(x, dataDT, simplify = F))
A B
1: 1 1
2: 2 2
3: 3 3
4: 1 1
5: 2 2
6: 3 3

from two lists to one by binding elements

I have two lists with two elements each,
l1 <- list(data.table(id=1:5, group=1), data.table(id=1:5, group=1))
l2 <- list(data.table(id=1:5, group=2), data.table(id=1:5, group=2))
and I would like to rbind(.) both elements, resulting in a new list with two elements.
> l
[[1]]
id group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 1 2
7: 2 2
8: 3 2
9: 4 2
10: 5 2
[[2]]
id group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 1 2
7: 2 2
8: 3 2
9: 4 2
10: 5 2
However, I only find examples where rbind(.) is applied to bind across elements. I suspect that the solution lies somewhere in lapply(.) but lapply(c(l1,l2),rbind) appears to bind the lists, producing a list of four elements.
You can use mapply or Map. mapply (which stands for multivariate apply) applies the supplied function to the first elements of the arguments and then the second and then the third and so on. Map is quite literally a wrapper to mapply that does not try to simplify the result (try running mapply with and without SIMPLIFY=T). Shorter, arguments are recycled as necessary.
mapply(x=l1, y=l2, function(x,y) rbind(x,y), SIMPLIFY = F)
#[[1]]
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 1 2
# 7: 2 2
# 8: 3 2
# 9: 4 2
#10: 5 2
#
#[[2]]
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 1 2
# 7: 2 2
# 8: 3 2
# 9: 4 2
#10: 5 2
As #Parfait pointed out you can do this Map:
Map(rbind, l1, l2)
#[[1]]
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 1 2
# 7: 2 2
# 8: 3 2
# 9: 4 2
#10: 5 2
#
#[[2]]
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 1 2
# 7: 2 2
# 8: 3 2
# 9: 4 2
#10: 5 2
Using tidyverse
library(tidyverse0
map2(l1, l2, bind_rows)

Resources