R find intervals in data.table - r

i want to add a new column with intervals or breakpoints by group. As an an example:
This is my data.table:
x <- data.table(a = c(1:8,1:8), b = c(rep("A",8),rep("B",8)))
I have already the breakpoint or rowindices:
pos <- data.table(b = c("A","A","B","B"), bp = c(3,5,2,4))
Here i can find the interval for group "A" with:
findInterval(1:nrow(x[b=="A"]), pos[b=="A"]$bp)
How can i do this for each group. In this case "A" and "B"?

An option is to split the datasets by 'b' column, use Map to loop over the corresponding lists, and apply findInterval
Map(function(u, v) findInterval(seq_len(nrow(u)), v$bp),
split(x, x$b), split(pos, pos$b))
#$A
#[1] 0 0 1 1 2 2 2 2
#$B
#[1] 0 1 1 2 2 2 2 2
or another option is to group by 'b' from 'x', then use findInterval by subsetting the 'bp' from 'pos' by filtering with a logical condition created based on .BY
x[, findInterval(seq_len(.N), pos$bp[pos$b==.BY]), b]
# b V1
# 1: A 0
# 2: A 0
# 3: A 1
# 4: A 1
# 5: A 2
# 6: A 2
# 7: A 2
# 8: A 2
# 9: B 0
#10: B 1
#11: B 1
#12: B 2
#13: B 2
#14: B 2
#15: B 2
#16: B 2

Another option using rolling join in data.table:
pos[, ri := rowid(b)]
x[, intvl := fcoalesce(pos[x, on=.(b, bp=a), roll=Inf, ri], 0L)]
output:
a b intvl
1: 1 A 0
2: 2 A 0
3: 3 A 1
4: 4 A 1
5: 5 A 2
6: 6 A 2
7: 7 A 2
8: 8 A 2
9: 1 B 0
10: 2 B 1
11: 3 B 1
12: 4 B 2
13: 5 B 2
14: 6 B 2
15: 7 B 2
16: 8 B 2

We can nest the pos data into list by b and join with x and use findInterval to get corresponding groups.
library(dplyr)
pos %>%
tidyr::nest(data = bp) %>%
right_join(x, by = 'b') %>%
group_by(b) %>%
mutate(interval = findInterval(a, data[[1]][[1]])) %>%
select(-data)
# b a interval
# <chr> <int> <int>
# 1 A 1 0
# 2 A 2 0
# 3 A 3 1
# 4 A 4 1
# 5 A 5 2
# 6 A 6 2
# 7 A 7 2
# 8 A 8 2
# 9 B 1 0
#10 B 2 1
#11 B 3 1
#12 B 4 2
#13 B 5 2
#14 B 6 2
#15 B 7 2
#16 B 8 2

Related

R data.table group by continuous values

I need some help with grouping data by continuous values.
If I have this data.table
dt <- data.table::data.table( a = c(1,1,1,2,2,2,2,1,1,2), b = seq(1:10), c = seq(1:10)+1 )
a b c
1: 1 1 2
2: 1 2 3
3: 1 3 4
4: 2 4 5
5: 2 5 6
6: 2 6 7
7: 2 7 8
8: 1 8 9
9: 1 9 10
10: 2 10 11
I need a group for every following equal values in column a. Of this group i need the first (also min possible) value of column b and the last (also max possible) value of column c.
Like this:
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
Thank you very much for your help. I do not get it solved alone.
Probably we can try
> dt[, .(a = a[1], b = b[1], c = c[.N]), rleid(a)][, -1]
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
An option with dplyr
library(dplyr)
dt %>%
group_by(grp = cumsum(c(TRUE, diff(a) != 0))) %>%
summarise(across(a:b, first), c = last(c)) %>%
select(-grp)
-output
# A tibble: 4 × 3
a b c
<dbl> <int> <dbl>
1 1 1 4
2 2 4 8
3 1 8 10
4 2 10 11

Is there some way to keep variable names from.SD+.SDcols together with non .SD variable names in data.table?

Given a data.table
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
DT
x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4
7: c 1 1 7 3
8: c 2 3 8 2
9: c 2 6 9 1
if one does
DT[, .(a, .SD), .SDcols=x:y]
a .SD.x .SD.v .SD.y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the variables from .SDcols become prefixed by .SD. On the other hand, if one tries, as in https://stackoverflow.com/a/62282856/997979,
DT[, c(.(a), .SD), .SDcols=x:y]
V1 x v y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the other variable name (a) become lost. (It is due to this reason that I re-ask the question which I initially marked as a duplicate to that linked above).
Is there some way to keep the names from both .SD variables and non .SD variables?
The goal is simultaneously being able to use .() to select variables without quotes and being able to select variables through .SDcols = patterns("...")
Thanks in advance!
not really sure why.. but it works ;-)
DT[, .(a, (.SD)), .SDcols=x:y]
# a x v y
# 1: 1 b 1 1
# 2: 2 b 1 3
# 3: 3 b 1 6
# 4: 4 a 2 1
# 5: 5 a 2 3
# 6: 6 a 1 6
# 7: 7 c 1 1
# 8: 8 c 2 3
# 9: 9 c 2 6

Get the group index in a data.table

I have the following data.table:
library(data.table)
DT <- data.table(a = c(1,2,3,4,5,6,7,8,9,10), b = c('A','A','A','B','B', 'C', 'C', 'C', 'D', 'D'), c = c(1,1,1,1,1,2,2,2,2,2))
> DT
a b c
1: 1 A 1
2: 2 A 1
3: 3 A 1
4: 4 B 1
5: 5 B 1
6: 6 C 2
7: 7 C 2
8: 8 C 2
9: 9 D 2
10: 10 D 2
I want to add a column that shows the index grouped by c (starts from 1 from each group in column c), but that only changes when the value of b is changed. The result wanted is shown below:
Here are two ways to do this :
Using rleid :
library(data.table)
DT[, col := rleid(b), c]
With match + unique :
DT[, col := match(b, unique(b)), c]
# a b c col
# 1: 1 A 1 1
# 2: 2 A 1 1
# 3: 3 A 1 1
3 4: 4 B 1 2
# 5: 5 B 1 2
# 6: 6 C 2 1
# 7: 7 C 2 1
# 8: 8 C 2 1
# 9: 9 D 2 2
#10: 10 D 2 2
We can use factor with levels specified and coerce it to integer
library(data.table)
DT[, col := as.integer(factor(b, levels = unique(b))), c]
-output
DT
# a b c col
# 1: 1 A 1 1
# 2: 2 A 1 1
# 3: 3 A 1 1
# 4: 4 B 1 2
# 5: 5 B 1 2
# 6: 6 C 2 1
# 7: 7 C 2 1
# 8: 8 C 2 1
# 9: 9 D 2 2
#10: 10 D 2 2
Or using base R with rle
with(DT, as.integer(ave(b, c, FUN = function(x)
with(rle(x), rep(seq_along(values), lengths)))))

How to efficiently find last preceding row with nonzero value in R data.table

Introduction
I am trying to find the optimal way how to find the last preceding row with nonzero value in a given column and return a value of different column on that row. I want to do it in R data.table and i am looking for maximum efficiency of that operation.
Example
let's have a data table like so:
set.seed(123)
DT = data.table(x=rep(c("b","a","c"),each=6),
y=rep(1:6, 3),
z = rbinom(18, 1, 0.3))
That gives us the following data table:
x y z
1: b 1 0
2: b 2 1
3: b 3 0
4: b 4 1
5: b 5 1
6: b 6 0
7: a 1 0
8: a 2 1
9: a 3 0
10: a 4 0
11: a 5 1
12: a 6 0
13: c 1 0
14: c 2 0
15: c 3 0
16: c 4 1
17: c 5 0
18: c 6 0
Now, the table is for each value in column x ordered by the column y. For each group given by the values in column x, I would like to create a column which would give me for each row the value of y from the row with last nonzero value of z.
Right now I am using lapply for each y and grouping by x which gives the desired result:
DT[, list(y,
z,
output = lapply(y, function(x) max(y[z != 0 & y <= x]))
),
by = 'x']
The question
Can i make my code from the example more efficient?
You might try using nafill:
# create a dummy column that is only populated for nonzero z (and hence NA elsewhere)
DT[z != 0, y_copy := y]
# nafill on this column using LOCF strategy by group:
DT[ , ans := nafill(y_copy, type = 'locf'), by = x][]
# x y z y_copy ans
# 1: b 1 0 NA NA
# 2: b 2 1 2 2
# 3: b 3 0 NA 2
# 4: b 4 1 4 4
# 5: b 5 1 5 5
# 6: b 6 0 NA 5
# 7: a 1 0 NA NA
# 8: a 2 1 2 2
# 9: a 3 0 NA 2
# 10: a 4 0 NA 2
# 11: a 5 1 5 5
# 12: a 6 0 NA 5
# 13: c 1 0 NA NA
# 14: c 2 0 NA NA
# 15: c 3 0 NA NA
# 16: c 4 1 4 4
# 17: c 5 0 NA 4
# 18: c 6 0 NA 4
For now, nafill is a development only feature (data.table 1.12.3+) but 1.12.4 should be on CRAN in the next week or two. For the moment, you can install this with install.packages('data.table', type = 'source', repos = 'http://Rdatatable.github.io/data.table')
If you don't want to create y_copy, you could do this inline with is.na<-:
DT[ , ans := nafill(`is.na<-`(y, z == 0), type = 'locf'), by = x]
This will be inefficient because z==0 is calculated repeatedly by group (instead of as a single vector); you could do this in the first step then:
DT[ , z_zero := z == 0]
But this means another dummy column (with less storage than y_copy if y is numeric, character, or complex)
Another option using rolling join:
DT[, output:= DT[z==1][.SD, on=.(x, y), roll=Inf, x.y]]
output:
x y z output
1: b 1 0 NA
2: b 2 1 2
3: b 3 0 2
4: b 4 1 4
5: b 5 1 5
6: b 6 0 5
7: a 1 0 NA
8: a 2 1 2
9: a 3 0 2
10: a 4 0 2
11: a 5 1 5
12: a 6 0 5
13: c 1 0 NA
14: c 2 0 NA
15: c 3 0 NA
16: c 4 1 4
17: c 5 0 4
18: c 6 0 4
An option with non-equi join
library(data.table)
library(zoo)
DT[DT[z!=0, .(y1 = y, x)], output := y1, on = .(x, y <= y1),
mult = 'last'][, output := na.locf0(output), x]
DT
# x y z output
# 1: b 1 0 NA
# 2: b 2 1 2
# 3: b 3 0 2
# 4: b 4 1 4
# 5: b 5 1 5
# 6: b 6 0 5
# 7: a 1 0 NA
# 8: a 2 1 2
# 9: a 3 0 2
#10: a 4 0 2
#11: a 5 1 5
#12: a 6 0 5
#13: c 1 0 NA
#14: c 2 0 NA
#15: c 3 0 NA
#16: c 4 1 4
#17: c 5 0 4
#18: c 6 0 4

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

Resources