How to subset data.table by external function with arbitrary conditions - r

Suppose I have a datatable like the following.
a <- seq(2)
b <- seq(3)
c <- seq(4)
dt <- data.table(expand.grid(a,b,c))
> dt
Var1 Var2 Var3
1: 1 1 1
2: 2 1 1
3: 1 2 1
4: 2 2 1
5: 1 3 1
6: 2 3 1
7: 1 1 2
8: 2 1 2
9: 1 2 2
10: 2 2 2
11: 1 3 2
12: 2 3 2
13: 1 1 3
14: 2 1 3
15: 1 2 3
16: 2 2 3
17: 1 3 3
18: 2 3 3
19: 1 1 4
20: 2 1 4
21: 1 2 4
22: 2 2 4
23: 1 3 4
24: 2 3 4
now I can easily subset by column values by using a standard datatable subset call. For example,
dt[Var2==2 & Var3==1]
Var1 Var2 Var3
1: 1 2 1
2: 2 2 1
But now suppose I wanted to create a function outside of the datatable, something, generically like
foo <- function(dt,...){
return(dt[Var2==2 & Var3==1])}
I have seen some examples using only 1 subset column and globalenv()$val, and you could define Var2 outside of the data.table filter.
foo <- function(dt,...){
return(dt[,Var2==globalenv()$Var2])}
But, if I had a large number of columns and wanted to filter by an arbitrary subset of the columns and values, this wouldn't seem to present a simple solution. I can do this a few ways, but they all seem very cumbersome and inefficient. Is there a way to subset by a function with arbitrary columns selected by the user that would accomplish this?
Like,
foo(dt,Var2=1,Var3=1)
foo(dt,Var1=2,Var3=1,Var10=2,...)
foo(dt,c(Var1=2,Var3=1,Var10=2))
etc
I added the extra dots since I want to be able to enter any number of arbitrary selection conditions to the function call.
In case anyone is wondering, my end goal is a much larger function. But the datatable filtering is a critical portion of it.

A slight modification from Christian's answer:
fun <- function(dt, ...) {
args <- list(...)
filter <- Reduce(
function(x, y) call("&", x, y),
Map(function(val, name) call("==", as.name(name), val), args, names(args)))
dt[eval(filter)]
}
fun(dt, Var1 = 1, Var3 = 1)
# Var1 Var2 Var3
#1: 1 1 1
#2: 1 2 1
#3: 1 3 1

One possible solution (Note the of == and not =, as in the post):
foo = function(dt, ...) {
eval(substitute(dt[Reduce(`&`, list(...)),]))
}
foo(dt,Var2==1,Var3==1)
Var1 Var2 Var3
<int> <int> <int>
1: 1 1 1
2: 2 1 1

Related

mutate variable by condition using two variables in long format data.table in r

In this data.table:
dt <- data.table(id=c(1,1,1,2,2,2), time=rep(1:3,2), x=c(1,0,0,0,1,0))
dt
id time x
1: 1 1 1
2: 1 2 0
3: 1 3 0
4: 2 1 0
5: 2 2 1
6: 2 3 0
I need the following:
id time x
1: 1 1 1
2: 1 2 1
3: 1 3 1
4: 2 1 0
5: 2 2 1
6: 2 3 1
that is
if x==1 at time==1 then x=1 at times 2 and 3, by id
if x==1 at time==2 then x=1 at time 3, by id
For the first point (I guess the second one will be similar), I have tried approaches mentioned in similar questions I posted before (here and here), but none work:
dt[x==1[time == 1], x := x[time == 1], id] gives an error
setDT(dt)[, x2:= ifelse(x==1 & time==1, x[time==1], x), by=id] changes xonly at time 1 (so, no real change observed)
It would be much easier to work with data.table in wide format, but I keep facing this kind of problem in long format and I don't want to reshape my data all the time
Thank you!
EDIT:
The answer provided by #GregorThomas, dt[, x := cummax(x), by = id], works for the problem that I presented.
Now I ask the same question for a character variable:
dt2 <- data.table(id=c(1,1,1,2,2,2), time=rep(1:3,2), x=c('a','b','b','b','a','b'))
dt2
id time x
1: 1 1 a
2: 1 2 b
3: 1 3 b
4: 2 1 b
5: 2 2 a
6: 2 3 b
In the table above, how could be done the following:
if x=='a' at time==1 then x='a' at times 2 and 3, by id
if x=='a' at time==2 then x='a' at time 3, by id
Using the cumulative maximum function cummax:
dt[, x := cummax(x), by = id]
dt
# id time x
# 1: 1 1 1
# 2: 1 2 1
# 3: 1 3 1
# 4: 2 1 0
# 5: 2 2 1
# 6: 2 3 1

create list from columns of data table expression

Consider the following dt:
dt <- data.table(a=c(1,1,2,3),b=c(4,5,6,4))
That looks like that:
> dt
a b
1: 1 4
2: 1 5
3: 2 6
4: 3 4
I'm here aggregating each column by it's unique values and then counting how many uniquye values each column has:
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 2
2: 2 1 5 1
3: 3 1 6 1
So 1 appears twice in dt and thus a.N is 2, the same logic goes on for the other values.
But the problem is if this transformations of the original datatable have different dimensions at the end, things will get recycled.
For example this dt:
dt <- data.table(a=c(1,1,2,3,7),b=c(4,5,6,4,4))
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 3
2: 2 1 5 1
3: 3 1 6 1
4: 7 1 4 3
Warning message:
In as.data.table.list(jval, .named = NULL) :
Item 2 has 3 rows but longest item has 4; recycled with remainder.
That is no longer the right answer because b.N should have now only 3 rows and things(vector) got recycled.
This is why I would like to transform the expression dt[,lapply(.SD,function(agg) dt[,.N,by=agg])] in a list with different dimensions, with the name of items in the list being the name of the columns in the new transformed dt.
A sketch of what I mean is:
newlist
$a.agg
1 2 3 7
$a.N
2 1 1 1
$b.agg
4 5 6 4
$b.N
3 1 1
Or even better solution would be to get a datatable with a track of the columns on another column:
dt_final
agg N column
1 2 a
2 1 a
3 1 a
7 1 a
4 3 b
5 1 b
6 1 b
Get the data in long format and then aggregate by group.
library(data.table)
dt_long <- melt(dt, measure.vars = c('a', 'b'))
dt_long[, .N, .(variable, value)]
# variable value N
#1: a 1 2
#2: a 2 1
#3: a 3 1
#4: a 7 1
#5: b 4 3
#6: b 5 1
#7: b 6 1
In tidyverse -
library(dplyr)
library(tidyr)
dt %>%
pivot_longer(cols = everything()) %>%
count(name, value)

Select value from previous group based on condition

I have the following df
df<-data.frame(value = c(1,1,1,2,1,1,2,2,1,2),
group = c(5,5,5,6,7,7,8,8,9,10),
no_rows = c(3,3,3,1,2,2,2,2,1,1))
where identical consecutive values form a group, i.e., values in rows 1:3 fall under group 5. Column "no_rows" tells us how many rows/entries each group has, i.e., group 5 has 3 rows/entries.
I am trying to substitute all values, where no_rows < 2, with the value from a previous group. I expect my end df to look like this:
df_end<-data.frame(value = c(1,1,1,1,1,1,2,2,2,2),
group = c(5,5,5,6,7,7,8,8,9,10),
no_rows = c(3,3,3,1,2,2,2,2,1,1))
I came up with this combination of if...else in a for loop, which gives me the desired output, however it is very slow and I am looking for a way to optimise it.
for (i in 2:length(df$group)){
if (df$no_rows[i] < 2){
df$value[i] <- df$value[i-1]
}
}
I have also tried with dplyr::mutate and lag() but it does not give me the desired output (it only removes the first value per group instead of taking the value of a previous group).
df<-df%>%
group_by(group) %>%
mutate(value = ifelse(no_rows < 2, lag(value), value))
I looked for a solution now for a few days but I could not find anything that fit my problem completly. Any ideas?
a data.table approach...
first, get the values of groups with length >=2, then fill in missing values (NA) by last-observation-carried-forward.
library(data.table)
# make it a data.table
setDT(df, key = "group")
# get values for groups of no_rows >= 2
df[no_rows >= 2, new_value := value][]
# value group no_rows new_value
# 1: 1 5 3 1
# 2: 1 5 3 1
# 3: 1 5 3 1
# 4: 2 6 1 NA
# 5: 1 7 2 1
# 6: 1 7 2 1
# 7: 2 8 2 2
# 8: 2 8 2 2
# 9: 1 9 1 NA
#10: 2 10 1 NA
# fill down missing values in new_value
setnafill(df, "locf", cols = c("new_value"))
# value group no_rows new_value
# 1: 1 5 3 1
# 2: 1 5 3 1
# 3: 1 5 3 1
# 4: 2 6 1 1
# 5: 1 7 2 1
# 6: 1 7 2 1
# 7: 2 8 2 2
# 8: 2 8 2 2
# 9: 1 9 1 2
#10: 2 10 1 2

Keep only 'by' variables when collapsing data.table

I have a very large data.table:
DT <- data.table(a=c(1,1,1,1,2,2,2,2,3,3,3,3),b=c(1,1,2,2),c=1:12)
And I need to collapse it by several variables, e.g. list(a,b). Easy:
DT[,sum(c),by=list(a,b)]
a b V1
1: 1 1 3
2: 1 2 7
3: 2 1 11
4: 2 2 15
5: 3 1 19
6: 3 2 23
However, I don't want to take any operation on c, I just want to drop it:
DT[,,by=list(a,b)] # includes a,b,c, thus does not collapse
DT[,list(),by=list(a,b)] # zero rows
DT[,a,by=list(a,b)] # what I want but adds extraneous column a after 'by' columns
How can I specify X below to get the indicated result?
DT[,X,by=list(a,b)]
a b
1: 1 1
2: 1 2
3: 2 1
4: 2 2
5: 3 1
6: 3 2
unique.data.table has a by argument, you could then subset result to get the columns you want.
eg
unique(DT, by = c('a', 'b'))[, c('a','b')]

R data.table with rollapply

Is there an existing idiom for computing rolling statistics using data.table grouping?
For example, given the following code:
DT = data.table(x=rep(c("a","b","c"),each=2), y=c(1,3), v=1:6)
setkey(DT, y)
stat.ror <- DT[,rollapply(v, width=1, by=1, mean, na.rm=TRUE), by=y];
If there isn't one yet, what would be the best way to do it?
In fact I am trying to solve this very problem right now. Here is a partial solution which will work for grouping by a single column:
Edit: got it with RcppRoll, I think:
windowed.average <- function(input.table,
window.width = 2,
id.cols = names(input.table)[3],
index.col = names(input.table)[1],
val.col = names(input.table)[2]) {
require(RcppRoll)
avg.with.group <-
input.table[,roll_mean(get(val.col), n = window.width),by=c(id.cols)]
avg.index <-
input.table[,roll_mean(get(index.col), n = window.width),by=c(id.cols)]$V1
output.table <- data.table(
Group = avg.with.group,
Index = avg.index)
# rename columns to (sensibly) match inputs
setnames(output.table, old=colnames(output.table),
new = c(id.cols,val.col,index.col))
return(output.table)
}
A (badly written) unit test that will pass the above:
require(testthat)
require(zoo)
test.datatable <- data.table(Time = rep(seq_len(10), times=2),
Voltage = runif(20),
Channel= rep(seq_len(2),each=10))
test.width <- 8
# first test: single id column
test.avgtable <- data.table(
test.datatable[,rollapply(Voltage, width = test.width, mean, na.rm=TRUE),
by=c("Channel")],
Time = test.datatable[,rollapply(Time, width = test.width, mean, na.rm=TRUE),
by=c("Channel")]$V1)
setnames(test.avgtable,old=names(test.avgtable),
new=c("Channel","Voltage","Time"))
expect_that(test.avgtable,
is_identical_to(windowed.average(test.datatable,test.width)))
How it looks:
> test.datatable
Time Voltage Channel Class
1: 1 0.310935570 1 1
2: 2 0.565257533 1 2
3: 3 0.577278573 1 1
4: 4 0.152315111 1 2
5: 5 0.836052122 1 1
6: 6 0.655417230 1 2
7: 7 0.034859642 1 1
8: 8 0.572040136 1 2
9: 9 0.268105436 1 1
10: 10 0.126484340 1 2
11: 1 0.139711248 2 1
12: 2 0.336316520 2 2
13: 3 0.413086486 2 1
14: 4 0.304146029 2 2
15: 5 0.399344631 2 1
16: 6 0.581641210 2 2
17: 7 0.183586025 2 1
18: 8 0.009775488 2 2
19: 9 0.449576242 2 1
20: 10 0.938517952 2 2
> test.avgtable
Channel Voltage Time
1: 1 0.4630195 4.5
2: 1 0.4576657 5.5
3: 1 0.4028191 6.5
4: 2 0.2959510 4.5
5: 2 0.3346841 5.5
6: 2 0.4099593 6.5
Unfortunately, I haven't managed to make it work with multiple groupings (as this second section shows):
Looks okay for multiple column groups:
# second test: multiple id columns
# Depends on the first test passing to be meaningful.
test.width <- 4
test.datatable[,Class:= rep(seq_len(2),times=ceiling(nrow(test.datatable)/2))]
# windowed.average(test.datatable,test.width,id.cols=c("Channel","Class"))
test.avgtable <- rbind(windowed.average(test.datatable[Class==1,],test.width),
windowed.average(test.datatable[Class==2,],test.width))
# somewhat artificially attaching expected class labels
test.avgtable[,Class:= rep(seq_len(2),times=nrow(test.avgtable)/4,each=2)]
setkey(test.avgtable,Channel)
setcolorder(test.avgtable,c("Channel","Class","Voltage","Time"))
expect_that(test.avgtable,
is_equivalent_to(windowed.average(test.datatable,test.width,
id.cols=c("Channel","Class"))))

Resources