R data.table with rollapply - r

Is there an existing idiom for computing rolling statistics using data.table grouping?
For example, given the following code:
DT = data.table(x=rep(c("a","b","c"),each=2), y=c(1,3), v=1:6)
setkey(DT, y)
stat.ror <- DT[,rollapply(v, width=1, by=1, mean, na.rm=TRUE), by=y];
If there isn't one yet, what would be the best way to do it?

In fact I am trying to solve this very problem right now. Here is a partial solution which will work for grouping by a single column:
Edit: got it with RcppRoll, I think:
windowed.average <- function(input.table,
window.width = 2,
id.cols = names(input.table)[3],
index.col = names(input.table)[1],
val.col = names(input.table)[2]) {
require(RcppRoll)
avg.with.group <-
input.table[,roll_mean(get(val.col), n = window.width),by=c(id.cols)]
avg.index <-
input.table[,roll_mean(get(index.col), n = window.width),by=c(id.cols)]$V1
output.table <- data.table(
Group = avg.with.group,
Index = avg.index)
# rename columns to (sensibly) match inputs
setnames(output.table, old=colnames(output.table),
new = c(id.cols,val.col,index.col))
return(output.table)
}
A (badly written) unit test that will pass the above:
require(testthat)
require(zoo)
test.datatable <- data.table(Time = rep(seq_len(10), times=2),
Voltage = runif(20),
Channel= rep(seq_len(2),each=10))
test.width <- 8
# first test: single id column
test.avgtable <- data.table(
test.datatable[,rollapply(Voltage, width = test.width, mean, na.rm=TRUE),
by=c("Channel")],
Time = test.datatable[,rollapply(Time, width = test.width, mean, na.rm=TRUE),
by=c("Channel")]$V1)
setnames(test.avgtable,old=names(test.avgtable),
new=c("Channel","Voltage","Time"))
expect_that(test.avgtable,
is_identical_to(windowed.average(test.datatable,test.width)))
How it looks:
> test.datatable
Time Voltage Channel Class
1: 1 0.310935570 1 1
2: 2 0.565257533 1 2
3: 3 0.577278573 1 1
4: 4 0.152315111 1 2
5: 5 0.836052122 1 1
6: 6 0.655417230 1 2
7: 7 0.034859642 1 1
8: 8 0.572040136 1 2
9: 9 0.268105436 1 1
10: 10 0.126484340 1 2
11: 1 0.139711248 2 1
12: 2 0.336316520 2 2
13: 3 0.413086486 2 1
14: 4 0.304146029 2 2
15: 5 0.399344631 2 1
16: 6 0.581641210 2 2
17: 7 0.183586025 2 1
18: 8 0.009775488 2 2
19: 9 0.449576242 2 1
20: 10 0.938517952 2 2
> test.avgtable
Channel Voltage Time
1: 1 0.4630195 4.5
2: 1 0.4576657 5.5
3: 1 0.4028191 6.5
4: 2 0.2959510 4.5
5: 2 0.3346841 5.5
6: 2 0.4099593 6.5
Unfortunately, I haven't managed to make it work with multiple groupings (as this second section shows):
Looks okay for multiple column groups:
# second test: multiple id columns
# Depends on the first test passing to be meaningful.
test.width <- 4
test.datatable[,Class:= rep(seq_len(2),times=ceiling(nrow(test.datatable)/2))]
# windowed.average(test.datatable,test.width,id.cols=c("Channel","Class"))
test.avgtable <- rbind(windowed.average(test.datatable[Class==1,],test.width),
windowed.average(test.datatable[Class==2,],test.width))
# somewhat artificially attaching expected class labels
test.avgtable[,Class:= rep(seq_len(2),times=nrow(test.avgtable)/4,each=2)]
setkey(test.avgtable,Channel)
setcolorder(test.avgtable,c("Channel","Class","Voltage","Time"))
expect_that(test.avgtable,
is_equivalent_to(windowed.average(test.datatable,test.width,
id.cols=c("Channel","Class"))))

Related

How to subset data.table by external function with arbitrary conditions

Suppose I have a datatable like the following.
a <- seq(2)
b <- seq(3)
c <- seq(4)
dt <- data.table(expand.grid(a,b,c))
> dt
Var1 Var2 Var3
1: 1 1 1
2: 2 1 1
3: 1 2 1
4: 2 2 1
5: 1 3 1
6: 2 3 1
7: 1 1 2
8: 2 1 2
9: 1 2 2
10: 2 2 2
11: 1 3 2
12: 2 3 2
13: 1 1 3
14: 2 1 3
15: 1 2 3
16: 2 2 3
17: 1 3 3
18: 2 3 3
19: 1 1 4
20: 2 1 4
21: 1 2 4
22: 2 2 4
23: 1 3 4
24: 2 3 4
now I can easily subset by column values by using a standard datatable subset call. For example,
dt[Var2==2 & Var3==1]
Var1 Var2 Var3
1: 1 2 1
2: 2 2 1
But now suppose I wanted to create a function outside of the datatable, something, generically like
foo <- function(dt,...){
return(dt[Var2==2 & Var3==1])}
I have seen some examples using only 1 subset column and globalenv()$val, and you could define Var2 outside of the data.table filter.
foo <- function(dt,...){
return(dt[,Var2==globalenv()$Var2])}
But, if I had a large number of columns and wanted to filter by an arbitrary subset of the columns and values, this wouldn't seem to present a simple solution. I can do this a few ways, but they all seem very cumbersome and inefficient. Is there a way to subset by a function with arbitrary columns selected by the user that would accomplish this?
Like,
foo(dt,Var2=1,Var3=1)
foo(dt,Var1=2,Var3=1,Var10=2,...)
foo(dt,c(Var1=2,Var3=1,Var10=2))
etc
I added the extra dots since I want to be able to enter any number of arbitrary selection conditions to the function call.
In case anyone is wondering, my end goal is a much larger function. But the datatable filtering is a critical portion of it.
A slight modification from Christian's answer:
fun <- function(dt, ...) {
args <- list(...)
filter <- Reduce(
function(x, y) call("&", x, y),
Map(function(val, name) call("==", as.name(name), val), args, names(args)))
dt[eval(filter)]
}
fun(dt, Var1 = 1, Var3 = 1)
# Var1 Var2 Var3
#1: 1 1 1
#2: 1 2 1
#3: 1 3 1
One possible solution (Note the of == and not =, as in the post):
foo = function(dt, ...) {
eval(substitute(dt[Reduce(`&`, list(...)),]))
}
foo(dt,Var2==1,Var3==1)
Var1 Var2 Var3
<int> <int> <int>
1: 1 1 1
2: 2 1 1

Unique ID for interconnected cases

I have the following data frame, that shows which cases are interconnected:
DebtorId DupDebtorId
1: 1 2
2: 1 3
3: 1 4
4: 5 1
5: 5 2
6: 5 3
7: 6 7
8: 7 6
My goal is to assign a unique group ID to each group of cases. The desired output is:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 6 2
7: 7 2
My train of thought:
library(data.table)
example <- data.table(
DebtorId = c(1,1,1,5,5,5,6,7),
DupDebtorId = c(2,3,4,1,2,3,7,6)
)
unique_pairs <- example[!duplicated(t(apply(example, 1, sort))),] #get unique pairs of DebtorID and DupDebtorID
unique_pairs[, group := .GRP, by=.(DebtorId)] #assign a group ID for each DebtorId
unique_pairs[, num := rowid(group)]
groups <- dcast(unique_pairs, group + DebtorId ~ num, value.var = 'DupDebtorId') #format data to wide for each group ID
#create new data table with unique cases to assign group ID
newdt <- data.table(DebtorId = sort(unique(c(example$DebtorId, example$DupDebtorId))), group = NA)
newdt$group <- as.numeric(newdt$group)
#loop through the mapped groups, selecting the first instance of group ID for the case
for (i in 1:nrow(newdt)) {
a <- newdt[i]$DebtorId
b <- min(which(groups[,-1] == a, arr.ind=TRUE)[,1])
newdt[i]$group <- b
}
Output:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
6: 6 3
7: 7 3
There are 2 problems in my approach:
From the output, you can see that it fails to recognize that case 5
belongs to group 1;
The final loop is agonizingly slow, which would
render it useless for my use case of 1M rows in my original data, and going the traditional := way does not work with which()
I'm not sure whether my approach could be optimized, or there is a better way of doing this altogether.
This functionality already exists in igraph, so if you don't need to do it yourself, we can build a graph from your data frame and then extract cluster membership. stack() is just an easy way to convert a named vector to data frame.
library(igraph)
g <- graph.data.frame(df)
df_membership <- clusters(g)$membership
stack(df_membership)
#> values ind
#> 1 1 1
#> 2 1 5
#> 3 2 6
#> 4 2 7
#> 5 1 2
#> 6 1 3
#> 7 1 4
Above, values corresponds to group and ind to DebtorId.

Select value from previous group based on condition

I have the following df
df<-data.frame(value = c(1,1,1,2,1,1,2,2,1,2),
group = c(5,5,5,6,7,7,8,8,9,10),
no_rows = c(3,3,3,1,2,2,2,2,1,1))
where identical consecutive values form a group, i.e., values in rows 1:3 fall under group 5. Column "no_rows" tells us how many rows/entries each group has, i.e., group 5 has 3 rows/entries.
I am trying to substitute all values, where no_rows < 2, with the value from a previous group. I expect my end df to look like this:
df_end<-data.frame(value = c(1,1,1,1,1,1,2,2,2,2),
group = c(5,5,5,6,7,7,8,8,9,10),
no_rows = c(3,3,3,1,2,2,2,2,1,1))
I came up with this combination of if...else in a for loop, which gives me the desired output, however it is very slow and I am looking for a way to optimise it.
for (i in 2:length(df$group)){
if (df$no_rows[i] < 2){
df$value[i] <- df$value[i-1]
}
}
I have also tried with dplyr::mutate and lag() but it does not give me the desired output (it only removes the first value per group instead of taking the value of a previous group).
df<-df%>%
group_by(group) %>%
mutate(value = ifelse(no_rows < 2, lag(value), value))
I looked for a solution now for a few days but I could not find anything that fit my problem completly. Any ideas?
a data.table approach...
first, get the values of groups with length >=2, then fill in missing values (NA) by last-observation-carried-forward.
library(data.table)
# make it a data.table
setDT(df, key = "group")
# get values for groups of no_rows >= 2
df[no_rows >= 2, new_value := value][]
# value group no_rows new_value
# 1: 1 5 3 1
# 2: 1 5 3 1
# 3: 1 5 3 1
# 4: 2 6 1 NA
# 5: 1 7 2 1
# 6: 1 7 2 1
# 7: 2 8 2 2
# 8: 2 8 2 2
# 9: 1 9 1 NA
#10: 2 10 1 NA
# fill down missing values in new_value
setnafill(df, "locf", cols = c("new_value"))
# value group no_rows new_value
# 1: 1 5 3 1
# 2: 1 5 3 1
# 3: 1 5 3 1
# 4: 2 6 1 1
# 5: 1 7 2 1
# 6: 1 7 2 1
# 7: 2 8 2 2
# 8: 2 8 2 2
# 9: 1 9 1 2
#10: 2 10 1 2

Cross join in Data.table doesnt seem to retain column names

data.table documentation says this, see ?CJ:
x = c(1,1,2)
y = c(4,6,4)
CJ(x, y) # output columns are automatically named 'x' and 'y'
However when I run the example, it doesnt seem to be retained
x = c(1,1,2)
y = c(4,6,4)
CJ(x, y)
V1 V2
1: 1 4
2: 1 4
3: 1 4
4: 1 4
5: 1 6
6: 1 6
7: 2 4
8: 2 4
9: 2 6
That names are retained is not mentioned in the main body of the help file ?CJ, that is in the Details or Value section. However, there appears to be mention that names are retained as a comment in the examples section of the help file (and it looks like this is where you got your example).
Digging around in the CJ function, which appears to be entirely implemented in R, there is a block near the end,
if (getOption("datatable.CJ.names", FALSE))
vnames = name_dots(...)$vnames
Running getOption("datatable.CJ.names", FALSE) returns FALSE with data.table version 1.12.0. When we set this to TRUE with
options("datatable.CJ.names"=TRUE)
then the code
x = c(1,1,2)
y = c(4,6,4)
CJ(x, y)
returns
x y
1: 1 4
2: 1 4
3: 1 4
4: 1 4
5: 1 6
6: 1 6
7: 2 4
8: 2 4
9: 2 6
However, you are also able to directly provide names (which is not mentioned in the help file).
CJ(uu=x, vv=y)
which returns
uu vv
1: 1 4
2: 1 4
3: 1 4
4: 1 4
5: 1 6
6: 1 6
7: 2 4
8: 2 4
9: 2 6
Note that this overrides the above option.

Function for returning cases that are unique between data sets

I'm working with data from a longitudinal survey that has a big number of realised waves.
In this phase, I'm checking how many new IDs (cases) are in each subsequent wave (i.e. in wave2 in comparison with wave1, wave2 vs wave3, wave3 vs wave4 etc.). Each wave has its own data set.
I created a code that is working fine for manually specified waves. It looks like this:
# Create example data
wave1 <- data.frame(ID = c(1,2,3,4,5))
wave2 <- data.frame(ID = c(1,2,3,6,7))
wave3 <- data.frame(ID = c(1,2,3,6,8))
# In this step I'm taking out the IDs from the first wave
idwave1 <- as.vector(wave1$ID)
# In this step I exclude the non-unique IDs from the subsequent wave (wave2) so I can get the number of unique IDs
wave2unique <- wave2[! wave2$ID %in% idwave1, ]
# Now I apply the same procedure for the wave2 and wave3
idwave2 <- as.vector(wave2$ID)
wave3unique <- wave3[! wave3$ID %in% idwave2, ]
However, after this, I stuck with creating some function that will do it for all waves automatically because I don't know to handle the situation that there is a different data set for each wave.
If your data is in a single table (as it probably should be)...
library(data.table)
wDT = rbindlist(list(wave1, wave2, wave3), idcol = "wave")
wave ID
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 1 5
6: 2 1
7: 2 2
8: 2 3
9: 2 6
10: 2 7
11: 3 1
12: 3 2
13: 3 3
14: 3 6
15: 3 8
Then do an anti-join:
wDT[!.(wave + 1L, ID), on=.(wave, ID)]
wave ID
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 1 5
6: 2 6
7: 2 7
8: 3 8
For dplyr instead of data.table, there's bind_rows (like rbindlist) and anti_join (like the x[!i]).
Are you seeking something below?
wave_ls = list(wave1, wave2, wave3)
f = function(LIST) {
out = list()
for (i in 1:(length(LIST) - 1)) {
wave1 = LIST[[i]]
wave2 = LIST[[i + 1]]
out[[i]] = wave2[!wave2$ID %in% wave1$ID,]
names(out)[i] = paste0('wave', i+1, 'unique')
}
out
}
f(wave_ls)
$wave2unique
[1] 6 7
$wave3unique
[1] 8

Resources