Rolling by group in data.table R

Rolling by group in data.table R - r

I'm trying to roll my function through data.table by group and run into problems. Not sure should I change my function or is my call wrong. Here is simple example:
Data
test <- data.table(return=c(0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2),
sec=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))
my function
zoo_fun <- function(dt, N) {
(rollapply(dt$return + 1, N, FUN=prod, fill=NA, align='right') - 1)
}
Running it (I want to create new column momentum, which would be just product of latest 3 observations added by one for each security (so grouping by=sec).
test[, momentum3 := zoo_fun(test, 3), by=sec]
Warning messages:
1: In `[.data.table`(test, , `:=`(momentum3, zoo_fun(test, 3)), by = sec) :
RHS 1 is length 10 (greater than the size (5) of group 1). The last 5 element(s) will be discarded.
2: In `[.data.table`(test, , `:=`(momentum3, zoo_fun(test, 3)), by = sec) :
RHS 1 is length 10 (greater than the size (5) of group 2). The last 5 element(s) will be discarded.
I get that warning and result is not expected:
> test
return sec momentum3
1: 0.1 A NA
2: 0.1 A NA
3: 0.1 A 0.331
4: 0.1 A 0.331
5: 0.1 A 0.331
6: 0.2 B NA
7: 0.2 B NA
8: 0.2 B 0.331
9: 0.2 B 0.331
10: 0.2 B 0.331
I was expecting B sec to be filled with 0.728 ((1.2*1.2*1.2) -1) with two NAs in start. What am I doing wrong? Is it that rolling functions won't work with grouping?

This answer suggested to use reduce() and shift() for rolling window problems with data.table. This benchmark showed that this might be considerably faster than zoo::rollapply().
test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
# return sec momentum
# 1: 0.1 A NA
# 2: 0.1 A NA
# 3: 0.1 A 0.331
# 4: 0.1 A 0.331
# 5: 0.1 A 0.331
# 6: 0.2 B NA
# 7: 0.2 B NA
# 8: 0.2 B 0.728
# 9: 0.2 B 0.728
#10: 0.2 B 0.728
Benchmark (10 rows, OP data set)
microbenchmark::microbenchmark(
zoo = test[, momentum := zoo_fun(return, 3), by = sec][],
red = test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][],
times = 100L
)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# zoo 2318.209 2389.131 2445.1707 2421.541 2466.1930 3108.382 100 b
# red 562.465 625.413 663.4893 646.880 673.4715 1094.771 100 a
Benchmark (100k rows)
To verify the benchmark results with the small data set, a larger data set is constructed:
n_rows <- 1e4
test0 <- data.table(return = rep(as.vector(outer(1:5/100, 1:2/10, "+")), n_rows),
sec = rep(rep(c("A", "B"), each = 5L), n_rows))
test0
# return sec
# 1: 0.11 A
# 2: 0.12 A
# 3: 0.13 A
# 4: 0.14 A
# 5: 0.15 A
# ---
# 99996: 0.21 B
# 99997: 0.22 B
# 99998: 0.23 B
# 99999: 0.24 B
#100000: 0.25 B
As test is being modified in place, each benchmark run is started with a fresh copy of test0.
microbenchmark::microbenchmark(
copy = test <- copy(test0),
zoo = {
test <- copy(test0)
test[, momentum := zoo_fun(return, 3), by = sec][]
},
red = {
test <- copy(test0)
test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
},
times = 10L
)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# copy 282.619 294.512 325.3261 298.424 350.272 414.983 10 a
# zoo 1129601.974 1144346.463 1188484.0653 1162598.499 1194430.395 1337727.279 10 b
# red 3354.554 3439.095 6135.8794 5002.008 7695.948 11443.595 10 a
For 100k rows, the Reduce() / shift() approach is more than 200 times faster than the zoo::rollapply().
Apparently, there are different interpretations of what the expected result is.
To investigate this, a modified data set is used:
test <- data.table(return=c(0.1, 0.11, 0.12, 0.13, 0.14, 0.21, 0.22, 0.23, 0.24, 0.25),
sec=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))
test
# return sec
# 1: 0.10 A
# 2: 0.11 A
# 3: 0.12 A
# 4: 0.13 A
# 5: 0.14 A
# 6: 0.21 B
# 7: 0.22 B
# 8: 0.23 B
# 9: 0.24 B
#10: 0.25 B
Note that the return values within in each group are varying which is different to the OP's data set where the returnvalues for each sec group are constant.
With this, the accepted answer (rollapply()) returns
test[, momentum := zoo_fun(return, 3), by = sec][]
# return sec momentum
# 1: 0.10 A NA
# 2: 0.11 A NA
# 3: 0.12 A 0.367520
# 4: 0.13 A 0.404816
# 5: 0.14 A 0.442784
# 6: 0.21 B NA
# 7: 0.22 B NA
# 8: 0.23 B 0.815726
# 9: 0.24 B 0.860744
#10: 0.25 B 0.906500
Henrik's answer returns:
test[test[ , tail(.I, 3), by = sec]$V1, res := prod(return + 1) - 1, by = sec][]
# return sec res
# 1: 0.10 A NA
# 2: 0.11 A NA
# 3: 0.12 A 0.442784
# 4: 0.13 A 0.442784
# 5: 0.14 A 0.442784
# 6: 0.21 B NA
# 7: 0.22 B NA
# 8: 0.23 B 0.906500
# 9: 0.24 B 0.906500
#10: 0.25 B 0.906500
The Reduce()/shift() solution returns:
test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
# return sec momentum
# 1: 0.10 A NA
# 2: 0.11 A NA
# 3: 0.12 A 0.367520
# 4: 0.13 A 0.404816
# 5: 0.14 A 0.442784
# 6: 0.21 B NA
# 7: 0.22 B NA
# 8: 0.23 B 0.815726
# 9: 0.24 B 0.860744
#10: 0.25 B 0.906500

When you use dt$return the whole data.table is picked internally within the groups. Just use the column you need in the function definition and it will work fine:
#use the column instead of the data.table
zoo_fun <- function(column, N) {
(rollapply(column + 1, N, FUN=prod, fill=NA, align='right') - 1)
}
#now it works fine
test[, momentum := zoo_fun(return, 3), by = sec]
As a separate note, you should probably not use return as a column or variable name.
Out:
> test
return sec momentum
1: 0.1 A NA
2: 0.1 A NA
3: 0.1 A 0.331
4: 0.1 A 0.331
5: 0.1 A 0.331
6: 0.2 B NA
7: 0.2 B NA
8: 0.2 B 0.728
9: 0.2 B 0.728
10: 0.2 B 0.728

Related

Assign bin value according to vector of thresholds

I have a vector of thresholds that I want to use for creating bins of a column on a data.table
thrshlds <- seq(from = 0, to = 1, by = 0.05)
test <- data.table(
A = rnorm(1000, 0.7, 1),
B = rbinom(1000, 3, 0.6)
)
The logic that I'm looking to implement is:
If the value of column A is equal or less than the value of each threshold, then assign it its respective threshold value. Similar to a SQL case when, but without manually assigning each threshold value.
Something like:
test[, new_category := fcase(A <= thrshlds[1], thrshlds[1],
A <= thrshlds[2], thrshlds[2],
.....)]
But I don't know how to do this kind of iteration inside a data.table query.
Thanks!

You can use cut :
library(data.table)
test[, new_category := cut(A, c(-Inf, thrshlds), thrshlds)]
test
# A B new_category
# 1: 0.220744413 3 0.25
# 2: -0.814886795 3 0
# 3: 1.134536656 2 <NA>
# 4: 0.180463333 1 0.2
# 5: -0.134559033 1 0
# ---
# 996: -0.332559649 1 0
# 997: 0.585641110 0 0.6
# 998: 0.765738832 2 0.8
# 999: 2.167632026 2 <NA>
#1000: 0.008935421 2 0.05

Not sure if this is an appropriate method or not, but here's a rolling join option that seems to work:
test[, new_category := data.table(thrshlds)[test, on="thrshlds==A", x.thrshlds, roll=-Inf] ]
#test[sample(1000, 12)]
# A B new_category
# 1: -1.1317742 3 0.00
# 2: 0.2926608 2 0.30
# 3: 1.5441214 2 NA
# 4: 0.9249706 1 0.95
# 5: 1.2663975 2 NA
# 6: 0.6472989 0 0.65
# 7: -0.5606153 2 0.00
# 8: 0.4439064 2 0.45
# 9: 0.8182938 1 0.85
#10: 0.8461909 2 0.85
#11: 1.0237554 1 NA
#12: 0.7752323 1 0.80

Transform NA values based on first registration and nearest values

I already made a similar question but now I want just to restrict the new values of NA.
I have some data like this:
Date 1 Date 2 Date 3 Date 4 Date 5 Date 6
A NA 0.1 0.2 NA 0.3 0.2
B 0.1 NA NA 0.3 0.2 0.1
C NA NA NA NA 0.3 NA
D 0.1 0.2 0.3 NA 0.1 NA
E NA NA 0.1 0.2 0.1 0.3
I would like to change the NA values of my data based on the first date a value is registered. So for example for A, the first registration is Date 2. Then I want that before that registration the values of NA in A are 0, and after the first registration the values of NA become the mean of the nearest values (mean of date 3 and 5).
In case the last value is an NA, transform it into the last registered value (as in C and D). In the case of E all NA values will become 0.
Get something like this:
Date 1 Date 2 Date 3 Date 4 Date 5 Date 6
A 0 0.1 0.2 0.25 0.3 0.2
B 0.1 0.2 0.2 0.3 0.2 0.1
C 0 0 0 0 0.3 0.3
D 0.1 0.2 0.3 0.2 0.1 0.1
E 0 0 0.1 0.2 0.1 0.3
Can you help me? I'm not sure how to do it in R.

Here is a way using na.approx from the zoo package and apply with MARGIN = 1 (so this is probably not very efficient but get's the job done).
library(zoo)
df1 <- as.data.frame(t(apply(dat, 1, na.approx, method = "constant", f = .5, na.rm = FALSE)))
This results in
df1
# V1 V2 V3 V4 V5
#A NA 0.1 0.2 0.25 0.3
#B 0.1 0.2 0.2 0.30 0.2
#C NA NA NA NA 0.3
#E NA NA 0.1 0.20 0.1
Replace NAs and rename columns.
df1[is.na(df1)] <- 0
names(df1) <- names(dat)
df1
# Date_1 Date_2 Date_3 Date_4 Date_5
#A 0.0 0.1 0.2 0.25 0.3
#B 0.1 0.2 0.2 0.30 0.2
#C 0.0 0.0 0.0 0.00 0.3
#E 0.0 0.0 0.1 0.20 0.1
explanation
Given a vector
x <- c(0.1, NA, NA, 0.3, 0.2)
na.approx(x)
returns x with linear interpolated values
#[1] 0.1000000 0.1666667 0.2333333 0.3000000 0.2000000
But OP asked for constant values so we need the argument method = "constant" from the approx function.
na.approx(x, method = "constant")
# [1] 0.1 0.1 0.1 0.3 0.2
But this is still not what OP asked for because it carries the last observation forward while you want the mean for the closest non-NA values. Therefore we need the argument f (also from approx)
na.approx(x, method = "constant", f = .5)
# [1] 0.1 0.2 0.2 0.3 0.2 # looks good
From ?approx
f : for method = "constant" a number between 0 and 1 inclusive, indicating a compromise between left- and right-continuous step functions. If y0 and y1 are the values to the left and right of the point then the value is y0 if f == 0, y1 if f == 1, and y0*(1-f)+y1*f for intermediate values. In this way the result is right-continuous for f == 0 and left-continuous for f == 1, even for non-finite y values.
Lastly, if we don't want to replace the NAs at the beginning and end of each row we need na.rm = FALSE.
From ?na.approx
na.rm : logical. If the result of the (spline) interpolation still results in NAs, should these be removed?
data
dat <- structure(list(Date_1 = c(NA, 0.1, NA, NA), Date_2 = c(0.1, NA,
NA, NA), Date_3 = c(0.2, NA, NA, 0.1), Date_4 = c(NA, 0.3, NA,
0.2), Date_5 = c(0.3, 0.2, 0.3, 0.1)), .Names = c("Date_1", "Date_2",
"Date_3", "Date_4", "Date_5"), class = "data.frame", row.names = c("A",
"B", "C", "E"))
EDIT
If there are NAs in the last column we can replace these with the last non-NAs before we apply na.approx as shown above.
dat$Date_6[is.na(dat$Date_6)] <- dat[cbind(1:nrow(dat),
max.col(!is.na(dat), ties.method = "last"))][is.na(dat$Date_6)]

This is another possible answer, using na.locf from the zoo package.
Edit: apply is actually not required; This solution fills in the last observed value if this value is missing.
# create the dataframe
Date1 <- c(NA,.1,NA,NA)
Date2 <- c(.1, NA,NA,NA)
Date3 <- c(.2,NA,NA,.1)
Date4 <- c(NA,.3,NA,.2)
Date5 <- c(.3,.2,.3,.1)
Date6 <- c(.1,NA,NA,NA)
df <- as.data.frame(cbind(Date1,Date2,Date3,Date4,Date5,Date6))
rownames(df) <- c('A','B','C','D')
> df
Date1 Date2 Date3 Date4 Date5 Date6
A NA 0.1 0.2 NA 0.3 0.1
B 0.1 NA NA 0.3 0.2 NA
C NA NA NA NA 0.3 NA
D NA NA 0.1 0.2 0.1 NA
# Load library
library(zoo)
df2 <- t(na.locf(t(df),na.rm = F)) # fill last observation carried forward
df3 <- t(na.locf(t(df),na.rm = F, fromLast = T)) # last obs carried backward
df4 <- (df2 + df3)/2 # mean of both dataframes
df4 <- t(na.locf(t(df4),na.rm = F)) # fill last observation carried forward
df4[is.na(df4)] <- 0 # NA values are 0
Date1 Date2 Date3 Date4 Date5 Date6
A 0.0 0.1 0.2 0.25 0.3 0.1
B 0.1 0.2 0.2 0.30 0.2 0.2
C 0.0 0.0 0.0 0.00 0.3 0.3
D 0.0 0.0 0.1 0.20 0.1 0.1

Here's another option with base R + rollmean from zoo (clearly easy to rewrite in base R for this case with window size k = 2).
t(apply(df, 1, function(x) {
means <- c(0, rollmean(na.omit(x), 2), tail(na.omit(x), 1))
replace(x, is.na(x), means[1 + cumsum(!is.na(x))[is.na(x)]])
}))
# Date1 Date2 Date3 Date4 Date5 Date6
# A 0.0 0.1 0.2 0.25 0.3 0.2
# B 0.1 0.2 0.2 0.30 0.2 0.1
# C 0.0 0.0 0.0 0.00 0.3 0.3
# D 0.1 0.2 0.3 0.20 0.1 0.1
# E 0.0 0.0 0.1 0.20 0.1 0.3
Explanation. Suppose that x is the first row of df:
# Date1 Date2 Date3 Date4 Date5 Date6
# A NA 0.1 0.2 NA 0.3 0.2
Then
means
# [1] 0.00 0.15 0.25 0.25 0.20
is a vector of 0, rolling means of two the following non-NA elements, and the last non-NA element. Then all we need to do is to replace those elements of x that are is.na(x). We will replace them by the elements of means at indices 1 + cumsum(!is.na(x))[is.na(x)]. That's the trickier part. Here
cumsum(!is.na(x))
# [1] 0 1 2 2 3 4
Meaning that the first element of x has seen 0 non-NA elements, while, say, the last one has seen 4 non-NA elements so far. Then
cumsum(!is.na(x))[is.na(x)]
# [1] 0 2
is about those NA elements in x that we want to replace. Notice that then
1 + cumsum(!is.na(x))[is.na(x)]
# [1] 1 3
corresponds to the elements of means that we want to use for replacement.

I am finding the function below too complicated but it works, so here it goes.
fun <- function(x){
if(anyNA(x)){
inx <- which(!is.na(x))
if(inx[1] > 1) x[seq_len(inx[1] - 1)] <- 0
prev <- inx[1]
for(i in inx[-1]){
if(i - prev > 1){
m <- mean(c(x[i], x[prev]))
while(prev < i){
x[prev] <- m
prev <- prev + 1
}
}
prev <- i
}
}
x
}
res <- t(apply(df1, 1, fun))
res <- as.data.frame(res)
res
# Date.1 Date.2 Date.3 Date.4 Date.5
#A 0.0 0.1 0.25 0.25 0.3
#B 0.2 0.2 0.20 0.30 0.2
#C 0.0 0.0 0.00 0.00 0.3
#E 0.0 0.0 0.10 0.20 0.1
Data.
df1 <- read.table(text = "
Date.1 Date.2 Date.3 Date.4 Date.5
A NA 0.1 0.2 NA 0.3
B 0.1 NA NA 0.3 0.2
C NA NA NA NA 0.3
E NA NA 0.1 0.2 0.1
", header = TRUE)

Replicate rows by different N

I’ve the following data
mydata <- data.frame(id=c(1,2,3,4,5), n=c(2.63, 1.5, 0.5, 3.5, 4))
1) I need to repeat number of rows for each id by n. For example, n=2.63 for id=1, then I need to replicated id=1 row three times. If n=0.5, then I need to replicate it only one time... so n needs to be round up.
2) Create a new variable called t, where the sum of t for each id must equal to n.
3) Create another new variable called accumulated.t
Here how the output looks like:
id n t accumulated.t
1 2.63 1 1
1 2.63 1 2
1 2.63 0.63 2.63
2 1.5 1 1
2 1.5 0.5 1.5
3 0.5 0.5 0.5
4 3.5 1 1
4 3.5 1 2
4 3.5 1 3
4 3.5 0.5 3.5
5 4 1 1
5 4 1 2
5 4 1 3
5 4 1 4

Get the ceiling of 'n' column and use that to expand the rows of 'mydata' (rep(1:nrow(mydata), ceiling(mydata$n)))
Using data.table, we convert the 'data.frame' to 'data.table' (setDT(mydata1)), grouped by 'id' column, we replicate (rep) 1 with times specified as the trunc of the first value of 'n' (rep(1, trunc(n[1]))). Take the difference between the unique value of 'n' per group and the sum of 'tmp' (n[1]-sum(tmp)). If the difference is greater than 0, we concatenate 'tmp' and 'tmp2' (c(tmp, tmp2)) or if it is '0', we take only 'tmp'. This can be placed in a list to create the two columns 't' and the cumulative sum of 'tmp3 (cumsum(tmp3)).
library(data.table)
mydata1 <- mydata[rep(1:nrow(mydata),ceiling(mydata$n)),]
setDT(mydata1)[, c('t', 'taccum') := {
tmp <- rep(1, trunc(n[1]))
tmp2 <- n[1]-sum(tmp)
tmp3= if(tmp2==0) tmp else c(tmp, tmp2)
list(tmp3, cumsum(tmp3)) },
by = id]
mydata1
# id n t taccum
# 1: 1 2.63 1.00 1.00
# 2: 1 2.63 1.00 2.00
# 3: 1 2.63 0.63 2.63
# 4: 2 1.50 1.00 1.00
# 5: 2 1.50 0.50 1.50
# 6: 3 0.50 0.50 0.50
# 7: 4 3.50 1.00 1.00
# 8: 4 3.50 1.00 2.00
# 9: 4 3.50 1.00 3.00
#10: 4 3.50 0.50 3.50
#11: 5 4.00 1.00 1.00
#12: 5 4.00 1.00 2.00
#13: 5 4.00 1.00 3.00
#14: 5 4.00 1.00 4.00

An alternative that utilizes base R.
mydata <- data.frame(id=c(1,2,3,4,5), n=c(2.63, 1.5, 0.5, 3.5, 4))
mynewdata <- data.frame(id = rep(x = mydata$id,times = ceiling(x = mydata$n)),
n = mydata$n[match(x = rep(x = mydata$id,ceiling(mydata$n)),table = mydata$id)],
t = rep(x = mydata$n / ceiling(mydata$n),times = ceiling(mydata$n)))
mynewdata$t.accum <- unlist(x = by(data = mynewdata$t,INDICES = mynewdata$id,FUN = cumsum))
We start by creating a data.frame with three columns, id, n, and t. id is calculated using rep and ceiling to repeat the ID variable the number of appropriate times. n is obtained by using match to look up the right value in mydata$n. t is obtained by obtaining the ratio of n and ceiling of n, and then repeating it the appropriate amount of times (in this case, ceiling of n again.
Then, we use cumsum to get the cumulative sum, called using by to allow by-group processing for each group of IDs. You could probably use tapply() here as well.

Better way to produce data frame using table()

Recently, I have found that I am using the following pattern over and over again. The process is:
cross-tabulate numeric variable by factor using table
create data frame from created table
add original numeric values to data frame (from row names (!))
remove row names
reorder columns of aggregated data frame
In R, it looks like this:
# Sample data
df <- data.frame(x = round(runif(100), 1),
y = factor(ifelse(runif(100) > .5, 1, 0),
labels = c('failure', 'success'))
)
# Get frequencies
dfSummary <- as.data.frame.matrix(table(df$x, df$y))
# Add column of original values from rownames
dfSummary$x <- as.numeric(rownames(dfSummary))
# Remove rownames
rownames(dfSummary) <- NULL
# Reorder columns
dfSummary <- dfSummary[, c(3, 1, 2)]
Is there anything more elegant in R, preferably using base functions? I know I can use sql to do this in single command - I think that it has to be possible to achieve similar behavior in R.
sqldf solution:
library(sqldf)
dfSummary <- sqldf("select
x,
sum(y = 'failure') as failure,
sum(y = 'success') as success
from df group by x")

An alternative with base R could be:
aggregate(. ~ x, transform(df, success = y == "sucess",
failure = y == "failure", y = NULL), sum)
# x success failure
#1 0.0 2 4
#2 0.1 6 8
#3 0.2 1 7
#4 0.3 5 4
#5 0.4 6 6
#6 0.5 3 3
#7 0.6 4 6
#8 0.7 6 6
#9 0.8 4 5
#10 0.9 6 7
#11 1.0 1 0

Your code modified as a function would be efficient compared to the other solutions in base R (so far). If you wanted the code in one-line, a "reshape/table" combo from base R could be used.
reshape(as.data.frame(table(df)), idvar='x', timevar='y',
direction='wide')
# x Freq.failure Freq.success
#1 0 3 2
#2 0.1 3 9
#3 0.2 5 5
#4 0.3 8 7
#5 0.4 5 3
#6 0.5 9 4
#7 0.6 3 6
#8 0.7 7 6
#9 0.8 3 1
#10 0.9 4 3
#11 1 0 4
In case you want to try data.table
library(data.table)
dcast.data.table(setDT(df), x~y)
# x failure success
# 1: 0.0 3 2
# 2: 0.1 3 9
# 3: 0.2 5 5
# 4: 0.3 8 7
# 5: 0.4 5 3
# 6: 0.5 9 4
# 7: 0.6 3 6
# 8: 0.7 7 6
# 9: 0.8 3 1
#10: 0.9 4 3
#11: 1.0 0 4
Update
I didn't notice the as.data.frame(table( converts to "factor" columns (thanks to #Hadley's comment). A workaround is:
res <- transform(reshape(as.data.frame(table(df), stringsAsFactors=FALSE),
idvar='x', timevar='y', direction='wide'), x= as.numeric(x))
data
set.seed(24)
df <- data.frame(x = round(runif(100), 1),
y = factor(ifelse(runif(100) > .5, 1, 0),
labels = c('failure', 'success'))
)
Benchmarks
set.seed(24)
df <- data.frame(x = round(runif(1e6), 1),
y = factor(ifelse(runif(1e6) > .5, 1, 0),
labels = c('failure', 'success'))
)
tomas <- function(){
dfSummary <- as.data.frame.matrix(table(df$x, df$y))
dfSummary$x <- as.numeric(rownames(dfSummary))
dfSummary <- dfSummary[, c(3, 1, 2)]}
doc <- function(){aggregate(. ~ x, transform(df,
success = y == "success", failure = y == "failure",
y = NULL), sum)}
akrun <- function(){reshape(as.data.frame(table(df)),
idvar='x', timevar='y', direction='wide')}
library(microbenchmark)
microbenchmark(tomas(), doc(), akrun(), unit='relative', times=20L)
Unit: relative
#expr min lq mean median uq max neval cld
#tomas() 1.000000 1.0000000 1.000000 1.000000 1.0000000 1.000000 20 a
#doc() 13.451037 11.5050997 13.082074 13.043584 12.8048306 19.715535 20 b
#akrun() 1.019977 0.9522809 1.012332 1.007569 0.9993835 1.533191 20 a
Updated with dcast.data.table
df1 <- copy(df)
akrun2 <- function() {dcast.data.table(setDT(df1), x~y)}
microbenchmark(tomas(), akrun2(), unit='relative', times=20L)
# Unit: relative
# expr min lq mean median uq max neval cld
# tomas() 6.493231 6.345752 6.410853 6.51594 6.502044 5.591753 20 b
# akrun2() 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 20 a

This should be relatively efficient. You cannot really suppress rownames in a dataframe, since they are a requirement of a valid dataframe
X <- table(df$x,df$y)
cbind( data.frame(x=rownames(X)), unclass(X) )
x failure success
0 0 5 3
0.1 0.1 6 1
0.2 0.2 7 8
0.3 0.3 7 3
0.4 0.4 6 6
0.5 0.5 6 4
0.6 0.6 2 5
0.7 0.7 2 7
0.8 0.8 3 7
0.9 0.9 4 6
1 1 2 0

Select one row from each group in a large data.table based on a condition [duplicate]

This question already has an answer here:
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 5 years ago.
I have a table where the key is repeated a number of times, and one to select just one row for each key, using the largest value of another column.
This example demonstrates the solution I have at the moment:
N = 10
k = 2
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
X Y
1: 1 -1.37925206
2: 1 -0.53837461
3: 2 0.26516340
4: 2 -0.04643483
5: 3 0.40331424
6: 3 0.28667275
7: 4 -0.30342327
8: 4 -2.13143267
9: 5 2.11178673
10: 5 -0.98047230
11: 6 -0.27230783
12: 6 -0.79540934
13: 7 1.54264549
14: 7 0.40079650
15: 8 -0.98474297
16: 8 0.73179201
17: 9 -0.34590491
18: 9 -0.55897393
19: 10 0.97523187
20: 10 1.16924293
> DT[, .SD[Y == max(Y)], by = X]
X Y
1: 1 -0.5383746
2: 2 0.2651634
3: 3 0.4033142
4: 4 -0.3034233
5: 5 2.1117867
6: 6 -0.2723078
7: 7 1.5426455
8: 8 0.7317920
9: 9 -0.3459049
10: 10 1.1692429
The problem is that for larger data.tables this take a very long time:
N = 10000
k = 25
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
system.time(DT[, .SD[Y == max(Y)], by = X])
user system elapsed
9.69 0.00 9.69
My actual table about 100 million rows...
Can anyone suggest a more efficient solution?
Edit - importance of set key
The solution proposed works well, but you must use setkey, or have the DT ordered for it to work:
See Example without "each" in rep:
N = 10
k = 2
DT = data.table(X = rep(1:N, k), Y = rnorm(k*N))
DT[DT[, Y == max(Y), by = X]$V1,]
X Y
1: 1 1.26925708
2: 4 -0.66625732
3: 5 0.41498548
4: 8 0.03531185
5: 9 0.30608380
6: 1 0.50308578
7: 4 0.19848227
8: 6 0.86458423
9: 8 0.69825500
10: 10 -0.38160503

This would be faster compared to .SD
system.time({setkey(DT, X)
DT[DT[,Y==max(Y), by=X]$V1,]})
# user system elapsed
#0.016 0.000 0.016
Or
system.time(DT[DT[, .I[Y==max(Y)], by=X]$V1])
# user system elapsed
# 0.023 0.000 0.023
If there are only two columns,
system.time(DT[,list(Y=max(Y)), by=X])
# user system elapsed
# 0.006 0.000 0.007
Compared to,
system.time(DT[, .SD[Y == max(Y)], by = X] )
# user system elapsed
# 2.946 0.006 2.962
Based on comments from #Khashaa, #AnandaMahto, the CRAN version (1.9.4) gives a different result for the .SD method compared to devel version (1.9.5) (which I used). You could get the same result for "CRAN" version (from #Arun's comments) by setting the options
options(datatable.auto.index=FALSE)
NOTE: In case of "ties", the solutions described here will return multiple rows for each group (as mentioned by #docendo discimus). My solutions are based on the "code" posted by the OP.
If there are "ties", then you could use unique with by option (in case the number of columns are > 2)
setkey(DT,X)
unique(DT[DT[,Y==max(Y), by=X]$V1,], by=c("X", "Y"))
microbenchmarks
library(microbenchmark)
f1 <- function(){setkey(DT,X)[DT[, Y==max(Y), by=X]$V1,]}
f2 <- function(){DT[DT[, .I[Y==max(Y)], by=X]$V1]}
f3 <- function(){DT[, list(Y=max(Y)), by=X]}
f4 <- function(){DT[, .SD[Y==max(Y)], by=X]}
microbenchmark(f1(), f2(), f3(), f4(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval
# f1() 2.794435 2.733706 3.024097 2.756398 2.832654 6.697893 20
# f2() 4.302534 4.291715 4.535051 4.271834 4.342437 8.114811 20
# f3() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20
# f4() 533.119480 522.069189 504.739719 507.494095 493.641512 466.862691 20
# cld
# a
# a
# a
# b
data
N = 10000
k = 25
set.seed(25)
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Rolling by group in data.table R - r

Related

Assign bin value according to vector of thresholds

Transform NA values based on first registration and nearest values

Replicate rows by different N

Better way to produce data frame using table()

Select one row from each group in a large data.table based on a condition [duplicate]

Categories

Resources