substitute into `j` element in data.table[, j , by] - r

I have:
DT = data.table(ID=rep(1:2,each = 2), Index=rep(1:2,times = 2), Close=3:6, Open=7:10)
My algorithm has earlier determined that the DT holds the time information in the column with name Index, hence the algorithm stores the following mapping:
time.col <- "Index"
Now the algorithm wants to perform a calculation that would be equivalent to:
DT[, list(Index, Value=cumsum(Close)),by=ID]
ID Index Value
1: 1 1 3
2: 1 2 7
3: 2 1 5
4: 2 2 11
How to rewrite the line and plug the time.col variable in?
Neither of the following works:
DT[, list(time.col, Value=cumsum(Close)),by=ID]
DT[, list(substitute(time.col), Value=cumsum(Close)),by=ID]

You can create an expression for all of j in DT:
e <- parse(text = paste0("list(", time.col,",", "Value=cumsum(Close))"))
DT[, eval(e),by=ID]
EDIT
Or, if you store "Index" as a name, you can evaluate time.col within the environment of .SD:
time.col <- as.name("Index")
DT[,list(eval(time.col,envir=.SD), Value=cumsum(Close)),by=ID]
Very similar question here: In R data.table, how do I pass variable parameters to an expression?
Also, this question helps to understand the mystery of non-standard evaluation in data.table:
eval and quote in data.table

It turns out that the fastest solution from the above-mentioned evals is
e <- parse(text = paste0("list(", time.col,",", "Value=cumsum(Close))"))
DT[, eval(e),by=ID]
However, := solution is even faster. See also Arun's note regarding copying.
Dataset
dim(DT); object.size(DT); DT
[1] 1354402 8
81291568 bytes
Instrument Date Open High Low Close Volume Adjusted Close
1: GOOG/AMEX_ABI 1981-03-11 NA NA 6.56 6.75 217200 NA
2: GOOG/AMEX_ABI 1981-03-12 NA NA 6.66 6.88 616400 NA
3: GOOG/AMEX_ABI 1981-03-13 NA NA 6.81 6.84 462000 NA
4: GOOG/AMEX_ABI 1981-03-16 NA NA 6.81 7.00 306400 NA
5: GOOG/AMEX_ABI 1981-03-17 NA NA 6.88 6.88 925600 NA
---
1354398: YAHOO/TSX_AMM_TO 2014-04-24 1.56 1.58 1.56 1.58 2700 1.58
1354399: YAHOO/TSX_AMM_TO 2014-04-25 1.60 1.62 1.59 1.62 11000 1.62
1354400: YAHOO/TSX_AMM_TO 2014-04-28 1.59 1.61 1.54 1.54 7200 1.54
1354401: YAHOO/TSX_AMM_TO 2014-04-29 1.58 1.60 1.58 1.59 500 1.59
1354402: YAHOO/TSX_AMM_TO 2014-04-30 1.55 1.55 1.50 1.52 36800 1.52
Benchmarking
time.col <- "Date"
fun <- function(){
out <- DT[, list(get(time.col), Value=cumsum(Close)),by=Instrument]
setnames(out, "V1", time.col)
}
fun2 <- function() {
DT[, Value := cumsum(Close), by=Instrument]
out <- DT[ , c("Instrument", ..time.col, "Value")]
DT[, Value:=NULL] # cleanup
out
}
fun2. <- function() {
DT[, Value := cumsum(Close), by=Instrument]
# out <- DT[,c("Instrument", ..time.col, "Value")]
# DT[, Value:=NULL] # cleanup
# out
}
fun3 <- function() {
DT[,list( eval(as.name(time.col),envir=.SD), Value=cumsum(Close)),by=Instrument]
}
fun4 <- function() {
e <- parse(text = paste0("list(", time.col,",", "Value=cumsum(Close))"))
DT[, eval(e),by=Instrument]
}
Result
library(rbenchmark)
benchmark(fun(),
fun2(),
fun3(),
fun4(),
replications=200)
test replications elapsed relative user.self sys.self user.child sys.child
1 fun() 200 5.40 1.327 5.29 0.11 NA NA
2 fun2() 200 5.18 1.273 4.72 0.45 NA NA
3 fun2.() 200 2.70 1.000 2.70 0.00 NA NA
3 fun3() 200 4.12 1.012 3.90 0.22 NA NA
4 fun4() 200 4.07 1.000 3.91 0.16 NA NA

Related

Max-If excel replication in R

I'm trying to replicate the max if function from Excel in R.
ksu price max
9144037 3.11 3.11
8448749 4.19 5.24
9649391 0 8.39
8448749 4.19 5.24
8448749 4.19 5.24
8448749 4.19 5.24
8448749 4.19 5.24
9649391 8.39 8.39
8448749 5.24 5.24
9144037 1.99 3.11
9144037 1.99 3.11
If I were doing it in excel i'd use max(if()). This code is supposed to look at the max price for each ksu, and return the max value on the last column. I've tried this :
max(price[ksu == ksu])
But it doesn't give me the desired output. It only returns one max value regardless of the ksu.
Assuming you have a data.frame called df you could easily use the ave function to get what you want. An example:
> df <- data.frame(grp = c('a','a','b','b'), vals = 1:4)
> df
grp vals
1 a 1
2 a 2
3 b 3
4 b 4
> # Returns a vector
> ave(df$vals, df$grp, FUN = max)
[1] 2 2 4 4
> # So we can store it back into the data.frame if we want
> df$max <- ave(df$vals, df$grp, FUN = max)
> df
grp vals max
1 a 1 2
2 a 2 2
3 b 3 4
4 b 4 4
So using your variable names (but still assuming the data.frame is df):
df$max <- ave(df$price, df$ksu, FUN = max)
Suppose your data is in a data.frame called dat, we can use the dplyr package:
library(dplyr)
dat %>%
group_by(ksu) %>%
mutate(max = max(price))
ksu price max
<int> <dbl> <dbl>
1 9144037 3.11 3.11
2 8448749 4.19 5.24
3 9649391 0.00 8.39
...

nested ifelse in R so close to working

I'm working with the following four columns of raw weight measurement data and a very nearly-functioning nested ifelse statement that results in the 'kg' vector.
Id G4_R_2_4 G4_R_2_5 G4_R_2_5_option2 kg
219 13237 16.0 NA NA 16.0
220 139129 8.50 55.70 47.20 8.50
221 139215 28.9 NA NA 28.9
222 139216 NA 46.70 8.50 46.70
223 139264 12.40 NA NA 12.40
224 139281 13.60 NA NA 13.60
225 139366 16.10 NA NA 16.10
226 139376 61.80 NA NA 61.80
227 140103 NA 48.60 9.10 48.60
The goal is to merge the three 'G4' columns into kg based on the following conditions:
1) If G4_R_2_4 is not NA, print its value
2) If G4_R_2_4 is NA, print the lesser of the values appearing in G4_R_2_5 and G4_R_2_5_option2 (sorry for lame variable names)
I've been working with the following statement (big dataset called 'child'):
> child$kg <- ifelse(child$G4_R_2_4 == 'NA' & child$G4_R_2_5 < child$G4_R_2_5_option2,
child$G4_R_2_5, ifelse(child$G4_R_2_4 == 'NA' & child$G4_R_2_5 > child$G4_R_2_5_option2,
child$G4_R_2_5_option2, child$G4_R_2_4))
Which results in the 'kg' vector I have now. It seems to satisfy the G4_R_2_4 condition (is/is not NA) but always prints the value from G4_R_2_5 for the NA cases. How do I get it to incorporate the greater than/less than condition?
It's not clear from your example, but I think the problem is you're handling NA's incorrectly and\or using wrong type for data.frame's columns. Try rewriting your code like that:
#if your columns are of character type (warnings are ok)
child$G4_R_2_4<-as.numeric(child$G4_R_2_4)
child$G4_R_2_5<-as.numeric(child$G4_R_2_5)
child$G4_R_2_5_option2<-as.numeric(child$G4_R_2_5_option2)
#correct NA handling
child$kg<-ifelse(is.na(child$G4_R_2_4) & child$G4_R_2_5 <
child$G4_R_2_5_option2, child$G4_R_2_5, ifelse(is.na(child$G4_R_2_4) &
child$G4_R_2_5 > child$G4_R_2_5_option2, child$G4_R_2_5_option2, child$G4_R_2_4))
Here is an alternative version that might be interesting, assuming that the values are stored in numerical form (else the column entries should be converted into numerical values, as suggested in the other answers):
get_kg <- function(x){
if(!is.na(x[2])) return (x[2])
return (min(x[3], x[4], na.rm = T))}
child$kg <- apply(child,1,get_kg)
#> child
# Id G4_R_2_4 G4_R_2_5 G4_R_2_5_option2 kg
#219 13237 16.0 NA NA 16.0
#220 139129 8.5 55.7 47.2 8.5
#221 139215 28.9 NA NA 28.9
#222 139216 NA 46.7 8.5 8.5
#223 139264 12.4 NA NA 12.4
#224 139281 13.6 NA NA 13.6
#225 139366 16.1 NA NA 16.1
#226 139376 61.8 NA NA 61.8
#227 140103 NA 48.6 9.1 9.1
We could do this using pmin. Assuming that your 'G4' columns are 'character' class, we convert those columns to 'numeric' class and use pmin on that columns.
indx <- grep('^G4', names(child))
child[indx] <- lapply(child[indx], as.numeric)
d1 <- child[indx]
child$kgN <- ifelse(is.na(d1[,1]), do.call(pmin, c(d1[-1], na.rm=TRUE)), d1[,1])
child$kgN
#[1] 16.0 8.5 28.9 8.5 12.4 13.6 16.1 61.8 9.1
Or without using ifelse
cbind(d1[,1], do.call(pmin, c(d1[-1], na.rm=TRUE)))[cbind(1:nrow(d1),
(is.na(d1[,1]))+1L)]
#[1] 16.0 8.5 28.9 8.5 12.4 13.6 16.1 61.8 9.1
Benchmarks
set.seed(24)
child1 <- as.data.frame(matrix(sample(c(NA,0:50), 1e6*3, replace=TRUE),
ncol=3, dimnames=list(NULL, c('G4_R_2_4', 'G4_R_2_5',
'G4_R_2_5_option2'))) )
cyberj0g <- function(){
with(child1, ifelse(is.na(G4_R_2_4) & G4_R_2_5 <
G4_R_2_5_option2, G4_R_2_5, ifelse(is.na(G4_R_2_4) &
G4_R_2_5 > G4_R_2_5_option2, G4_R_2_5_option2, G4_R_2_4)))
}
get_kg <- function(x){
if(!is.na(x[2])) return (x[2])
return (min(x[3], x[4], na.rm = T))}
RHertel <- function() apply(child1,1,get_kg)
akrun <- function(){cbind(child1[,1], do.call(pmin, c(child1[-1],
na.rm=TRUE)))[cbind(1:nrow(child1), (is.na(child1[,1]))+1L)]}
system.time(cyberj0g())
# user system elapsed
# 0.451 0.000 0.388
system.time(RHertel())
# user system elapsed
# 11.808 0.000 10.928
system.time(akrun())
# user system elapsed
# 0.000 0.000 0.084
library(microbenchmark)
microbenchmark(cyberj0g(), akrun(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval cld
# cyberj0g() 3.750391 4.137777 3.538063 4.091793 2.895156 3.197511 20 b
# akrun() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 a
I'm pretty sure the problem is that you're not testing whether the values are NAs, you're testing whether they are equal to the string "NA", which they never are. This should work:
child$kg <- ifelse(is.na(child$G4_R_2_4) &
child$G4_R_2_5 < child$G4_R_2_5_option2,
child$G4_R_2_5,
ifelse(is.na(child$G4_R_2_4) &
child$G4_R_2_5 > child$G4_R_2_5_option2,
child$G4_R_2_5_option2,
child$G4_R_2_4))

Rolling correlation with id and date

I have some data that has a name, date, and two factors (x,y). I would like to calculate
dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
names(df1)<-c("name","date","x","y")
df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
names(df2)<-c("name","date","x","y")
df<-rbind(df1,df2)
I would like to add a column named "Correl" that for each date, takes the correlation of the previous 5 periods. However, when the name changes, I would like it to give NA's instead.
As you can see below, when the data becomes XYZ instead of ABC, the first 4 periods, the correlation is NA. When there's 5 data points is when the correlation begins again.
name date x y Correl
ABC 1/1/2013 -3.59 -5.13 NA
ABC 1/2/2013 -8.69 4.22 NA
ABC 1/3/2013 2.80 -0.59 NA
ABC 1/4/2013 0.54 5.06 NA
ABC 1/5/2013 1.13 3.49 -0.03
ABC 1/6/2013 0.52 5.16 -0.38
ABC 1/7/2013 -0.24 -5.40 0.08
ABC 1/8/2013 3.26 -2.75 -0.16
ABC 1/9/2013 1.33 5.94 -0.04
ABC 1/10/2013 2.24 1.14 -0.01
ABC 1/11/2013 0.01 9.87 -0.24
ABC 1/12/2013 2.29 1.28 -0.99
ABC 1/13/2013 1.03 -6.30 -0.41
ABC 1/14/2013 0.62 4.82 -0.47
ABC 1/15/2013 1.08 -1.17 -0.50
ABC 1/16/2013 2.43 8.86 0.45
ABC 1/17/2013 -3.43 9.38 -0.35
ABC 1/18/2013 -5.73 7.59 -0.38
ABC 1/19/2013 1.77 3.13 -0.44
ABC 1/20/2013 -0.97 -0.77 -0.24
XYZ 1/1/2013 2.12 10.22 NA
XYZ 1/2/2013 -0.81 0.22 NA
XYZ 1/3/2013 -1.55 -2.25 NA
XYZ 1/4/2013 -4.53 3.63 NA
XYZ 1/5/2013 2.95 -1.51 0.13
XYZ 1/6/2013 6.76 24.16 0.69
XYZ 1/7/2013 3.33 7.31 0.66
XYZ 1/8/2013 -1.47 -4.23 0.67
XYZ 1/9/2013 3.89 -0.43 0.81
XYZ 1/10/2013 5.63 17.95 0.86
XYZ 1/11/2013 3.29 -7.09 0.63
XYZ 1/12/2013 6.03 -9.03 0.29
XYZ 1/13/2013 -5.63 6.96 -0.19
XYZ 1/14/2013 1.70 13.59 -0.18
XYZ 1/15/2013 -1.19 -16.79 -0.29
XYZ 1/16/2013 4.76 4.91 -0.11
XYZ 1/17/2013 9.02 25.16 0.57
XYZ 1/18/2013 4.56 6.48 0.84
XYZ 1/19/2013 5.30 11.81 0.99
XYZ 1/20/2013 -0.60 3.38 0.84
UPDATE: I have tried all of your suggestions and have run into problems using the actual data. I have attached a subset of the data below:
https://www.dropbox.com/s/6k4xhwuinlu0p1f/TEST_SUBSET.csv?dl=0
I cannot get this to work. I've tried removing the NA's, renaming the rows, reading the data in differently, formatting the date differently. Nothing is working for me. Can you see if what you are running is working for this dataset? Thank you very much folks!
Apply ave to the row indexes of df to process by name and use rollapplyr to perform the rolling computations. Note that i is a vector of indexes:
library(zoo)
corx <- function(x) cor(x[, 1], x[, 2])
df$Correl <- ave(1:nrow(df), df$name, FUN = function(i)
rollapplyr(df[i, c("x", "y")], 5, corx, by.column = FALSE, fill = NA))
Update Changed rollapply to rollapplyr to be consistent with the output shown in the question. If you want centred correlations change it back to rollapply.
This is a little late to the party, but the below is a pretty compact solution with dplyr and rollapply from (zoo package).
library(dplyr)
library(zoo)
dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
names(df1)<-c("name","date","x","y")
df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
names(df2)<-c("name","date","x","y")
df<-rbind(df1,df2)
df<-df %>%
group_by(name)%>%
arrange(date) %>%
do({
correl <- rollapply(.[-(1:2)],width = 5, function(a) cor(a[,1],a[,2]), by.column = FALSE, align = "right", fill = NA)
data.frame(., correl)
})
which returns...
> df
Source: local data frame [40 x 5]
Groups: name
name date x y correl
1 ABC 2013-01-01 -0.61707785 -0.7299461 NA
2 ABC 2013-01-02 1.35353618 9.1314743 NA
3 ABC 2013-01-03 2.60815932 0.2511828 NA
4 ABC 2013-01-04 -2.89619789 -1.2586655 NA
5 ABC 2013-01-05 2.23750886 4.6616034 0.52013407
6 ABC 2013-01-06 -1.97573999 3.6800832 0.37575664
7 ABC 2013-01-07 1.70360813 2.2621718 0.32390612
8 ABC 2013-01-08 0.02017797 2.5088032 0.64020507
9 ABC 2013-01-09 0.96263256 1.6711756 -0.00557611
10 ABC 2013-01-10 -0.62400803 5.2011656 -0.66040650
.. ... ... ... ... ...
checking that the other group responds correctly...
> df %>%
+ filter(name=="XYZ")
Source: local data frame [20 x 5]
Groups: name
name date x y correl
1 XYZ 2013-01-01 3.4199729 5.0866361 NA
2 XYZ 2013-01-02 4.7326297 -5.4613465 NA
3 XYZ 2013-01-03 3.8983329 11.1635903 NA
4 XYZ 2013-01-04 1.5235936 3.9077184 NA
5 XYZ 2013-01-05 -5.4885373 7.8961020 -0.3755766
6 XYZ 2013-01-06 0.2311371 2.0157046 -0.3754510
7 XYZ 2013-01-07 2.6903306 -3.2940181 -0.1808097
8 XYZ 2013-01-08 -0.2584268 3.6047800 -0.8457930
9 XYZ 2013-01-09 -0.2897795 2.1029431 -0.9526992
10 XYZ 2013-01-10 5.9571558 18.5810947 0.7025559
11 XYZ 2013-01-11 -7.5250647 -8.0858699 0.7949917
12 XYZ 2013-01-12 2.8438336 -8.4072829 0.6563161
13 XYZ 2013-01-13 7.2295030 -0.1236801 0.5383666
14 XYZ 2013-01-14 -0.7579570 -0.2830291 0.5542751
15 XYZ 2013-01-15 4.3116507 -6.5291051 0.3894343
16 XYZ 2013-01-16 1.4334510 0.5957465 -0.1480032
17 XYZ 2013-01-17 -2.6444881 6.1261976 -0.6183805
18 XYZ 2013-01-18 0.8517223 0.5587499 -0.9243050
19 XYZ 2013-01-19 6.2140131 -3.0944259 -0.8939475
20 XYZ 2013-01-20 11.2871086 -0.1187153 -0.6845300
Hope this helps!
FOLLOW UP
I just ran the following on your actual data set:
library(dplyr)
library(zoo)
import <- read.csv("TEST_SUBSET.CSV", header=TRUE, stringsAsFactors = FALSE)
str(head(import))
import_df<-import %>%
group_by(id)%>%
arrange(asof_dt) %>%
do({
correl <- rollapply(.[-(1:2)],width = 5, function(a) cor(a[,1],a[,2]), by.column = FALSE, align = "right", fill = NA)
data.frame(., correl)
})
import_df
and received the following:
> import_df
Source: local data frame [15,365 x 5]
Groups: id
id asof_dt x y correl
1 DC1123 1/10/1990 -0.003773632 NA NA
2 DC1123 1/10/1991 0.014034992 NA NA
3 DC1123 1/10/1992 -0.004109765 NA NA
4 DC1123 1/10/1994 0.006369326 0.012176085 NA
5 DC1123 1/10/1995 0.014900600 0.001241080 NA
6 DC1123 1/10/1996 0.005763689 -0.013112491 NA
7 DC1123 1/10/1997 0.006949765 0.010737034 NA
8 DC1123 1/10/2000 0.044052805 0.003346296 0.02724175
9 DC1123 1/10/2001 0.009452785 0.017582638 0.01362101
10 DC1123 1/10/2002 -0.018876970 0.004346372 0.01343657
.. ... ... ... ... ...
so it feels like its working.
The (cor) function is only going to return data when it has 5 input points, which doesn't happen until row 8.
Here is a a solution using base R, note that it requires that the data set be sorted by name and date, in that order.
dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
names(df1)<-c("name","date","x","y")
df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
names(df2)<-c("name","date","x","y")
df<-rbind(df1,df2)
rollcorr = function(df, lag = 4) {
out = numeric(nrow(df) - lag)
for( i in seq_along(out) ) {
window = i:(i+lag)
out[i] = cor(df$x[window], df$y[window])
}
out <- c(rep(NA, lag), out)
return(out)
}
df$Correl <- do.call(c, by(df[, -1], df[, 1], rollcorr))

Create lagged vectors based on a different data.frame in a panel in R

I've got two data.frames, one with event data and one with stock data of several companies (here it's only two). I want two additional columns with lagged dates (-1 day and +1 day) for both companies in my event data.frame. The lagged dates should come from my stock data.frame (df) of course. How can i do that?
DATE <- c("01.01.2000","02.01.2000","03.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000","01.01.2000","02.01.2000","04.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B")
df <- data.frame(DATE, RET, COMP)
df
# DATE RET COMP
# 1 01.01.2000 -2.00 A
# 2 02.01.2000 1.10 A
# 3 03.01.2000 3.00 A
# 4 06.01.2000 1.40 A
# 5 07.01.2000 -0.20 A
# 6 09.01.2000 0.60 A
# 7 10.01.2000 0.10 A
# 8 01.01.2000 -0.21 B
# 9 02.01.2000 -1.20 B
# 10 04.01.2000 0.90 B
# 11 06.01.2000 0.30 B
# 12 07.01.2000 -0.10 B
# 13 09.01.2000 0.30 B
# 14 10.01.2000 -0.12 B
DATE <- c("02.01.2000","03.01.2000","06.01.2000","09.01.2000","06.01.2000","07.01.2000","09.01.2000")
ARTICLE <- c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23")
COMP <- c("A","A","A","A","B","B","B")
event <- data.frame(DATE, ARTICLE, COMP)
event
# DATE ARTICLE COMP
# 1 02.01.2000 blabla11 A
# 2 03.01.2000 blabla12 A
# 3 06.01.2000 blabla13 A
# 4 09.01.2000 blabla14 A
# 5 06.01.2000 blabla21 B
# 6 07.01.2000 blabla22 B
# 7 09.01.2000 blabla23 B
the output should be my data.frame event with the two additional columns DATEm1 and DATEp1
# DATE DATEm1 DATEp1 ARTICLE COMP
# 1 02.01.2000 01.01.2000 03.01.2000 blabla11 A
# 2 03.01.2000 02.01.2000 06.01.2000 blabla12 A
# 3 06.01.2000 03.01.2000 07.01.2000 blabla13 A
# 4 09.01.2000 07.01.2000 10.01.2000 blabla14 A
# 5 06.01.2000 04.01.2000 07.01.2000 blabla21 B
# 6 07.01.2000 06.01.2000 09.01.2000 blabla22 B
# 7 09.01.2000 07.01.2000 10.01.2000 blabla23 B
I have tried the approach in the answer of G. Grothendieck, which works perfectly for this example.
The problem is, my original data.frame contains way more data than this example and the sqldf approach is rather slow and uses a lot of memory (too much for my machine). Does anyone have another solution for this?
I tried an approach that uses embed and data.table. Testing with the provided example data, it is competitive with the other data.table approaches (see benchmarking below), but still a bit slower. The embed approach might be faster when extended to additional lags, but I'm not sure if that's relevant.
Anyway, I put the (as of right now) answers together, and compared the timings and output. I don't know how much the exact output matters to you (e.g., I lost a bit of time on the benchmarking b/c I had to dump the RET column), but take note that the different answers vary slightly in the output format/ content. All approaches provide a result that is similar to your desired output format.
I wonder if the different methods scale differently for different sizes of data.frames ... If you test these, I'd be curious to know which is fastest for you and your data! :)
Data and Libraries
library("data.table")
library("sqldf")
library("microbenchmark")
# ========
# = Data =
# ========
DATE <- c("01.01.2000", "02.01.2000", "03.01.2000", "06.01.2000", "07.01.2000", "09.01.2000", "10.01.2000", "01.01.2000", "02.01.2000", "04.01.2000", "06.01.2000", "07.01.2000", "09.01.2000", "10.01.2000")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B")
df0 <- data.frame(DATE, RET, COMP)
DATE <- c("02.01.2000","03.01.2000","06.01.2000","09.01.2000","06.01.2000","07.01.2000","09.01.2000")
ARTICLE <- c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23")
COMP <- c("A","A","A","A","B","B","B")
event0 <- data.frame(DATE, ARTICLE, COMP)
rbatt (this answer)
# ==================
# = rbatt function =
# ==================
# Devations from desired format:
# 1) column order (COMP is first instead of last, otherwise correct order)
m2l <- function(x) split(x, rep(1:ncol(x), each = nrow(x))) # Thanks to https://stackoverflow.com/a/6823557/2343633
e2 <- function(x, d=1) m2l(rbind(matrix(NA, ncol=d, nrow=d-1), embed(x,d)))
testRB <- function(df=df0, event=event0){
dt1 <- as.data.table(df)
dt1[,DATE:=as.character(DATE)]
dt1[,c("DATEp1","DATE","DATEm1"):=e2(DATE,3),by=COMP]
dt1[,RET:=NULL]
setkey(dt1, COMP, DATE, DATEp1, DATEm1)
dt2 <- as.data.table(event)
dt2[,DATE:=as.character(DATE)]
setkey(dt2,COMP,DATE)
# below is slightly slower than doing dt1[,RET:=NULL] then dt <- dt1[dt2]
# dt <- dt1[dt2, list(DATEp1, DATEm1, ARTICLE)] # join
dt <- dt1[dt2]
dt
}
rbatt output:
# COMP DATE DATEp1 DATEm1 ARTICLE
#1: A 02.01.2000 03.01.2000 01.01.2000 blabla11
#2: A 03.01.2000 06.01.2000 02.01.2000 blabla12
#3: A 06.01.2000 07.01.2000 03.01.2000 blabla13
#4: A 09.01.2000 10.01.2000 07.01.2000 blabla14
#5: B 06.01.2000 07.01.2000 04.01.2000 blabla21
#6: B 07.01.2000 09.01.2000 06.01.2000 blabla22
#7: B 09.01.2000 10.01.2000 07.01.2000 blabla23
DA Answer
edited – DA optimization #1 (old code commented-out)
edited – DA optimization #2 (old code commented, versions labeled)
# ===========================
# = David Arenburg function =
# ===========================
# https://stackoverflow.com/a/23483865/2343633
# Devations from desired format:
# 1) column order
# 2) format of DATE, DATEm1, DATEp1
testDA <- function(df=df0, event=event0){
# Original DA below:
# df$DATE <- as.Date(strptime(as.character(df$DATE), format = "%m.%d.%Y"))
# event$DATE <- as.Date(strptime(as.character(event$DATE), format = "%m.%d.%Y"))
#
# ## Making sure "df" is sorted. If your data sets are already ordered you can skip the ordering both here and in the `setDT`
# df <- df[order(df$COMP, df$DATE), ]
#
# library(data.table)
# DT <- setDT(event)[order(COMP, DATE), list(
# DATEm1 = df[match(DATE, df$DATE) - 1, "DATE"],
# DATEp1 = df[match(DATE, df$DATE) + 1, "DATE"]
# ), by = c("ARTICLE", "DATE", "COMP")]
# DT
# Optimization #1:
# event$DATE <- as.character(event$DATE) # converting event$DATE to character (if it is already a character, better to skip this part)
# tempdf <- as.character(data.table(df, key = c("COMP", "DATE"))$DATE) # sorting and conerting df$DATE to character too so they will match
# setDT(event)[order(COMP, DATE), `:=` (
# DATEm1 = tempdf[match(DATE, tempdf) - 1],
# DATEp1 = tempdf[match(DATE, tempdf) + 1]
# ), by = c("DATE", "COMP")]
# event
# Optimization #2
# library(data.table) # loading data.table pckg
tempdf <- data.table(df, key = c("COMP", "DATE"))$DATE # sorting df and taking only the dates for speed
setDT(event)[order(COMP, DATE), `:=` (
DATEm1 = tempdf[match(DATE, tempdf) - 1],
DATEp1 = tempdf[match(DATE, tempdf) + 1]
)]
event
}
David Arenburg output:
edited for DA optimization #1 (#2 may be bugged)
note wrong content in row 7 column "DATEm1", month should be 04
# > testDA()
# DATE ARTICLE COMP DATEm1 DATEp1
# 1: 02.01.2000 blabla11 A 01.01.2000 03.01.2000
# 2: 03.01.2000 blabla12 A 02.01.2000 06.01.2000
# 3: 06.01.2000 blabla13 A 03.01.2000 07.01.2000
# 4: 09.01.2000 blabla14 A 07.01.2000 10.01.2000
# 5: 06.01.2000 blabla21 B 03.01.2000 07.01.2000
# 6: 07.01.2000 blabla22 B 06.01.2000 09.01.2000
# 7: 09.01.2000 blabla23 B 07.01.2000 10.01.2000
GG Answer
# ============================
# = G. Grothendieck function =
# ============================
# https://stackoverflow.com/a/23415033/2343633
# Deviations from desired format:
# 1) format of DATE, DATEm1, DATEp1
testGG <- function(df=df0, event=event0){
# ensure that dates sort correctly by converting to yyyy-mm-dd
df2 <- transform(df, DATE = format(as.Date(DATE, "%m.%d.%Y")))
event2 <- transform(event, DATE = format(as.Date(DATE, "%m.%d.%Y")))
result <- sqldf(c("create index i on df2(COMP, DATE)",
"select
event.DATE,
max(A.DATE) DATEm1,
min(B.DATE) DATEp1,
event.ARTICLE,
event.COMP
from event2 event, main.df2 A, main.df2 B
on event.COMP = A.COMP and event.COMP = B.COMP
and event.DATE > A.DATE and event.DATE < B.DATE
group by event.DATE, event.COMP
order by event.COMP, event.DATE"))
result
}
GG output:
# DATE DATEm1 DATEp1 ARTICLE COMP
# 1 2000-02-01 2000-01-01 2000-03-01 blabla11 A
# 2 2000-03-01 2000-02-01 2000-06-01 blabla12 A
# 3 2000-06-01 2000-03-01 2000-07-01 blabla13 A
# 4 2000-09-01 2000-07-01 2000-10-01 blabla14 A
# 5 2000-06-01 2000-04-01 2000-07-01 blabla21 B
# 6 2000-07-01 2000-06-01 2000-09-01 blabla22 B
# 7 2000-09-01 2000-07-01 2000-10-01 blabla23 B
Arun answer
# =================
# = Arun function =
# =================
# https://stackoverflow.com/a/23484292/2343633
# Deviations from desired format:
# 1) Column order (COMP first, ARTICLE does not come after DATEm1)
testAR <- function(df=df0, event=event0){
dt1 = as.data.table(df)
dt2 = as.data.table(event)
key_cols = c("COMP", "DATE")
setcolorder(dt2, c(key_cols, setdiff(names(dt2), key_cols)))
setkeyv(dt1, key_cols)
idx1 = dt1[dt2, which=TRUE, mult="first"]-1L
idx2 = dt1[dt2, which=TRUE, mult="last"]+1L
idx1[idx1 == 0L] = NA
dt2[, `:=`(DATEm1 = dt1$DATE[idx1],
DATEp1 = dt1$DATE[idx2]
)]
dt2
}
Arun output:
# COMP DATE ARTICLE DATEm1 DATEp1
# 1: A 02.01.2000 blabla11 01.01.2000 03.01.2000
# 2: A 03.01.2000 blabla12 02.01.2000 06.01.2000
# 3: A 06.01.2000 blabla13 03.01.2000 07.01.2000
# 4: A 09.01.2000 blabla14 07.01.2000 10.01.2000
# 5: B 06.01.2000 blabla21 04.01.2000 07.01.2000
# 6: B 07.01.2000 blabla22 06.01.2000 09.01.2000
# 7: B 09.01.2000 blabla23 07.01.2000 10.01.2000
Benchmark
edit – note that this is the original benchmark (original code, original OP data set)
# =============
# = Benchmark =
# =============
microbenchmark(testAR(), testDA(), testRB(), testGG())
# Unit: milliseconds
# expr min lq median uq max neval
# testAR() 3.220278 3.414430 3.509251 3.626438 7.209494 100
# testDA() 4.273542 4.471227 4.569370 4.752857 6.460922 100
# testRB() 5.704559 5.981680 6.135946 6.457392 14.309858 100
# testGG() 22.337065 23.064494 23.964581 24.622467 50.934712 100
EDIT: Benchmark with larger data set
Note that I drop testGG() from this benchmark b/c it was far slower (I did some tests on a couple intermediate data sets, and tetGG() scaled worse than the other 3 approaches).
# ========
# = Data =
# ========
mos <- c("01","02","03","06","07","09","10", "01", "02", "04", "06", "07", "09", "10")
yrs <- 1920:2020
DATE <- paste(mos, "01", rep(yrs, each=length(mos)), sep=".")
RET <- rep(c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12), length(yrs))
COMP <- rep(c("A","A","A","A","A","A","A","B","B","B","B","B","B","B"), length(yrs))
df0 <- data.frame(DATE, RET, COMP)
mos2 <- c("02","03","06","09","06","07","09")
DATE <- paste(mos2, "01", rep(yrs, each=length(mos2)), sep=".")
ARTICLE <- rep(c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23"), length(yrs))
COMP <- rep(c("A","A","A","A","B","B","B"), length(yrs))
event0 <- data.frame(DATE, ARTICLE, COMP)
edit – original benchmarks for large dataset:
# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
# expr min lq median uq max neval
# testAR() 3.458217 3.696698 3.934349 4.697033 6.584214 100
# testDA() 143.180409 148.916461 151.776002 155.219515 237.524369 100
# testRB() 7.279168 7.636102 8.073778 8.828537 11.143111 100
edit – benchmark for large dataset after DA optimization #1:
# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
# expr min lq median uq max neval
# testAR() 3.198266 3.440739 3.605723 3.788199 22.52867 100
# testDA() 56.290346 59.528819 60.821921 64.580825 80.99480 100
# testRB() 6.763570 7.200741 7.400343 7.748849 20.97527 100
edit – benchmark for large data set after DA optimization #2:
NOTE – warning resulting from update #2 to testDA()
# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
# expr min lq median uq max neval
# testAR() 3.423508 6.055584 6.246517 6.333444 7.653360 100
# testDA() 2.665558 3.961070 4.062354 4.139571 8.427439 100
# testRB() 6.421328 6.669137 6.877517 6.966977 8.271469 100
# There were 50 or more warnings (use warnings() to see the first 50)
# > warnings()[1]
# Warning message:
# In `[.data.table`(dt2, , `:=`(DATEm1 = dt1$DATE[idx1], ... :
# Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.
Memory and time profiling on large data set, 50 iterations each
Profiling code
Rprof("testAR.out", memory.profiling=TRUE)
for(i in 1:50){
arAns <- testAR()
}
Rprof(NULL)
Rprof("testDA.out", memory.profiling=TRUE)
for(i in 1:50){
daAns <- testDA()
}
Rprof(NULL)
Rprof("testRB.out", memory.profiling=TRUE)
for(i in 1:50){
rbAns <- testRB()
}
Rprof(NULL)
testAR() profile results
# > summaryRprof("testAR.out", memory="both")$by.self
# self.time self.pct total.time total.pct mem.total
# "[[" 0.02 10 0.06 30 8.3
# "head" 0.02 10 0.04 20 12.1
# "nrow" 0.02 10 0.04 20 10.6
# ".Call" 0.02 10 0.02 10 8.2
# ".row_names_info" 0.02 10 0.02 10 8.4
# "<Anonymous>" 0.02 10 0.02 10 8.3
# "key" 0.02 10 0.02 10 0.0
# "levels.default" 0.02 10 0.02 10 0.0
# "match" 0.02 10 0.02 10 0.0
# "stopifnot" 0.02 10 0.02 10 4.2
testDA() profile results
# > summaryRprof("testDA.out", memory="both")$by.self
# self.time self.pct total.time total.pct mem.total
# "match" 2.04 26.56 2.34 30.47 94.2
# "[.data.frame" 1.78 23.18 6.50 84.64 295.3
# "NextMethod" 0.80 10.42 0.80 10.42 33.9
# "strptime" 0.42 5.47 0.46 5.99 25.9
# "[" 0.34 4.43 7.14 92.97 335.9
# "[.Date" 0.34 4.43 1.14 14.84 49.8
# "names" 0.34 4.43 0.34 4.43 17.9
# "%in%" 0.28 3.65 1.44 18.75 50.3
# "dim" 0.28 3.65 0.30 3.91 13.9
# "order" 0.16 2.08 0.18 2.34 1.7
# "$" 0.16 2.08 0.16 2.08 7.0
# ".Call" 0.14 1.82 6.76 88.02 308.4
# "length" 0.14 1.82 0.14 1.82 4.6
# "sys.call" 0.14 1.82 0.14 1.82 5.6
# "<Anonymous>" 0.04 0.52 0.04 0.52 9.5
# "as.Date.POSIXlt" 0.04 0.52 0.04 0.52 3.4
# "getwd" 0.04 0.52 0.04 0.52 9.5
# "do.call" 0.02 0.26 0.18 2.34 1.7
# "assign" 0.02 0.26 0.04 0.52 0.1
# ".subset2" 0.02 0.26 0.02 0.26 6.1
# "all" 0.02 0.26 0.02 0.26 0.0
# "file.info" 0.02 0.26 0.02 0.26 0.0
# "is.data.table" 0.02 0.26 0.02 0.26 0.0
# "lockBinding" 0.02 0.26 0.02 0.26 0.1
# "parent.frame" 0.02 0.26 0.02 0.26 0.0
# "pmatch" 0.02 0.26 0.02 0.26 0.0
# "which" 0.02 0.26 0.02 0.26 6.5
testRB() profile results
# > summaryRprof("testRB.out", memory="both")$by.self
# self.time self.pct total.time total.pct mem.total
# "sort.list" 0.04 9.52 0.06 14.29 21.5
# "length" 0.04 9.52 0.04 9.52 0.0
# "pmatch" 0.04 9.52 0.04 9.52 13.9
# "[.data.table" 0.02 4.76 0.42 100.00 71.8
# ".Call" 0.02 4.76 0.12 28.57 39.6
# "split.default" 0.02 4.76 0.10 23.81 32.9
# "alloc.col" 0.02 4.76 0.08 19.05 13.3
# "[[" 0.02 4.76 0.04 9.52 6.9
# "cedta" 0.02 4.76 0.04 9.52 0.0
# "lapply" 0.02 4.76 0.04 9.52 0.0
# "[[.data.frame" 0.02 4.76 0.02 4.76 6.9
# "as.character" 0.02 4.76 0.02 4.76 6.0
# "as.name" 0.02 4.76 0.02 4.76 5.3
# "attr" 0.02 4.76 0.02 4.76 0.0
# "exists" 0.02 4.76 0.02 4.76 0.0
# "FUN" 0.02 4.76 0.02 4.76 0.0
# "intersect" 0.02 4.76 0.02 4.76 6.5
# "is.data.table" 0.02 4.76 0.02 4.76 0.0
Conclusion
As far as I can tell, Arun's answer is the fastest and most memory efficient. rbatt answer scales better with data set size than does DA's answer – my initial guess was that approaches using POSIX or Date classes might not scale well, but I'm unsure if this hunch is supported by the profiling results. If someone thinks it would be helpful, I could provide the full profile results, instead of just the $by.self portion.
Also worth noting is that time spent and memory used were negatively correlated among approaches – the fastest approaches used the least memory.
Here's another approach using data.table:
First, we convert df and event to data.tables. Here I'll use as.data.table(.). But you can use setDT if you don't want to make a copy. That is, by doing setDT(df), df will be modified by reference to a data.table.
require(data.table) ## >= 1.9.2
dt1 = as.data.table(df)
dt2 = as.data.table(event)
Then we'll prepare the data as follows:
key_cols = c("COMP", "DATE")
setcolorder(dt2, c(key_cols, setdiff(names(dt2), key_cols)))
setkeyv(dt1, key_cols)
The setcolorder rearranges the columns of your data.tables by reference. setkeyv sorts the data.table by the given columns in ascending order, and marks the key columns for dt1.
The column reordering is essential here because, we don't set key on dt2 (because that'll sort dt2 which may be undesirable for you). And since no key is set of dt2, data.table takes the first 'n' (=2 here) columns from dt2 to match with the key columns from dt1.
Note: A join x[i] using data.table absolutely requires key of x to be set. Here x = dt1. Setting key on i is optional, depending on if you wish the order to be preserved or not.
Now, we perform two joins and get the corresponding matching indices:
idx1 = dt1[dt2, which=TRUE, mult="first"]-1L
idx2 = dt1[dt2, which=TRUE, mult="last"]+1L
The first join gets for each match of dt2 in dt1, the first matching position in dt1. Similarly, the second join gets for each match of dt2 in dt1, the last matching position in dt1. We add -1 and +1 to get the previous and next indices respectively.
Take care of one special case:
idx1[idx1 == 0L] = NA
When the matching index is 1, subtracting it will result in 0. Because of R's behaviour on 0-index, we've to explicitly replace it with NA here.
Now, we can just subset those dates and add it to dt2 by reference as follows:
dt2[, `:=`(DATEm1 = dt1$DATE[idx1],
DATEp1 = dt1$DATE[idx2]
)]
# COMP DATE ARTICLE DATEm1 DATEp1
# 1: A 02.01.2000 blabla11 01.01.2000 03.01.2000
# 2: A 03.01.2000 blabla12 02.01.2000 06.01.2000
# 3: A 06.01.2000 blabla13 03.01.2000 07.01.2000
# 4: A 09.01.2000 blabla14 07.01.2000 10.01.2000
# 5: B 06.01.2000 blabla21 04.01.2000 07.01.2000
# 6: B 07.01.2000 blabla22 06.01.2000 09.01.2000
# 7: B 09.01.2000 blabla23 07.01.2000 10.01.2000
This can be done with a triple join in sqldf:
library(sqldf)
# ensure that dates sort correctly by converting to yyyy-mm-dd
df2 <- transform(df, DATE = format(as.Date(DATE, "%m.%d.%Y")))
event2 <- transform(event, DATE = format(as.Date(DATE, "%m.%d.%Y")))
result <- sqldf(c("create index i on df2(COMP, DATE)",
"select
event.DATE,
max(A.DATE) DATEm1,
min(B.DATE) DATEp1,
event.ARTICLE,
event.COMP
from event2 event, main.df2 A, main.df2 B
on event.COMP = A.COMP and event.COMP = B.COMP
and event.DATE > A.DATE and event.DATE < B.DATE
group by event.DATE, event.COMP
order by event.COMP, event.DATE"))
giving:
> result
DATE DATEm1 DATEp1 ARTICLE COMP
1 2000-02-01 2000-01-01 2000-03-01 blabla11 A
2 2000-03-01 2000-02-01 2000-06-01 blabla12 A
3 2000-06-01 2000-03-01 2000-07-01 blabla13 A
4 2000-09-01 2000-07-01 2000-10-01 blabla14 A
5 2000-06-01 2000-04-01 2000-07-01 blabla21 B
6 2000-07-01 2000-06-01 2000-09-01 blabla22 B
7 2000-09-01 2000-07-01 2000-10-01 blabla23 B
library(data.table) # loading data.table pckg
tempdf <- data.table(df, key = c("COMP", "DATE")) # Sorting df
DATEVEC <- tempdf$DATE # Creating DATE vector to choose from
Key <- paste(DATEVEC, tempdf$COMP) # Creating Key vector for matching
setDT(event)[order(COMP, DATE), `:=`(
DATEm1 = DATEVEC[match(paste(DATE, COMP), Key) - 1],
DATEp1 = DATEVEC[match(paste(DATE, COMP), Key) + 1]
)]
event
# DATE ARTICLE COMP DATEm1 DATEp1
# 1: 02.01.2000 blabla11 A 01.01.2000 03.01.2000
# 2: 03.01.2000 blabla12 A 02.01.2000 06.01.2000
# 3: 06.01.2000 blabla13 A 03.01.2000 07.01.2000
# 4: 09.01.2000 blabla14 A 07.01.2000 10.01.2000
# 5: 06.01.2000 blabla21 B 04.01.2000 07.01.2000
# 6: 07.01.2000 blabla22 B 06.01.2000 09.01.2000
# 7: 09.01.2000 blabla23 B 07.01.2000 10.01.2000
Another way
tempdf <- data.table(df, key = c("COMP", "DATE")) # Sorting df
DATEVEC <- tempdf$DATE # Creating DATE vector to choose from
Keydf <- paste(DATEVEC, tempdf$COMP) # Creating Key vector for matching
event <- data.table(event, key = c("COMP", "DATE")) # Sorting event
event$Keyev <- paste(event$DATE, event$COMP) # Creating Key vector for matching
event[, `:=`(
DATEm1 = DATEVEC[match(Keyev, Keydf) - 1],
DATEp1 = DATEVEC[match(Keyev, Keydf) + 1]
)]
event
# DATE ARTICLE COMP Keyev DATEm1 DATEp1
# 1: 02.01.2000 blabla11 A 02.01.2000 A 01.01.2000 03.01.2000
# 2: 03.01.2000 blabla12 A 03.01.2000 A 02.01.2000 06.01.2000
# 3: 06.01.2000 blabla13 A 06.01.2000 A 03.01.2000 07.01.2000
# 4: 09.01.2000 blabla14 A 09.01.2000 A 07.01.2000 10.01.2000
# 5: 06.01.2000 blabla21 B 06.01.2000 B 04.01.2000 07.01.2000
# 6: 07.01.2000 blabla22 B 07.01.2000 B 06.01.2000 09.01.2000
# 7: 09.01.2000 blabla23 B 09.01.2000 B 07.01.2000 10.01.2000
Third way
setDT(df) # Making df adata.table
setkey(df, COMP, DATE) # Sorting df
DATEVEC <- df$DATE # Creating DATE vector to choose from
Keydf <- paste(DATEVEC, df$COMP) # Creating Key vector for matching
setDT(event) # Making event a data.table
setkey(event, COMP, DATE) # Sorting event
event$Keyev <- paste(event$DATE, event$COMP) # Creating Key vector for matching
event[, `:=`(
DATEm1 = DATEVEC[match(Keyev, Keydf) - 1],
DATEp1 = DATEVEC[match(Keyev, Keydf) + 1]
)]
event
# DATE ARTICLE COMP Keyev DATEm1 DATEp1
# 1: 02.01.2000 blabla11 A 02.01.2000 A 01.01.2000 03.01.2000
# 2: 03.01.2000 blabla12 A 03.01.2000 A 02.01.2000 06.01.2000
# 3: 06.01.2000 blabla13 A 06.01.2000 A 03.01.2000 07.01.2000
# 4: 09.01.2000 blabla14 A 09.01.2000 A 07.01.2000 10.01.2000
# 5: 06.01.2000 blabla21 B 06.01.2000 B 04.01.2000 07.01.2000
# 6: 07.01.2000 blabla22 B 07.01.2000 B 06.01.2000 09.01.2000
# 7: 09.01.2000 blabla23 B 09.01.2000 B 07.01.2000 10.01.2000

How to efficiently use Rprof in R?

I would like to know if it is possible to get a profile from R-Code in a way that is similar to matlab's Profiler. That is, to get to know which line numbers are the one's that are especially slow.
What I acchieved so far is somehow not satisfactory. I used Rprof to make me a profile file. Using summaryRprof I get something like the following:
$by.self
self.time self.pct total.time total.pct
[.data.frame 0.72 10.1 1.84 25.8
inherits 0.50 7.0 1.10 15.4
data.frame 0.48 6.7 4.86 68.3
unique.default 0.44 6.2 0.48 6.7
deparse 0.36 5.1 1.18 16.6
rbind 0.30 4.2 2.22 31.2
match 0.28 3.9 1.38 19.4
[<-.factor 0.28 3.9 0.56 7.9
levels 0.26 3.7 0.34 4.8
NextMethod 0.22 3.1 0.82 11.5
...
and
$by.total
total.time total.pct self.time self.pct
data.frame 4.86 68.3 0.48 6.7
rbind 2.22 31.2 0.30 4.2
do.call 2.22 31.2 0.00 0.0
[ 1.98 27.8 0.16 2.2
[.data.frame 1.84 25.8 0.72 10.1
match 1.38 19.4 0.28 3.9
%in% 1.26 17.7 0.14 2.0
is.factor 1.20 16.9 0.10 1.4
deparse 1.18 16.6 0.36 5.1
...
To be honest, from this output I don't get where my bottlenecks are because (a) I use data.frame pretty often and (b) I never use e.g., deparse. Furthermore, what is [?
So I tried Hadley Wickham's profr, but it was not any more useful considering the following graph:
Is there a more convenient way to see which line numbers and particular function calls are slow?
Or, is there some literature that I should consult?
Any hints appreciated.
EDIT 1:
Based on Hadley's comment I will paste the code of my script below and the base graph version of the plot. But note, that my question is not related to this specific script. It is just a random script that I recently wrote. I am looking for a general way of how to find bottlenecks and speed up R-code.
The data (x) looks like this:
type word response N Classification classN
Abstract ANGER bitter 1 3a 3a
Abstract ANGER control 1 1a 1a
Abstract ANGER father 1 3a 3a
Abstract ANGER flushed 1 3a 3a
Abstract ANGER fury 1 1c 1c
Abstract ANGER hat 1 3a 3a
Abstract ANGER help 1 3a 3a
Abstract ANGER mad 13 3a 3a
Abstract ANGER management 2 1a 1a
... until row 1700
The script (with short explanations) is this:
Rprof("profile1.out")
# A new dataset is produced with each line of x contained x$N times
y <- vector('list',length(x[,1]))
for (i in 1:length(x[,1])) {
y[[i]] <- data.frame(rep(x[i,1],x[i,"N"]),rep(x[i,2],x[i,"N"]),rep(x[i,3],x[i,"N"]),rep(x[i,4],x[i,"N"]),rep(x[i,5],x[i,"N"]),rep(x[i,6],x[i,"N"]))
}
all <- do.call('rbind',y)
colnames(all) <- colnames(x)
# create a dataframe out of a word x class table
table_all <- table(all$word,all$classN)
dataf.all <- as.data.frame(table_all[,1:length(table_all[1,])])
dataf.all$words <- as.factor(rownames(dataf.all))
dataf.all$type <- "no"
# get type of the word.
words <- levels(dataf.all$words)
for (i in 1:length(words)) {
dataf.all$type[i] <- as.character(all[pmatch(words[i],all$word),"type"])
}
dataf.all$type <- as.factor(dataf.all$type)
dataf.all$typeN <- as.numeric(dataf.all$type)
# aggregate response categories
dataf.all$c1 <- apply(dataf.all[,c("1a","1b","1c","1d","1e","1f")],1,sum)
dataf.all$c2 <- apply(dataf.all[,c("2a","2b","2c")],1,sum)
dataf.all$c3 <- apply(dataf.all[,c("3a","3b")],1,sum)
Rprof(NULL)
library(profr)
ggplot.profr(parse_rprof("profile1.out"))
Final data looks like this:
1a 1b 1c 1d 1e 1f 2a 2b 2c 3a 3b pa words type typeN c1 c2 c3 pa
3 0 8 0 0 0 0 0 0 24 0 0 ANGER Abstract 1 11 0 24 0
6 0 4 0 1 0 0 11 0 13 0 0 ANXIETY Abstract 1 11 11 13 0
2 11 1 0 0 0 0 4 0 17 0 0 ATTITUDE Abstract 1 14 4 17 0
9 18 0 0 0 0 0 0 0 0 8 0 BARREL Concrete 2 27 0 8 0
0 1 18 0 0 0 0 4 0 12 0 0 BELIEF Abstract 1 19 4 12 0
The base graph plot:
Running the script today also changed the ggplot2 graph a little (basically only the labels), see here.
Alert readers of yesterdays breaking news (R 3.0.0 is finally out) may have noticed something interesting that is directly relevant to this question:
Profiling via Rprof() now optionally records information at the statement level, not just the function level.
And indeed, this new feature answers my question and I will show how.
Let's say, we want to compare whether vectorizing and pre-allocating are really better than good old for-loops and incremental building of data in calculating a summary statistic such as the mean. The, relatively stupid, code is the following:
# create big data frame:
n <- 1000
x <- data.frame(group = sample(letters[1:4], n, replace=TRUE), condition = sample(LETTERS[1:10], n, replace = TRUE), data = rnorm(n))
# reasonable operations:
marginal.means.1 <- aggregate(data ~ group + condition, data = x, FUN=mean)
# unreasonable operations:
marginal.means.2 <- marginal.means.1[NULL,]
row.counter <- 1
for (condition in levels(x$condition)) {
for (group in levels(x$group)) {
tmp.value <- 0
tmp.length <- 0
for (c in 1:nrow(x)) {
if ((x[c,"group"] == group) & (x[c,"condition"] == condition)) {
tmp.value <- tmp.value + x[c,"data"]
tmp.length <- tmp.length + 1
}
}
marginal.means.2[row.counter,"group"] <- group
marginal.means.2[row.counter,"condition"] <- condition
marginal.means.2[row.counter,"data"] <- tmp.value / tmp.length
row.counter <- row.counter + 1
}
}
# does it produce the same results?
all.equal(marginal.means.1, marginal.means.2)
To use this code with Rprof, we need to parse it. That is, it needs to be saved in a file and then called from there. Hence, I uploaded it to pastebin, but it works exactly the same with local files.
Now, we
simply create a profile file and indicate that we want to save the line number,
source the code with the incredible combination eval(parse(..., keep.source = TRUE)) (seemingly the infamous fortune(106) does not apply here, as I haven't found another way)
stop the profiling and indicate that we want the output based on the line numbers.
The code is:
Rprof("profile1.out", line.profiling=TRUE)
eval(parse(file = "http://pastebin.com/download.php?i=KjdkSVZq", keep.source=TRUE))
Rprof(NULL)
summaryRprof("profile1.out", lines = "show")
Which gives:
$by.self
self.time self.pct total.time total.pct
download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11
<no location> 4.38 34.93 4.38 34.93
download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48
download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16
$by.total
total.time total.pct self.time self.pct
download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11
<no location> 4.38 34.93 4.38 34.93
download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48
download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16
$by.line
self.time self.pct total.time total.pct
<no location> 4.38 34.93 4.38 34.93
download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48
download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11
download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16
$sample.interval
[1] 0.02
$sampling.time
[1] 12.54
Checking the source code tells us that the problematic line (#17) is indeed the stupid if-statement in the for-loop. Compared with basically no time for calculating the same using vectorized code (line #6).
I haven't tried it with any graphical output, but I am already very impressed by what I got so far.
Update: This function has been re-written to deal with line numbers. It's on github here.
I wrote this function to parse the file from Rprof and output a table of somewhat clearer results than summaryRprof. It displays the full stack of functions (and line numbers if line.profiling=TRUE), and their relative contribution to run time:
proftable <- function(file, lines=10) {
# require(plyr)
interval <- as.numeric(strsplit(readLines(file, 1), "=")[[1L]][2L])/1e+06
profdata <- read.table(file, header=FALSE, sep=" ", comment.char = "",
colClasses="character", skip=1, fill=TRUE,
na.strings="")
filelines <- grep("#File", profdata[,1])
files <- aaply(as.matrix(profdata[filelines,]), 1, function(x) {
paste(na.omit(x), collapse = " ") })
profdata <- profdata[-filelines,]
total.time <- interval*nrow(profdata)
profdata <- as.matrix(profdata[,ncol(profdata):1])
profdata <- aaply(profdata, 1, function(x) {
c(x[(sum(is.na(x))+1):length(x)],
x[seq(from=1,by=1,length=sum(is.na(x)))])
})
stringtable <- table(apply(profdata, 1, paste, collapse=" "))
uniquerows <- strsplit(names(stringtable), " ")
uniquerows <- llply(uniquerows, function(x) replace(x, which(x=="NA"), NA))
dimnames(stringtable) <- NULL
stacktable <- ldply(uniquerows, function(x) x)
stringtable <- stringtable/sum(stringtable)*100
stacktable <- data.frame(PctTime=stringtable[], stacktable)
stacktable <- stacktable[order(stringtable, decreasing=TRUE),]
rownames(stacktable) <- NULL
stacktable <- head(stacktable, lines)
na.cols <- which(sapply(stacktable, function(x) all(is.na(x))))
stacktable <- stacktable[-na.cols]
parent.cols <- which(sapply(stacktable, function(x) length(unique(x)))==1)
parent.call <- paste0(paste(stacktable[1,parent.cols], collapse = " > ")," >")
stacktable <- stacktable[,-parent.cols]
calls <- aaply(as.matrix(stacktable[2:ncol(stacktable)]), 1, function(x) {
paste(na.omit(x), collapse= " > ")
})
stacktable <- data.frame(PctTime=stacktable$PctTime, Call=calls)
frac <- sum(stacktable$PctTime)
attr(stacktable, "total.time") <- total.time
attr(stacktable, "parent.call") <- parent.call
attr(stacktable, "files") <- files
attr(stacktable, "total.pct.time") <- frac
cat("\n")
print(stacktable, row.names=FALSE, right=FALSE, digits=3)
cat("\n")
cat(paste(files, collapse="\n"))
cat("\n")
cat(paste("\nParent Call:", parent.call))
cat(paste("\n\nTotal Time:", total.time, "seconds\n"))
cat(paste0("Percent of run time represented: ", format(frac, digits=3)), "%")
invisible(stacktable)
}
Running this on the Henrik's example file, I get this:
> Rprof("profile1.out", line.profiling=TRUE)
> source("http://pastebin.com/download.php?i=KjdkSVZq")
> Rprof(NULL)
> proftable("profile1.out", lines=10)
PctTime Call
20.47 1#17 > [ > 1#17 > [.data.frame
9.73 1#17 > [ > 1#17 > [.data.frame > [ > [.factor
8.72 1#17 > [ > 1#17 > [.data.frame > [ > [.factor > NextMethod
8.39 == > Ops.factor
5.37 ==
5.03 == > Ops.factor > noNA.levels > levels
4.70 == > Ops.factor > NextMethod
4.03 1#17 > [ > 1#17 > [.data.frame > [ > [.factor > levels
4.03 1#17 > [ > 1#17 > [.data.frame > dim
3.36 1#17 > [ > 1#17 > [.data.frame > length
#File 1: http://pastebin.com/download.php?i=KjdkSVZq
Parent Call: source > withVisible > eval > eval >
Total Time: 5.96 seconds
Percent of run time represented: 73.8 %
Note that the "Parent Call" applies to all the stacks represented on the table. This makes is useful when your IDE or whatever calls your code wraps it in a bunch of functions.
I currently have R uninstalled here, but in SPlus you can interrupt the execution with the Escape key, and then do traceback(), which will show you the call stack. That should enable you to use this handy method.
Here are some reasons why tools built on the same concepts as gprof are not very good at locating performance problems.
A different solution comes from a different question: how to effectively use library(profr) in R:
For example:
install.packages("profr")
devtools::install_github("alexwhitworth/imputation")
x <- matrix(rnorm(1000), 100)
x[x>1] <- NA
library(imputation)
library(profr)
a <- profr(kNN_impute(x, k=5, q=2), interval= 0.005)
It doesn't seem (to me at least), like the plots are at all helpful here (eg plot(a)). But the data structure itself does seem to suggest a solution:
R> head(a, 10)
level g_id t_id f start end n leaf time source
9 1 1 1 kNN_impute 0.005 0.190 1 FALSE 0.185 imputation
10 2 1 1 var_tests 0.005 0.010 1 FALSE 0.005 <NA>
11 2 2 1 apply 0.010 0.190 1 FALSE 0.180 base
12 3 1 1 var.test 0.005 0.010 1 FALSE 0.005 stats
13 3 2 1 FUN 0.010 0.110 1 FALSE 0.100 <NA>
14 3 2 2 FUN 0.115 0.190 1 FALSE 0.075 <NA>
15 4 1 1 var.test.default 0.005 0.010 1 FALSE 0.005 <NA>
16 4 2 1 sapply 0.010 0.040 1 FALSE 0.030 base
17 4 3 1 dist_q.matrix 0.040 0.045 1 FALSE 0.005 imputation
18 4 4 1 sapply 0.045 0.075 1 FALSE 0.030 base
Single iteration solution:
That is the data structure suggests the use of tapply to summarize the data. This can be done quite simply for a single run of profr::profr
t <- tapply(a$time, paste(a$source, a$f, sep= "::"), sum)
t[order(t)] # time / function
R> round(t[order(t)] / sum(t), 4) # percentage of total time / function
base::! base::%in% base::| base::anyDuplicated
0.0015 0.0015 0.0015 0.0015
base::c base::deparse base::get base::match
0.0015 0.0015 0.0015 0.0015
base::mget base::min base::t methods::el
0.0015 0.0015 0.0015 0.0015
methods::getGeneric NA::.findMethodInTable NA::.getGeneric NA::.getGenericFromCache
0.0015 0.0015 0.0015 0.0015
NA::.getGenericFromCacheTable NA::.identC NA::.newSignature NA::.quickCoerceSelect
0.0015 0.0015 0.0015 0.0015
NA::.sigLabel NA::var.test.default NA::var_tests stats::var.test
0.0015 0.0015 0.0015 0.0015
base::paste methods::as<- NA::.findInheritedMethods NA::.getClassFromCache
0.0030 0.0030 0.0030 0.0030
NA::doTryCatch NA::tryCatchList NA::tryCatchOne base::crossprod
0.0030 0.0030 0.0030 0.0045
base::try base::tryCatch methods::getClassDef methods::possibleExtends
0.0045 0.0045 0.0045 0.0045
methods::loadMethod methods::is imputation::dist_q.matrix methods::validObject
0.0075 0.0090 0.0120 0.0136
NA::.findNextFromTable methods::addNextMethod NA::.nextMethod base::lapply
0.0166 0.0346 0.0361 0.0392
base::sapply imputation::impute_fn_knn methods::new imputation::kNN_impute
0.0392 0.0392 0.0437 0.0557
methods::callNextMethod kernlab::as.kernelMatrix base::apply kernlab::kernelMatrix
0.0572 0.0633 0.0663 0.0753
methods::initialize NA::FUN base::standardGeneric
0.0798 0.0994 0.1325
From this, I can see that the biggest time users are kernlab::kernelMatrix and the overhead from R for S4 classes and generics.
Preferred:
I note that, given the stochastic nature of the sampling process, I prefer to use averages to get a more robust picture of the time profile:
prof_list <- replicate(100, profr(kNN_impute(x, k=5, q=2),
interval= 0.005), simplify = FALSE)
fun_timing <- vector("list", length= 100)
for (i in 1:100) {
fun_timing[[i]] <- tapply(prof_list[[i]]$time, paste(prof_list[[i]]$source, prof_list[[i]]$f, sep= "::"), sum)
}
# Here is where the stochastic nature of the profiler complicates things.
# Because of randomness, each replication may have slightly different
# functions called during profiling
sapply(fun_timing, function(x) {length(names(x))})
# we can also see some clearly odd replications (at least in my attempt)
> sapply(fun_timing, sum)
[1] 2.820 5.605 2.325 2.895 3.195 2.695 2.495 2.315 2.005 2.475 4.110 2.705 2.180 2.760
[15] 3130.240 3.435 7.675 7.155 5.205 3.760 7.335 7.545 8.155 8.175 6.965 5.820 8.760 7.345
[29] 9.815 7.965 6.370 4.900 5.720 4.530 6.220 3.345 4.055 3.170 3.725 7.780 7.090 7.670
[43] 5.400 7.635 7.125 6.905 6.545 6.855 7.185 7.610 2.965 3.865 3.875 3.480 7.770 7.055
[57] 8.870 8.940 10.130 9.730 5.205 5.645 3.045 2.535 2.675 2.695 2.730 2.555 2.675 2.270
[71] 9.515 4.700 7.270 2.950 6.630 8.370 9.070 7.950 3.250 4.405 3.475 6.420 2948.265 3.470
[85] 3.320 3.640 2.855 3.315 2.560 2.355 2.300 2.685 2.855 2.540 2.480 2.570 3.345 2.145
[99] 2.620 3.650
Removing the unusual replications and converting to data.frames:
fun_timing <- fun_timing[-c(15,83)]
fun_timing2 <- lapply(fun_timing, function(x) {
ret <- data.frame(fun= names(x), time= x)
dimnames(ret)[[1]] <- 1:nrow(ret)
return(ret)
})
Merge replications (almost certainly could be faster) and examine results:
# function for merging DF's in a list
merge_recursive <- function(list, ...) {
n <- length(list)
df <- data.frame(list[[1]])
for (i in 2:n) {
df <- merge(df, list[[i]], ... = ...)
}
return(df)
}
# merge
fun_time <- merge_recursive(fun_timing2, by= "fun", all= FALSE)
# do some munging
fun_time2 <- data.frame(fun=fun_time[,1], avg_time=apply(fun_time[,-1], 1, mean, na.rm=T))
fun_time2$avg_pct <- fun_time2$avg_time / sum(fun_time2$avg_time)
fun_time2 <- fun_time2[order(fun_time2$avg_time, decreasing=TRUE),]
# examine results
R> head(fun_time2, 15)
fun avg_time avg_pct
4 base::standardGeneric 0.6760714 0.14745123
20 NA::FUN 0.4666327 0.10177262
12 methods::initialize 0.4488776 0.09790023
9 kernlab::kernelMatrix 0.3522449 0.07682464
8 kernlab::as.kernelMatrix 0.3215816 0.07013698
11 methods::callNextMethod 0.2986224 0.06512958
1 base::apply 0.2893367 0.06310437
7 imputation::kNN_impute 0.2433163 0.05306731
14 methods::new 0.2309184 0.05036331
10 methods::addNextMethod 0.2012245 0.04388708
3 base::sapply 0.1875000 0.04089377
2 base::lapply 0.1865306 0.04068234
6 imputation::impute_fn_knn 0.1827551 0.03985890
19 NA::.nextMethod 0.1790816 0.03905772
18 NA::.findNextFromTable 0.1003571 0.02188790
Results
From the results, a similar but more robust picture emerges as with a single case. Namely, there is a lot of overhead from R and also that library(kernlab) is slowing me down. Of note, since kernlab is implemented in S4, the overhead in R is related since S4 classes are substantially slower than S3 classes.
I'd also note that my personal opinion is that a cleaned up version of this might be a useful pull request as a summary method for profr. Although I'd be interested to see others' suggestions!

Resources