R fill vector efficiently - r

I have a fairly big vector (>500,000 in length). It contains a bunch of NA interspersed with 1 and it is always guaranteed that it begins with 1.
I would like to replace some of the NA in v1 with 1, based on a comparison operation on contiguous indices of another vector v2 (of same length as v1).
Is there an efficient way of doing this in vectorized notation so that looping is done in a low-level implementation? Maybe using ifelse?
Reproducible example below:
v1<-c(1,NA,NA,NA,1,NA,NA,NA,NA,NA,1,NA,NA,1,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,1)
v2<-c(10,10,10,9,10,9,9,9,9,9,10,10,10,11,8,12,12,12,12,12,12,12,12,12,12,13)
# goal is to fill through v1 in such a way that whenever
# v1[i] == NA and v1[i-1] == 1 and v2[i] == v2[i-1], then v1[i] == 1
MM<-data.frame(v1,v2)
for (i in 2:length(v1)){
# conditions: v1[i-1] == 1; v1[i]==NA; v2[i]==v2[i-1]
if (!is.na(v1[i-1]) && is.na(v1[i]) && v2[i]==v2[i-1]){
v1[i]<-1
}
}
MM$v1_altered<-v1
MM

There's probably a faster solution, but this is the best I could come up with in a couple minutes. My solution is slower than the OPs for small vectors, but increasingly faster for larger vectors.
library(zoo) # for na.locf
library(rbenchmark)
v1<-c(1,NA,NA,NA,1,NA,NA,NA,NA,NA,1,NA,NA,1,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,1)
v2<-c(10,10,10,9,10,9,9,9,9,9,10,10,10,11,8,12,12,12,12,12,12,12,12,12,12,13)
V1 <- rep(v1, each=20000) # 520,000 observations
V2 <- rep(v2, each=20000) # 520,000 observations
fun1 <- function(v1,v2) {
for (i in 2:length(v1)){
if (!is.na(v1[i-1]) && is.na(v1[i]) && v2[i]==v2[i-1]){
v1[i]<-1
}
}
v1
}
fun2 <- function(v1,v2) {
# create groups in which we need to assess missing values
d <- cumsum(as.logical(c(0,diff(v2))))
# for each group, carry the first obs forward
ave(v1, d, FUN=function(x) na.locf(x, na.rm=FALSE))
}
all.equal(fun1(V1,V2), fun2(V1,V2))
# [1] TRUE
benchmark(fun1(V1,V2), fun2(V1,V2))
# test replications elapsed relative user.self sys.self
# 1 fun1(V1, V2) 100 194.29 6.113593 192.72 0.17
# 2 fun2(V1, V2) 100 31.78 1.000000 30.74 0.95

A vectorised solution would look like:
v1[-1] <- ifelse(diff(v2), 0, v1[-length(v1)])
But the above won't work, and I don't think you can avoid an explicit loop since, if I understand correctly, you want to propagate new values. So, how about:
cmp <- diff(v2)
for (i in 2:length(v1)){
v1[i] <- if(cmp[i-1]) 0 else v1[i-1]
}

It may not be faster, but v1[i] <- v1[i-1] * (cmp[i-1] == 0) avoids all explicit "if" calls. I can't test it right now, but you might try #James solution vs. looping over this form for, say a vector of 1e4 length to see which executes faster.

The function fun1 can be speeded up considerably by using the compiler package.
Using the code provided by Joshua and extending it with the compiler package:
library(zoo) # for na.locf
library(rbenchmark)
library(compiler)
v1 <- c(1,NA,NA,NA,1,NA,NA,NA,NA,NA,1,NA,NA,1,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,1)
v2 <- c(10,10,10,9,10,9,9,9,9,9,10,10,10,11,8,12,12,12,12,12,12,12,12,12,12,13)
fun1 <- function(v1,v2) {
for (i in 2:length(v1)){
if (!is.na(v1[i-1]) && is.na(v1[i]) && v2[i]==v2[i-1]){
v1[i]<-1
}
}
v1
}
fun2 <- function(v1,v2) {
# create groups in which we need to assess missing values
d <- cumsum(as.logical(c(0,diff(v2))))
# for each group, carry the first obs forward
ave(v1, d, FUN=function(x) na.locf(x, na.rm=FALSE))
}
fun3 <- cmpfun(fun1)
fun1(v1,v2)
fun2(v1,v2)
all.equal(fun1(v1,v2), fun2(v1,v2))
all.equal(fun1(v1,v2), fun3(v1,v2))
Nrep <- 1000
V1 <- rep(v1, each=Nrep)
V2 <- rep(v2, each=Nrep)
all.equal(fun1(V1,V2), fun2(V1,V2))
all.equal(fun1(V1,V2), fun3(V1,V2))
benchmark(fun1(V1,V2), fun2(V1,V2), fun3(V1,V2))
we get the following result
benchmark(fun1(V1,V2), fun2(V1,V2), fun3(V1,V2))
test replications elapsed relative user.self sys.self user.child
1 fun1(V1, V2) 100 12.252 5.706567 12.190 0.045 0
2 fun2(V1, V2) 100 2.147 1.000000 2.133 0.013 0
3 fun3(V1, V2) 100 3.702 1.724266 3.644 0.023 0
So the compiled fun1 is a lot faster than the original fun1 but still slower than fun2.

Related

How to use with() function in R instead of apply()

I am trying to optimise a code that I have written using the apply() and similar functions (e.g. lapply()). Unfortunately I do not see much of improvement so searching I came across this post apply() is slow - how to make it faster or what are my alternatives? where a suggestion is to use the function with() instead of apply() which is certainly much faster.
What I want to do is to apply a user defined function to every row of a matrix. This function takes as input the data from the row, makes some calculations and returns a vector with the results.
A toy example where I use the apply() function, the with() and a vectorized version:
#Generate a matrix 10x3
prbl1=matrix(runif(30),nrow=10)
prbl2=data.frame(prbl1)
prbl3=prbl2
#function for the apply()
fn1=function(row){
x=row[1]
y=row[2]
z=row[3]
k1=2*x+3*y+4*z
k2=2*x*3*y*4*z
k3=2*x*y+3*x*z
return(c(k1,k2,k3))
}
#function for the with()
fn2=function(x,y,z){
k1=2*x+3*y+4*z
k2=2*x*3*y*4*z
k3=2*x*y+3*x*z
return(c(k1,k2,k3))
}
#Vectorise fn2
fn3=Vectorize(fn2)
#apply the functions:
rslt1=t(apply(prbl1,1,fn1))
rslt2=t(with(prbl2,fn2(X1,X2,X3)))
rslt2=cbind(rslt2[1:10],rslt2[11:20],rslt2[21:30])
rslt3=t(with(prbl3,fn3(X1,X2,X3)))
All three produce the same output, a matrix 10x3 which is what I want. Nevertheless, notice at rslt2 that I need to bind the results as the output of using with() is a vector of length 300. I suspected that this is due to the fact that the function is not vectorised (if I understood this correctly). In rslt3 I am using a vectorised version of fn2 which generated the output in the expected way.
When I compare the performance of the three, I get:
library(rbenchmark)
benchmark(rslt1=t(apply(prbl1,1,fn1)),
rslt2=with(prbl2,fn2(X1,X2,X3)),
rslt3=with(prbl3,fn3(X1,X2,X3)),
replications=1000000)
test replications elapsed relative user.self sys.self user.child sys.child
1 rslt1 1000000 103.51 7.129 102.63 0.02 NA NA
2 rslt2 1000000 14.52 1.000 14.41 0.01 NA NA
3 rslt3 1000000 123.44 8.501 122.41 0.05 NA NA
where with() without vectorisation is definitely faster.
My question: Since rslt2 is the most efficient approach, is there a way that I can use this correctly without the need to bind the results afterwards? It does the job but I feel is not efficient coding.
The first and third functions you give are being applied 1 row at a time, so are called 10 times in your example. The second function is taking advantage of the fact that multiplication and addition in R are already vectorised and so using any form of loop or ply function is unnecessary. The function is only called once. If you wanted to use your current code, all you'd need to do is change the c to cbind in fn2.
fn2=function(x,y,z){
k1=2*x+3*y+4*z
k2=2*x*3*y*4*z
k3=2*x*y+3*x*z
return(cbind(k1,k2,k3))
}
All that with does is evaluate the expression it's given in the list, data.frame or environment given. So with(prbl2,fn2(X1,X2,X3)) is entirely equivalent to fn2(prbl2$X1, prbl2$X2, prbl2$X3).
Is this your real function? If it is, then problem solved. If not, then it depends on whether your real function consists entirely of operations and functions that already are vectorised or can be replaced with vectorised equivalents.
For the amended function per the comments:
Single row:
fn1 <- function(row){
x <- row[1]
y <- row[2]
z <- row[3]
k1 <- 2*x+3*y+4*z
k2 <- 2*x*3*y*4*z
k3 <- 2*x*y+3*x*z
if (k1>0 & k2>0 &k3>0){
return(cbind(k1,k2,k3))
} else {
k1 <- 5*x+3*y+4*z
k2 <- 5*x*3*y*4*z
k3 <- 5*x*y+3*x*z
if (k1<0 || k2<0 || k3<0) {
return(cbind(0,0,0))
} else {
return(cbind(k1,k2,k3))
}
}
}
Whole matrix:
fn2 <- function(mat) {
x <- mat[, 1]
y <- mat[, 2]
z <- mat[, 3]
k1 <- 2*x+3*y+4*z
k2 <- 2*x*3*y*4*z
k3 <- 2*x*y+3*x*z
l1 <- 5*x+3*y+4*z
l2 <- 5*x*3*y*4*z
l3 <- 5*x*y+3*x*z
out <- array(0, dim = dim(mat))
useK <- k1 > 0 & k2 > 0 & k3 > 0
useL <- !useK & l1 >= 0 & l2 >= 0 & l3 >= 0
out[useK, ] <- cbind(k1, k2, k3)[useK, ]
out[useL, ] <- cbind(l1, l2, l3)[useL, ]
out
}

Speed up R loop [duplicate]

This question already has answers here:
Any documentation for optimizing the performance of R? [duplicate]
(4 answers)
Closed 9 years ago.
Speeding up loops in R can easily be done using a function from the apply family. How can I use an apply function in the code below to speed it up? Note that within the loop, at each iteration, one column is permuted and a function is applied to the new data frame (i.e., the initial data frame with one column permuted). I cannot seem to get apply to work because the new data frame has to be built within the loop.
#x <- data.frame(a=1:10,b=11:20,c=21:30) #small example
x <- data.frame(matrix(runif(50*100),nrow=50,ncol=100)) #larger example
y <- rowMeans(x)
start <- Sys.time()
totaldiff <- numeric()
for (i in 1:ncol(x)){
x.after <- x
x.after[,i] <- sample(x[,i])
diff <- abs(y-rowMeans(x.after))
totaldiff[i] <- sum(diff)
}
colnames(x)[which.max(totaldiff)]
Sys.time() - start
After working through this and other replies, the optimization strategies (and approximate speed-up) here seem to be
(30x) Choose an appropriate data representation -- matrix, rather than data.frame
(1.5x) Reduce unnecessary data copies -- difference of columns, rather than of rowMeans
Structure for loops as *apply functions (to emphasize code structure, simplify memory management, and provide type consistency)
(2x) Hoist vector operations outside loops -- abs and sum on columns become abs and colSums on a matrix
for an overall speed-up of about 100x. For this size and complexity of code, the use of the compiler or parallel packages would not be effective.
I put your code into a function
f0 <- function(x) {
y <- rowMeans(x)
totaldiff <- numeric()
for (i in 1:ncol(x)){
x.after <- x
x.after[,i] <- sample(x[,i])
diff <- abs(y-rowMeans(x.after))
totaldiff[i] <- sum(diff)
}
which.max(totaldiff)
}
and here we have
x <- data.frame(matrix(runif(50*100),nrow=50,ncol=100)) #larger example
set.seed(123)
system.time(res0 <- f0(x))
## user system elapsed
## 1.065 0.000 1.066
Your data can be represented as a matrix, and operations on R matrices are faster than on data.frames.
m <- matrix(runif(50*100),nrow=50,ncol=100)
set.seed(123)
system.time(res0.m <- f0(m))
## user system elapsed
## 0.036 0.000 0.037
identical(res0, res0.m)
##[1] TRUE
That's probably the biggest speed-up. But for the specific operation here we don't need to calculate the row means of the updated matrix, just the change in the mean from shuffling one column
f1 <- function(x) {
y <- rowMeans(x)
totaldiff <- numeric()
for (i in 1:ncol(x)){
diff <- abs(sample(x[,i]) - x[,i]) / ncol(x)
totaldiff[i] <- sum(diff)
}
which.max(totaldiff)
}
The for loop doesn't follow the right pattern for filling up the result vector totaldiff (you want to "pre-allocate and fill", so totaldiff <- numeric(ncol(x))) but we can use an sapply and let R worry about that (this memory management is one of the advantages of using the apply family of functions)
f2 <- function(x) {
totaldiff <- sapply(seq_len(ncol(x)), function(i, x) {
sum(abs(sample(x[,i]) - x[,i]) / ncol(x))
}, x)
which.max(totaldiff)
}
set.seed(123); identical(res0, f1(m))
set.seed(123); identical(res0, f2(m))
The timings are
> library(microbenchmark)
> microbenchmark(f0(m), f1(m), f2(m))
Unit: milliseconds
expr min lq median uq max neval
f0(m) 32.45073 33.07804 33.16851 33.26364 33.81924 100
f1(m) 22.20913 23.87784 23.96915 24.06216 24.66042 100
f2(m) 21.02474 22.60745 22.70042 22.80080 23.19030 100
#flodel points out that vapply can be faster (and provides type safety)
f3 <- function(x) {
totaldiff <- vapply(seq_len(ncol(x)), function(i, x) {
sum(abs(sample(x[,i]) - x[,i]) / ncol(x))
}, numeric(1), x)
which.max(totaldiff)
}
and that
f4 <- function(x)
which.max(colSums(abs((apply(x, 2, sample) - x))))
is still faster (ncol(x) is a constant factor, so removed) -- The abs and sum are hoisted outside the sapply, maybe at the expense of additional memory use. The advice in the comments to compile functions is good in general; here are some further timings
> microbenchmark(f0(m), f1(m), f1.c(m), f2(m), f2.c(m), f3(m), f4(m))
Unit: milliseconds
expr min lq median uq max neval
f0(m) 32.35600 32.88326 33.12274 33.25946 34.49003 100
f1(m) 22.21964 23.41500 23.96087 24.06587 24.49663 100
f1.c(m) 20.69856 21.20862 22.20771 22.32653 213.26667 100
f2(m) 20.76128 21.52786 22.66352 22.79101 69.49891 100
f2.c(m) 21.16423 21.57205 22.94157 23.06497 23.35764 100
f3(m) 20.17755 21.41369 21.99292 22.10814 22.36987 100
f4(m) 10.10816 10.47535 10.56790 10.61938 10.83338 100
where the ".c" are compiled versions and
Compilation is particularly helpful in code written with for loops but doesn't do much for vectorized code; this is shown here where's a small but consistent improvement from compiling f1's for loop, but not f2's sapply.
Since you are looking at efficiency/optimization, start by using the rbenchmark package for comparison purposes.
Rewriting your given example as a function (so that it can be replicated and compared)
forFirst <- function(x) {
y <- rowMeans(x)
totaldiff <- numeric()
for (i in 1:ncol(x)){
x.after <- x
x.after[,i] <- sample(x[,i])
diff <- abs(y-rowMeans(x.after))
totaldiff[i] <- sum(diff)
}
colnames(x)[which.max(totaldiff)]
}
Applying some standard optimizations (pre-allocating totaldiff to the right size, eliminating intermediate variables that are only used once) gives
forSecond <- function(x) {
y <- rowMeans(x)
totaldiff <- numeric(ncol(x))
for (i in 1:ncol(x)){
x.after <- x
x.after[,i] <- sample(x[,i])
totaldiff[i] <- sum(abs(y-rowMeans(x.after)))
}
colnames(x)[which.max(totaldiff)]
}
Not much more can be done for this that I can see to improve the algorithm itself in the loop. A better algorithm would be the most help, but since this particular problem is just an example, it is not worth spending that time.
The apply version looks very similar.
applyFirst <- function(x) {
y <- rowMeans(x)
totaldiff <- sapply(seq_len(ncol(x)), function(i) {
x[,i] <- sample(x[,i])
sum(abs(y-rowMeans(x)))
})
colnames(x)[which.max(totaldiff)]
}
Benchmarking them gives:
> library("rbenchmark")
> benchmark(forFirst(x),
+ forSecond(x),
+ applyFirst(x),
+ order = "relative")
test replications elapsed relative user.self sys.self user.child
1 forFirst(x) 100 16.92 1.000 16.88 0.00 NA
2 forSecond(x) 100 17.02 1.006 16.96 0.03 NA
3 applyFirst(x) 100 17.05 1.008 17.02 0.01 NA
sys.child
1 NA
2 NA
3 NA
The differences between these is just noise. In fact, running the benchmark again gives a different ordering:
> benchmark(forFirst(x),
+ forSecond(x),
+ applyFirst(x),
+ order = "relative")
test replications elapsed relative user.self sys.self user.child
3 applyFirst(x) 100 17.05 1.000 17.02 0 NA
2 forSecond(x) 100 17.08 1.002 17.05 0 NA
1 forFirst(x) 100 17.44 1.023 17.41 0 NA
sys.child
3 NA
2 NA
1 NA
So these approaches are the same speed. Any real improvement will come from using a better algorithm than just simple looping and copying to create the intermediate results.
Apply functions do not necessarily speed up loops in R. Sometimes they can even slow them down. There's no reason to believe that turning this into an apply family function will speed it up any appreciable amount.
As an aside, this code seems like a relatively pointless endeavour. It's just going to select a random column. I could get the same result by just doing that in the first place. Perhaps this is nested in a larger loop looking for a distribution?

does the by( ) function make growing list

Does the by function make a list that grows one element at a time?
I need to process a data frame with about 4M observations grouped by a factor column. The situation is similar to the example below:
> # Make 4M rows of data
> x = data.frame(col1=1:4000000, col2=10000001:14000000)
> # Make a factor
> x[,"f"] = x[,"col1"] - x[,"col1"] %% 5
>
> head(x)
col1 col2 f
1 1 10000001 0
2 2 10000002 0
3 3 10000003 0
4 4 10000004 0
5 5 10000005 5
6 6 10000006 5
Now, a tapply on one of the columns takes a reasonable amount of time:
> t1 = Sys.time()
> z = tapply(x[, 1], x[, "f"], mean)
> Sys.time() - t1
Time difference of 22.14491 secs
But if I do this:
z = by(x[, 1], x[, "f"], mean)
That doesn't finish anywhere near the same time (I gave up after a minute).
Of course, in the above example, tapply could be used, but I actually need to process multiple columns together. What is the better way to do this?
by is slower than tapply because it is wrapping by.
Let's take a look at some benchmarks: tapply in this situation is more than 3x faster than using by
UPDATED to include #Roland's great recomendation:
library(rbenchmark)
library(data.table)
dt <- data.table(x,key="f")
using.tapply <- quote(tapply(x[, 1], x[, "f"], mean))
using.by <- quote(by(x[, 1], x[, "f"], mean))
using.dtable <- quote(dt[,mean(col1),by=key(dt)])
times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative")
times[,c("test", "elapsed", "relative")]
#------------------------#
# RESULTS #
#------------------------#
# COMPARING tapply VS by #
#-----------------------------------
# test elapsed relative
# 1 using.tapply 2.453 1.000
# 2 using.by 8.889 3.624
# COMPARING data.table VS tapply VS by #
#------------------------------------------#
# test elapsed relative
# 2 using.dtable 0.168 1.000
# 1 using.tapply 2.396 14.262
# 3 using.by 8.566 50.988
If x$f is a factor, the loss in efficiency between tapply and by is even greater!
Although, notice that they both improve relative to non-factor inputs, while data.table remains approx the same or worse
x[, "f"] <- as.factor(x[, "f"])
dt <- data.table(x,key="f")
times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative")
times[,c("test", "elapsed", "relative")]
# test elapsed relative
# 2 using.dtable 0.175 1.000
# 1 using.tapply 1.803 10.303
# 3 using.by 7.854 44.880
As for the why, the short answer is in the documentation itself.
?by :
Description
Function by is an object-oriented wrapper for tapply applied to data frames.
let's take a look at the source for by (or more specificaly, by.data.frame):
by.data.frame
function (data, INDICES, FUN, ..., simplify = TRUE)
{
if (!is.list(INDICES)) {
IND <- vector("list", 1L)
IND[[1L]] <- INDICES
names(IND) <- deparse(substitute(INDICES))[1L]
}
else IND <- INDICES
FUNx <- function(x) FUN(data[x, , drop = FALSE], ...)
nd <- nrow(data)
ans <- eval(substitute(tapply(seq_len(nd), IND, FUNx, simplify = simplify)),
data)
attr(ans, "call") <- match.call()
class(ans) <- "by"
ans
}
We see immediately that there is still a call to tapply plus a lot of extras (including calls to deparse(substitute(.)) and an eval(substitute(.)) both of which are relatively slow). Therefore it makes sense that your tapply will be relatively faster than a similar call to by.
Regarding a better way to do this: With 4M rows you should use data.table.
library(data.table)
dt <- data.table(x,key="f")
dt[,mean(col1),by=key(dt)]
dt[,list(mean1=mean(col1),mean2=mean(col2)),by=key(dt)]
dt[,lapply(.SD,mean),by=key(dt)]

Efficient subsetting in R using 2 dataframes

I have a big time series full in one dataframe and a list of timestamps in a different dataframe test. I need to subset full with data points surrounding the timestamps in test. My first instinct (as an R noob) was to write the below, which was wrong
subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i))
Looking at the result I realized that R is looping through both the vectors simultaneously giving the wrong result. My option is to write a loop like the below:
subs<-data.frame()
for (j in test$dt)
subs <- rbind(subs,subset(full,full$dt>(j-i) & full$dt<(j+i)))
I feel that there might be a better way to do loops and this article implores us to avoid R loops as much as possible. The other reason is I might be hitting up against performance issues as this would be at the heart of an optimization algorithm. Any suggestions from gurus would be greatly appreciated.
EDIT:
Here is some reproducible code that shows the wrong approach as well as the approach that works but could be better.
#create a times series
full <- data.frame(seq(1:200),rnorm(200,0,1))
colnames(full)<-c("dt","val")
#my smaller array of points of interest
test <- data.frame(seq(5,200,by=23))
colnames(test)<-c("dt")
# my range around the points of interset
i<-3
#the wrong approach
subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i))
#this works, but not sure this is the best way to go about it
subs<-data.frame()
for (j in test$dt)
subs <- rbind(subs,subset(full,full$dt>(j-i) & full$dt<(j+i)))
EDIT:
I updated the values to better reflect my usecase, and I see #mrdwab 's solution pulling ahead unexpectedly and by a wide margin.
I am using benchmark code from #mrdwab and the initialization is as follows:
set.seed(1)
full <- data.frame(
dt = 1:15000000,
val = floor(rnorm(15000000,0,1))
)
test <- data.frame(dt = floor(runif(24,1,15000000)))
i <- 500
The benchmarks are:
test replications elapsed relative
2 mrdwab 2 1.31 1.00000
3 spacedman 2 69.06 52.71756
1 andrie 2 93.68 71.51145
4 original 2 114.24 87.20611
Totally unexpected. Mind = blown. Can someone please shed some light in this dark corner and enlighten as to what is happening.
Important: As #mrdwab notes below, his solution works only if the vectors are integers. If not, #spacedman has the right solution
Here's a real R way to do it. Functionally. No loops...
Starting with Andrie's example data.
First, an interval comparison function:
> cf = function(l,u){force(l);force(u);function(x){x>l & x<u}}
An OR composition function:
> OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}}
Now there's sort of a loop here, to construct a list of those comparison functions:
> funs = mapply(cf,test$dt-i,test$dt+i)
Now combine all those into one function:
> anyF = Reduce(OR,funs)
And now we apply the OR composition to our interval testing functions:
> head(full[anyF(full$dt),])
dt val
3 3 -0.83562861
4 4 1.59528080
5 5 0.32950777
6 6 -0.82046838
7 7 0.48742905
26 26 -0.05612874
What you've got now is a function of a single variable that tests if the value is in the ranges you defined.
> anyF(1:10)
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
I don't know if this is faster, or better, or what. Someone do some benchmarks!
I don't know if it's any more efficient, but I would think you could also do something like this to get what you want:
subs <- apply(test, 1, function(x) c((x-2):(x+2)))
full[which(full$dt %in% subs), ]
I had to adjust your "3" to "2" since x would be included both ways.
Benchmarking (just for fun)
#Spacedman leads the way!
First, the required data and functions.
## Data
set.seed(1)
full <- data.frame(
dt = 1:200,
val = rnorm(200,0,1)
)
test <- data.frame(dt = seq(5,200,by=23))
i <- 3
## Spacedman's functions
cf = function(l,u){force(l);force(u);function(x){x>l & x<u}}
OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}}
funs = mapply(cf,test$dt-i,test$dt+i)
anyF = Reduce(OR,funs)
Second, the benchmarking.
## Benchmarking
require(rbenchmark)
benchmark(andrie = do.call(rbind,
lapply(test$dt,
function(j) full[full$dt > (j-i) &
full$dt < (j+i), ])),
mrdwab = {subs <- apply(test, 1,
function(x) c((x-(i-1)):(x+(i-1))))
full[which(full$dt %in% subs), ]},
spacedman = full[anyF(full$dt),],
original = {subs <- data.frame()
for (j in test$dt)
subs <- rbind(subs,
subset(full, full$dt > (j-i) &
full$dt < (j+i)))},
columns = c("test", "replications", "elapsed", "relative"),
order = "relative")
# test replications elapsed relative
# 3 spacedman 100 0.064 1.000000
# 2 mrdwab 100 0.105 1.640625
# 1 andrie 100 0.520 8.125000
# 4 original 100 1.080 16.875000
There is nothing inherently wrong with your code. To achieve your aim, you need a loop of some sort around a vectorised subset operation.
But here is more R-ish way to do it, which might well be faster:
do.call(rbind,
lapply(test$dt, function(j)full[full$dt > (j-i) & full$dt < (j+i), ])
)
PS: You can significantly simplify your reproducible example:
set.seed(1)
full <- data.frame(
dt = 1:200,
val = rnorm(200,0,1)
)
test <- data.frame(dt = seq(5,200,by=23))
i <- 3
xx <- do.call(rbind,
lapply(test$dt, function(j)full[full$dt > (j-i) & full$dt < (j+i), ])
)
head(xx)
dt val
3 3 -0.83562861
4 4 1.59528080
5 5 0.32950777
6 6 -0.82046838
7 7 0.48742905
26 26 -0.05612874
one more way using data.tables:
{
temp <- data.table(x=unique(c(full$dt,(test$dt-i),(test$dt+i))),key="x")
temp[,index:=1:nrow(temp)]
startpoints <- temp[J(test$dt-i),index]$index
endpoints <- temp[J(test$dt+i),index]$index
allpoints <- as.vector(mapply(FUN=function(x,y) x:y,x=startpoints,y=endpoints))
setkey(x=temp,index)
ans <- temp[J(allpoints)]$x
}
benchmarks:
number of rows in test:9
number of rows in full:10000
test replications elapsed relative
1 spacedman 100 0.406 1.000
2 new 100 1.179 2.904
number of rows in full:100000
test replications elapsed relative
2 new 100 2.374 1.000
1 spacedman 100 3.753 1.581

Vectorize a product calculation which depends on previous elements?

I'm trying to speed up/vectorize some calculations in a time series.
Can I vectorize a calculation in a for loop which can depend on results from an earlier iteration? For example:
z <- c(1,1,0,0,0,0)
zi <- 2:6
for (i in zi) {z[i] <- ifelse (z[i-1]== 1, 1, 0) }
uses the z[i] values updated in earlier steps:
> z
[1] 1 1 1 1 1 1
In my effort at vectorizing this
z <- c(1,1,0,0,0,0)
z[zi] <- ifelse( z[zi-1] == 1, 1, 0)
the element-by-element operations don't use results updated in the operation:
> z
[1] 1 1 1 0 0 0
So this vectorized operation operates in 'parallel' rather than iterative fashion. Is there a way I can write/vectorize this to get the results of the for loop?
ifelse is vectorized and there's a bit of a penalty if you're using it on one element at a time in a for-loop. In your example, you can get a pretty good speedup by using if instead of ifelse.
fun1 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- ifelse(z[i-1]==1, 1, 0)
}
z
}
fun2 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- if(z[i-1]==1) 1 else 0
}
z
}
z <- c(1,1,0,0,0,0)
identical(fun1(z),fun2(z))
# [1] TRUE
system.time(replicate(10000, fun1(z)))
# user system elapsed
# 1.13 0.00 1.32
system.time(replicate(10000, fun2(z)))
# user system elapsed
# 0.27 0.00 0.26
You can get some additional speed gains out of fun2 by compiling it.
library(compiler)
cfun2 <- cmpfun(fun2)
system.time(replicate(10000, cfun2(z)))
# user system elapsed
# 0.11 0.00 0.11
So there's a 10x speedup without vectorization. As others have said (and some have illustrated) there are ways to vectorize your example, but that may not translate to your actual problem. Hopefully this is general enough to be applicable.
The filter function may be useful to you as well if you can figure out how to express your problem in terms of a autoregressive or moving average process.
This is a nice and simple example where Rcpp can shine.
So let us first recast functions 1 and 2 and their compiled counterparts:
library(inline)
library(rbenchmark)
library(compiler)
fun1 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- ifelse(z[i-1]==1, 1, 0)
}
z
}
fun1c <- cmpfun(fun1)
fun2 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- if(z[i-1]==1) 1 else 0
}
z
}
fun2c <- cmpfun(fun2)
We write a Rcpp variant very easily:
funRcpp <- cxxfunction(signature(zs="numeric"), plugin="Rcpp", body="
Rcpp::NumericVector z = Rcpp::NumericVector(zs);
int n = z.size();
for (int i=1; i<n; i++) {
z[i] = (z[i-1]==1.0 ? 1.0 : 0.0);
}
return(z);
")
This uses the inline package to compile, load and link the five-liner on the fly.
Now we can define our test-date, which we make a little longer than the original (as just running the original too few times result in unmeasurable times):
R> z <- rep(c(1,1,0,0,0,0), 100)
R> identical(fun1(z),fun2(z),fun1c(z),fun2c(z),funRcpp(z))
[1] TRUE
R>
All answers are seen as identical.
Finally, we can benchmark:
R> res <- benchmark(fun1(z), fun2(z),
+ fun1c(z), fun2c(z),
+ funRcpp(z),
+ columns=c("test", "replications", "elapsed",
+ "relative", "user.self", "sys.self"),
+ order="relative",
+ replications=1000)
R> print(res)
test replications elapsed relative user.self sys.self
5 funRcpp(z) 1000 0.005 1.0 0.01 0
4 fun2c(z) 1000 0.466 93.2 0.46 0
2 fun2(z) 1000 1.918 383.6 1.92 0
3 fun1c(z) 1000 10.865 2173.0 10.86 0
1 fun1(z) 1000 12.480 2496.0 12.47 0
The compiled version wins by a factor of almost 400 against the best R version, and almost 100 against its byte-compiled variant. For function 1, the byte compilation matters much less and both variants trail the C++ by a factor of well over two-thousand.
It took about one minute to write the C++ version. The speed gain suggests it was a minute well spent.
For comparison, here is the result for the original short vector called more often:
R> z <- c(1,1,0,0,0,0)
R> res2 <- benchmark(fun1(z), fun2(z),
+ fun1c(z), fun2c(z),
+ funRcpp(z),
+ columns=c("test", "replications",
+ "elapsed", "relative", "user.self", "sys.self"),
+ order="relative",
+ replications=10000)
R> print(res2)
test replications elapsed relative user.self sys.self
5 funRcpp(z) 10000 0.046 1.000000 0.04 0
4 fun2c(z) 10000 0.132 2.869565 0.13 0
2 fun2(z) 10000 0.271 5.891304 0.27 0
3 fun1c(z) 10000 1.045 22.717391 1.05 0
1 fun1(z) 10000 1.202 26.130435 1.20 0
The qualitative ranking is unchanged: the Rcpp version dominates, function2 is second-best. with the byte-compiled version being about twice as fast that the plain R variant, but still almost three times slower than the C++ version. And the relative difference are lower: relatively speaking, the function call overhead matters less and the actual looping matters more: C++ gets a bigger advantage on the actual loop operations in the longer vectors. That it is an important result as it suggests that more real-life sized data, the compiled version may reap a larger benefit.
Edited to correct two small oversights in the code examples. And edited again with thanks to Josh to catch a setup error relative to fun2c.
I think this is cheating and not generalizable, but: according to the rules you have above, any occurrence of 1 in the vector will make all subsequent elements 1 (by recursion: z[i] is 1 set to 1 if z[i-1] equals 1; therefore z[i] will be set to 1 if z[i-2] equals 1; and so forth). Depending on what you really want to do, there may be such a recursive solution available if you think carefully about it ...
z <- c(1,1,0,0,0,0)
first1 <- min(which(z==1))
z[seq_along(z)>first1] <- 1
edit: this is wrong, but I'm leaving it up to admit my mistakes. Based on a little bit of playing (and less thinking), I think the actual solution to this recursion is more symmetric and even simpler:
rep(z[1],length(z))
Test cases:
z <- c(1,1,0,0,0,0)
z <- c(0,1,1,0,0,0)
z <- c(0,0,1,0,0,0)
Check out the rollapply function in zoo.
I'm not super familiar with it, but I think this does what you want:
> c( 1, rollapply(z,2,function(x) x[1]) )
[1] 1 1 1 1 1 1
I'm sort of kludging it by using a window of 2 and then only using the first element of that window.
For more complicated examples you could perform some calculation on x[1] and return that instead.
Sometimes you just need to think about it totally differently. What you're doing is creating a vector where every item is the same as the first if it's a 1 or 0 otherwise.
z <- c(1,1,0,0,0,0)
if (z[1] != 1) z[1] <- 0
z[2:length(z)] <- z[1]
There is a function that does this particular calculation: cumprod (cumulative product)
> cumprod(z[zi])
[1] 1 0 0 0 0
> cumprod(c(1,2,3,4,0,5))
[1] 1 2 6 24 0 0
Otherwise, vectorize with Rccp as other answers have shown.
It's also possible to do this with "apply" using the original vector and a lagged version of the vector as the constituent columns of a data frame.

Resources