fast way in R to do two nested for loops [duplicate] - r

This question already has answers here:
Subtract every element of vector A from every element of vector B
(4 answers)
Closed 5 years ago.
I need to take the difference between any two elements of two vector.
If A<-c(1,2) and B<-c(3,4) then my result R should be c(3-1,3-2,4-1,4-2).
With this snippet
myfunction <- function(N)
{
A = runif(N)
B = runif(N)
R = c()
for(a in A){
for(b in B){
R=c(b-a,R)
}
}
R
}
print(system.time(result <- myfunction(300)))
I get this time
user system elapsed
14.27 0.01 14.39
Is there any faster way to do it?

The fastest base solution is the use of outer:
as.vector(outer(B,A,"-"))
To my own surprise, map2_dbl is actually quite a bit faster than outer:
Not to my surprise, map2_dbl seems faster, but that's because it is not calculating every combination of values in A and B:
test elapsed relative
3 CP(A, B) 7.54 47.125 # using expand.grid
2 JL(A, B) 0.16 1.000 # using map2_dbl
1 JM(A, B) 3.13 19.563 # using outer
But:
> A <- 1:3
> B <- 3:1
> JL(A,B)
[1] -2 0 2
> JM(A,B)
[1] 2 1 0 1 0 -1 0 -1 -2
This is for two vectors of length 1000, and with 100 replications. I didn't include your own solution because that one is ridiculously slow for two reasons:
for loops in R are quite a bit faster than in the old days, but still not as optimal as using functions that have their loops coded in C or equivalent. That's the case for the functions used in the tested code here.
you "grow" your result object. Every loop through the code, that R becomes one value larger, so R has to look for a new place in the memory to store it. That's actually the biggest bottleneck in your code. Try to avoid that kind of construct at all costs, because it's one of the most important causes of terribly slow code.
The benchmark code:
library(tidyverse)
JM <- function(A,B){
as.vector(outer(B,A,"-"))
}
JL <- function(A,B){
map2_dbl(.x = A,
.y = B,
.f = ~ c(.x - .y))
}
CP <- function(A,B){
as.data.frame(expand.grid(A,B)) %>%
mutate(Var3 = Var2-Var1)
}
library(rbenchmark)
A <- runif(1000)
B <- runif(1000)
benchmark(JM(A,B),
JL(A,B),
CP(A,B),
replications = 100,
columns = c("test","elapsed","relative"))

You can use expand.grid to vectorize the approach:
A <- runif(300)
B <- runif(300)
library(dplyr)
R <- as.data.frame(expand.grid(A,B)) %>%
mutate(Var3 = Var2-Var1)
The first 5 lines of output:
Var1 Var2 Var3
1 0.8516676 0.325261 -0.5264066246
2 0.2126453 0.325261 0.1126156694
3 0.5394620 0.325261 -0.2142010126
4 0.1364876 0.325261 0.1887734290
5 0.3248651 0.325261 0.0003958747
This took:
user system elapsed
0.02 0.00 0.02
Your function took:
user system elapsed
42.39 0.43 42.90

Using purrr::map2:
library(tidyverse)
N = 300
A = runif(N)
B = runif(N)
R = c()
print(
system.time(
result <- map(
.x = A,
.f = ~ c(.x - B)) %>% unlist
)
)
Time taken:
user system elapsed
0.02 0 0.02
If I got your attention now, check out this repo for a nice walk through of purrr.

Related

set missing values to constant in R, computational speed

In R, I have a reasonably large data frame (d) which is 10500 by 6000. All values are numeric.
It has many na value elements in both its rows and columns, and I am looking to replace these values with a zero. I have used:
d[is.na(d)] <- 0
but this is rather slow. Is there a better way to do this in R?
I am open to using other R packages.
I would prefer it if the discussion focused on computational speed rather than, "why would you replace na's with zeros", for example. And, while I realize a similar Q has been asked (How do I replace NA values with zeros in an R dataframe?) the focus has not been towards computational speed on a large data frame with many missing values.
Thanks!
Edited Solution:
As helpfully suggested, changing d to a matrix before applying is.na sped up the computation by an order of magnitude
You can get a considerable performance increase using the data.table package.
It is much faster, in general, with a lot of manipulations and transformations.
The downside is the learning curve of the syntax.
However, if you are looking for a speed performance boost, the investment could be worth it.
Generate fake data
r <- 10500
c <- 6000
x <- sample(c(NA, 1:5), r * c, replace = TRUE)
df <- data.frame(matrix(x, nrow = r, ncol = c))
Base R
df1 <- df
system.time(df1[is.na(df1)] <- 0)
user system elapsed
4.74 0.00 4.78
tidyr - replace_na()
dfReplaceNA <- function (df) {
require(tidyr)
l <- setNames(lapply(vector("list", ncol(df)), function(x) x <- 0), names(df))
replace_na(df, l)
}
system.time(df2 <- dfReplaceNA(df))
user system elapsed
4.27 0.00 4.28
data.table - set()
dtReplaceNA <- function (df) {
require(data.table)
dt <- data.table(df)
for (j in 1:ncol(dt)) {set(dt, which(is.na(dt[[j]])), j, 0)}
setDF(dt) # Return back a data.frame object
}
system.time(df3 <- dtReplaceNA(df))
user system elapsed
0.80 0.31 1.11
Compare data frames
all.equal(df1, df2)
[1] TRUE
all.equal(df1, df3)
[1] TRUE
I guess that all columns must be numeric or assigning 0s to NAs wouldn't be sensible.
I get the following timings, with approximately 10,000 NAs:
> M <- matrix(0, 10500, 6000)
> set.seed(54321)
> r <- sample(1:10500, 10000, replace=TRUE)
> c <- sample(1:6000, 10000, replace=TRUE)
> M[cbind(r, c)] <- NA
> D <- data.frame(M)
> sum(is.na(M)) # check
[1] 9999
> sum(is.na(D)) # check
[1] 9999
> system.time(M[is.na(M)] <- 0)
user system elapsed
0.19 0.12 0.31
> system.time(D[is.na(D)] <- 0)
user system elapsed
3.87 0.06 3.95
So, with this number of NAs, I get about an order of magnitude speedup by using a matrix. (With fewer NAs, the difference is smaller.) But the time using a data frame is just 4 seconds on my modest laptop -- much less time than it took to answer the question. If the problem really is of this magnitude, why is that slow?
I hope this helps.

Benchmark analysis and display results of analysis AND benchmark?

Probably, I just missed a parameter... but, maybe someone can point me to it: How can run analysis in R benchmark it and still store the result back somewhere?. I know R functions can only return one single object, but I could either make use of a list here or paste the benchmark results and store the analysis in the function's return value.
But, is there any way to evaluate benchmark (or system.time) and analysis without running it twice like this?:
require(rbenchmark)
bmark <- function(x){
res <- list()
res[[1]] <- benchmark(x^6)
res[[2]] <- x^6
res
}
EDIT: I am sorry I caused some confusion about what I really want to do. Maybe the use case makes it clearer: I don't have a typical benchmark situation where I want to check whether my custom function is faster than some other function. It's rather that I run the same thing with different data on different machines. I don't need this in a test environment, but in production – I just want to let users of a script know how long it took. If that's an hour or more people can plan their lunch break :) .
Here's an example using two functions. The first one uses plyr and the second uses data.table.
# dummy data
require(plyr)
require(data.table)
set.seed(45)
x1 <- data.frame(x=rnorm(1e6), grp = sample(letters[1:26], 1e6, replace=T))
x1.dt <- data.table(x1, key="grp")
# function that uses plyr
DF.FUN <- function(x) {
ddply(x1, .(grp), summarise, m.x = mean(x))
}
# function that uses data.table
DT.FUN <- function(x) {
x1.dt[, list(m.x=mean(x)),by=grp]
}
require(rbenchmark)
> benchmark( s1 <- DF.FUN(), s2 <- DT.FUN(), order="elapsed", replications=2)
# test replications elapsed relative user.self sys.self user.child sys.child
# 2 s2 <- DT.FUN() 2 0.036 1.000 0.031 0.006 0 0
# 1 s1 <- DF.FUN() 2 0.527 14.639 0.363 0.163 0 0
Now, s1 and s2 contain the results from each function, and the benchmarked results will be displayed on screen.
# > head(s1)
# grp m.x
# 1 a 0.0069312201
# 2 b -0.0002422315
# 3 c -0.0129449586
# 4 d -0.0036275338
# 5 e 0.0013438022
# 6 f -0.0015428427
# > head(s2)
# grp m.x
# 1: a 0.0069312201
# 2: b -0.0002422315
# 3: c -0.0129449586
# 4: d -0.0036275338
# 5: e 0.0013438022
# 6: f -0.0015428427
Is this what you were after?
I read the question a bit differently than Arun. This would be the answer to what I thought was being asked:
> bres <- bmark(2)
> bres
[[1]]
test replications elapsed relative user.self sys.self user.child sys.child
1 x^6 100 0.001 1 0.001 0.001 0 0
[[2]]
[1] 64
The bmark function is returning a result with the default 100 replications. It you wanted to annotate the results you could use paste() and if you wanted to add a parameter for number of reps:
bmark2 <- function(x, reps=100){
res <- list()
res[[1]] <- benchmark(x^6, replications=reps)
res[[2]] <- paste(reps, " replications of ", x, "to the 6th in", res[[1]]$elapsed)
res
}
I am unsure of what StackOverflow thinks about answering old questions, but it seems like nobody actually answered after your edit. So here goes:
To time a process in R you can use two methods.
The first one uses system.time(expression) and gives you how much time it took to evaluate the expression within the brackets.
If this is not practical in your case you can get system time with Sys.time() before the operation and after the operation and subtract the two.
If this finally answers your question please accept the solution :)

Efficient subsetting in R using 2 dataframes

I have a big time series full in one dataframe and a list of timestamps in a different dataframe test. I need to subset full with data points surrounding the timestamps in test. My first instinct (as an R noob) was to write the below, which was wrong
subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i))
Looking at the result I realized that R is looping through both the vectors simultaneously giving the wrong result. My option is to write a loop like the below:
subs<-data.frame()
for (j in test$dt)
subs <- rbind(subs,subset(full,full$dt>(j-i) & full$dt<(j+i)))
I feel that there might be a better way to do loops and this article implores us to avoid R loops as much as possible. The other reason is I might be hitting up against performance issues as this would be at the heart of an optimization algorithm. Any suggestions from gurus would be greatly appreciated.
EDIT:
Here is some reproducible code that shows the wrong approach as well as the approach that works but could be better.
#create a times series
full <- data.frame(seq(1:200),rnorm(200,0,1))
colnames(full)<-c("dt","val")
#my smaller array of points of interest
test <- data.frame(seq(5,200,by=23))
colnames(test)<-c("dt")
# my range around the points of interset
i<-3
#the wrong approach
subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i))
#this works, but not sure this is the best way to go about it
subs<-data.frame()
for (j in test$dt)
subs <- rbind(subs,subset(full,full$dt>(j-i) & full$dt<(j+i)))
EDIT:
I updated the values to better reflect my usecase, and I see #mrdwab 's solution pulling ahead unexpectedly and by a wide margin.
I am using benchmark code from #mrdwab and the initialization is as follows:
set.seed(1)
full <- data.frame(
dt = 1:15000000,
val = floor(rnorm(15000000,0,1))
)
test <- data.frame(dt = floor(runif(24,1,15000000)))
i <- 500
The benchmarks are:
test replications elapsed relative
2 mrdwab 2 1.31 1.00000
3 spacedman 2 69.06 52.71756
1 andrie 2 93.68 71.51145
4 original 2 114.24 87.20611
Totally unexpected. Mind = blown. Can someone please shed some light in this dark corner and enlighten as to what is happening.
Important: As #mrdwab notes below, his solution works only if the vectors are integers. If not, #spacedman has the right solution
Here's a real R way to do it. Functionally. No loops...
Starting with Andrie's example data.
First, an interval comparison function:
> cf = function(l,u){force(l);force(u);function(x){x>l & x<u}}
An OR composition function:
> OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}}
Now there's sort of a loop here, to construct a list of those comparison functions:
> funs = mapply(cf,test$dt-i,test$dt+i)
Now combine all those into one function:
> anyF = Reduce(OR,funs)
And now we apply the OR composition to our interval testing functions:
> head(full[anyF(full$dt),])
dt val
3 3 -0.83562861
4 4 1.59528080
5 5 0.32950777
6 6 -0.82046838
7 7 0.48742905
26 26 -0.05612874
What you've got now is a function of a single variable that tests if the value is in the ranges you defined.
> anyF(1:10)
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
I don't know if this is faster, or better, or what. Someone do some benchmarks!
I don't know if it's any more efficient, but I would think you could also do something like this to get what you want:
subs <- apply(test, 1, function(x) c((x-2):(x+2)))
full[which(full$dt %in% subs), ]
I had to adjust your "3" to "2" since x would be included both ways.
Benchmarking (just for fun)
#Spacedman leads the way!
First, the required data and functions.
## Data
set.seed(1)
full <- data.frame(
dt = 1:200,
val = rnorm(200,0,1)
)
test <- data.frame(dt = seq(5,200,by=23))
i <- 3
## Spacedman's functions
cf = function(l,u){force(l);force(u);function(x){x>l & x<u}}
OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}}
funs = mapply(cf,test$dt-i,test$dt+i)
anyF = Reduce(OR,funs)
Second, the benchmarking.
## Benchmarking
require(rbenchmark)
benchmark(andrie = do.call(rbind,
lapply(test$dt,
function(j) full[full$dt > (j-i) &
full$dt < (j+i), ])),
mrdwab = {subs <- apply(test, 1,
function(x) c((x-(i-1)):(x+(i-1))))
full[which(full$dt %in% subs), ]},
spacedman = full[anyF(full$dt),],
original = {subs <- data.frame()
for (j in test$dt)
subs <- rbind(subs,
subset(full, full$dt > (j-i) &
full$dt < (j+i)))},
columns = c("test", "replications", "elapsed", "relative"),
order = "relative")
# test replications elapsed relative
# 3 spacedman 100 0.064 1.000000
# 2 mrdwab 100 0.105 1.640625
# 1 andrie 100 0.520 8.125000
# 4 original 100 1.080 16.875000
There is nothing inherently wrong with your code. To achieve your aim, you need a loop of some sort around a vectorised subset operation.
But here is more R-ish way to do it, which might well be faster:
do.call(rbind,
lapply(test$dt, function(j)full[full$dt > (j-i) & full$dt < (j+i), ])
)
PS: You can significantly simplify your reproducible example:
set.seed(1)
full <- data.frame(
dt = 1:200,
val = rnorm(200,0,1)
)
test <- data.frame(dt = seq(5,200,by=23))
i <- 3
xx <- do.call(rbind,
lapply(test$dt, function(j)full[full$dt > (j-i) & full$dt < (j+i), ])
)
head(xx)
dt val
3 3 -0.83562861
4 4 1.59528080
5 5 0.32950777
6 6 -0.82046838
7 7 0.48742905
26 26 -0.05612874
one more way using data.tables:
{
temp <- data.table(x=unique(c(full$dt,(test$dt-i),(test$dt+i))),key="x")
temp[,index:=1:nrow(temp)]
startpoints <- temp[J(test$dt-i),index]$index
endpoints <- temp[J(test$dt+i),index]$index
allpoints <- as.vector(mapply(FUN=function(x,y) x:y,x=startpoints,y=endpoints))
setkey(x=temp,index)
ans <- temp[J(allpoints)]$x
}
benchmarks:
number of rows in test:9
number of rows in full:10000
test replications elapsed relative
1 spacedman 100 0.406 1.000
2 new 100 1.179 2.904
number of rows in full:100000
test replications elapsed relative
2 new 100 2.374 1.000
1 spacedman 100 3.753 1.581

Count the number of valid observations (no NA) pairwise in a data frame

Say I have a data frame like this:
Df <- data.frame(
V1 = c(1,2,3,NA,5),
V2 = c(1,2,NA,4,5),
V3 = c(NA,2,NA,4,NA)
)
Now I want to count the number of valid observations for every combination of two variables. For that, I wrote a function sharedcount:
sharedcount <- function(x,...){
nx <- names(x)
alln <- combn(nx,2)
out <- apply(alln,2,
function(y)sum(complete.cases(x[y]))
)
data.frame(t(alln),out)
}
This gives the output:
> sharedcount(Df)
X1 X2 out
1 V1 V2 3
2 V1 V3 1
3 V2 V3 2
All fine, but the function itself takes pretty long on big dataframes (600 variables and about 10000 observations). I have the feeling I'm overseeing an easier approach, especially since cor(...,use='pairwise') is running still a whole lot faster while it has to do something similar :
> require(rbenchmark)
> benchmark(sharedcount(TestDf),cor(TestDf,use='pairwise'),
+ columns=c('test','elapsed','relative'),
+ replications=1
+ )
test elapsed relative
2 cor(TestDf, use = "pairwise") 0.25 1.0
1 sharedcount(TestDf) 1.90 7.6
Any tips are appreciated.
Note : Using Vincent's trick, I wrote a function that returns the same data frame. Code in my answer below.
The following is slightly faster:
x <- !is.na(Df)
t(x) %*% x
# test elapsed relative
# cor(Df) 12.345 1.000000
# t(x) %*% x 20.736 1.679708
I thought Vincent's looked really elegant, not to mention being faster than my sophomoric for-loop, except it seems to be needing an extraction step which I added below. This is just an example of the heavy overhead in the apply method when used with dataframes.
shrcnt <- function(Df) {Comb <- t(combn(1:ncol(Df),2) )
shrd <- 1:nrow(Comb)
for (i in seq_len(shrd)){
shrd[i] <- sum(complete.cases(Df[,Comb[i,1]], Df[,Comb[i,2]]))}
return(shrd)}
benchmark(
shrcnt(Df), sharedcount(Df), {prs <- t(x) %*% x; prs[lower.tri(prs)]},
cor(Df,use='pairwise'),
columns=c('test','elapsed','relative'),
replications=100
)
#--------------
test elapsed relative
3 { 0.008 1.0
4 cor(Df, use = "pairwise") 0.020 2.5
2 sharedcount(Df) 0.092 11.5
1 shrcnt(Df) 0.036 4.5
Based on the lovely trick of Vincent and the additional lower.tri() suggestion of DWin, I came up with following function that gives me the same output (i.e. a data frame) as my original one, and runs a whole lot faster :
sharedcount2 <- function(x,stringsAsFactors=FALSE,...){
counts <- crossprod(!is.na(x))
id <- lower.tri(counts)
count <- counts[id]
X1 <- colnames(counts)[col(counts)[id]]
X2 <- rownames(counts)[row(counts)[id]]
data.frame(X1,X2,count)
}
Note the use of crossprod(), as that one gives a small improvement compared to %*%, but it does exactly the same.
The timings :
> benchmark(sharedcount(TestDf),sharedcount2(TestDf),
+ replications=5,
+ columns=c('test','replications','elapsed','relative'))
test replications elapsed relative
1 sharedcount(TestDf) 5 10.00 90.90909
2 sharedcount2(TestDf) 5 0.11 1.00000
Note: I supplied TestDf in the question, as I noticed that the timings differ depending on the size of the data frames. As shown here, the time increase is a lot more dramatic than when compared using a small data frame.

Vectorize a product calculation which depends on previous elements?

I'm trying to speed up/vectorize some calculations in a time series.
Can I vectorize a calculation in a for loop which can depend on results from an earlier iteration? For example:
z <- c(1,1,0,0,0,0)
zi <- 2:6
for (i in zi) {z[i] <- ifelse (z[i-1]== 1, 1, 0) }
uses the z[i] values updated in earlier steps:
> z
[1] 1 1 1 1 1 1
In my effort at vectorizing this
z <- c(1,1,0,0,0,0)
z[zi] <- ifelse( z[zi-1] == 1, 1, 0)
the element-by-element operations don't use results updated in the operation:
> z
[1] 1 1 1 0 0 0
So this vectorized operation operates in 'parallel' rather than iterative fashion. Is there a way I can write/vectorize this to get the results of the for loop?
ifelse is vectorized and there's a bit of a penalty if you're using it on one element at a time in a for-loop. In your example, you can get a pretty good speedup by using if instead of ifelse.
fun1 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- ifelse(z[i-1]==1, 1, 0)
}
z
}
fun2 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- if(z[i-1]==1) 1 else 0
}
z
}
z <- c(1,1,0,0,0,0)
identical(fun1(z),fun2(z))
# [1] TRUE
system.time(replicate(10000, fun1(z)))
# user system elapsed
# 1.13 0.00 1.32
system.time(replicate(10000, fun2(z)))
# user system elapsed
# 0.27 0.00 0.26
You can get some additional speed gains out of fun2 by compiling it.
library(compiler)
cfun2 <- cmpfun(fun2)
system.time(replicate(10000, cfun2(z)))
# user system elapsed
# 0.11 0.00 0.11
So there's a 10x speedup without vectorization. As others have said (and some have illustrated) there are ways to vectorize your example, but that may not translate to your actual problem. Hopefully this is general enough to be applicable.
The filter function may be useful to you as well if you can figure out how to express your problem in terms of a autoregressive or moving average process.
This is a nice and simple example where Rcpp can shine.
So let us first recast functions 1 and 2 and their compiled counterparts:
library(inline)
library(rbenchmark)
library(compiler)
fun1 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- ifelse(z[i-1]==1, 1, 0)
}
z
}
fun1c <- cmpfun(fun1)
fun2 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- if(z[i-1]==1) 1 else 0
}
z
}
fun2c <- cmpfun(fun2)
We write a Rcpp variant very easily:
funRcpp <- cxxfunction(signature(zs="numeric"), plugin="Rcpp", body="
Rcpp::NumericVector z = Rcpp::NumericVector(zs);
int n = z.size();
for (int i=1; i<n; i++) {
z[i] = (z[i-1]==1.0 ? 1.0 : 0.0);
}
return(z);
")
This uses the inline package to compile, load and link the five-liner on the fly.
Now we can define our test-date, which we make a little longer than the original (as just running the original too few times result in unmeasurable times):
R> z <- rep(c(1,1,0,0,0,0), 100)
R> identical(fun1(z),fun2(z),fun1c(z),fun2c(z),funRcpp(z))
[1] TRUE
R>
All answers are seen as identical.
Finally, we can benchmark:
R> res <- benchmark(fun1(z), fun2(z),
+ fun1c(z), fun2c(z),
+ funRcpp(z),
+ columns=c("test", "replications", "elapsed",
+ "relative", "user.self", "sys.self"),
+ order="relative",
+ replications=1000)
R> print(res)
test replications elapsed relative user.self sys.self
5 funRcpp(z) 1000 0.005 1.0 0.01 0
4 fun2c(z) 1000 0.466 93.2 0.46 0
2 fun2(z) 1000 1.918 383.6 1.92 0
3 fun1c(z) 1000 10.865 2173.0 10.86 0
1 fun1(z) 1000 12.480 2496.0 12.47 0
The compiled version wins by a factor of almost 400 against the best R version, and almost 100 against its byte-compiled variant. For function 1, the byte compilation matters much less and both variants trail the C++ by a factor of well over two-thousand.
It took about one minute to write the C++ version. The speed gain suggests it was a minute well spent.
For comparison, here is the result for the original short vector called more often:
R> z <- c(1,1,0,0,0,0)
R> res2 <- benchmark(fun1(z), fun2(z),
+ fun1c(z), fun2c(z),
+ funRcpp(z),
+ columns=c("test", "replications",
+ "elapsed", "relative", "user.self", "sys.self"),
+ order="relative",
+ replications=10000)
R> print(res2)
test replications elapsed relative user.self sys.self
5 funRcpp(z) 10000 0.046 1.000000 0.04 0
4 fun2c(z) 10000 0.132 2.869565 0.13 0
2 fun2(z) 10000 0.271 5.891304 0.27 0
3 fun1c(z) 10000 1.045 22.717391 1.05 0
1 fun1(z) 10000 1.202 26.130435 1.20 0
The qualitative ranking is unchanged: the Rcpp version dominates, function2 is second-best. with the byte-compiled version being about twice as fast that the plain R variant, but still almost three times slower than the C++ version. And the relative difference are lower: relatively speaking, the function call overhead matters less and the actual looping matters more: C++ gets a bigger advantage on the actual loop operations in the longer vectors. That it is an important result as it suggests that more real-life sized data, the compiled version may reap a larger benefit.
Edited to correct two small oversights in the code examples. And edited again with thanks to Josh to catch a setup error relative to fun2c.
I think this is cheating and not generalizable, but: according to the rules you have above, any occurrence of 1 in the vector will make all subsequent elements 1 (by recursion: z[i] is 1 set to 1 if z[i-1] equals 1; therefore z[i] will be set to 1 if z[i-2] equals 1; and so forth). Depending on what you really want to do, there may be such a recursive solution available if you think carefully about it ...
z <- c(1,1,0,0,0,0)
first1 <- min(which(z==1))
z[seq_along(z)>first1] <- 1
edit: this is wrong, but I'm leaving it up to admit my mistakes. Based on a little bit of playing (and less thinking), I think the actual solution to this recursion is more symmetric and even simpler:
rep(z[1],length(z))
Test cases:
z <- c(1,1,0,0,0,0)
z <- c(0,1,1,0,0,0)
z <- c(0,0,1,0,0,0)
Check out the rollapply function in zoo.
I'm not super familiar with it, but I think this does what you want:
> c( 1, rollapply(z,2,function(x) x[1]) )
[1] 1 1 1 1 1 1
I'm sort of kludging it by using a window of 2 and then only using the first element of that window.
For more complicated examples you could perform some calculation on x[1] and return that instead.
Sometimes you just need to think about it totally differently. What you're doing is creating a vector where every item is the same as the first if it's a 1 or 0 otherwise.
z <- c(1,1,0,0,0,0)
if (z[1] != 1) z[1] <- 0
z[2:length(z)] <- z[1]
There is a function that does this particular calculation: cumprod (cumulative product)
> cumprod(z[zi])
[1] 1 0 0 0 0
> cumprod(c(1,2,3,4,0,5))
[1] 1 2 6 24 0 0
Otherwise, vectorize with Rccp as other answers have shown.
It's also possible to do this with "apply" using the original vector and a lagged version of the vector as the constituent columns of a data frame.

Resources