Related
I have several algorithms that depend on the efficiency of determining whether an element exists in a vector or not. It seems to me that %in% (which is equivalent to is.element()) should be the most efficient as it simply returns a Boolean value. After testing several methods, to my surprise, those methods are by far the most inefficient. Below is my analysis (the results get worse as the size of the vectors increase):
EfficiencyTest <- function(n, Lim) {
samp1 <- sample(Lim, n)
set1 <- sample(Lim, Lim)
print(system.time(for(i in 1:n) {which(set1==samp1[i])}))
print(system.time(for(i in 1:n) {samp1[i] %in% set1}))
print(system.time(for(i in 1:n) {is.element(samp1[i], set1)}))
print(system.time(for(i in 1:n) {match(samp1[i], set1)}))
a <- system.time(set1 <- sort(set1))
b <- system.time(for (i in 1:n) {BinVecCheck(samp1[i], set1)})
print(a+b)
}
> EfficiencyTest(10^3, 10^5)
user system elapsed
0.29 0.11 0.40
user system elapsed
19.79 0.39 20.21
user system elapsed
19.89 0.53 20.44
user system elapsed
20.04 0.28 20.33
user system elapsed
0.02 0.00 0.03
Where BinVecCheck is a binary search algorithm that I wrote that returns TRUE/FALSE. Note that I include the time it takes to sort the vector with the final method. Here is the code for the binary search:
BinVecCheck <- function(tar, vec) {
if (tar==vec[1] || tar==vec[length(vec)]) {return(TRUE)}
size <- length(vec)
size2 <- trunc(size/2)
dist <- (tar - vec[size2])
if (dist > 0) {
lower <- size2 - 1L
upper <- size
} else {
lower <- 1L
upper <- size2 + 1L
}
while (size2 > 1 && !(dist==0)) {
size2 <- trunc((upper-lower)/2)
temp <- lower+size2
dist <- (tar - vec[temp])
if (dist > 0) {
lower <- temp-1L
} else {
upper <- temp+1L
}
}
if (dist==0) {return(TRUE)} else {return(FALSE)}
}
Platform Info:
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Question
Is there a more efficient way of determining whether an element exists in a vector in R? For example, is there an equivalent R function to the Python set function, that greatly improves on this approach? Also, why is %in%, and the like, so inefficient even when compared to the which function which gives more information (not only does it determine existence, but it also gives the indices of all true accounts)?
My tests aren't bearing out all of your claims, but that seems (?) to be due to cross-platform differences (which makes the question even more mysterious, and possibly worth taking up on r-devel#r-project.org, although maybe not since the fastmatch solution below dominates anyway ...)
n <- 10^3; Lim <- 10^5
set.seed(101)
samp1 <- sample(Lim,n)
set1 <- sample(Lim,Lim)
library("rbenchmark")
library("fastmatch")
`%fin%` <- function(x, table) {
stopifnot(require(fastmatch))
fmatch(x, table, nomatch = 0L) > 0L
}
benchmark(which=sapply(samp1,function(x) which(set1==x)),
infun=sapply(samp1,function(x) x %in% set1),
fin= sapply(samp1,function(x) x %fin% set1),
brc= sapply(samp1,BinVecCheck,vec=sort(set1)),
replications=20,
columns = c("test", "replications", "elapsed", "relative"))
## test replications elapsed relative
## 4 brc 20 0.871 2.329
## 3 fin 20 0.374 1.000
## 2 infun 20 6.480 17.326
## 1 which 20 10.634 28.433
This says that %in% is about twice as fast as which -- your BinVecCheck function is 7x better, but the fastmatch solution from here gets another factor of 2. I don't know if a specialized Rcpp implementation could do better or not ...
In fact, I get different answers even when running your code:
## user system elapsed (which)
## 0.488 0.096 0.586
## user system elapsed (%in%)
## 0.184 0.132 0.315
## user system elapsed (is.element)
## 0.188 0.124 0.313
## user system elapsed (match)
## 0.148 0.164 0.312
## user system elapsed (BinVecCheck)
## 0.048 0.008 0.055
update: on r-devel Peter Dalgaard explains the platform discrepancy (which is an R version difference, not an OS difference) by pointing to the R NEWS entry:
match(x, table) is faster, sometimes by an order of magnitude, when x is of length one and incomparables is unchanged, thanks to Haverty's PR#16491.
sessionInfo()
## R Under development (unstable) (2015-10-23 r69563)
## Platform: i686-pc-linux-gnu (32-bit)
## Running under: Ubuntu precise (12.04.5 LTS)
%in% is just sugar for match, and is defined as:
"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
Both match and which are low level (compiled C) functions called by .Internal(). You can actually see the source code by using the pryr package:
install.packages("pryr")
library(pryr)
pryr::show_c_source(.Internal(which(x)))
pryr::show_c_source(.Internal(match(x, table, nomatch, incomparables)))
You would be pointed to this page for which and this page for match.
which does not perform any of the casting, checks etc that match performs. This might explain its higher performance in your tests (but I haven't tested your results myself).
After many days researching this topic, I have found that the fastest method of determining existence depends on the number of elements being tested. From the answer given by #ben-bolker, %fin% looks like the clear-cut winner. This seems to be the case when the number of elements being tested (all elements in samp1) is small compared to the size of the vector (set1). Before we go any further, lets look at the binary search algorithm above.
First of all, the very first line in the original algorithm has an extremely low probability of evaluating to TRUE, so why check it everytime?
if (tar==vec[1] || tar==vec[size]) {return(TRUE)}
Instead, I put this statement inside the else statement at the very-end.
Secondly, determining the size of the vector every time is redundant, especially when I know the length of the test vector (set1) ahead of time. So, I added size as an argument to the algorithm and simply pass it as a variable. Below is the modified binary search code.
ModifiedBinVecCheck <- function(tar, vec, size) {
size2 <- trunc(size/2)
dist <- (tar - vec[size2])
if (dist > 0) {
lower <- size2 - 1L
upper <- size
} else {
lower <- 1L
upper <- size2 + 1L
}
while (size2 > 1 && !(dist==0)) {
size2 <- trunc((upper-lower)/2)
temp <- lower+size2
dist <- (tar - vec[temp])
if (dist > 0) {
lower <- temp-1L
} else {
upper <- temp+1L
}
}
if (dist==0) {
return(TRUE)
} else {
if (tar==vec[1] || tar==vec[size]) {return(TRUE)} else {return(FALSE)}
}
}
As we know, in order to use a binary search, your vector must be sorted, which cost time. The default sorting method for sort is shell, which can be used on all datatypes, but has the drawback (generally speaking) of being slower than the quick method (quick can only be used on doubles or integers). With quick as my method for sorting (since we are dealing with numbers) combined with the modified binary search, we get a significant performance increase (from the old binary search depending on the case). It should be noted that fmatch improves on match only when the datatype is an integer, real, or character.
Now, let's look at some test cases with differing sizes of n.
Case1 (n = 10^3 & Lim = 10^6, so n to Lim ratio is 1:1000):
n <- 10^3; Lim <- 10^6
set.seed(101)
samp1 <- sample(Lim,n)
set1 <- sample(Lim,Lim)
benchmark(fin= sapply(samp1,function(x) x %fin% set1),
brc= sapply(samp1,ModifiedBinVecCheck,vec=sort(set1, method = "quick"),size=Lim),
oldbrc= sapply(samp1,BinVecCheck,vec=sort(set1)),
replications=10,
columns = c("test", "replications", "elapsed", "relative"))
test replications elapsed relative
2 brc 10 0.97 4.217
1 fin 10 0.23 1.000
3 oldbrc 10 1.45 6.304
Case2 (n = 10^4 & Lim = 10^6, so n to Lim ratio is 1:100):
n <- 10^4; Lim <- 10^6
set.seed(101)
samp1 <- sample(Lim,n)
set1 <- sample(Lim,Lim)
benchmark(fin= sapply(samp1,function(x) x %fin% set1),
brc= sapply(samp1,ModifiedBinVecCheck,vec=sort(set1, method = "quick"),size=Lim),
oldbrc= sapply(samp1,BinVecCheck,vec=sort(set1)),
replications=10,
columns = c("test", "replications", "elapsed", "relative"))
test replications elapsed relative
2 brc 10 2.08 1.000
1 fin 10 2.16 1.038
3 oldbrc 10 2.57 1.236
Case3: (n = 10^5 & Lim = 10^6, so n to Lim ratio is 1:10):
n <- 10^5; Lim <- 10^6
set.seed(101)
samp1 <- sample(Lim,n)
set1 <- sample(Lim,Lim)
benchmark(fin= sapply(samp1,function(x) x %fin% set1),
brc= sapply(samp1,ModifiedBinVecCheck,vec=sort(set1, method = "quick"),size=Lim),
oldbrc= sapply(samp1,BinVecCheck,vec=sort(set1)),
replications=10,
columns = c("test", "replications", "elapsed", "relative"))
test replications elapsed relative
2 brc 10 13.13 1.000
1 fin 10 21.23 1.617
3 oldbrc 10 13.93 1.061
Case4: (n = 10^6 & Lim = 10^6, so n to Lim ratio is 1:1):
n <- 10^6; Lim <- 10^6
set.seed(101)
samp1 <- sample(Lim,n)
set1 <- sample(Lim,Lim)
benchmark(fin= sapply(samp1,function(x) x %fin% set1),
brc= sapply(samp1,ModifiedBinVecCheck,vec=sort(set1, method = "quick"),size=Lim),
oldbrc= sapply(samp1,BinVecCheck,vec=sort(set1)),
replications=10,
columns = c("test", "replications", "elapsed", "relative"))
test replications elapsed relative
2 brc 10 124.61 1.000
1 fin 10 214.20 1.719
3 oldbrc 10 127.39 1.022
As you can see, as n gets large relative to Lim, the efficiency of the binary search (both of them) start to dominate. In Case 1, %fin% was over 4x faster than the modified binary search, in Case2 there was almost no difference, in Case 3 we really start to see the binary search dominance, and in Case 4, the modified binary search is almost twice as fast as %fin%.
Thus, to answer the question "Which method is faster?", it depends. %fin% is faster for a small number of elemental checks with respect to the test vector and the ModifiedBinVecCheck is faster for a larger number of elemental checks with respect to the test vector.
any( x == "foo" ) should be plenty fast if you can be sure that x is free of NAs. If you may have NAs, R 3.3 has a speedup for "%in%" that will help.
For binary search, see findInterval before rolling your own. This doesn't sound like a job for binary search unless x is constant and sorted.
I've struggled long enough on my own to find an answer. I promise I'll try to learn from the solutions. For the sake of learning, I would like to understand how to do it with explicit loops, but if you want to share a vectorized approach as a bonus that is also greatly appreciated.
Say I'm going to play a game once each day and I know the probability of victory each day. I want a function which takes that vector of probabilities and returns the cumulative probability of success on at least one day. So if I play for 3 days in a row and the probability of victory was 0.5 each day then my function should return "0.875, 0.75, 0.5"
Here is my most recent failed attempt at writing this function:
prob_cum <- function(prob_today) {
p_cum <- rep(0, length(prob_today))
for (i in 1:length(prob_today)) {
for (j in i:length(prob_today)) {
p_cum[j] <- p_cum[j-1] - ((1 - p_cum[j-1]) * prob_today[j])
}
}
p_cum
}
prob_daily <- c(.5,.5,.5)
prob_cum(prob_daily)
> 1 - cumprod( 1- c(0.5,0.5,0.5) )
[1] 0.500 0.750 0.875
# (1- prob_success) is the prob_non_success vector
Easily wrapped into a function if needed. Your intial test was not a good one because it did not disclose my original error in not subtracting the success vector from 1 within the cumprod argument.
vec<-runif(100)
prob_cum <- function(prob_today) {
p_cum <- rep(0, length(prob_today))
p_cum[1] <- prob_today[1]
for (j in seq_along(prob_today)[-1]) {
p_cum[j] <- p_cum[j-1] + ((1 - p_cum[j-1]) * prob_today[j])
}
p_cum
}
Prob_vec <- function(vec) 1 - cumprod( 1- vec)
require(rbenchmark)
benchmark( prob_cum(vec) , Prob_vec(vec) ,replications=1000)
# test replications elapsed relative user.self sys.self user.child sys.child
#1 prob_cum(vec) 1000 0.538 59.778 0.532 0.008 0 0
#2 Prob_vec(vec) 1000 0.009 1.000 0.008 0.002 0 0
Working though each problem at a time:
You have a loop over i which doesn't do anything; it just performs the same calculations multiple times and each time overwrites the results (with the same results). Drop that.
prob_cum <- function(prob_today) {
p_cum <- rep(0, length(prob_today))
for (j in i:length(prob_today)) {
p_cum[j] <- p_cum[j-1] - ((1 - p_cum[j-1]) * prob_today[j])
}
p_cum
}
This still has problems. For j=1, you try to access p_cum[0] which is a zero-length vector and your calculation assumes a one-length vector. That is why you get the error message
Error in p_cum[j] <- p_cum[j - 1] - ((1 - p_cum[j - 1]) * prob_today[j]) :
replacement has length zero
Initialize p_cum[1] and then loop over the rest.
prob_cum <- function(prob_today) {
p_cum <- rep(0, length(prob_today))
p_cum[1] <- prob_today[1]
for (j in 2:length(prob_today)) {
p_cum[j] <- p_cum[j-1] - ((1 - p_cum[j-1]) * prob_today[j])
}
p_cum
}
This looping construct is potentially dangerous. It works so long as prob_today is at least length 2 but will behave unexpectedly if the length is 1. Better is
prob_cum <- function(prob_today) {
p_cum <- rep(0, length(prob_today))
p_cum[1] <- prob_today[1]
for (j in seq_along(prob_today)[-1]) {
p_cum[j] <- p_cum[j-1] - ((1 - p_cum[j-1]) * prob_today[j])
}
p_cum
}
Now we get to a real problem: your algorithm is wrong. The probability of getting at least one win by day j is the probability of getting at least one by day j-1 PLUS the probability of getting a win on day j given that there hasn't been a win to that point. You have a minus.
prob_cum <- function(prob_today) {
p_cum <- rep(0, length(prob_today))
p_cum[1] <- prob_today[1]
for (j in seq_along(prob_today)[-1]) {
p_cum[j] <- p_cum[j-1] + ((1 - p_cum[j-1]) * prob_today[j])
}
p_cum
}
Now you have a function that works:
> prob_cum(prob_daily)
[1] 0.500 0.750 0.875
> prob_cum(c(0.5, 0.01, 0.99))
[1] 0.50000 0.50500 0.99505
The fully vectorized solution follows from expressing the probability differently. The probability of getting at least one win is 1 minus the probability of getting all losses up to that day. Those are independent probabilities, so are just a product of getting a loss each day.
prob_cum <- function(prob_today) {
1 - cumprod(1-prob_today)
}
which gives the same results
> prob_cum(prob_daily)
[1] 0.500 0.750 0.875
> prob_cum(c(0.5, 0.01, 0.99))
[1] 0.50000 0.50500 0.99505
and works for single values and empty vectors without any additional adjustments needed
> prob_cum(c(0.75))
[1] 0.75
> prob_cum(c())
numeric(0)
I want to know how to vectorize and memoize a custom function in R. It seems
my way of thinking is not aligned with R's way of operation. So, I gladly
welcome any links to good reading material. For example, R inferno is a nice
resource, but it didn't help to figure out memoization in R.
More generally, can you provide a relevant usage example for the memoise
or R.cache packages?
I haven't been able to find any other discussions on this subject. Searching
for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
for those keywords at http://r-project.markmail.org/ does not return helpful
discussions. I emailed the mailing list and did not receive a complete
answer.
I am not solely interested in memoizing the GC function, and I am aware of
Bioconductor and the various packages
available there.
Here's my data:
seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC")
Some sequences are missing, so they're blank "".
I have a function for calculating GC content:
> GC <- function(s) {
if (!is.character(s)) return(NA)
n <- nchar(s)
if (n == 0) return(NA)
m <- gregexpr('[GCSgcs]', s)[[1]]
if (m[1] < 1) return(0)
return(100.0 * length(m) / n)
}
It works:
> GC('')
[1] NA
> GC('G')
[1] 100
> GC('GAG')
[1] 66.66667
> sapply(seqs, GC)
G C CCC T TTCCT
NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 NA
C CTC
100.00000 66.66667
I want to memoize it. Then, I want to vectorize it.
Apparently, I must have the wrong mindset for using the memoise or
R.cache R packages:
> system.time(dummy <- sapply(rep(seqs,100), GC))
user system elapsed
0.044 0.000 0.054
>
> library(memoise)
> GCm1 <- memoise(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm1))
user system elapsed
0.164 0.000 0.173
>
> library(R.cache)
> GCm2 <- addMemoization(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm2))
user system elapsed
10.601 0.252 10.926
Notice that the memoized functions are several orders of magnitude slower.
I tried the hash package, but things seem to be happening behind the
scenes and I don't understand the output. The sequence C should have a
value of 100, not NULL.
Note that using has.key(s, cache) instead of exists(s, cache) results
in the same output. Also, using cache[s] <<- result instead of
cache[[s]] <<- result results in the same output.
> cache <- hash()
> GCc <- function(s) {
if (!is.character(s) || nchar(s) == 0) {
return(NA)
}
if(exists(s, cache)) {
return(cache[[s]])
}
result <- GC(s)
cache[[s]] <<- result
return(result)
}
> sapply(seqs,GCc)
[[1]]
[1] NA
$G
[1] 100
$C
NULL
$CCC
[1] 100
$T
NULL
[[6]]
[1] NA
$TTCCT
[1] 40
[[8]]
[1] NA
$C
NULL
$CTC
[1] 66.66667
At least I figured out how to vectorize:
> GCv <- Vectorize(GC)
> GCv(seqs)
G C CCC T TTCCT
NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 NA
C CTC
100.00000 66.66667
Relevant stackoverflow posts:
Options for caching / memoization / hashing in R
While this won't give you memoization across calls, you can use factors to make individual calls a lot faster if there is a fair bit of repetition. Eg using Joshua's GC2 (though I had to remove fixed=T to get it to work):
GC2 <- function(s) {
if(!is.character(s)) stop("'s' must be character")
n <- nchar(s)
m <- gregexpr('[GCSgcs]', s)
len <- sapply(m, length)
neg <- sapply(m, "[[", 1)
len <- len*(neg > 0)
100.0 * len/n
}
One can easily define a wrapper like:
GC3 <- function(s) {
x <- factor(s)
GC2(levels(x))[x]
}
system.time(GC2(rep(seqs, 50000)))
# user system elapsed
# 8.97 0.00 8.99
system.time(GC3(rep(seqs, 50000)))
# user system elapsed
# 0.06 0.00 0.06
This doesn't explicitly answer your question, but this function is ~4 times faster than yours.
GC2 <- function(s) {
if(!is.character(s)) stop("'s' must be character")
n <- nchar(s)
m <- gregexpr('[GCSgcs]', s)
len <- sapply(m, length)
neg <- sapply(m, "[[", 1)
len <- len*(neg > 0)
len/n
}
I'm trying to speed up/vectorize some calculations in a time series.
Can I vectorize a calculation in a for loop which can depend on results from an earlier iteration? For example:
z <- c(1,1,0,0,0,0)
zi <- 2:6
for (i in zi) {z[i] <- ifelse (z[i-1]== 1, 1, 0) }
uses the z[i] values updated in earlier steps:
> z
[1] 1 1 1 1 1 1
In my effort at vectorizing this
z <- c(1,1,0,0,0,0)
z[zi] <- ifelse( z[zi-1] == 1, 1, 0)
the element-by-element operations don't use results updated in the operation:
> z
[1] 1 1 1 0 0 0
So this vectorized operation operates in 'parallel' rather than iterative fashion. Is there a way I can write/vectorize this to get the results of the for loop?
ifelse is vectorized and there's a bit of a penalty if you're using it on one element at a time in a for-loop. In your example, you can get a pretty good speedup by using if instead of ifelse.
fun1 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- ifelse(z[i-1]==1, 1, 0)
}
z
}
fun2 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- if(z[i-1]==1) 1 else 0
}
z
}
z <- c(1,1,0,0,0,0)
identical(fun1(z),fun2(z))
# [1] TRUE
system.time(replicate(10000, fun1(z)))
# user system elapsed
# 1.13 0.00 1.32
system.time(replicate(10000, fun2(z)))
# user system elapsed
# 0.27 0.00 0.26
You can get some additional speed gains out of fun2 by compiling it.
library(compiler)
cfun2 <- cmpfun(fun2)
system.time(replicate(10000, cfun2(z)))
# user system elapsed
# 0.11 0.00 0.11
So there's a 10x speedup without vectorization. As others have said (and some have illustrated) there are ways to vectorize your example, but that may not translate to your actual problem. Hopefully this is general enough to be applicable.
The filter function may be useful to you as well if you can figure out how to express your problem in terms of a autoregressive or moving average process.
This is a nice and simple example where Rcpp can shine.
So let us first recast functions 1 and 2 and their compiled counterparts:
library(inline)
library(rbenchmark)
library(compiler)
fun1 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- ifelse(z[i-1]==1, 1, 0)
}
z
}
fun1c <- cmpfun(fun1)
fun2 <- function(z) {
for(i in 2:NROW(z)) {
z[i] <- if(z[i-1]==1) 1 else 0
}
z
}
fun2c <- cmpfun(fun2)
We write a Rcpp variant very easily:
funRcpp <- cxxfunction(signature(zs="numeric"), plugin="Rcpp", body="
Rcpp::NumericVector z = Rcpp::NumericVector(zs);
int n = z.size();
for (int i=1; i<n; i++) {
z[i] = (z[i-1]==1.0 ? 1.0 : 0.0);
}
return(z);
")
This uses the inline package to compile, load and link the five-liner on the fly.
Now we can define our test-date, which we make a little longer than the original (as just running the original too few times result in unmeasurable times):
R> z <- rep(c(1,1,0,0,0,0), 100)
R> identical(fun1(z),fun2(z),fun1c(z),fun2c(z),funRcpp(z))
[1] TRUE
R>
All answers are seen as identical.
Finally, we can benchmark:
R> res <- benchmark(fun1(z), fun2(z),
+ fun1c(z), fun2c(z),
+ funRcpp(z),
+ columns=c("test", "replications", "elapsed",
+ "relative", "user.self", "sys.self"),
+ order="relative",
+ replications=1000)
R> print(res)
test replications elapsed relative user.self sys.self
5 funRcpp(z) 1000 0.005 1.0 0.01 0
4 fun2c(z) 1000 0.466 93.2 0.46 0
2 fun2(z) 1000 1.918 383.6 1.92 0
3 fun1c(z) 1000 10.865 2173.0 10.86 0
1 fun1(z) 1000 12.480 2496.0 12.47 0
The compiled version wins by a factor of almost 400 against the best R version, and almost 100 against its byte-compiled variant. For function 1, the byte compilation matters much less and both variants trail the C++ by a factor of well over two-thousand.
It took about one minute to write the C++ version. The speed gain suggests it was a minute well spent.
For comparison, here is the result for the original short vector called more often:
R> z <- c(1,1,0,0,0,0)
R> res2 <- benchmark(fun1(z), fun2(z),
+ fun1c(z), fun2c(z),
+ funRcpp(z),
+ columns=c("test", "replications",
+ "elapsed", "relative", "user.self", "sys.self"),
+ order="relative",
+ replications=10000)
R> print(res2)
test replications elapsed relative user.self sys.self
5 funRcpp(z) 10000 0.046 1.000000 0.04 0
4 fun2c(z) 10000 0.132 2.869565 0.13 0
2 fun2(z) 10000 0.271 5.891304 0.27 0
3 fun1c(z) 10000 1.045 22.717391 1.05 0
1 fun1(z) 10000 1.202 26.130435 1.20 0
The qualitative ranking is unchanged: the Rcpp version dominates, function2 is second-best. with the byte-compiled version being about twice as fast that the plain R variant, but still almost three times slower than the C++ version. And the relative difference are lower: relatively speaking, the function call overhead matters less and the actual looping matters more: C++ gets a bigger advantage on the actual loop operations in the longer vectors. That it is an important result as it suggests that more real-life sized data, the compiled version may reap a larger benefit.
Edited to correct two small oversights in the code examples. And edited again with thanks to Josh to catch a setup error relative to fun2c.
I think this is cheating and not generalizable, but: according to the rules you have above, any occurrence of 1 in the vector will make all subsequent elements 1 (by recursion: z[i] is 1 set to 1 if z[i-1] equals 1; therefore z[i] will be set to 1 if z[i-2] equals 1; and so forth). Depending on what you really want to do, there may be such a recursive solution available if you think carefully about it ...
z <- c(1,1,0,0,0,0)
first1 <- min(which(z==1))
z[seq_along(z)>first1] <- 1
edit: this is wrong, but I'm leaving it up to admit my mistakes. Based on a little bit of playing (and less thinking), I think the actual solution to this recursion is more symmetric and even simpler:
rep(z[1],length(z))
Test cases:
z <- c(1,1,0,0,0,0)
z <- c(0,1,1,0,0,0)
z <- c(0,0,1,0,0,0)
Check out the rollapply function in zoo.
I'm not super familiar with it, but I think this does what you want:
> c( 1, rollapply(z,2,function(x) x[1]) )
[1] 1 1 1 1 1 1
I'm sort of kludging it by using a window of 2 and then only using the first element of that window.
For more complicated examples you could perform some calculation on x[1] and return that instead.
Sometimes you just need to think about it totally differently. What you're doing is creating a vector where every item is the same as the first if it's a 1 or 0 otherwise.
z <- c(1,1,0,0,0,0)
if (z[1] != 1) z[1] <- 0
z[2:length(z)] <- z[1]
There is a function that does this particular calculation: cumprod (cumulative product)
> cumprod(z[zi])
[1] 1 0 0 0 0
> cumprod(c(1,2,3,4,0,5))
[1] 1 2 6 24 0 0
Otherwise, vectorize with Rccp as other answers have shown.
It's also possible to do this with "apply" using the original vector and a lagged version of the vector as the constituent columns of a data frame.
As a matter of best practices, I'm trying to determine if it's better to create a function and apply() it across a matrix, or if it's better to simply loop a matrix through the function. I tried it both ways and was surprised to find apply() is slower. The task is to take a vector and evaluate it as either being positive or negative and then return a vector with 1 if it's positive and -1 if it's negative. The mash() function loops and the squish() function is passed to the apply() function.
million <- as.matrix(rnorm(100000))
mash <- function(x){
for(i in 1:NROW(x))
if(x[i] > 0) {
x[i] <- 1
} else {
x[i] <- -1
}
return(x)
}
squish <- function(x){
if(x >0) {
return(1)
} else {
return(-1)
}
}
ptm <- proc.time()
loop_million <- mash(million)
proc.time() - ptm
ptm <- proc.time()
apply_million <- apply(million,1, squish)
proc.time() - ptm
loop_million results:
user system elapsed
0.468 0.008 0.483
apply_million results:
user system elapsed
1.401 0.021 1.423
What is the advantage to using apply() over a for loop if performance is degraded? Is there a flaw in my test? I compared the two resulting objects for a clue and found:
> class(apply_million)
[1] "numeric"
> class(loop_million)
[1] "matrix"
Which only deepens the mystery. The apply() function cannot accept a simple numeric vector and that's why I cast it with as.matrix() in the beginning. But then it returns a numeric. The for loop is fine with a simple numeric vector. And it returns an object of same class as that one passed to it.
The point of the apply (and plyr) family of functions is not speed, but expressiveness. They also tend to prevent bugs because they eliminate the book keeping code needed with loops.
Lately, answers on stackoverflow have over-emphasised speed. Your code will get faster on its own as computers get faster and R-core optimises the internals of R. Your code will never get more elegant or easier to understand on its own.
In this case you can have the best of both worlds: an elegant answer using vectorisation that is also very fast, (million > 0) * 2 - 1.
As Chase said: Use the power of vectorization. You're comparing two bad solutions here.
To clarify why your apply solution is slower:
Within the for loop, you actually use the vectorized indices of the matrix, meaning there is no conversion of type going on. I'm going a bit rough over it here, but basically the internal calculation kind of ignores the dimensions. They're just kept as an attribute and returned with the vector representing the matrix. To illustrate :
> x <- 1:10
> attr(x,"dim") <- c(5,2)
> y <- matrix(1:10,ncol=2)
> all.equal(x,y)
[1] TRUE
Now, when you use the apply, the matrix is split up internally in 100,000 row vectors, every row vector (i.e. a single number) is put through the function, and in the end the result is combined into an appropriate form. The apply function reckons a vector is best in this case, and thus has to concatenate the results of all rows. This takes time.
Also the sapply function first uses as.vector(unlist(...)) to convert anything to a vector, and in the end tries to simplify the answer into a suitable form. Also this takes time, hence also the sapply might be slower here. Yet, it's not on my machine.
IF apply would be a solution here (and it isn't), you could compare :
> system.time(loop_million <- mash(million))
user system elapsed
0.75 0.00 0.75
> system.time(sapply_million <- matrix(unlist(sapply(million,squish,simplify=F))))
user system elapsed
0.25 0.00 0.25
> system.time(sapply2_million <- matrix(sapply(million,squish)))
user system elapsed
0.34 0.00 0.34
> all.equal(loop_million,sapply_million)
[1] TRUE
> all.equal(loop_million,sapply2_million)
[1] TRUE
You can use lapply or sapply on vectors if you want. However, why not use the appropriate tool for the job, in this case ifelse()?
> ptm <- proc.time()
> ifelse_million <- ifelse(million > 0,1,-1)
> proc.time() - ptm
user system elapsed
0.077 0.007 0.093
> all.equal(ifelse_million, loop_million)
[1] TRUE
And for comparison's sake, here are the two comparable runs using the for loop and sapply:
> ptm <- proc.time()
> apply_million <- sapply(million, squish)
> proc.time() - ptm
user system elapsed
0.469 0.004 0.474
> ptm <- proc.time()
> loop_million <- mash(million)
> proc.time() - ptm
user system elapsed
0.408 0.001 0.417
It is far faster in this case to do index-based replacement than either the ifelse(), the *apply() family, or the loop:
> million <- million2 <- as.matrix(rnorm(100000))
> system.time(million3 <- ifelse(million > 0, 1, -1))
user system elapsed
0.046 0.000 0.044
> system.time({million2[(want <- million2 > 0)] <- 1; million2[!want] <- -1})
user system elapsed
0.006 0.000 0.007
> all.equal(million2, million3)
[1] TRUE
It is well worth having all these tools at your finger tips. You can use the one that makes the most sense to you (as you need to understand the code months or years later) and then start to move to more optimised solutions if compute time becomes prohibitive.
Better example for speed advantage of for loop.
for_loop <- function(x){
out <- vector(mode="numeric",length=NROW(x))
for(i in seq(length(out)))
out[i] <- max(x[i,])
return(out)
}
apply_loop <- function(x){
apply(x,1,max)
}
million <- matrix(rnorm(1000000),ncol=10)
> system.time(apply_loop(million))
user system elapsed
0.57 0.00 0.56
> system.time(for_loop(million))
user system elapsed
0.32 0.00 0.33
EDIT
Version suggested by Eduardo.
max_col <- function(x){
x[cbind(seq(NROW(x)),max.col(x))]
}
By row
> system.time(for_loop(million))
user system elapsed
0.99 0.00 1.11
> system.time(apply_loop(million))
user system elapsed
1.40 0.00 1.44
> system.time(max_col(million))
user system elapsed
0.06 0.00 0.06
By column
> system.time(for_loop(t(million)))
user system elapsed
0.05 0.00 0.05
> system.time(apply_loop(t(million)))
user system elapsed
0.07 0.00 0.07
> system.time(max_col(t(million)))
user system elapsed
0.04 0.00 0.06