...regarding execution time and / or memory.
If this is not true, prove it with a code snippet. Note that speedup by vectorization does not count. The speedup must come from apply (tapply, sapply, ...) itself.
The apply functions in R don't provide improved performance over other looping functions (e.g. for). One exception to this is lapply which can be a little faster because it does more work in C code than in R (see this question for an example of this).
But in general, the rule is that you should use an apply function for clarity, not for performance.
I would add to this that apply functions have no side effects, which is an important distinction when it comes to functional programming with R. This can be overridden by using assign or <<-, but that can be very dangerous. Side effects also make a program harder to understand since a variable's state depends on the history.
Edit:
Just to emphasize this with a trivial example that recursively calculates the Fibonacci sequence; this could be run multiple times to get an accurate measure, but the point is that none of the methods have significantly different performance:
> fibo <- function(n) {
+ if ( n < 2 ) n
+ else fibo(n-1) + fibo(n-2)
+ }
> system.time(for(i in 0:26) fibo(i))
user system elapsed
7.48 0.00 7.52
> system.time(sapply(0:26, fibo))
user system elapsed
7.50 0.00 7.54
> system.time(lapply(0:26, fibo))
user system elapsed
7.48 0.04 7.54
> library(plyr)
> system.time(ldply(0:26, fibo))
user system elapsed
7.52 0.00 7.58
Edit 2:
Regarding the usage of parallel packages for R (e.g. rpvm, rmpi, snow), these do generally provide apply family functions (even the foreach package is essentially equivalent, despite the name). Here's a simple example of the sapply function in snow:
library(snow)
cl <- makeSOCKcluster(c("localhost","localhost"))
parSapply(cl, 1:20, get("+"), 3)
This example uses a socket cluster, for which no additional software needs to be installed; otherwise you will need something like PVM or MPI (see Tierney's clustering page). snow has the following apply functions:
parLapply(cl, x, fun, ...)
parSapply(cl, X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
parApply(cl, X, MARGIN, FUN, ...)
parRapply(cl, x, fun, ...)
parCapply(cl, x, fun, ...)
It makes sense that apply functions should be used for parallel execution since they have no side effects. When you change a variable value within a for loop, it is globally set. On the other hand, all apply functions can safely be used in parallel because changes are local to the function call (unless you try to use assign or <<-, in which case you can introduce side effects). Needless to say, it's critical to be careful about local vs. global variables, especially when dealing with parallel execution.
Edit:
Here's a trivial example to demonstrate the difference between for and *apply so far as side effects are concerned:
> df <- 1:10
> # *apply example
> lapply(2:3, function(i) df <- df * i)
> df
[1] 1 2 3 4 5 6 7 8 9 10
> # for loop example
> for(i in 2:3) df <- df * i
> df
[1] 6 12 18 24 30 36 42 48 54 60
Note how the df in the parent environment is altered by for but not *apply.
Sometimes speedup can be substantial, like when you have to nest for-loops to get the average based on a grouping of more than one factor. Here you have two approaches that give you the exact same result :
set.seed(1) #for reproducability of the results
# The data
X <- rnorm(100000)
Y <- as.factor(sample(letters[1:5],100000,replace=T))
Z <- as.factor(sample(letters[1:10],100000,replace=T))
# the function forloop that averages X over every combination of Y and Z
forloop <- function(x,y,z){
# These ones are for optimization, so the functions
#levels() and length() don't have to be called more than once.
ylev <- levels(y)
zlev <- levels(z)
n <- length(ylev)
p <- length(zlev)
out <- matrix(NA,ncol=p,nrow=n)
for(i in 1:n){
for(j in 1:p){
out[i,j] <- (mean(x[y==ylev[i] & z==zlev[j]]))
}
}
rownames(out) <- ylev
colnames(out) <- zlev
return(out)
}
# Used on the generated data
forloop(X,Y,Z)
# The same using tapply
tapply(X,list(Y,Z),mean)
Both give exactly the same result, being a 5 x 10 matrix with the averages and named rows and columns. But :
> system.time(forloop(X,Y,Z))
user system elapsed
0.94 0.02 0.95
> system.time(tapply(X,list(Y,Z),mean))
user system elapsed
0.06 0.00 0.06
There you go. What did I win? ;-)
...and as I just wrote elsewhere, vapply is your friend!
...it's like sapply, but you also specify the return value type which makes it much faster.
foo <- function(x) x+1
y <- numeric(1e6)
system.time({z <- numeric(1e6); for(i in y) z[i] <- foo(i)})
# user system elapsed
# 3.54 0.00 3.53
system.time(z <- lapply(y, foo))
# user system elapsed
# 2.89 0.00 2.91
system.time(z <- vapply(y, foo, numeric(1)))
# user system elapsed
# 1.35 0.00 1.36
Jan. 1, 2020 update:
system.time({z1 <- numeric(1e6); for(i in seq_along(y)) z1[i] <- foo(y[i])})
# user system elapsed
# 0.52 0.00 0.53
system.time(z <- lapply(y, foo))
# user system elapsed
# 0.72 0.00 0.72
system.time(z3 <- vapply(y, foo, numeric(1)))
# user system elapsed
# 0.7 0.0 0.7
identical(z1, z3)
# [1] TRUE
I've written elsewhere that an example like Shane's doesn't really stress the difference in performance among the various kinds of looping syntax because the time is all spent within the function rather than actually stressing the loop. Furthermore, the code unfairly compares a for loop with no memory with apply family functions that return a value. Here's a slightly different example that emphasizes the point.
foo <- function(x) {
x <- x+1
}
y <- numeric(1e6)
system.time({z <- numeric(1e6); for(i in y) z[i] <- foo(i)})
# user system elapsed
# 4.967 0.049 7.293
system.time(z <- sapply(y, foo))
# user system elapsed
# 5.256 0.134 7.965
system.time(z <- lapply(y, foo))
# user system elapsed
# 2.179 0.126 3.301
If you plan to save the result then apply family functions can be much more than syntactic sugar.
(the simple unlist of z is only 0.2s so the lapply is much faster. Initializing the z in the for loop is quite fast because I'm giving the average of the last 5 of 6 runs so moving that outside the system.time would hardly affect things)
One more thing to note though is that there is another reason to use apply family functions independent of their performance, clarity, or lack of side effects. A for loop typically promotes putting as much as possible within the loop. This is because each loop requires setup of variables to store information (among other possible operations). Apply statements tend to be biased the other way. Often times you want to perform multiple operations on your data, several of which can be vectorized but some might not be able to be. In R, unlike other languages, it is best to separate those operations out and run the ones that are not vectorized in an apply statement (or vectorized version of the function) and the ones that are vectorized as true vector operations. This often speeds up performance tremendously.
Taking Joris Meys example where he replaces a traditional for loop with a handy R function we can use it to show the efficiency of writing code in a more R friendly manner for a similar speedup without the specialized function.
set.seed(1) #for reproducability of the results
# The data - copied from Joris Meys answer
X <- rnorm(100000)
Y <- as.factor(sample(letters[1:5],100000,replace=T))
Z <- as.factor(sample(letters[1:10],100000,replace=T))
# an R way to generate tapply functionality that is fast and
# shows more general principles about fast R coding
YZ <- interaction(Y, Z)
XS <- split(X, YZ)
m <- vapply(XS, mean, numeric(1))
m <- matrix(m, nrow = length(levels(Y)))
rownames(m) <- levels(Y)
colnames(m) <- levels(Z)
m
This winds up being much faster than the for loop and just a little slower than the built in optimized tapply function. It's not because vapply is so much faster than for but because it is only performing one operation in each iteration of the loop. In this code everything else is vectorized. In Joris Meys traditional for loop many (7?) operations are occurring in each iteration and there's quite a bit of setup just for it to execute. Note also how much more compact this is than the for version.
When applying functions over subsets of a vector, tapply can be pretty faster than a for loop. Example:
df <- data.frame(id = rep(letters[1:10], 100000),
value = rnorm(1000000))
f1 <- function(x)
tapply(x$value, x$id, sum)
f2 <- function(x){
res <- 0
for(i in seq_along(l <- unique(x$id)))
res[i] <- sum(x$value[x$id == l[i]])
names(res) <- l
res
}
library(microbenchmark)
> microbenchmark(f1(df), f2(df), times=100)
Unit: milliseconds
expr min lq median uq max neval
f1(df) 28.02612 28.28589 28.46822 29.20458 32.54656 100
f2(df) 38.02241 41.42277 41.80008 42.05954 45.94273 100
apply, however, in most situation doesn't provide any speed increase, and in some cases can be even lot slower:
mat <- matrix(rnorm(1000000), nrow=1000)
f3 <- function(x)
apply(x, 2, sum)
f4 <- function(x){
res <- 0
for(i in 1:ncol(x))
res[i] <- sum(x[,i])
res
}
> microbenchmark(f3(mat), f4(mat), times=100)
Unit: milliseconds
expr min lq median uq max neval
f3(mat) 14.87594 15.44183 15.87897 17.93040 19.14975 100
f4(mat) 12.01614 12.19718 12.40003 15.00919 40.59100 100
But for these situations we've got colSums and rowSums:
f5 <- function(x)
colSums(x)
> microbenchmark(f5(mat), times=100)
Unit: milliseconds
expr min lq median uq max neval
f5(mat) 1.362388 1.405203 1.413702 1.434388 1.992909 100
Related
In R, I have a reasonably large data frame (d) which is 10500 by 6000. All values are numeric.
It has many na value elements in both its rows and columns, and I am looking to replace these values with a zero. I have used:
d[is.na(d)] <- 0
but this is rather slow. Is there a better way to do this in R?
I am open to using other R packages.
I would prefer it if the discussion focused on computational speed rather than, "why would you replace na's with zeros", for example. And, while I realize a similar Q has been asked (How do I replace NA values with zeros in an R dataframe?) the focus has not been towards computational speed on a large data frame with many missing values.
Thanks!
Edited Solution:
As helpfully suggested, changing d to a matrix before applying is.na sped up the computation by an order of magnitude
You can get a considerable performance increase using the data.table package.
It is much faster, in general, with a lot of manipulations and transformations.
The downside is the learning curve of the syntax.
However, if you are looking for a speed performance boost, the investment could be worth it.
Generate fake data
r <- 10500
c <- 6000
x <- sample(c(NA, 1:5), r * c, replace = TRUE)
df <- data.frame(matrix(x, nrow = r, ncol = c))
Base R
df1 <- df
system.time(df1[is.na(df1)] <- 0)
user system elapsed
4.74 0.00 4.78
tidyr - replace_na()
dfReplaceNA <- function (df) {
require(tidyr)
l <- setNames(lapply(vector("list", ncol(df)), function(x) x <- 0), names(df))
replace_na(df, l)
}
system.time(df2 <- dfReplaceNA(df))
user system elapsed
4.27 0.00 4.28
data.table - set()
dtReplaceNA <- function (df) {
require(data.table)
dt <- data.table(df)
for (j in 1:ncol(dt)) {set(dt, which(is.na(dt[[j]])), j, 0)}
setDF(dt) # Return back a data.frame object
}
system.time(df3 <- dtReplaceNA(df))
user system elapsed
0.80 0.31 1.11
Compare data frames
all.equal(df1, df2)
[1] TRUE
all.equal(df1, df3)
[1] TRUE
I guess that all columns must be numeric or assigning 0s to NAs wouldn't be sensible.
I get the following timings, with approximately 10,000 NAs:
> M <- matrix(0, 10500, 6000)
> set.seed(54321)
> r <- sample(1:10500, 10000, replace=TRUE)
> c <- sample(1:6000, 10000, replace=TRUE)
> M[cbind(r, c)] <- NA
> D <- data.frame(M)
> sum(is.na(M)) # check
[1] 9999
> sum(is.na(D)) # check
[1] 9999
> system.time(M[is.na(M)] <- 0)
user system elapsed
0.19 0.12 0.31
> system.time(D[is.na(D)] <- 0)
user system elapsed
3.87 0.06 3.95
So, with this number of NAs, I get about an order of magnitude speedup by using a matrix. (With fewer NAs, the difference is smaller.) But the time using a data frame is just 4 seconds on my modest laptop -- much less time than it took to answer the question. If the problem really is of this magnitude, why is that slow?
I hope this helps.
I have a large graph (several, actually) in igraph—on the order of 100,000 vertices—and each vertex has an attribute which is either true or false. For each vertex, I would like to count how many of the vertices directly connected to it have the attribute. My current solution is the following function, which takes as its argument a graph.
attrcount <- function(g) {
nb <- neighborhood(g,order=1)
return(sapply(nb,function(x) {sum(V(g)$attr[x]}))
}
This returns a vector of counts which is off by 1 for vertices which have the attribute, but I can adjust this easily.
The problem is that this runs incredibly slowly, and it seems like there should be a fast way to do this, since, for instance, computing the degree of each vertex is practically instantaneous with degree(g).
Am I doing this a stupid way?
As an example, suppose this was our graph.
set.seed(42)
g <- erdos.renyi.game(169081, 178058, type="gnm")
V(g)$att <- as.logical(rbinom(vcount(g), 1, 0.5))
Use get.adjlist to query all adjacent vertices, and then sapply (or tapply might be even faster) on this list to get the counts. It is also worth storing the attribute in a vector, because then you don't need to extract it all the time.
With sapply
system.time({
al <- get.adjlist(g)
att <- V(g)$att
res <- sapply(al, function(x) sum(att[x]))
})
# user system elapsed
# 0.571 0.005 0.576
With tapply
system.time({
al <- get.adjlist(g)
alv <- unlist(al)
alf <- factor(rep(seq_along(al), sapply(al, length)),
levels=seq_along(al))
att <- V(g)$att
res2 <- tapply(att[alv], alf, sum)
res2[is.na(res2)] <- 0
})
# user system elapsed
# 1.121 0.020 1.144
all(res == res2)
# TRUE
Somewhat a surprise to me, but the tapply solution is actually slower.
If this is still not enough, then I guess you can still make it faster by writing it in C/C++.
For faster computation, use get.adjacency to pull the adjacency matrix, then multiply the matrix by the attribute vector using %*%:
library(igraph)
set.seed(42)
g <- erdos.renyi.game(1000, 1000, type = "gnm")
V(g)$att <- as.logical(rbinom(vcount(g), 1, 0.5))
system.time({
ma <- get.adjacency(g)
att <- V(g)$att
res1 <- as.numeric(ma %*% att)
})
# user system elapsed
# 0.003 0.000 0.003
Compared to using get.adjlist and sapply:
system.time({
al <- get.adjlist(g)
att <- V(g)$att
res2 <- sapply(al, function(x) sum(att[x]))
})
# user system elapsed
# 9.733 0.243 10.107
After modifying the class of res1, the results vector is identical:
res1 <- as.numeric(res1)
identical(res1, res2)
# [1] TRUE
The R function
xts:::na.locf.xts
is extremely slow when used with a multicolumn xts of more than a few columns.
There is indeed a loop over the columns in the code of na.locf.xts
I am trying to find a way to avoid this loop.
Any idea?
The loop in na.locf.xts is slow because it creates a copy of the entire object for each column in the object. The loop itself isn't slow; the copies created by [.xts are slow.
There's an experimental (and therefore unexported) version of na.locf.xts on R-Forge that moves the loop over columns to C, which avoids copying the object. It's quite a bit faster for very large objects.
set.seed(21)
m <- replicate(20, rnorm(1e6))
is.na(m) <- sample(length(x), 1e5)
x <- xts(m, Sys.time()-1e6:1)
y <- x[1:1e5,1:3]
> # smaller objects
> system.time(a <- na.locf(y))
user system elapsed
0.008 0.000 0.008
> system.time(b <- xts:::.na.locf.xts(y))
user system elapsed
0.000 0.000 0.003
> identical(a,b)
[1] TRUE
> # larger objects
> system.time(a <- na.locf(x))
user system elapsed
1.620 1.420 3.064
> system.time(b <- xts:::.na.locf.xts(x))
user system elapsed
0.124 0.092 0.220
> identical(a,b)
[1] TRUE
timeIndex <- index(x)
x <- apply(x, 2, na.locf)
x <- as.xts(x, order.by = timeIndex)
This avoids the column-by-column data copying. Without this, when filling the nth column, you make a copy of 1 : (n - 1) columns and append the nth column to it, which becomes prohibitively slow when n is large.
This question already has answers here:
Any documentation for optimizing the performance of R? [duplicate]
(4 answers)
Closed 9 years ago.
Speeding up loops in R can easily be done using a function from the apply family. How can I use an apply function in the code below to speed it up? Note that within the loop, at each iteration, one column is permuted and a function is applied to the new data frame (i.e., the initial data frame with one column permuted). I cannot seem to get apply to work because the new data frame has to be built within the loop.
#x <- data.frame(a=1:10,b=11:20,c=21:30) #small example
x <- data.frame(matrix(runif(50*100),nrow=50,ncol=100)) #larger example
y <- rowMeans(x)
start <- Sys.time()
totaldiff <- numeric()
for (i in 1:ncol(x)){
x.after <- x
x.after[,i] <- sample(x[,i])
diff <- abs(y-rowMeans(x.after))
totaldiff[i] <- sum(diff)
}
colnames(x)[which.max(totaldiff)]
Sys.time() - start
After working through this and other replies, the optimization strategies (and approximate speed-up) here seem to be
(30x) Choose an appropriate data representation -- matrix, rather than data.frame
(1.5x) Reduce unnecessary data copies -- difference of columns, rather than of rowMeans
Structure for loops as *apply functions (to emphasize code structure, simplify memory management, and provide type consistency)
(2x) Hoist vector operations outside loops -- abs and sum on columns become abs and colSums on a matrix
for an overall speed-up of about 100x. For this size and complexity of code, the use of the compiler or parallel packages would not be effective.
I put your code into a function
f0 <- function(x) {
y <- rowMeans(x)
totaldiff <- numeric()
for (i in 1:ncol(x)){
x.after <- x
x.after[,i] <- sample(x[,i])
diff <- abs(y-rowMeans(x.after))
totaldiff[i] <- sum(diff)
}
which.max(totaldiff)
}
and here we have
x <- data.frame(matrix(runif(50*100),nrow=50,ncol=100)) #larger example
set.seed(123)
system.time(res0 <- f0(x))
## user system elapsed
## 1.065 0.000 1.066
Your data can be represented as a matrix, and operations on R matrices are faster than on data.frames.
m <- matrix(runif(50*100),nrow=50,ncol=100)
set.seed(123)
system.time(res0.m <- f0(m))
## user system elapsed
## 0.036 0.000 0.037
identical(res0, res0.m)
##[1] TRUE
That's probably the biggest speed-up. But for the specific operation here we don't need to calculate the row means of the updated matrix, just the change in the mean from shuffling one column
f1 <- function(x) {
y <- rowMeans(x)
totaldiff <- numeric()
for (i in 1:ncol(x)){
diff <- abs(sample(x[,i]) - x[,i]) / ncol(x)
totaldiff[i] <- sum(diff)
}
which.max(totaldiff)
}
The for loop doesn't follow the right pattern for filling up the result vector totaldiff (you want to "pre-allocate and fill", so totaldiff <- numeric(ncol(x))) but we can use an sapply and let R worry about that (this memory management is one of the advantages of using the apply family of functions)
f2 <- function(x) {
totaldiff <- sapply(seq_len(ncol(x)), function(i, x) {
sum(abs(sample(x[,i]) - x[,i]) / ncol(x))
}, x)
which.max(totaldiff)
}
set.seed(123); identical(res0, f1(m))
set.seed(123); identical(res0, f2(m))
The timings are
> library(microbenchmark)
> microbenchmark(f0(m), f1(m), f2(m))
Unit: milliseconds
expr min lq median uq max neval
f0(m) 32.45073 33.07804 33.16851 33.26364 33.81924 100
f1(m) 22.20913 23.87784 23.96915 24.06216 24.66042 100
f2(m) 21.02474 22.60745 22.70042 22.80080 23.19030 100
#flodel points out that vapply can be faster (and provides type safety)
f3 <- function(x) {
totaldiff <- vapply(seq_len(ncol(x)), function(i, x) {
sum(abs(sample(x[,i]) - x[,i]) / ncol(x))
}, numeric(1), x)
which.max(totaldiff)
}
and that
f4 <- function(x)
which.max(colSums(abs((apply(x, 2, sample) - x))))
is still faster (ncol(x) is a constant factor, so removed) -- The abs and sum are hoisted outside the sapply, maybe at the expense of additional memory use. The advice in the comments to compile functions is good in general; here are some further timings
> microbenchmark(f0(m), f1(m), f1.c(m), f2(m), f2.c(m), f3(m), f4(m))
Unit: milliseconds
expr min lq median uq max neval
f0(m) 32.35600 32.88326 33.12274 33.25946 34.49003 100
f1(m) 22.21964 23.41500 23.96087 24.06587 24.49663 100
f1.c(m) 20.69856 21.20862 22.20771 22.32653 213.26667 100
f2(m) 20.76128 21.52786 22.66352 22.79101 69.49891 100
f2.c(m) 21.16423 21.57205 22.94157 23.06497 23.35764 100
f3(m) 20.17755 21.41369 21.99292 22.10814 22.36987 100
f4(m) 10.10816 10.47535 10.56790 10.61938 10.83338 100
where the ".c" are compiled versions and
Compilation is particularly helpful in code written with for loops but doesn't do much for vectorized code; this is shown here where's a small but consistent improvement from compiling f1's for loop, but not f2's sapply.
Since you are looking at efficiency/optimization, start by using the rbenchmark package for comparison purposes.
Rewriting your given example as a function (so that it can be replicated and compared)
forFirst <- function(x) {
y <- rowMeans(x)
totaldiff <- numeric()
for (i in 1:ncol(x)){
x.after <- x
x.after[,i] <- sample(x[,i])
diff <- abs(y-rowMeans(x.after))
totaldiff[i] <- sum(diff)
}
colnames(x)[which.max(totaldiff)]
}
Applying some standard optimizations (pre-allocating totaldiff to the right size, eliminating intermediate variables that are only used once) gives
forSecond <- function(x) {
y <- rowMeans(x)
totaldiff <- numeric(ncol(x))
for (i in 1:ncol(x)){
x.after <- x
x.after[,i] <- sample(x[,i])
totaldiff[i] <- sum(abs(y-rowMeans(x.after)))
}
colnames(x)[which.max(totaldiff)]
}
Not much more can be done for this that I can see to improve the algorithm itself in the loop. A better algorithm would be the most help, but since this particular problem is just an example, it is not worth spending that time.
The apply version looks very similar.
applyFirst <- function(x) {
y <- rowMeans(x)
totaldiff <- sapply(seq_len(ncol(x)), function(i) {
x[,i] <- sample(x[,i])
sum(abs(y-rowMeans(x)))
})
colnames(x)[which.max(totaldiff)]
}
Benchmarking them gives:
> library("rbenchmark")
> benchmark(forFirst(x),
+ forSecond(x),
+ applyFirst(x),
+ order = "relative")
test replications elapsed relative user.self sys.self user.child
1 forFirst(x) 100 16.92 1.000 16.88 0.00 NA
2 forSecond(x) 100 17.02 1.006 16.96 0.03 NA
3 applyFirst(x) 100 17.05 1.008 17.02 0.01 NA
sys.child
1 NA
2 NA
3 NA
The differences between these is just noise. In fact, running the benchmark again gives a different ordering:
> benchmark(forFirst(x),
+ forSecond(x),
+ applyFirst(x),
+ order = "relative")
test replications elapsed relative user.self sys.self user.child
3 applyFirst(x) 100 17.05 1.000 17.02 0 NA
2 forSecond(x) 100 17.08 1.002 17.05 0 NA
1 forFirst(x) 100 17.44 1.023 17.41 0 NA
sys.child
3 NA
2 NA
1 NA
So these approaches are the same speed. Any real improvement will come from using a better algorithm than just simple looping and copying to create the intermediate results.
Apply functions do not necessarily speed up loops in R. Sometimes they can even slow them down. There's no reason to believe that turning this into an apply family function will speed it up any appreciable amount.
As an aside, this code seems like a relatively pointless endeavour. It's just going to select a random column. I could get the same result by just doing that in the first place. Perhaps this is nested in a larger loop looking for a distribution?
What is the fastest way to detect if a vector has at least 1 NA in R? I've been using:
sum( is.na( data ) ) > 0
But that requires examining each element, coercion, and the sum function.
As of R 3.1.0 anyNA() is the way to do this. On atomic vectors this will stop after the first NA instead of going through the entire vector as would be the case with any(is.na()). Additionally, this avoids creating an intermediate logical vector with is.na that is immediately discarded. Borrowing Joran's example:
x <- y <- runif(1e7)
x[1e4] <- NA
y[1e7] <- NA
microbenchmark::microbenchmark(any(is.na(x)), anyNA(x), any(is.na(y)), anyNA(y), times=10)
# Unit: microseconds
# expr min lq mean median uq
# any(is.na(x)) 13444.674 13509.454 21191.9025 13639.3065 13917.592
# anyNA(x) 6.840 13.187 13.5283 14.1705 14.774
# any(is.na(y)) 165030.942 168258.159 178954.6499 169966.1440 197591.168
# anyNA(y) 7193.784 7285.107 7694.1785 7497.9265 7865.064
Notice how it is substantially faster even when we modify the last value of the vector; this is in part because of the avoidance of the intermediate logical vector.
I'm thinking:
any(is.na(data))
should be slightly faster.
We mention this in some of our Rcpp presentations and actually have some benchmarks which show a pretty large gain from embedded C++ with Rcpp over the R solution because
a vectorised R solution still computes every single element of the vector expression
if your goal is to just satisfy any(), then you can abort after the first match -- which is what our Rcpp sugar (in essence: some C++ template magic to make C++ expressions look more like R expressions, see this vignette for more) solution does.
So by getting a compiled specialised solution to work, we do indeed get a fast solution. I should add that while I have not compared this to the solutions offered in this SO question here, I am reasonably confident about the performance.
Edit And the Rcpp package contains examples in the directory sugarPerformance. It has an increase of the several thousand of the 'sugar-can-abort-soon' over 'R-computes-full-vector-expression' for any(), but I should add that that case does not involve is.na() but a simple boolean expression.
One could write a for loop stopping at NA, but the system.time then depends on where the NA is... (if there is none, it takes looooong)
set.seed(1234)
x <- sample(c(1:5, NA), 100000000, replace = TRUE)
nacount <- function(x){
for(i in 1:length(x)){
if(is.na(x[i])) {
print(TRUE)
break}
}}
system.time(
nacount(x)
)
[1] TRUE
User System verstrichen
0.14 0.04 0.18
system.time(
any(is.na(x))
)
User System verstrichen
0.28 0.08 0.37
system.time(
sum(is.na(x)) > 0
)
User System verstrichen
0.45 0.07 0.53
Here are some actual times from my (slow) machine for some of the various methods discussed so far:
x <- runif(1e7)
x[1e4] <- NA
system.time(sum(is.na(x)) > 0)
> system.time(sum(is.na(x)) > 0)
user system elapsed
0.065 0.001 0.065
system.time(any(is.na(x)))
> system.time(any(is.na(x)))
user system elapsed
0.035 0.000 0.034
system.time(match(NA,x))
> system.time(match(NA,x))
user system elapsed
1.824 0.112 1.918
system.time(NA %in% x)
> system.time(NA %in% x)
user system elapsed
1.828 0.115 1.925
system.time(which(is.na(x) == TRUE))
> system.time(which(is.na(x) == TRUE))
user system elapsed
0.099 0.029 0.127
It's not surprising that match and %in% are similar, since %in% is implemented using match.
You can try:
d <- c(1,2,3,NA,5,3)
which(is.na(d) == TRUE, arr.ind=TRUE)