I have huge amounts of data to analyze, I tend to leave space between words or variable names as I write my code, So the question is, incases where efficiency is the number 1 priority, does the white space have cost?
Is c<-a+b more efficient than c <- a + b
To a first, second, third, ..., approximation, no, it won't cost you any time at all.
The extra time you spend pressing the space bar is orders of magnitude more costly than the cost at run time (and neither matter at all).
The much more significant cost will come from any any decreased readability that results from leaving out spaces, which can make code harder (for humans) to parse.
In a word, no!
library(microbenchmark)
f1 <- function(x){
j <- rnorm( x , mean = 0 , sd = 1 ) ;
k <- j * 2 ;
return( k )
}
f2 <- function(x){j<-rnorm(x,mean=0,sd=1);k<-j*2;return(k)}
microbenchmark( f1(1e3) , f2(1e3) , times= 1e3 )
Unit: microseconds
expr min lq median uq max neval
f1(1000) 110.763 112.8430 113.554 114.319 677.996 1000
f2(1000) 110.386 112.6755 113.416 114.151 5717.811 1000
#Even more runs and longer sampling
microbenchmark( f1(1e4) , f2(1e4) , times= 1e4 )
Unit: milliseconds
expr min lq median uq max neval
f1(10000) 1.060010 1.074880 1.079174 1.083414 66.791782 10000
f2(10000) 1.058773 1.074186 1.078485 1.082866 7.491616 10000
EDIT
It seems like using microbenchmark would be unfair because the expressions are parsed before ever they are run in the loop. However using source should mean that with each iteration the sourced code must be parsed and whitespace removed. So I saved the functions to two seperate files, with the last line of the file being a call of the function, e.g.so my file f2.R looks like this:
f2 <- function(x){j<-rnorm(x,mean=0,sd=1);k<-j*2;return(k)};f2(1e3)
And I test them like so:
microbenchmark( eval(source("~/Desktop/f2.R")) , eval(source("~/Desktop/f1.R")) , times = 1e3)
Unit: microseconds
expr min lq median uq max neval
eval(source("~/Desktop/f2.R")) 649.786 658.6225 663.6485 671.772 7025.662 1000
eval(source("~/Desktop/f1.R")) 687.023 697.2890 702.2315 710.111 19014.116 1000
And a visual representation of the difference with 1e4 replications....
Maybe it does make a minuscule difference in the situation where functions are repeatedly parsed but this wouldn't happen in normal use cases.
YES
But, No, not really:
TL;DR It would probably take longer just to run your script to remove the whitespaces than the time it saved by removing them.
#Josh O'Brien really hit the nail on the head. But I juts couldnt resist to benchmark
As you can see, if you are dealing with an order of magnitude of 100 MILLION lines then you will see a miniscule hinderance.
HOWEVER With that many lines, there would be a high likelihood of their being at least one (if not hundreds) of hotspots,
where simply improving the code in one of these would give you much greater speed than greping out all the whitespace.
library(microbenchmark)
microbenchmark(LottaSpace = eval(LottaSpace), NoSpace = eval(NoSpace), NormalSpace = eval(NormalSpace), times=10e7)
# 100 times; Unit: microseconds
expr min lq median uq max
1 LottaSpace 7.526 7.9185 8.1065 8.4655 54.850
2 NormalSpace 7.504 7.9115 8.1465 8.5540 28.409
3 NoSpace 7.544 7.8645 8.0565 8.3270 12.241
# 10,000 times; Unit: microseconds
expr min lq median uq max
1 LottaSpace 7.284 7.943 8.094 8.294 47888.24
2 NormalSpace 7.182 7.925 8.078 8.276 46318.20
3 NoSpace 7.246 7.921 8.073 8.271 48687.72
WHERE:
LottaSpace <- quote({
a <- 3
b <- 4
c <- 5
for (i in 1:7)
i + i
})
NoSpace <- quote({
a<-3
b<-4
c<-5
for(i in 1:7)
i+i
})
NormalSpace <- quote({
a <- 3
b <- 4
c <- 5
for (i in 1:7)
i + i
})
The only part this can affect is the parsing of the source code into tokens. I can't imagine that the difference in parsing time would be significant. However, you can eliminate this aspect by compiling the functions using the compile or cmpfun functions of the compiler package. Then the parsing is only done once and any whitespace difference can not affect execution time.
There should be no difference in performance, although:
fn1<-function(a,b) c<-a+b
fn2<-function(a,b) c <- a + b
library(rbenchmark)
> benchmark(fn1(1,2),fn2(1,2),replications=10000000)
test replications elapsed relative user.self sys.self user.child
1 fn1(1, 2) 10000000 53.87 1.212 53.4 0.37 NA
2 fn2(1, 2) 10000000 44.46 1.000 44.3 0.14 NA
same with microbenchmark:
Unit: nanoseconds
expr min lq median uq max neval
fn1(1, 2) 0 467 467 468 90397803 1e+07
fn2(1, 2) 0 467 467 468 85995868 1e+07
So the first result was bogus..
Related
As we learn from this answer, there's a substantial performance increase when using anyNA() over any(is.na()) to detect whether a vector has at least one NA element. This makes sense, as the algorithm of anyNA() stops after the first NA value it finds, whereas any(is.na()) has to first run over the entire vector with is.na().
By contrast, I want to know whether a vector has at least 1 non-NA value. This means that I'm looking for an implementation that would stop after the first encounter with a non-NA value. Yes, I can use any(!is.na()), but then I face the issue with having is.na() run over the entire vector first.
Is there a performant opposite equivalent to anyNA(), i.e., "anyNonNA()"?
I'm not aware of a native function that stops if it comes across a non-NA value, but we can write a simple one using Rcpp:
Rcpp::cppFunction("bool any_NonNA(NumericVector v) {
for(size_t i = 0; i < v.length(); i++) {
if(!(Rcpp::traits::is_na<REALSXP>(v[i]))) return true;
}
return false;
}")
This creates an R function called any_NonNA which does what we need. Let's test it on a large vector of 100,000 NA values:
test <- rep(NA, 1e5)
any_NonNA(test)
#> [1] FALSE
any(!is.na(test))
#> [1] FALSE
Now let's make the first element a non-NA:
test[1] <- 1
any_NonNA(test)
#> [1] TRUE
any(!is.na(test))
#> [1] TRUE
So it gives the correct result, but is it faster?
Certainly, in this example, since it should stop after the first element, it should be much quicker. This is indeed the case if we do a head-to-head comparison:
microbenchmark::microbenchmark(
baseR = any(!is.na(test)),
Rcpp = any_NonNA(test)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> baseR 275.1 525.0 670.948 533.05 568.7 13029.9 100 b
#> Rcpp 1.6 2.1 4.319 3.30 5.1 33.7 100 a
As expected, this is a couple of orders of magnitude faster. What about if our first non-NA value is mid-way through the vector?
test[1] <- NA
test[50000] <- 1
microbenchmark::microbenchmark(
baseR = any(!is.na(test)),
Rcpp = any_NonNA(test)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> baseR 332.1 579.35 810.948 597.95 624.40 12010.4 100 b
#> Rcpp 299.4 300.70 311.516 305.10 309.25 370.1 100 a
Still faster, but not by much.
If we put our non-NA value at the end we shouldn't see much difference:
test[50000] <- NA
test[100000] <- 1
microbenchmark::microbenchmark(
baseR = any(!is.na(test)),
Rcpp = any_NonNA(test)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> baseR 395.6 631.65 827.173 642.6 663.8 11357.0 100 a
#> Rcpp 596.3 602.25 608.011 605.8 612.6 632.6 100 a
So this does indeed look to be faster than the base R solution (at least for large vectors).
anyNA() seems to be a collaboration by google. I think to check wether there are any NA is far common than the opposite, thus justifying the existene of that "special" function.
Here my attemp for numeric only:
anyNonNA <- Rcpp::cppFunction(
'bool anyNonNA(NumericVector x){
for (double i:x) if (!Rcpp::NumericVector::is_na(i)) return TRUE;
return FALSE;}
')
var <- rep(NA_real_, 1e7)
any(!is.na(var)) #FALSE
anyNonNA(var) #FALSE
var[5e6] <- 0
any(!is.na(var)) #TRUE
anyNonNA(var) #TRUE
microbenchmark::microbenchmark(any(!is.na(var)))
#Unit: milliseconds
# expr min lq mean median uq max neval
# any(!is.na(var)) 41.1922 46.6087 55.57655 59.1408 61.87265 74.4424 100
microbenchmark::microbenchmark(anyNonNA(var))
#Unit: milliseconds
# expr min lq mean median uq max neval
# anyNonNA(var) 10.6333 10.71325 11.05704 10.8553 11.2082 14.871 100
I have huge amounts of data to analyze, I tend to leave space between words or variable names as I write my code, So the question is, incases where efficiency is the number 1 priority, does the white space have cost?
Is c<-a+b more efficient than c <- a + b
To a first, second, third, ..., approximation, no, it won't cost you any time at all.
The extra time you spend pressing the space bar is orders of magnitude more costly than the cost at run time (and neither matter at all).
The much more significant cost will come from any any decreased readability that results from leaving out spaces, which can make code harder (for humans) to parse.
In a word, no!
library(microbenchmark)
f1 <- function(x){
j <- rnorm( x , mean = 0 , sd = 1 ) ;
k <- j * 2 ;
return( k )
}
f2 <- function(x){j<-rnorm(x,mean=0,sd=1);k<-j*2;return(k)}
microbenchmark( f1(1e3) , f2(1e3) , times= 1e3 )
Unit: microseconds
expr min lq median uq max neval
f1(1000) 110.763 112.8430 113.554 114.319 677.996 1000
f2(1000) 110.386 112.6755 113.416 114.151 5717.811 1000
#Even more runs and longer sampling
microbenchmark( f1(1e4) , f2(1e4) , times= 1e4 )
Unit: milliseconds
expr min lq median uq max neval
f1(10000) 1.060010 1.074880 1.079174 1.083414 66.791782 10000
f2(10000) 1.058773 1.074186 1.078485 1.082866 7.491616 10000
EDIT
It seems like using microbenchmark would be unfair because the expressions are parsed before ever they are run in the loop. However using source should mean that with each iteration the sourced code must be parsed and whitespace removed. So I saved the functions to two seperate files, with the last line of the file being a call of the function, e.g.so my file f2.R looks like this:
f2 <- function(x){j<-rnorm(x,mean=0,sd=1);k<-j*2;return(k)};f2(1e3)
And I test them like so:
microbenchmark( eval(source("~/Desktop/f2.R")) , eval(source("~/Desktop/f1.R")) , times = 1e3)
Unit: microseconds
expr min lq median uq max neval
eval(source("~/Desktop/f2.R")) 649.786 658.6225 663.6485 671.772 7025.662 1000
eval(source("~/Desktop/f1.R")) 687.023 697.2890 702.2315 710.111 19014.116 1000
And a visual representation of the difference with 1e4 replications....
Maybe it does make a minuscule difference in the situation where functions are repeatedly parsed but this wouldn't happen in normal use cases.
YES
But, No, not really:
TL;DR It would probably take longer just to run your script to remove the whitespaces than the time it saved by removing them.
#Josh O'Brien really hit the nail on the head. But I juts couldnt resist to benchmark
As you can see, if you are dealing with an order of magnitude of 100 MILLION lines then you will see a miniscule hinderance.
HOWEVER With that many lines, there would be a high likelihood of their being at least one (if not hundreds) of hotspots,
where simply improving the code in one of these would give you much greater speed than greping out all the whitespace.
library(microbenchmark)
microbenchmark(LottaSpace = eval(LottaSpace), NoSpace = eval(NoSpace), NormalSpace = eval(NormalSpace), times=10e7)
# 100 times; Unit: microseconds
expr min lq median uq max
1 LottaSpace 7.526 7.9185 8.1065 8.4655 54.850
2 NormalSpace 7.504 7.9115 8.1465 8.5540 28.409
3 NoSpace 7.544 7.8645 8.0565 8.3270 12.241
# 10,000 times; Unit: microseconds
expr min lq median uq max
1 LottaSpace 7.284 7.943 8.094 8.294 47888.24
2 NormalSpace 7.182 7.925 8.078 8.276 46318.20
3 NoSpace 7.246 7.921 8.073 8.271 48687.72
WHERE:
LottaSpace <- quote({
a <- 3
b <- 4
c <- 5
for (i in 1:7)
i + i
})
NoSpace <- quote({
a<-3
b<-4
c<-5
for(i in 1:7)
i+i
})
NormalSpace <- quote({
a <- 3
b <- 4
c <- 5
for (i in 1:7)
i + i
})
The only part this can affect is the parsing of the source code into tokens. I can't imagine that the difference in parsing time would be significant. However, you can eliminate this aspect by compiling the functions using the compile or cmpfun functions of the compiler package. Then the parsing is only done once and any whitespace difference can not affect execution time.
There should be no difference in performance, although:
fn1<-function(a,b) c<-a+b
fn2<-function(a,b) c <- a + b
library(rbenchmark)
> benchmark(fn1(1,2),fn2(1,2),replications=10000000)
test replications elapsed relative user.self sys.self user.child
1 fn1(1, 2) 10000000 53.87 1.212 53.4 0.37 NA
2 fn2(1, 2) 10000000 44.46 1.000 44.3 0.14 NA
same with microbenchmark:
Unit: nanoseconds
expr min lq median uq max neval
fn1(1, 2) 0 467 467 468 90397803 1e+07
fn2(1, 2) 0 467 467 468 85995868 1e+07
So the first result was bogus..
I have a question function which takes a range and I need to execute a while loop for the give range. Below is the pseudo-code I wrote. Here I intend to read files from a sorted list and, start = 4 and end = 8 would mean read files 4 to 8.
readFiles<-function(start,end){
i = start
while(i<end){
#do something
i += 1
}
}
I need to know how to do this in R. Any help is appreciated.
You can try this :
readFiles<-function(start,end){
for (i in start:end){
print(i) # this is an example, here you put the code to read the file
# it just allows you to see that the index starts at 4 and ends at 8
}
}
readFiles(4,8)
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
As pointed out by mra68, if you don't want that the functions does something if end>start you could do this :
readFiles<-function(start,end){
if (start<=end){
for (i in start:end){
print(i)
}
}
}
It will not do anything for readFiles(8,4). Using print(i) as the function in the loop, it is slightly faster than a while if start<=endand also faster if end>start:
Unit: microseconds
expr min lq mean median uq max neval cld
readFiles(1, 10) 591.437 603.1610 668.4673 610.6850 642.007 1460.044 100 a
readFiles2(1, 10) 548.041 559.2405 640.9673 574.6385 631.333 2278.605 100 a
Unit: microseconds
expr min lq mean median uq max neval cld
readFiles(10, 1) 1.75 1.751 2.47508 2.10 2.101 23.098 100 b
readFiles2(10, 1) 1.40 1.401 1.72613 1.75 1.751 6.300 100 a
Here, readFiles2 is the if ... for solution and readFiles is the while solution.
I was looking at the benchmarks in this answer, and wanted to compare them with diag (used in a different answer). Unfortunately, it seems that diag takes ages:
nc <- 1e4
set.seed(1)
m <- matrix(sample(letters,nc^2,replace=TRUE), ncol = nc)
microbenchmark(
diag = diag(m),
cond = m[row(m)==col(m)],
vec = m[(1:nc-1L)*nc+1:nc],
mat = m[cbind(1:nc,1:nc)],
times=10)
Comments: I tested these with identical. I took "cond" from one of the answers to this homework question. Results are similar with a matrix of integers, 1:26 instead of letters.
Results:
Unit: microseconds
expr min lq mean median uq max neval
diag 604343.469 629819.260 710371.3320 706842.3890 793144.019 837115.504 10
cond 3862039.512 3985784.025 4175724.0390 4186317.5260 4312493.742 4617117.706 10
vec 317.088 329.017 432.9099 350.1005 629.460 651.376 10
mat 272.147 292.953 441.7045 345.9400 637.506 706.860 10
It is just a matrix-subsetting operation, so I don't know why there's so much overhead. Looking inside the function, I see a few checks and then c(m)[v], where v is the same vector used in the "vec" benchmark. Timing these two...
v <- (1:nc-1L)*nc+1:nc
microbenchmark(diaglike=c(m)[v],vec=m[v])
# Unit: microseconds
# expr min lq mean median uq max neval
# diaglike 579224.436 664853.7450 720372.8105 712649.706 767281.5070 931976.707 100
# vec 334.843 339.8365 568.7808 646.799 663.5825 1445.067 100
...it seems I have found my culprit. So, the new variation on my question is: Why is there a seemingly unnecessary and very time-consuming c in diag?
Summary
As of R version 3.2.1 (World-Famous Astronaut) diag() has received an update. The discussion moved to r-devel where it was noted that c() strips non-name attributes and may have been why it was placed there. While some people worried that removing c() would cause unknown issues on matrix-like objects, Peter Dalgaard found that, "The only case where the c() inside diag() has an effect is where M[i,j] != M[(i-1)*m+j] AND c(M) will stringize M in column-major order, so that M[i,j] == c(M)[(i-1)*m+j]."
Luke Tierney tested #Frank 's removal of c(), finding it did not effect anything on CRAN or BIOC and so was implemented to replace c(x)[...] with x[...] on line 27. This leads to relatively large speedups in diag(). Below is a speed test showing the improvement with R 3.2.1's version of diag().
library(microbenchmark)
nc <- 1e4
set.seed(1)
m <- matrix(sample(letters,nc^2,replace=TRUE), ncol = nc)
microbenchmark(diagOld(m),diag(m))
Unit: microseconds
expr min lq mean median uq max neval
diagOld(m) 451189.242 526622.2775 545116.5668 531905.5635 540008.704 682223.733 100
diag(m) 222.563 646.8675 644.7444 714.4575 740.701 1015.459 100
I have a following R problem. I made an experiment and observed some cars speed. I have a table with cars (where number 1 means for example Porche, 2 Volvo and so on) and their speeds. One car could been taken into an observation more than once. So, for example, Porche was observed tree times, Volvo two times.
exp<-data.frame(car=c(1,1,1,2,2,3),speed=c(10,20,30,40,50,60))
I would like to add a third column, where for every row/every car the maximum speed is calculated. So it looks like that:
exp<-data.frame(car=c(1,1,1,2,2,3),speed=c(10,20,30,40,50,60), maxSpeed=c(30,30,30,50,50,60))
Maximal observed speed for Porsche was 30, so every row with Porsche will get maxSpeed = 30.
I know that it should be apply/sapply function, but have no idea how to implement it. Anyone? :)
#Arun this is my result in a bigger sample (1000 records). The ratio of the medians is now (actually) 0.82:
exp <- data.frame(car=sample(1:10, 1000, T),speed=rnorm(1000, 20, 5))
f1 <- function() mutate(exp, maxSpeed = ave(speed, car, FUN=max))
f2 <- function() transform(exp, maxSpeed = ave(speed, car, FUN=max))
library(microbenchmark)
library(plyr)
> microbenchmark(f1(), f2(), times=1000)
Unit: microseconds
expr min lq median uq max neval
f1() 551.321 565.112 570.565 589.9680 27866.23 1000
f2() 662.933 683.138 689.552 713.7665 28510.24 1000
the plyr documentation itself says Mutate seems to be considerably faster than transform for large data frames.
However, for this case, you're probably right. If I enlarge the sample:
> exp <- data.frame(car=sample(1:1000, 100000, T),speed=rnorm(100000, 20, 5))
> microbenchmark(f1(), f2(), times=100)
Unit: milliseconds
expr min lq median uq max neval
f1() 37.92438 39.00056 40.66607 41.18115 77.41645 100
f2() 39.47731 40.28650 43.11927 43.70779 78.34878 100
The ratio gets close to one. To be honest I was quite sure about plyr perfomance (always rely on it in my codes), that's why my 'claim' in the comment. Probably in different situation it performs better..
EDIT:
using f3() from #Arun comment
> microbenchmark(f1(), f2(), f3(), times=100)
Unit: milliseconds
expr min lq median uq max neval
f1() 38.76050 39.57129 41.48728 42.14812 76.94338 100
f2() 40.38913 41.19767 44.12329 44.78782 79.94021 100
f3() 38.63606 39.58700 40.24272 42.04902 76.07551 100
Yep! slightly faster... moves less data?
very straight forward with data.table
library(data.table)
exp <- data.table(exp)
exp[, maxSpeed := max(speed), by=car]
which gives:
exp
car speed maxSpeed
1: 1 10 30
2: 1 20 30
3: 1 30 30
4: 2 40 50
5: 2 50 50
6: 3 60 60
transform(exp, maxSpeed = ave(speed, car, FUN=max))
Another way using split:
exp$maxSpeed <- exp$speed
split(exp$maxSpeed, exp$car) <- lapply(split(exp$maxSpeed, exp$car), max)
exp