Suppose I have a vector that is nested in a dataframe with one or two levels. Is there a quick and dirty way to access the last value, without using the length() function? Something ala PERL's $# special var?
So I would like something like:
dat$vec1$vec2[$#]
instead of:
dat$vec1$vec2[length(dat$vec1$vec2)]
I use the tail function:
tail(vector, n=1)
The nice thing with tail is that it works on dataframes too, unlike the x[length(x)] idiom.
To answer this not from an aesthetical but performance-oriented point of view, I've put all of the above suggestions through a benchmark. To be precise, I've considered the suggestions
x[length(x)]
mylast(x), where mylast is a C++ function implemented through Rcpp,
tail(x, n=1)
dplyr::last(x)
x[end(x)[1]]]
rev(x)[1]
and applied them to random vectors of various sizes (10^3, 10^4, 10^5, 10^6, and 10^7). Before we look at the numbers, I think it should be clear that anything that becomes noticeably slower with greater input size (i.e., anything that is not O(1)) is not an option. Here's the code that I used:
Rcpp::cppFunction('double mylast(NumericVector x) { int n = x.size(); return x[n-1]; }')
options(width=100)
for (n in c(1e3,1e4,1e5,1e6,1e7)) {
x <- runif(n);
print(microbenchmark::microbenchmark(x[length(x)],
mylast(x),
tail(x, n=1),
dplyr::last(x),
x[end(x)[1]],
rev(x)[1]))}
It gives me
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 171 291.5 388.91 337.5 390.0 3233 100
mylast(x) 1291 1832.0 2329.11 2063.0 2276.0 19053 100
tail(x, n = 1) 7718 9589.5 11236.27 10683.0 12149.0 32711 100
dplyr::last(x) 16341 19049.5 22080.23 21673.0 23485.5 70047 100
x[end(x)[1]] 7688 10434.0 13288.05 11889.5 13166.5 78536 100
rev(x)[1] 7829 8951.5 10995.59 9883.0 10890.0 45763 100
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 204 323.0 475.76 386.5 459.5 6029 100
mylast(x) 1469 2102.5 2708.50 2462.0 2995.0 9723 100
tail(x, n = 1) 7671 9504.5 12470.82 10986.5 12748.0 62320 100
dplyr::last(x) 15703 19933.5 26352.66 22469.5 25356.5 126314 100
x[end(x)[1]] 13766 18800.5 27137.17 21677.5 26207.5 95982 100
rev(x)[1] 52785 58624.0 78640.93 60213.0 72778.0 851113 100
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 214 346.0 583.40 529.5 720.0 1512 100
mylast(x) 1393 2126.0 4872.60 4905.5 7338.0 9806 100
tail(x, n = 1) 8343 10384.0 19558.05 18121.0 25417.0 69608 100
dplyr::last(x) 16065 22960.0 36671.13 37212.0 48071.5 75946 100
x[end(x)[1]] 360176 404965.5 432528.84 424798.0 450996.0 710501 100
rev(x)[1] 1060547 1140149.0 1189297.38 1180997.5 1225849.0 1383479 100
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 327 584.0 1150.75 996.5 1652.5 3974 100
mylast(x) 2060 3128.5 7541.51 8899.0 9958.0 16175 100
tail(x, n = 1) 10484 16936.0 30250.11 34030.0 39355.0 52689 100
dplyr::last(x) 19133 47444.5 55280.09 61205.5 66312.5 105851 100
x[end(x)[1]] 1110956 2298408.0 3670360.45 2334753.0 4475915.0 19235341 100
rev(x)[1] 6536063 7969103.0 11004418.46 9973664.5 12340089.5 28447454 100
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 327 722.0 1644.16 1133.5 2055.5 13724 100
mylast(x) 1962 3727.5 9578.21 9951.5 12887.5 41773 100
tail(x, n = 1) 9829 21038.0 36623.67 43710.0 48883.0 66289 100
dplyr::last(x) 21832 35269.0 60523.40 63726.0 75539.5 200064 100
x[end(x)[1]] 21008128 23004594.5 37356132.43 30006737.0 47839917.0 105430564 100
rev(x)[1] 74317382 92985054.0 108618154.55 102328667.5 112443834.0 187925942 100
This immediately rules out anything involving rev or end since they're clearly not O(1) (and the resulting expressions are evaluated in a non-lazy fashion). tail and dplyr::last are not far from being O(1) but they're also considerably slower than mylast(x) and x[length(x)]. Since mylast(x) is slower than x[length(x)] and provides no benefits (rather, it's custom and does not handle an empty vector gracefully), I think the answer is clear: Please use x[length(x)].
If you're looking for something as nice as Python's x[-1] notation, I think you're out of luck. The standard idiom is
x[length(x)]
but it's easy enough to write a function to do this:
last <- function(x) { return( x[length(x)] ) }
This missing feature in R annoys me too!
Combining lindelof's and Gregg Lind's ideas:
last <- function(x) { tail(x, n = 1) }
Working at the prompt, I usually omit the n=, i.e. tail(x, 1).
Unlike last from the pastecs package, head and tail (from utils) work not only on vectors but also on data frames etc., and also can return data "without first/last n elements", e.g.
but.last <- function(x) { head(x, n = -1) }
(Note that you have to use head for this, instead of tail.)
The dplyr package includes a function last():
last(mtcars$mpg)
# [1] 21.4
I just benchmarked these two approaches on data frame with 663,552 rows using the following code:
system.time(
resultsByLevel$subject <- sapply(resultsByLevel$variable, function(x) {
s <- strsplit(x, ".", fixed=TRUE)[[1]]
s[length(s)]
})
)
user system elapsed
3.722 0.000 3.594
and
system.time(
resultsByLevel$subject <- sapply(resultsByLevel$variable, function(x) {
s <- strsplit(x, ".", fixed=TRUE)[[1]]
tail(s, n=1)
})
)
user system elapsed
28.174 0.000 27.662
So, assuming you're working with vectors, accessing the length position is significantly faster.
Another way is to take the first element of the reversed vector:
rev(dat$vect1$vec2)[1]
I have another method for finding the last element in a vector.
Say the vector is a.
> a<-c(1:100,555)
> end(a) #Gives indices of last and first positions
[1] 101 1
> a[end(a)[1]] #Gives last element in a vector
[1] 555
There you go!
Package data.table includes last function
library(data.table)
last(c(1:10))
# [1] 10
Whats about
> a <- c(1:100,555)
> a[NROW(a)]
[1] 555
The xts package provides a last function:
library(xts)
a <- 1:100
last(a)
[1] 100
As of purrr 1.0.0, pluck now accepts negative integers to index from the right:
library(purrr)
pluck(LETTERS, -1)
"Z"
Related
I have a data frame where one column is a list of time-stamps. I need to annotate which time-stamps are valid or not, depending on whether or not they are close enough (i.e., within 1 second) to an element of another list of valid time-stamps. For this I have a helper function.
valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974, 344)
actual_times <- c(200, 210, 215, 220.5, 260)
strain <- c("green", "green", "green", "green", "green", "green")
valid_or_not <- c(rep("NULL", 6))
df <- data.frame(strain, actual_times, valid_or_not)
My data-frame looks like this:
strain actual_times valid_or_not
1 green 200.0 NULL
2 green 210.0 NULL
3 green 215.0 NULL
4 green 220.5 NULL
5 green 260.0 NULL
My helper (that checks to see if an actual_time is within 1 second of a valid time) is as follows:
valid_or_not_fxn<- function(actual_time){
c = "not valid"
for (i in 1:length(valid_times))
if (abs(valid_times[i] - actual_time) <= 1) {
c <- "valid"
} else {
}
return(c)
}
What I've tried to do is loop through the entire data-frame using a for loop with this helper function.
However....it's really slow (on my real data-set) because it's a nested loop cross-comparing two lists that are 100s of elements long. I can't figure out to optimize this.
df$valid_or_not <- as.character(df$valid_or_not)
for (i in 1:nrow(df))
print(df[i, "valid_or_not"])
df[i, "valid_or_not"] <- valid_or_not_fxn(df[i, "actual_times"])
Thank you for any help!
No matter what you do, you essentially have to do at least length(valid_times) comparisons. Probably better off looping over valid_times and comparing each item of that vector to your actual_times column as a vectorised operation. That way you'd only have 5 loop iterations.
One way of doing this is then:
df$test <- Reduce(`|`, lapply(valid_times, function(x) abs(df$actual_times - x) <= 1))
# strain actual_times valid_or_not
#1 green 200.0 FALSE
#2 green 210.0 FALSE
#3 green 215.0 FALSE
#4 green 220.5 TRUE
#5 green 260.0 TRUE
100K rows in df and 1000 valid_times test finishes in <4 seconds:
df2 <- df[sample(1:5,1e5,replace=TRUE),]
valid_times2 <- valid_times[sample(1:5,1000,replace=TRUE)]
system.time(Reduce(`|`, lapply(valid_times2, function(x) abs(df2$actual_times - x) <= 1)))
# user system elapsed
# 3.13 0.40 3.54
The easist way to do it is avoiding data frame operations. So you can do this check and populate the valid_or_not vector before combining them into the dataframe as:
valid_or_not[sapply(actual_times, function(x) any(abs(x - valid_times) <= 1))] <- "valid"
Note that, by this line, the valid_or_not vector is indexed with an equal length vector of boolean values (whether the condition is satisfied, T or F). So only TRUE valued indices from the vector are updated. valid_or_not and actual_times vectors must be of same length where as valid_times vector can be of different length.
By the way "plying" a for loop does not enhance the performance significantly since it is just a "wrapper" for "for" loops. Only performance increase comes from avoiding intermediary objects due to neater and more concise style of code and avoiding redundant copying in some cases. The same case is true for the Vectorize function: It just wraps the for loop that goes through the function and in for example "outer" function, the FUN must be "vectorized" in that manner. In fact it does not give the performance of a truely vectorized operation. In my example the performance enhancement comes from the substitution of the for loop with the "any" function.
And because of some kind of a "bug", subsetting data frames has an important penalty. As Hadley Wickham explains in Performance topic of Advanced-R:
Extracting a single value from a data frame
The following microbenchmark shows five ways to access a single value
(the number in the bottom-right corner) from the built-in mtcars
dataset. The variation in performance is startling: the slowest method
takes 30x longer than the fastest. There’s no reason that there has to
be such a huge difference in performance. It’s simply that no one has
had the time to fix it.
microbenchmark(
"[32, 11]" = mtcars[32, 11],
"$carb[32]" = mtcars$carb[32],
"[[c(11, 32)]]" = mtcars[[c(11, 32)]],
"[[11]][32]" = mtcars[[11]][32],
".subset2" = .subset2(mtcars, 11)[32] )
## Unit: nanoseconds
## expr min lq mean median uq max neval
## [32, 11] 15,300 16,300 18354 17,000 17,800 76,400 100
## $carb[32] 8,860 9,930 12836 10,600 11,600 85,400 100
## [[c(11, 32)]] 7,200 8,110 9293 8,780 9,350 21,300 100
## [[11]][32] 6,330 7,580 8377 8,100 8,690 20,900 100
## .subset2 334 566 4461 669 800 368,000 100
The most efficient way to subset a data frame is to use the .subset2 method. Your poor performance can mostly be attributed to this fact.
And as last notes:
If the "else" in your conditional statment does not do anything (just like in your example: else {}) you do not have to include it. R has some lazy operations (does not evaluate a statement as long as it is not executed inside the code), but that does not mean it always skips non-executed code portions.
The "character" values in your example are in fact categoric: Only
one of few values can be chosen for each entry. So there is no need
to store them as "characters" and they can be converted into factors
(which are just integer values). This can also enhance
performance.
An addition for #thelatemail 's working solution:
In R, "or" (|) operator isn't lazy while "any" function is. A ply combining or's work till the end while "any" function stops at the first encounter of a TRUE value - which enhances the performance (I will write a blog post on this topic ASAP). And vectorized "any" is almost as fast as native C code while *ply can be slightly faster than for loops in R (That I will benchmark and show in another blog post soon).
Some benchmarks showing this:
Pure "any" and | comparison:
> microbenchmark(any(T,F,F,F,F,F), T|F|F|F|F|F)
Unit: nanoseconds
expr min lq mean median uq max neval cld
any(T, F, F, F, F, F) 274 307.0 545.86 366.5 429.5 16380 100 a
T | F | F | F | F | F 597 626.5 903.47 668.5 730.0 18966 100 a
Pure "Reduce" and vectorization comparison:
> vec0 <- rep(1, 1e6)
> microbenchmark(Reduce("+", vec0), sum(vec0), times = 10)
Unit: microseconds
expr min lq mean median uq
Reduce("+", vec0) 308415.064 310071.953 318503.6048 312940.6355 317648.354
sum(vec0) 930.625 936.775 944.2416 943.5425 949.257
max neval cld
369864.993 10 b
962.349 10 a
And a reduced "|" vs. vectorized "any" comparison (for an extreme case). "any" beats by more than 1e5 times:
> vec1 <- c(T, rep(F, 1e6))
> microbenchmark(Reduce("|", vec1), any(vec1), times = 10)
Unit: nanoseconds
expr min lq mean median uq
Reduce("|", vec1) 394040518 395792399 402703632.6 399191803 400990304
any(vec1) 154 267 1932.5 2588 2952
max neval cld
441805451 10 b
3420 10 a
When the single TRUE is at the very end (so "any" is not lazy anymore and has to check the whole vector), "any" still beats by more than 400 times:
> vec2 <- c(rep(F, 1e6), T)
> microbenchmark(Reduce("|", vec2), any(vec2), times = 10)
Unit: microseconds
expr min lq mean median uq
Reduce("|", vec2) 396625.318 401744.849 416732.5087 407447.375 424538.222
any(vec2) 736.975 787.047 857.5575 832.137 926.076
max neval cld
482116.632 10 b
1013.732 10 a
I have huge amounts of data to analyze, I tend to leave space between words or variable names as I write my code, So the question is, incases where efficiency is the number 1 priority, does the white space have cost?
Is c<-a+b more efficient than c <- a + b
To a first, second, third, ..., approximation, no, it won't cost you any time at all.
The extra time you spend pressing the space bar is orders of magnitude more costly than the cost at run time (and neither matter at all).
The much more significant cost will come from any any decreased readability that results from leaving out spaces, which can make code harder (for humans) to parse.
In a word, no!
library(microbenchmark)
f1 <- function(x){
j <- rnorm( x , mean = 0 , sd = 1 ) ;
k <- j * 2 ;
return( k )
}
f2 <- function(x){j<-rnorm(x,mean=0,sd=1);k<-j*2;return(k)}
microbenchmark( f1(1e3) , f2(1e3) , times= 1e3 )
Unit: microseconds
expr min lq median uq max neval
f1(1000) 110.763 112.8430 113.554 114.319 677.996 1000
f2(1000) 110.386 112.6755 113.416 114.151 5717.811 1000
#Even more runs and longer sampling
microbenchmark( f1(1e4) , f2(1e4) , times= 1e4 )
Unit: milliseconds
expr min lq median uq max neval
f1(10000) 1.060010 1.074880 1.079174 1.083414 66.791782 10000
f2(10000) 1.058773 1.074186 1.078485 1.082866 7.491616 10000
EDIT
It seems like using microbenchmark would be unfair because the expressions are parsed before ever they are run in the loop. However using source should mean that with each iteration the sourced code must be parsed and whitespace removed. So I saved the functions to two seperate files, with the last line of the file being a call of the function, e.g.so my file f2.R looks like this:
f2 <- function(x){j<-rnorm(x,mean=0,sd=1);k<-j*2;return(k)};f2(1e3)
And I test them like so:
microbenchmark( eval(source("~/Desktop/f2.R")) , eval(source("~/Desktop/f1.R")) , times = 1e3)
Unit: microseconds
expr min lq median uq max neval
eval(source("~/Desktop/f2.R")) 649.786 658.6225 663.6485 671.772 7025.662 1000
eval(source("~/Desktop/f1.R")) 687.023 697.2890 702.2315 710.111 19014.116 1000
And a visual representation of the difference with 1e4 replications....
Maybe it does make a minuscule difference in the situation where functions are repeatedly parsed but this wouldn't happen in normal use cases.
YES
But, No, not really:
TL;DR It would probably take longer just to run your script to remove the whitespaces than the time it saved by removing them.
#Josh O'Brien really hit the nail on the head. But I juts couldnt resist to benchmark
As you can see, if you are dealing with an order of magnitude of 100 MILLION lines then you will see a miniscule hinderance.
HOWEVER With that many lines, there would be a high likelihood of their being at least one (if not hundreds) of hotspots,
where simply improving the code in one of these would give you much greater speed than greping out all the whitespace.
library(microbenchmark)
microbenchmark(LottaSpace = eval(LottaSpace), NoSpace = eval(NoSpace), NormalSpace = eval(NormalSpace), times=10e7)
# 100 times; Unit: microseconds
expr min lq median uq max
1 LottaSpace 7.526 7.9185 8.1065 8.4655 54.850
2 NormalSpace 7.504 7.9115 8.1465 8.5540 28.409
3 NoSpace 7.544 7.8645 8.0565 8.3270 12.241
# 10,000 times; Unit: microseconds
expr min lq median uq max
1 LottaSpace 7.284 7.943 8.094 8.294 47888.24
2 NormalSpace 7.182 7.925 8.078 8.276 46318.20
3 NoSpace 7.246 7.921 8.073 8.271 48687.72
WHERE:
LottaSpace <- quote({
a <- 3
b <- 4
c <- 5
for (i in 1:7)
i + i
})
NoSpace <- quote({
a<-3
b<-4
c<-5
for(i in 1:7)
i+i
})
NormalSpace <- quote({
a <- 3
b <- 4
c <- 5
for (i in 1:7)
i + i
})
The only part this can affect is the parsing of the source code into tokens. I can't imagine that the difference in parsing time would be significant. However, you can eliminate this aspect by compiling the functions using the compile or cmpfun functions of the compiler package. Then the parsing is only done once and any whitespace difference can not affect execution time.
There should be no difference in performance, although:
fn1<-function(a,b) c<-a+b
fn2<-function(a,b) c <- a + b
library(rbenchmark)
> benchmark(fn1(1,2),fn2(1,2),replications=10000000)
test replications elapsed relative user.self sys.self user.child
1 fn1(1, 2) 10000000 53.87 1.212 53.4 0.37 NA
2 fn2(1, 2) 10000000 44.46 1.000 44.3 0.14 NA
same with microbenchmark:
Unit: nanoseconds
expr min lq median uq max neval
fn1(1, 2) 0 467 467 468 90397803 1e+07
fn2(1, 2) 0 467 467 468 85995868 1e+07
So the first result was bogus..
I am new in R script, here is my simple problem,
how to extract top 100 and bottom 100 values from a file in single command.
top<- head(xdata, 100)
bottom <- head(xdata, 100)
but I want in single command
like this...
both <- head(xdata, 100) + head(xdata, 100)
Thanks
You can do it this way, if n is the length of your data vector.
# Fake data
n <- 10^6
xdata <- runif(n)
# Get first 100 and last 100 in vector
v <- xdata[c(1:100, (n-99):n)]
You can also use tail as someone mentioned in the comments, but it is more efficient to index as I did above. To demonstrate this:
# Load microbenchmark package to compare computation speed
library(microbenchmark)
library(microbenchmark)
m <- microbenchmark( "direct index" = xdata[c(1:100, (n-99):n)],
"head/tail" = c(head(xdata, 100), tail(xdata, 100)))
print(m)
#Unit: microseconds
# expr min lq mean median uq max neval
#direct index 2.814 3.028 3.54298 3.422 3.6950 16.255 100
#head/tail 29.239 30.691 34.61539 31.628 33.0045 110.648 100
Indexing is 6.5X faster on my machine.
Suppose I have a a 5 million row data frame, with two columns, as such (this data frame only has ten rows for simplicity):
df <- data.frame(start=c(11,21,31,41,42,54,61,63), end=c(20,30,40,50,51,63,70,72))
I want to be able to produce the following numbers in a numeric vector:
11 to 20, 21 to 30, 31 to 40, 41 to 50, 51, 54-63, 64-70, 71-72
And then take the length of the new vector (in this case, 10+10+10+10+1+10+7+2) = 60
*NOTE, I do not need the vector itself, just it's length will suffice. So if someone has a more intelligent logical approach to obtain the length, that is welcomed.
Essentially, what was done, was the for each row in the dataframe, the sequence from the start to end was taken, and all these sequences were combined, and then filtered for UNIQUE values.
So I used an approach as such:
length(unique(c(apply(df, 1, function(x) {
return(as.numeric(x[1]):as.numeric(x[2]))
}))))
which proves incredibly slow on my five million row data frame.
Any quicker more efficient solutions? Bonus, please try to add system time.
user system elapsed
19.946 0.620 20.477
This should work, assuming your data is sorted.
library(dplyr) # for the lag function
with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))
#[1] 60
library(microbenchmark)
microbenchmark(
beginneR={with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))},
r2evans={vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1))); sum(mm[,2]-vec+1);},
times = 1000
)
Unit: microseconds
expr min lq median uq max neval
beginneR 37.398 41.4455 42.731 44.0795 74.349 1000
r2evans 31.788 35.2470 36.827 38.3925 9298.669 1000
So matrix is still faster, but not much (and the conversion step is still not included here). And I wonder why the max duration in #r2evans's answer is so high compared to all other values (which are really fast)
Another method:
mm <- as.matrix(df) ## critical for performance/scalability
(vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1))))
## [1] 11 21 31 41 51 54 64 71
sum(mm[,2] - vec + 1)
## [1] 60
(This should scale reasonable well, certainly better than data.frames.)
Edit: after I updated my code to use matrices and no apply calls, I did a quick benchmark of my implementation compared with the other answer (which is also correct):
library(microbenchmark)
library(dplyr)
microbenchmark(
beginneR={
df <- data.frame(start=c(11,21,31,41,42,54,61,63),
end=c(20,30,40,50,51,63,70,72))
with(df, sum(end - pmax(start, lag(end, 1, default = 0)+1) + 1))
},
r2evans={
mm <- matrix(c(11,21,31,41,42,54,61,63,
20,30,40,50,51,63,70,72), nc=2)
vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1)))
sum(mm[,2]-vec+1)
}
)
## Unit: microseconds
## expr min lq median uq max neval
## beginneR 230.410 238.297 244.9015 261.228 443.574 100
## r2evans 37.791 40.725 44.7620 47.880 147.124 100
This benefits greatly from the use of matrices instead of data.frames.
Oh, and system time is not that helpful here :-)
system.time({
mm <- matrix(c(11,21,31,41,42,54,61,63,
20,30,40,50,51,63,70,72), nc=2)
vec <- pmax(mm[,1], c(0,1+head(mm[,2],n=-1)))
sum(mm[,2]-vec+1)
})
## user system elapsed
## 0 0 0
I have huge amounts of data to analyze, I tend to leave space between words or variable names as I write my code, So the question is, incases where efficiency is the number 1 priority, does the white space have cost?
Is c<-a+b more efficient than c <- a + b
To a first, second, third, ..., approximation, no, it won't cost you any time at all.
The extra time you spend pressing the space bar is orders of magnitude more costly than the cost at run time (and neither matter at all).
The much more significant cost will come from any any decreased readability that results from leaving out spaces, which can make code harder (for humans) to parse.
In a word, no!
library(microbenchmark)
f1 <- function(x){
j <- rnorm( x , mean = 0 , sd = 1 ) ;
k <- j * 2 ;
return( k )
}
f2 <- function(x){j<-rnorm(x,mean=0,sd=1);k<-j*2;return(k)}
microbenchmark( f1(1e3) , f2(1e3) , times= 1e3 )
Unit: microseconds
expr min lq median uq max neval
f1(1000) 110.763 112.8430 113.554 114.319 677.996 1000
f2(1000) 110.386 112.6755 113.416 114.151 5717.811 1000
#Even more runs and longer sampling
microbenchmark( f1(1e4) , f2(1e4) , times= 1e4 )
Unit: milliseconds
expr min lq median uq max neval
f1(10000) 1.060010 1.074880 1.079174 1.083414 66.791782 10000
f2(10000) 1.058773 1.074186 1.078485 1.082866 7.491616 10000
EDIT
It seems like using microbenchmark would be unfair because the expressions are parsed before ever they are run in the loop. However using source should mean that with each iteration the sourced code must be parsed and whitespace removed. So I saved the functions to two seperate files, with the last line of the file being a call of the function, e.g.so my file f2.R looks like this:
f2 <- function(x){j<-rnorm(x,mean=0,sd=1);k<-j*2;return(k)};f2(1e3)
And I test them like so:
microbenchmark( eval(source("~/Desktop/f2.R")) , eval(source("~/Desktop/f1.R")) , times = 1e3)
Unit: microseconds
expr min lq median uq max neval
eval(source("~/Desktop/f2.R")) 649.786 658.6225 663.6485 671.772 7025.662 1000
eval(source("~/Desktop/f1.R")) 687.023 697.2890 702.2315 710.111 19014.116 1000
And a visual representation of the difference with 1e4 replications....
Maybe it does make a minuscule difference in the situation where functions are repeatedly parsed but this wouldn't happen in normal use cases.
YES
But, No, not really:
TL;DR It would probably take longer just to run your script to remove the whitespaces than the time it saved by removing them.
#Josh O'Brien really hit the nail on the head. But I juts couldnt resist to benchmark
As you can see, if you are dealing with an order of magnitude of 100 MILLION lines then you will see a miniscule hinderance.
HOWEVER With that many lines, there would be a high likelihood of their being at least one (if not hundreds) of hotspots,
where simply improving the code in one of these would give you much greater speed than greping out all the whitespace.
library(microbenchmark)
microbenchmark(LottaSpace = eval(LottaSpace), NoSpace = eval(NoSpace), NormalSpace = eval(NormalSpace), times=10e7)
# 100 times; Unit: microseconds
expr min lq median uq max
1 LottaSpace 7.526 7.9185 8.1065 8.4655 54.850
2 NormalSpace 7.504 7.9115 8.1465 8.5540 28.409
3 NoSpace 7.544 7.8645 8.0565 8.3270 12.241
# 10,000 times; Unit: microseconds
expr min lq median uq max
1 LottaSpace 7.284 7.943 8.094 8.294 47888.24
2 NormalSpace 7.182 7.925 8.078 8.276 46318.20
3 NoSpace 7.246 7.921 8.073 8.271 48687.72
WHERE:
LottaSpace <- quote({
a <- 3
b <- 4
c <- 5
for (i in 1:7)
i + i
})
NoSpace <- quote({
a<-3
b<-4
c<-5
for(i in 1:7)
i+i
})
NormalSpace <- quote({
a <- 3
b <- 4
c <- 5
for (i in 1:7)
i + i
})
The only part this can affect is the parsing of the source code into tokens. I can't imagine that the difference in parsing time would be significant. However, you can eliminate this aspect by compiling the functions using the compile or cmpfun functions of the compiler package. Then the parsing is only done once and any whitespace difference can not affect execution time.
There should be no difference in performance, although:
fn1<-function(a,b) c<-a+b
fn2<-function(a,b) c <- a + b
library(rbenchmark)
> benchmark(fn1(1,2),fn2(1,2),replications=10000000)
test replications elapsed relative user.self sys.self user.child
1 fn1(1, 2) 10000000 53.87 1.212 53.4 0.37 NA
2 fn2(1, 2) 10000000 44.46 1.000 44.3 0.14 NA
same with microbenchmark:
Unit: nanoseconds
expr min lq median uq max neval
fn1(1, 2) 0 467 467 468 90397803 1e+07
fn2(1, 2) 0 467 467 468 85995868 1e+07
So the first result was bogus..