When I was re-reading Hadley's Advanced R recently, I noticed that he said in Chapter 6 that `if` can be used as a function like
`if`(i == 1, print("yes"), print("no"))
(If you have the physical book in hand, it's on Page 80)
We know that ifelse is slow (Does ifelse really calculate both of its vectors every time? Is it slow?) as it evaluates all arguments. Will `if` be a good alternative to that as if seems to only evaluate TRUE arguments (this is just my assumption)?
Update: Based on the answers from #Benjamin and #Roman and the comments from #Gregor and many others, ifelse seems to be a better solution for vectorized calculations. I'm taking #Benjamin's answer here as it provides a more comprehensive comparison and for the community wellness. However, both answers(and the comments) are worth reading.
This is more of an extended comment building on Roman's answer, but I need the code utilities to expound:
Roman is correct that if is faster than ifelse, but I am under the impression that the speed boost of if isn't particularly interesting since it isn't something that can easily be harnessed through vectorization. That is to say, if is only advantageous over ifelse when the cond/test argument is of length 1.
Consider the following function which is an admittedly weak attempt at vectorizing if without having the side effect of evaluating both the yes and no conditions as ifelse does.
ifelse2 <- function(test, yes, no){
result <- rep(NA, length(test))
for (i in seq_along(test)){
result[i] <- `if`(test[i], yes[i], no[i])
}
result
}
ifelse2a <- function(test, yes, no){
sapply(seq_along(test),
function(i) `if`(test[i], yes[i], no[i]))
}
ifelse3 <- function(test, yes, no){
result <- rep(NA, length(test))
logic <- test
result[logic] <- yes[logic]
result[!logic] <- no[!logic]
result
}
set.seed(pi)
x <- rnorm(1000)
library(microbenchmark)
microbenchmark(
standard = ifelse(x < 0, x^2, x),
modified = ifelse2(x < 0, x^2, x),
modified_apply = ifelse2a(x < 0, x^2, x),
third = ifelse3(x < 0, x^2, x),
fourth = c(x, x^2)[1L + ( x < 0 )],
fourth_modified = c(x, x^2)[seq_along(x) + length(x) * (x < 0)]
)
Unit: microseconds
expr min lq mean median uq max neval cld
standard 52.198 56.011 97.54633 58.357 68.7675 1707.291 100 ab
modified 91.787 93.254 131.34023 94.133 98.3850 3601.967 100 b
modified_apply 645.146 653.797 718.20309 661.568 676.0840 3703.138 100 c
third 20.528 22.873 76.29753 25.513 27.4190 3294.350 100 ab
fourth 15.249 16.129 19.10237 16.715 20.9675 43.695 100 a
fourth_modified 19.061 19.941 22.66834 20.528 22.4335 40.468 100 a
SOME EDITS: Thanks to Frank and Richard Scriven for noticing my shortcomings.
As you can see, the process of breaking up the vector to be suitable to pass to if is a time consuming process and ends up being slower than just running ifelse (which is probably why no one has bothered to implement my solution).
If you're really desperate for an increase in speed, you can use the ifelse3 approach above. Or better yet, Frank's less obvious* but brilliant solution.
by 'less obvious' I mean, it took me two seconds to realize what he did. And per nicola's comment below, please note that this works only when yes and no have length 1, otherwise you'll want to stick with ifelse3
if is a primitive (complied) function called through the .Primitive interface, while ifelse is R bytecode, so it seems that if will be faster. Running some quick benchmarks
> microbenchmark(`if`(TRUE, "a", "b"), ifelse(TRUE, "a", "b"))
Unit: nanoseconds
expr min lq mean median uq max neval cld
if (TRUE) "a" else "b" 46 54 372.59 60.0 68.0 30007 100 a
ifelse(TRUE, "a", "b") 1212 1327 1581.62 1442.5 1617.5 11743 100 b
> microbenchmark(`if`(FALSE, "a", "b"), ifelse(FALSE, "a", "b"))
Unit: nanoseconds
expr min lq mean median uq max neval cld
if (FALSE) "a" else "b" 47 55 91.64 61.5 73 2550 100 a
ifelse(FALSE, "a", "b") 1256 1346 1688.78 1460.0 1677 17260 100 b
It seems that if not taking into account the code that is in actual branches, if is at least 20x faster than ifelse. However, note that this doesn't account the complexity of expression being tested and possible optimizations on that.
Update: Please note that this quick benchmark represent a very simplified and somewhat biased use case of if vs ifelse (as pointed out in the comments). While it is correct, it underrepresents the ifelse use cases, for that Benjamin's answer seems to provided more fair comparison.
Yes. I develop a for 152589 records using ifelse() took 90 min and using if() improve to 25min
for(i in ...){
# "Case 1"
# asesorMinimo<-( dummyAsesor%>%filter(FechaAsignacion==min(FechaAsignacion)) )[1,]
# asesorRegla<-tail(dummyAsesor%>%filter( FechaAsignacion<=dumFinClase)%>%arrange(FechaAsignacion),1)
# #Asigna Asesor
# dummyRow<-dummyRow%>%mutate(asesorRetencion=ifelse(dim(asesorRegla)[1]==0,asesorMinimo$OperadorNombreApellido,asesorRegla$OperadorNombreApellido))
# "Case 2"
asesorRegla<-tail(dummyAsesor%>%filter( FechaAsignacion<=dumFinClase)%>%arrange(FechaAsignacion),1)
asesorMinimo<-( dummyAsesor%>%filter(FechaAsignacion==min(FechaAsignacion)) )[1,]
if(dim(asesorRegla)[1]==0){
dummyRow<-dummyRow%>%mutate(asesorRetencion=asesorMinimo[1,7])
}else{
dummyRow<-dummyRow%>%mutate(asesorRetencion=asesorRegla[1,7])
}
}
Related
I have a data frame where one column is a list of time-stamps. I need to annotate which time-stamps are valid or not, depending on whether or not they are close enough (i.e., within 1 second) to an element of another list of valid time-stamps. For this I have a helper function.
valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974, 344)
actual_times <- c(200, 210, 215, 220.5, 260)
strain <- c("green", "green", "green", "green", "green", "green")
valid_or_not <- c(rep("NULL", 6))
df <- data.frame(strain, actual_times, valid_or_not)
My data-frame looks like this:
strain actual_times valid_or_not
1 green 200.0 NULL
2 green 210.0 NULL
3 green 215.0 NULL
4 green 220.5 NULL
5 green 260.0 NULL
My helper (that checks to see if an actual_time is within 1 second of a valid time) is as follows:
valid_or_not_fxn<- function(actual_time){
c = "not valid"
for (i in 1:length(valid_times))
if (abs(valid_times[i] - actual_time) <= 1) {
c <- "valid"
} else {
}
return(c)
}
What I've tried to do is loop through the entire data-frame using a for loop with this helper function.
However....it's really slow (on my real data-set) because it's a nested loop cross-comparing two lists that are 100s of elements long. I can't figure out to optimize this.
df$valid_or_not <- as.character(df$valid_or_not)
for (i in 1:nrow(df))
print(df[i, "valid_or_not"])
df[i, "valid_or_not"] <- valid_or_not_fxn(df[i, "actual_times"])
Thank you for any help!
No matter what you do, you essentially have to do at least length(valid_times) comparisons. Probably better off looping over valid_times and comparing each item of that vector to your actual_times column as a vectorised operation. That way you'd only have 5 loop iterations.
One way of doing this is then:
df$test <- Reduce(`|`, lapply(valid_times, function(x) abs(df$actual_times - x) <= 1))
# strain actual_times valid_or_not
#1 green 200.0 FALSE
#2 green 210.0 FALSE
#3 green 215.0 FALSE
#4 green 220.5 TRUE
#5 green 260.0 TRUE
100K rows in df and 1000 valid_times test finishes in <4 seconds:
df2 <- df[sample(1:5,1e5,replace=TRUE),]
valid_times2 <- valid_times[sample(1:5,1000,replace=TRUE)]
system.time(Reduce(`|`, lapply(valid_times2, function(x) abs(df2$actual_times - x) <= 1)))
# user system elapsed
# 3.13 0.40 3.54
The easist way to do it is avoiding data frame operations. So you can do this check and populate the valid_or_not vector before combining them into the dataframe as:
valid_or_not[sapply(actual_times, function(x) any(abs(x - valid_times) <= 1))] <- "valid"
Note that, by this line, the valid_or_not vector is indexed with an equal length vector of boolean values (whether the condition is satisfied, T or F). So only TRUE valued indices from the vector are updated. valid_or_not and actual_times vectors must be of same length where as valid_times vector can be of different length.
By the way "plying" a for loop does not enhance the performance significantly since it is just a "wrapper" for "for" loops. Only performance increase comes from avoiding intermediary objects due to neater and more concise style of code and avoiding redundant copying in some cases. The same case is true for the Vectorize function: It just wraps the for loop that goes through the function and in for example "outer" function, the FUN must be "vectorized" in that manner. In fact it does not give the performance of a truely vectorized operation. In my example the performance enhancement comes from the substitution of the for loop with the "any" function.
And because of some kind of a "bug", subsetting data frames has an important penalty. As Hadley Wickham explains in Performance topic of Advanced-R:
Extracting a single value from a data frame
The following microbenchmark shows five ways to access a single value
(the number in the bottom-right corner) from the built-in mtcars
dataset. The variation in performance is startling: the slowest method
takes 30x longer than the fastest. There’s no reason that there has to
be such a huge difference in performance. It’s simply that no one has
had the time to fix it.
microbenchmark(
"[32, 11]" = mtcars[32, 11],
"$carb[32]" = mtcars$carb[32],
"[[c(11, 32)]]" = mtcars[[c(11, 32)]],
"[[11]][32]" = mtcars[[11]][32],
".subset2" = .subset2(mtcars, 11)[32] )
## Unit: nanoseconds
## expr min lq mean median uq max neval
## [32, 11] 15,300 16,300 18354 17,000 17,800 76,400 100
## $carb[32] 8,860 9,930 12836 10,600 11,600 85,400 100
## [[c(11, 32)]] 7,200 8,110 9293 8,780 9,350 21,300 100
## [[11]][32] 6,330 7,580 8377 8,100 8,690 20,900 100
## .subset2 334 566 4461 669 800 368,000 100
The most efficient way to subset a data frame is to use the .subset2 method. Your poor performance can mostly be attributed to this fact.
And as last notes:
If the "else" in your conditional statment does not do anything (just like in your example: else {}) you do not have to include it. R has some lazy operations (does not evaluate a statement as long as it is not executed inside the code), but that does not mean it always skips non-executed code portions.
The "character" values in your example are in fact categoric: Only
one of few values can be chosen for each entry. So there is no need
to store them as "characters" and they can be converted into factors
(which are just integer values). This can also enhance
performance.
An addition for #thelatemail 's working solution:
In R, "or" (|) operator isn't lazy while "any" function is. A ply combining or's work till the end while "any" function stops at the first encounter of a TRUE value - which enhances the performance (I will write a blog post on this topic ASAP). And vectorized "any" is almost as fast as native C code while *ply can be slightly faster than for loops in R (That I will benchmark and show in another blog post soon).
Some benchmarks showing this:
Pure "any" and | comparison:
> microbenchmark(any(T,F,F,F,F,F), T|F|F|F|F|F)
Unit: nanoseconds
expr min lq mean median uq max neval cld
any(T, F, F, F, F, F) 274 307.0 545.86 366.5 429.5 16380 100 a
T | F | F | F | F | F 597 626.5 903.47 668.5 730.0 18966 100 a
Pure "Reduce" and vectorization comparison:
> vec0 <- rep(1, 1e6)
> microbenchmark(Reduce("+", vec0), sum(vec0), times = 10)
Unit: microseconds
expr min lq mean median uq
Reduce("+", vec0) 308415.064 310071.953 318503.6048 312940.6355 317648.354
sum(vec0) 930.625 936.775 944.2416 943.5425 949.257
max neval cld
369864.993 10 b
962.349 10 a
And a reduced "|" vs. vectorized "any" comparison (for an extreme case). "any" beats by more than 1e5 times:
> vec1 <- c(T, rep(F, 1e6))
> microbenchmark(Reduce("|", vec1), any(vec1), times = 10)
Unit: nanoseconds
expr min lq mean median uq
Reduce("|", vec1) 394040518 395792399 402703632.6 399191803 400990304
any(vec1) 154 267 1932.5 2588 2952
max neval cld
441805451 10 b
3420 10 a
When the single TRUE is at the very end (so "any" is not lazy anymore and has to check the whole vector), "any" still beats by more than 400 times:
> vec2 <- c(rep(F, 1e6), T)
> microbenchmark(Reduce("|", vec2), any(vec2), times = 10)
Unit: microseconds
expr min lq mean median uq
Reduce("|", vec2) 396625.318 401744.849 416732.5087 407447.375 424538.222
any(vec2) 736.975 787.047 857.5575 832.137 926.076
max neval cld
482116.632 10 b
1013.732 10 a
I have a huge vector which has a couple of NA values, and I'm trying to find the max value in that vector (the vector is all numbers), but I can't do this because of the NA values.
How can I remove the NA values so that I can compute the max?
Trying ?max, you'll see that it actually has a na.rm = argument, set by default to FALSE. (That's the common default for many other R functions, including sum(), mean(), etc.)
Setting na.rm=TRUE does just what you're asking for:
d <- c(1, 100, NA, 10)
max(d, na.rm=TRUE)
If you do want to remove all of the NAs, use this idiom instead:
d <- d[!is.na(d)]
A final note: Other functions (e.g. table(), lm(), and sort()) have NA-related arguments that use different names (and offer different options). So if NA's cause you problems in a function call, it's worth checking for a built-in solution among the function's arguments. I've found there's usually one already there.
The na.omit function is what a lot of the regression routines use internally:
vec <- 1:1000
vec[runif(200, 1, 1000)] <- NA
max(vec)
#[1] NA
max( na.omit(vec) )
#[1] 1000
Use discard from purrr (works with lists and vectors).
discard(v, is.na)
The benefit is that it is easy to use pipes; alternatively use the built-in subsetting function [:
v %>% discard(is.na)
v %>% `[`(!is.na(.))
Note that na.omit does not work on lists:
> x <- list(a=1, b=2, c=NA)
> na.omit(x)
$a
[1] 1
$b
[1] 2
$c
[1] NA
?max shows you that there is an extra parameter na.rm that you can set to TRUE.
Apart from that, if you really want to remove the NAs, just use something like:
myvec[!is.na(myvec)]
Just in case someone new to R wants a simplified answer to the original question
How can I remove NA values from a vector?
Here it is:
Assume you have a vector foo as follows:
foo = c(1:10, NA, 20:30)
running length(foo) gives 22.
nona_foo = foo[!is.na(foo)]
length(nona_foo) is 21, because the NA values have been removed.
Remember is.na(foo) returns a boolean matrix, so indexing foo with the opposite of this value will give you all the elements which are not NA.
You can call max(vector, na.rm = TRUE). More generally, you can use the na.omit() function.
I ran a quick benchmark comparing the two base approaches and it turns out that x[!is.na(x)] is faster than na.omit. User qwr suggested I try purrr::dicard also - this turned out to be massively slower (though I'll happily take comments on my implementation & test!)
microbenchmark::microbenchmark(
purrr::map(airquality,function(x) {x[!is.na(x)]}),
purrr::map(airquality,na.omit),
purrr::map(airquality, ~purrr::discard(.x, .p = is.na)),
times = 1e6)
Unit: microseconds
expr min lq mean median uq max neval cld
purrr::map(airquality, function(x) { x[!is.na(x)] }) 66.8 75.9 130.5643 86.2 131.80 541125.5 1e+06 a
purrr::map(airquality, na.omit) 95.7 107.4 185.5108 129.3 190.50 534795.5 1e+06 b
purrr::map(airquality, ~purrr::discard(.x, .p = is.na)) 3391.7 3648.6 5615.8965 4079.7 6486.45 1121975.4 1e+06 c
For reference, here's the original test of x[!is.na(x)] vs na.omit:
microbenchmark::microbenchmark(
purrr::map(airquality,function(x) {x[!is.na(x)]}),
purrr::map(airquality,na.omit),
times = 1000000)
Unit: microseconds
expr min lq mean median uq max neval cld
map(airquality, function(x) { x[!is.na(x)] }) 53.0 56.6 86.48231 58.1 64.8 414195.2 1e+06 a
map(airquality, na.omit) 85.3 90.4 134.49964 92.5 104.9 348352.8 1e+06 b
Another option using complete.cases like this:
d <- c(1, 100, NA, 10)
result <- complete.cases(d)
output <- d[result]
output
#> [1] 1 100 10
max(output)
#> [1] 100
Created on 2022-08-26 with reprex v2.0.2
I'm trying to sum the digits of integers in the last 2 columns of my data frame. I have found a function that does the summing, but I think I may have an issue with applying the function - not sure?
Dataframe
a = c("a", "b", "c")
b = c(1, 11, 2)
c = c(2, 4, 23)
data <- data.frame(a,b,c)
#Digitsum function
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(as.character(x)) - 1))) %% 10)
#Applying function
data[2:3] <- lapply(data[2:3], digitsum)
This is the error that I get:
*Warning messages:
1: In 0:(nchar(as.character(x)) - 1) :
numerical expression has 3 elements: only the first used
2: In 0:(nchar(as.character(x)) - 1) :
numerical expression has 3 elements: only the first used*
Your function digitsum at the moment works fine for a single scalar input, for example,
digitsum(32)
# [1] 5
But, it can not take a vector input, otherwise ":" will complain. You need to vectorize this function, using Vectorize:
vec_digitsum <- Vectorize(digitsum)
Then it works for a vector input:
b = c(1, 11, 2)
vec_digitsum(b)
# [1] 1 2 2
Now you can use lapply without trouble.
#Zheyuan Li 's answer solved your problem of using lapply. Though I'd like to add several points:
Vectorize is just a wrapper with mapply, which doesn't give you the performance of vectorization.
The function itself can be improved for much better readability:
see
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(as.character(x)) - 1))) %% 10)
vec_digitsum <- Vectorize(digitsum)
sumdigits <- function(x){
digits <- strsplit(as.character(x), "")[[1]]
sum(as.numeric(digits))
}
vec_sumdigits <- Vectorize(sumdigits)
microbenchmark::microbenchmark(digitsum(12324255231323),
sumdigits(12324255231323), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
digitsum(12324255231323) 12.223 12.712 14.50613 13.201 13.690 96.801 100 a
sumdigits(12324255231323) 13.689 14.667 15.32743 14.668 15.157 38.134 100 a
The performance of two versions are similar, but the 2nd one is much easier to understand.
Interestingly, the Vectorize wrapper add considerable overhead for single input:
microbenchmark::microbenchmark(vec_digitsum(12324255231323),
vec_sumdigits(12324255231323), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
vec_digitsum(12324255231323) 92.890 96.801 267.2665 100.223 108.045 16387.07 100 a
vec_sumdigits(12324255231323) 94.357 98.757 106.2705 101.445 107.556 286.00 100 a
Another advantage of this function is that if you have really big numbers in string format, it will still work (with small modification of removing the as.character). While the first version function will have problem with big numbers or may introduce errors.
Note: At first my benchmark was comparing the vectorized version of OP function and non-vectorized version of my function, that gave me the wrong impression of my function is much faster. Turned out that was caused by Vectorize overhead.
I was looking at the benchmarks in this answer, and wanted to compare them with diag (used in a different answer). Unfortunately, it seems that diag takes ages:
nc <- 1e4
set.seed(1)
m <- matrix(sample(letters,nc^2,replace=TRUE), ncol = nc)
microbenchmark(
diag = diag(m),
cond = m[row(m)==col(m)],
vec = m[(1:nc-1L)*nc+1:nc],
mat = m[cbind(1:nc,1:nc)],
times=10)
Comments: I tested these with identical. I took "cond" from one of the answers to this homework question. Results are similar with a matrix of integers, 1:26 instead of letters.
Results:
Unit: microseconds
expr min lq mean median uq max neval
diag 604343.469 629819.260 710371.3320 706842.3890 793144.019 837115.504 10
cond 3862039.512 3985784.025 4175724.0390 4186317.5260 4312493.742 4617117.706 10
vec 317.088 329.017 432.9099 350.1005 629.460 651.376 10
mat 272.147 292.953 441.7045 345.9400 637.506 706.860 10
It is just a matrix-subsetting operation, so I don't know why there's so much overhead. Looking inside the function, I see a few checks and then c(m)[v], where v is the same vector used in the "vec" benchmark. Timing these two...
v <- (1:nc-1L)*nc+1:nc
microbenchmark(diaglike=c(m)[v],vec=m[v])
# Unit: microseconds
# expr min lq mean median uq max neval
# diaglike 579224.436 664853.7450 720372.8105 712649.706 767281.5070 931976.707 100
# vec 334.843 339.8365 568.7808 646.799 663.5825 1445.067 100
...it seems I have found my culprit. So, the new variation on my question is: Why is there a seemingly unnecessary and very time-consuming c in diag?
Summary
As of R version 3.2.1 (World-Famous Astronaut) diag() has received an update. The discussion moved to r-devel where it was noted that c() strips non-name attributes and may have been why it was placed there. While some people worried that removing c() would cause unknown issues on matrix-like objects, Peter Dalgaard found that, "The only case where the c() inside diag() has an effect is where M[i,j] != M[(i-1)*m+j] AND c(M) will stringize M in column-major order, so that M[i,j] == c(M)[(i-1)*m+j]."
Luke Tierney tested #Frank 's removal of c(), finding it did not effect anything on CRAN or BIOC and so was implemented to replace c(x)[...] with x[...] on line 27. This leads to relatively large speedups in diag(). Below is a speed test showing the improvement with R 3.2.1's version of diag().
library(microbenchmark)
nc <- 1e4
set.seed(1)
m <- matrix(sample(letters,nc^2,replace=TRUE), ncol = nc)
microbenchmark(diagOld(m),diag(m))
Unit: microseconds
expr min lq mean median uq max neval
diagOld(m) 451189.242 526622.2775 545116.5668 531905.5635 540008.704 682223.733 100
diag(m) 222.563 646.8675 644.7444 714.4575 740.701 1015.459 100
Suppose I have a vector that is nested in a dataframe with one or two levels. Is there a quick and dirty way to access the last value, without using the length() function? Something ala PERL's $# special var?
So I would like something like:
dat$vec1$vec2[$#]
instead of:
dat$vec1$vec2[length(dat$vec1$vec2)]
I use the tail function:
tail(vector, n=1)
The nice thing with tail is that it works on dataframes too, unlike the x[length(x)] idiom.
To answer this not from an aesthetical but performance-oriented point of view, I've put all of the above suggestions through a benchmark. To be precise, I've considered the suggestions
x[length(x)]
mylast(x), where mylast is a C++ function implemented through Rcpp,
tail(x, n=1)
dplyr::last(x)
x[end(x)[1]]]
rev(x)[1]
and applied them to random vectors of various sizes (10^3, 10^4, 10^5, 10^6, and 10^7). Before we look at the numbers, I think it should be clear that anything that becomes noticeably slower with greater input size (i.e., anything that is not O(1)) is not an option. Here's the code that I used:
Rcpp::cppFunction('double mylast(NumericVector x) { int n = x.size(); return x[n-1]; }')
options(width=100)
for (n in c(1e3,1e4,1e5,1e6,1e7)) {
x <- runif(n);
print(microbenchmark::microbenchmark(x[length(x)],
mylast(x),
tail(x, n=1),
dplyr::last(x),
x[end(x)[1]],
rev(x)[1]))}
It gives me
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 171 291.5 388.91 337.5 390.0 3233 100
mylast(x) 1291 1832.0 2329.11 2063.0 2276.0 19053 100
tail(x, n = 1) 7718 9589.5 11236.27 10683.0 12149.0 32711 100
dplyr::last(x) 16341 19049.5 22080.23 21673.0 23485.5 70047 100
x[end(x)[1]] 7688 10434.0 13288.05 11889.5 13166.5 78536 100
rev(x)[1] 7829 8951.5 10995.59 9883.0 10890.0 45763 100
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 204 323.0 475.76 386.5 459.5 6029 100
mylast(x) 1469 2102.5 2708.50 2462.0 2995.0 9723 100
tail(x, n = 1) 7671 9504.5 12470.82 10986.5 12748.0 62320 100
dplyr::last(x) 15703 19933.5 26352.66 22469.5 25356.5 126314 100
x[end(x)[1]] 13766 18800.5 27137.17 21677.5 26207.5 95982 100
rev(x)[1] 52785 58624.0 78640.93 60213.0 72778.0 851113 100
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 214 346.0 583.40 529.5 720.0 1512 100
mylast(x) 1393 2126.0 4872.60 4905.5 7338.0 9806 100
tail(x, n = 1) 8343 10384.0 19558.05 18121.0 25417.0 69608 100
dplyr::last(x) 16065 22960.0 36671.13 37212.0 48071.5 75946 100
x[end(x)[1]] 360176 404965.5 432528.84 424798.0 450996.0 710501 100
rev(x)[1] 1060547 1140149.0 1189297.38 1180997.5 1225849.0 1383479 100
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 327 584.0 1150.75 996.5 1652.5 3974 100
mylast(x) 2060 3128.5 7541.51 8899.0 9958.0 16175 100
tail(x, n = 1) 10484 16936.0 30250.11 34030.0 39355.0 52689 100
dplyr::last(x) 19133 47444.5 55280.09 61205.5 66312.5 105851 100
x[end(x)[1]] 1110956 2298408.0 3670360.45 2334753.0 4475915.0 19235341 100
rev(x)[1] 6536063 7969103.0 11004418.46 9973664.5 12340089.5 28447454 100
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 327 722.0 1644.16 1133.5 2055.5 13724 100
mylast(x) 1962 3727.5 9578.21 9951.5 12887.5 41773 100
tail(x, n = 1) 9829 21038.0 36623.67 43710.0 48883.0 66289 100
dplyr::last(x) 21832 35269.0 60523.40 63726.0 75539.5 200064 100
x[end(x)[1]] 21008128 23004594.5 37356132.43 30006737.0 47839917.0 105430564 100
rev(x)[1] 74317382 92985054.0 108618154.55 102328667.5 112443834.0 187925942 100
This immediately rules out anything involving rev or end since they're clearly not O(1) (and the resulting expressions are evaluated in a non-lazy fashion). tail and dplyr::last are not far from being O(1) but they're also considerably slower than mylast(x) and x[length(x)]. Since mylast(x) is slower than x[length(x)] and provides no benefits (rather, it's custom and does not handle an empty vector gracefully), I think the answer is clear: Please use x[length(x)].
If you're looking for something as nice as Python's x[-1] notation, I think you're out of luck. The standard idiom is
x[length(x)]
but it's easy enough to write a function to do this:
last <- function(x) { return( x[length(x)] ) }
This missing feature in R annoys me too!
Combining lindelof's and Gregg Lind's ideas:
last <- function(x) { tail(x, n = 1) }
Working at the prompt, I usually omit the n=, i.e. tail(x, 1).
Unlike last from the pastecs package, head and tail (from utils) work not only on vectors but also on data frames etc., and also can return data "without first/last n elements", e.g.
but.last <- function(x) { head(x, n = -1) }
(Note that you have to use head for this, instead of tail.)
The dplyr package includes a function last():
last(mtcars$mpg)
# [1] 21.4
I just benchmarked these two approaches on data frame with 663,552 rows using the following code:
system.time(
resultsByLevel$subject <- sapply(resultsByLevel$variable, function(x) {
s <- strsplit(x, ".", fixed=TRUE)[[1]]
s[length(s)]
})
)
user system elapsed
3.722 0.000 3.594
and
system.time(
resultsByLevel$subject <- sapply(resultsByLevel$variable, function(x) {
s <- strsplit(x, ".", fixed=TRUE)[[1]]
tail(s, n=1)
})
)
user system elapsed
28.174 0.000 27.662
So, assuming you're working with vectors, accessing the length position is significantly faster.
Another way is to take the first element of the reversed vector:
rev(dat$vect1$vec2)[1]
I have another method for finding the last element in a vector.
Say the vector is a.
> a<-c(1:100,555)
> end(a) #Gives indices of last and first positions
[1] 101 1
> a[end(a)[1]] #Gives last element in a vector
[1] 555
There you go!
Package data.table includes last function
library(data.table)
last(c(1:10))
# [1] 10
Whats about
> a <- c(1:100,555)
> a[NROW(a)]
[1] 555
The xts package provides a last function:
library(xts)
a <- 1:100
last(a)
[1] 100
As of purrr 1.0.0, pluck now accepts negative integers to index from the right:
library(purrr)
pluck(LETTERS, -1)
"Z"