With the data set here:
https://www.dropbox.com/s/gyimxbz5f3v0uq3/kfg.RData?dl=0
And executing the below code:
matrix(nrow=1600,ncol=8) -> ctw
for(k in 1:8){
for(i in 1:1600){
which(kfg[,9]==i) -> aj
if(length(aj)!=0){
sample(kfg[aj,11],prob=kfg[aj,k],size=1) -> ctw[i,k]
}
ctw[i,k]
}
}
Is doable, but the real set is over 800k rows and it takes very long. Is there a way in data.table or other package to do this faster? It is very slow to do the which() step.
I had to revise your original code to check for non-zero probabilities. I also removed the statement ctw[i,k] from the last line of the inner loop, because it has no effect. Your code is
matrix(nrow=1600,ncol=8) -> ctw
for(k in 1:8){
for(i in 1:1600){
which(kfg[,9]==i) -> aj
if ((length(aj)!=0) && any(kfg[aj, k] > 0)) {
sample(kfg[aj,11],prob=kfg[aj,k],size=1) -> ctw[i,k]
}
}
}
ctw
I reversed the order of the loops, so that kfg[,9] == i is only evaluated once instead of 8 times. I also took the test for length(aj) != 0 outside the loops using tabulate(). My revised code is
matrix(nrow=1600,ncol=8) -> ctw
which(tabulate(kfg[, 9], 1600) != 0) -> ii
for(i in ii) {
kfg[,9] == i -> aj
for(k in 1:8)
if (any(kfg[aj, k] > 0))
sample(kfg[aj,11], 1, prob=kfg[aj,k]) -> ctw[i,k]
}
ctw
This is approximately 5x faster for your sample data.
It is much faster to extract the vector of sample values kfg[,11] == kfg[[11]] once, and to work with a matrix as.matrix(kfg[, 1:8]) of probabilities, rather than a data.frame. For the sample data it is marginally faster to hoist the split on column 9 out of the loop, and to avoid the conditional inside the k loop by doing a vectorized calculation outside the loop to identify relevant indices
nrow <- 1600
matrix(nrow=nrow,ncol=8) -> ctw
x <- kfg[[11]]
pr <- as.matrix(kfg[,1:8])
ajs <- split(seq_len(nrow(kfg)), factor(kfg[[9]], levels=seq_len(nrow)))
ii <- seq_along(ajs)[lengths(ajs) > 0]
for(i in ii) {
aj <- ajs[[i]]
kk <- which(colSums(pr[aj,, drop=FALSE]) > 0)
for(k in kk)
sample(x[aj], 1, prob=pr[aj,k]) -> ctw[i,k]
}
ctw
These lead to a further 5x speed-up, so 25 times faster than the original.
To measure the speed, I enclosed each of the above in a function, e.g.,
f0 <- function() {
matrix(nrow=1600,ncol=8) -> ctw
for(k in 1:8){
for(i in 1:1600){
which(kfg[,9]==i) -> aj
if ((length(aj)!=0) && any(kfg[aj, k] > 0)) {
sample(kfg[aj,11],prob=kfg[aj,k],size=1) -> ctw[i,k]
}
}
}
ctw
}
and used the microbenchmark package
> library(microbenchmark)
> microbenchmark(f0(), f1(), f2(), times=10)
Unit: milliseconds
expr min lq mean median uq max neval cld
f0() 466.12527 483.43954 484.34258 483.74805 484.21627 521.19957 10 c
f1() 92.77415 94.79052 94.99273 95.10352 95.45368 96.10641 10 b
f2() 17.33708 17.83257 17.87095 17.87205 18.01723 18.16400 10 a
f1() and f2() should be identical, but they are not
> set.seed(123); res1 <- f1(); set.seed(123); res2 <- f2()
> all.equal(res1, res2)
[1] "'is.NA' value mismatch: 12096 in current 12133 in target"
Investigating, this is because the values in column 9 are numeric, but are treated, e.g., kfg[, 9] == i as though they are integer. For instance,
> kfg[[9]][(kfg[[9]] > 28 & kfg[[9]] <= 29)]
[1] 29 29 29
> kfg[[9]][(kfg[[9]] > 28 & kfg[[9]] <= 29)] == 29
[1] FALSE FALSE FALSE
Perhaps the intention is
kfg[[9]] = round(kfg[[9]])
With this change, we have
> all.equal(res1, res2)
[1] TRUE
> identical(res1, res2)
[1] TRUE
Related
I am trying to sum the odds numbers of a specific number (but excluding itself), for example: N = 5 then 1+3 = 4
a<-5
sum<-function(x){
k<-0
for (n in x) {
if(n %% 2 == 1)
k<-k+1
}
return(k)
}
sum(a)
# [1] 1
But the function is not working, because it counts the odds numbers instead of summing them.
We may use vectorized approach
a1 <- head(seq_len(a), -1)
sum(a1[a1%%2 == 1])
[1] 4
If we want a loop, perhaps
f1 <- function(x) {
s <- 0
k <- 1
while(k < x) {
if(k %% 2 == 1) {
s <- s + k
}
k <- k + 1
}
s
}
f1(5)
The issue in OP's code is
for(n in x)
where x is just a single value and thus n will be looped once - i.e. if our input is 5, then it will be only looped once and 'n' will be 5. Instead, it would be seq_len(x -1). The correct loop would be something like
f2<-function(x){
k<- 0
for (n in seq_len(x-1)) {
if(n %% 2 == 1) {
k <- k + n
}
}
k
}
f2(5)
NOTE: sum is a base R function. So, it is better to name the custom function with a different name
Mathematically, we can try the following code to calculate the sum (N could be odd or even)
(ceiling((N - 1) / 2))^2
It's simple and it does what it says:
sum(seq(1, length.out = floor(N/2), by = 2))
The multiplication solution is probably gonna be quicker, though.
NB - an earlier version of this answer was
sum(seq(1, N - 1, 2))
which as #tjebo points out, silently gives the wrong answer for N = 1.
We could use logical statement to access the values:
a <- 5
a1 <- head(seq_len(a), -1)
sum(a1[c(TRUE, FALSE)])
output:
[1] 4
Fun benchmarking. Does it surprise that Thomas' simple formula is by far the fastest solution...?
count_odds_thomas <- function(x){
(ceiling((x - 1) / 2))^2
}
count_odds_akrun <- function(x){
a1 <- head(seq_len(x), -1)
sum(a1[a1%%2 == 1])
}
count_odds_dash2 <- function(x){
sum(seq(1, x - 1, 2))
}
m <- microbenchmark::microbenchmark(
akrun = count_odds_akrun(10^6),
dash2 = count_odds_dash2(10^6),
thomas = count_odds_thomas(10^6)
)
m
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> akrun 22117564 26299922.0 30052362.16 28653712 31891621 70721894 100
#> dash2 4016254 4384944.0 7159095.88 4767401 8202516 52423322 100
#> thomas 439 935.5 27599.34 6223 8482 2205286 100
ggplot2::autoplot(m)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Moreover, Thomas solution works on really big numbers (also no surprise)... on my machine, count_odds_akrun stuffs the memory at a “mere” 10^10, but Thomas works fine till Infinity…
count_odds_thomas(10^10)
#> [1] 2.5e+19
count_odds_akrun(10^10)
#> Error: vector memory exhausted (limit reached?)
Let's say you have a function that takes a number as an input and outputs a vector. However, the output vector size depends on the input and you can't calculate it before the function.
For example, take the 3N+1 famous algorithm. A simple implementation of that algorithm, returning the whole path until 1 could look like this:
compute <- function(x) {
if (x %% 2 == 0)
return(x / 2)
return(3*x + 1)
}
algo <- function(x) {
if (x == 1)
return(1)
output <- x
while(x != 1) {
x <- compute(x)
output <- c(output, x)
}
return(output)
}
The algo function returns the whole path of an input X to 1, according to the function. As you can tell, the output variable grows dynamically, using the c() (combine) function.
Are there any alternatives to this? Is growing a list faster? Should I adopt some classic dynamic vector logic, such as initializing an empty N-sized vector and double it everytime it goes full?
EDIT: Please don't mind trying to optimize the way my helper functions are structured. I get it, but that's not the point here! I am only concerned about the c() function and an alternative to it.
Update
As per your edit, maybe you can check the following solution
algo_TIC2 <- function(x) {
res <- x
repeat {
u <- tail(res, 1)
if (u != 1) {
res[length(res) + 1] <- if (u %% 2) 3 * u + 1 else u / 2
} else {
return(res)
}
}
}
You can use recursions like below
compute <- function(x) if (x %% 2) 3*x + 1 else x / 2
algo_TIC1 <- function(x) {
if (x == 1) {
return(1)
}
c(x, algo_TIC1(compute(x)))
}
and you will see
> algo_TIC1(3000)
[1] 3000 1500 750 375 1126 563 1690 845 2536 1268 634 317 952 476 238
[16] 119 358 179 538 269 808 404 202 101 304 152 76 38 19 58
[31] 29 88 44 22 11 34 17 52 26 13 40 20 10 5 16
[46] 8 4 2 1
If you don't want any helper function, i.e., compute, you can try
algo_TIC1 <- function(x) {
if (x == 1) {
return(1)
}
c(x, algo_TIC1(if (x %% 2) 3*x + 1 else x / 2))
}
So, what bothers you is reallocation, and you are right. Let's see.
library(microbenchmark)
microbenchmark({
a <- c()
for (i in seq(1e4)) {
a <- c(a, i)
}
})
microbenchmark({
a <- numeric(1e4)
for (i in seq(1e4)) {
a[[i]] <- i
}
})
microbenchmark({
a <- numeric(1)
k <- 1
for (i in seq(1e4)) {
if (i > k) {
a <- c(a, numeric(k))
k <- k + k
}
a[[i]] <- i
}
a <- head(a, 1e4)
})
And the timings:
Append
min lq mean median uq max neval
78.0162 78.67925 83.36224 79.54515 81.79535 166.6988 100
Preallocate
min lq mean median uq max neval
1.484901 1.516051 1.567897 1.5552 1.569451 1.895601 100
Amortize
min lq mean median uq max neval
3.316501 3.377201 3.62415 3.484351 3.585701 11.7596 100
Never append many elements to a vector. If possible, preallocate, otherwise amortized allocation will do.
Even if you don't know the actual size beforehand, you may have an upper bound. Then you can still preallocate and truncate in the end. Even a reasonable estimate is useful: preallocate that size, and then resort to amortization if needed.
A remark: R is not good at loops. For small loops, for instance over variables of a dataframe or files in a directory, there is usually no problem. But if you have a long computation that really needs to be achieved with many loops and you can't vectorize, R might not be the right tool. On occasions, writing a function in C, C++, Fortran or Java could help: it's fairly easy to build plugins or to use Rcpp, and the performance gain is considerable.
You can set the length of a vector and then make assignments to specific elements. Your code would look like this:
algo2 <- function(x) {
if (x == 1)
return(1)
output <- x
index <- 1
while(x != 1) {
x <- compute(x)
index <- index + 1
if (index > length(output))
length(output) <- 2*length(output)
output[index] <- x
}
return(output[seq_len(index)])
}
This makes a difference, though not a big one in your example, because all those calls to compute() (and return()!) are quite costly. If you folded that calculation into algo you'd see more improvement. You could also initialize output to a length that is likely to be good enough for most cases, and rarely need doubling.
trying to write a for loop function to determine the number of schools with room costs in column 34 higher than board cost in column 23.
numrows <- dim(schools)[1]
for(ii in 1:numrows){
if(schools[ii, 34] > schools[ii, 23], na.rm = TRUE){
nrow(numrows)
}
}
I'm getting the following error
Error in if (schools[ii, 34] > schools[ii, 23]) { :
missing value where TRUE/FALSE needed
I did notice that some of the board costs are missing and i'd like to omit those in the comparisons. Also I'm expecting just the number of rows that satisfy the condition.
To further demonstrate my point, here is a simple example based on a 10,000-row sample data.frame
set.seed(2018)
df <- data.frame(one = runif(10^4), two = runif(10^4))
Running a microbenchmark analysis
library(microbenchmark)
res <- microbenchmark(
vectorised = sum(df[, 1] > df[, 2]),
for_loop = {
ss <- 0
for (i in seq_len(nrow(df))) if (df[i, 1] > df[i, 2]) ss <- ss + 1
ss
})
res
# Unit: microseconds
# expr min lq mean median uq
# vectorised 59.681 65.13 78.33118 72.8305 77.9195
# for_loop 346250.957 359535.08 398508.54996 379421.2305 426452.4265
# max neval
# 152.172 100
# 608490.869 100
library(ggplot2)
autoplot(res)
Notice the four order of magnitude (!!!) difference (that's a factor of 10,000!) between the for loop and the vectorised operation. Neither surprising nor interesting.
The structure of the data leading to the error
Error in if (schools[ii, 34] > schools[ii, 23]) { :
missing value where TRUE/FALSE needed
occurs when one or both of the values in the comparison is NA, because the NA propagates through the comparison x > y, e.g.,
> test = 1 > NA
> test
[1] NA
and the flow control if (test) {} can't determine whether the test is TRUE (and so the code should be executed) or FALSE
> if (test) {}
Error in if (test) { : missing value where TRUE/FALSE needed
A simple vectorized solution isn't good enough
> set.seed(123)
> n = 10; x = sample(n); y = sample(n); y[5] = NA
> sum(x > y)
[1] NA
though the 'fix' is obvious and inexpensive
> sum(x > y, na.rm = TRUE)
[1] 3
The for loop also fails, but it is not possible (as in part of the original question) to simply add an na.rm = TRUE clause to the if statement
s = 0
for (i in seq_along(x)) {
if (x[i] > y[i], na.rm = TRUE)
s <- s + 1
}
s
because this is not syntactically valid
Error: unexpected ',' in:
"for (i in seq_along(x)) {
if (x[i] > y[i],"
so a more creative solution needs to be found, e.g., testing whether the value of the comparison is actually TRUE
s <- 0
for (i in seq_along(x)) {
if (isTRUE(x[i] > y[i]))
s <- s + 1
}
s
Of course it is not useful to compare the performance of the incorrect code; one needs to write the correct solutions first
f1 <- function(x, y)
sum(x > y, na.rm = TRUE)
f2 <- function(x, y) {
s <- 0
for (i in seq_along(x))
if (isTRUE(x[i] > y[i]))
s <- s + 1
s
}
f1() seems much more compact and readable compared to f2(), but we need to make sure the results are sensible
> x > y
[1] FALSE TRUE FALSE FALSE NA TRUE FALSE FALSE FALSE TRUE
> f1(x, y)
[1] 3
and the same
> identical(f1(x, y), f2(x, y))
[1] FALSE
Hey wait, what's going on? They look the same?
> f2(x, y)
[1] 3
Actually, the results are numerically equal, but f1() returns an integer value whereas f2() returns a numeric
> all.equal(f1(x, y), f2(x, y))
[1] TRUE
> class(f1(x, y))
[1] "integer"
> class(f2(x, y))
[1] "numeric"
and if we're comparing performance we really need the results to be identical -- no sense comparing apples and oranges. We can update f2() to return an integer by making sure the sum s is always an integer -- use a suffix L, e.g., 0L, to create an integer value
> class(0)
[1] "numeric"
> class(0L)
[1] "integer"
and make sure an integer 1L is added to s on each successful iteration
f2a <- function(x, y) {
s <- 0L
for (i in seq_along(x))
if (isTRUE(x[i] > y[i]))
s <- s + 1L
s
}
We then have
> f2a(x, y)
[1] 3
> identical(f1(x, y), f2a(x, y))
[1] TRUE
and are now in a position to compare performance
> microbenchmark(f1(x, y), f2a(x, y))
Unit: microseconds
expr min lq mean median uq max neval
f1(x, y) 1.740 1.8965 2.05500 2.023 2.0975 6.741 100
f2a(x, y) 17.505 18.2300 18.67314 18.487 18.7440 34.193 100
Certainly f2a() is much slower, but for this size problem since the unit is 'microseconds' maybe this doesn't matter -- how do the solutions scale with problem size?
> set.seed(123)
> x = lapply(10^(3:7), sample)
> y = lapply(10^(3:7), sample)
> f = f1; microbenchmark(f(x[[1]], y[[1]]), f(x[[2]], y[[2]]), f(x[[3]], y[[3]]))
Unit: microseconds
expr min lq mean median uq max neval
f(x[[1]], y[[1]]) 9.655 9.976 10.63951 10.3250 11.1695 17.098 100
f(x[[2]], y[[2]]) 76.722 78.239 80.24091 78.9345 79.7495 125.589 100
f(x[[3]], y[[3]]) 764.034 895.075 914.83722 908.4700 922.9735 1106.027 100
> f = f2a; microbenchmark(f(x[[1]], y[[1]]), f(x[[2]], y[[2]]), f(x[[3]], y[[3]]))
Unit: milliseconds
expr min lq mean median uq
f(x[[1]], y[[1]]) 1.260307 1.296196 1.417762 1.338847 1.393495
f(x[[2]], y[[2]]) 12.686183 13.167982 14.067785 13.923531 14.666305
f(x[[3]], y[[3]]) 133.639508 138.845753 144.152542 143.349102 146.913338
max neval
3.345009 100
17.713220 100
165.990545 100
They both scale linearly (not surprising) but even for lengths of 100000 f2a() doesn't seem too bad -- 1/6th of a second -- and might be a candidate in a situation where the vectorization obfuscated the code rather than clarified it. The cost of extracting individual elements from columns of a data.frame change this calculus, but also point to the value of operating on atomic vectors rather than complicated data structures.
For what it's worth one can think of worse implementations, especially
f3 <- function(x, y) {
s <- logical(0)
for (i in seq_along(x))
s <- c(s, isTRUE(x[i] > y[i]))
sum(s)
}
which scales quadratically
> f = f3; microbenchmark(f(x[[1]], y[[1]]), f(x[[2]], y[[2]]), f(x[[3]], y[[3]]), times = 1)
Unit: milliseconds
expr min lq mean median
f(x[[1]], y[[1]]) 7.018899 7.018899 7.018899 7.018899
f(x[[2]], y[[2]]) 371.248504 371.248504 371.248504 371.248504
f(x[[3]], y[[3]]) 42528.280139 42528.280139 42528.280139 42528.280139
uq max neval
7.018899 7.018899 1
371.248504 371.248504 1
42528.280139 42528.280139 1
(because c(s, ...) copies all of s to add one element to it) and is a pattern found very often in people's code.
A second common slowdown is use of complicated data structures (like the data.frame) rather than simple data structures (like atomic vectors), e.g., comparing
f4 <- function(df) {
s <- 0L
x <- df[[1]]
y <- df[[2]]
for (i in seq_len(nrow(df))) {
if (isTRUE(x[i] > y[i]))
s <- s + 1L
}
s
}
f5 <- function(df) {
s <- 0L
for (i in seq_len(nrow(df))) {
if (isTRUE(df[i, 1] > df[i, 2]))
s <- s + 1L
}
s
}
with
> df <- Map(data.frame, x, y)
> identical(f1(x[[1]], y[[1]]), f4(df[[1]]))
[1] TRUE
> identical(f1(x[[1]], y[[1]]), f5(df[[1]]))
[1] TRUE
> microbenchmark(f1(x[[1]], y[[1]]), f2(x[[1]], y[[1]]), f4(df[[1]]), f5(df[[1]]), times = 10)
Unit: microseconds
expr min lq mean median uq
f1(x[[1]], y[[1]]) 10.042 10.324 13.3511 13.4425 14.690
f2a(x[[1]], y[[1]]) 1310.186 1316.869 1480.1526 1344.8795 1386.322
f4(df[[1]]) 1329.307 1336.869 1363.4238 1358.7080 1365.427
f5(df[[1]]) 37051.756 37106.026 38187.8278 37876.0940 38416.276
max neval
20.753 10
2676.030 10
1439.402 10
42292.588 10
I wrote this function to find a factorial of number
fact <- function(n) {
if (n < 0){
cat ("Sorry, factorial does not exist for negative numbers", "\n")
} else if (n == 0){
cat ("The factorial of 0 is 1", "\n")
} else {
results = 1
for (i in 1:n){
results = results * i
}
cat(paste("The factorial of", n ,"is", results, "\n"))
}
}
Now I want to implement Memoization in R. I have Basic idea on R and trying to implement using them. But I am not sure is this way forward. Could you please also elaborate this topic as well. Thank in advance.
Memoized Factorial
fact_tbl <- c(0, 1, rep(NA, 100))
fact_mem <- function(n){
stopifnot(n > 0)
if(!is.na(fib_tbl[n])){
fib_tbl[n]
} else {
fact_tbl[n-1] <<- fac_mem(n-1) * n
}
}
print (fact_mem(4))
First of all, if you need an efficient implementation, use R's factorial function. Don't write it yourself. Then, the factorial is a good exercise for understanding recursion:
myfactorial <- function(n) {
if (n == 1) return(1)
n * myfactorial(n-1)
}
myfactorial(10)
#[1] 3628800
With this function memoization is only useful, if you intend to use the function repeatedly. You can implement memoization in R using closures. Hadley explains these in his book.
createMemFactorial <- function() {
res <- 1
memFactorial <- function(n) {
if (n == 1) return(1)
#grow res if necessary
if (length(res) < n) res <<- `length<-`(res, n)
#return pre-calculated value
if (!is.na(res[n])) return(res[n])
#calculate new values
res[n] <<- n * factorial(n-1)
res[n]
}
memFactorial
}
memFactorial <- createMemFactorial()
memFactorial(10)
#[1] 3628800
Is it actually faster?
library(microbenchmark)
microbenchmark(factorial(10),
myfactorial(10),
memFactorial(10))
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# factorial(10) 235 264.0 348.02 304.5 378.5 2463 100 a
# myfactorial(10) 4799 5279.5 6491.94 5629.0 6044.5 15955 100 c
# memFactorial(10) 950 1025.0 1344.51 1134.5 1292.0 7942 100 b
Note that microbenchmark evaluates the functions (by default) 100 times. Since we have stored the value for n = 10 when testing the memFactorial, we time only the if conditions and the lookup here. As you can also see, R's implementation, which is mostly written in C, is faster.
A better (and easier) example implements Fibonacci numbers. Here the algorithm itself benefits from memoization.
#naive recursive implementation
fib <- function(n) {
if(n == 1 || n == 2) return(1)
fib(n-1) + fib(n-2)
}
#with memoization
fibm <- function(n) {
if(n == 1 || n == 2) return(1)
seq <- integer(n)
seq[1:2] <- 1
calc <- function(n) {
if (seq[n] != 0) return(seq[n])
seq[n] <<- calc(n-1) + calc(n-2)
seq[n]
}
calc(n)
}
#try it:
fib(20)
#[1] 6765
fibm(20)
#[1] 6765
#Is memoization faster?
microbenchmark(fib(20),
fibm(20))
#Unit: microseconds
# expr min lq mean median uq max neval cld
# fib(20) 8005.314 8804.130 9758.75325 9301.6210 9798.8500 46867.182 100 b
#fibm(20) 38.991 44.798 54.12626 53.6725 60.4035 97.089 100 a
Fastest way to find the index of the second (third...) highest/lowest value in vector or column ?
i.e. what
sort(x,partial=n-1)[n-1]
is to
max()
but for
which.max()
Best,
Fastest way to find second (third...) highest/lowest value in vector or column
One possible route is to use the index.return argument to sort. I'm not sure if this is fastest though.
set.seed(21)
x <- rnorm(10)
ind <- 2
sapply(sort(x, index.return=TRUE), `[`, length(x)-ind+1)
# x ix
# 1.746222 3.000000
EDIT 2 :
As Joshua pointed out, none of the given solutions actually performs correct when you have a tie on the maxima, so :
X <- c(11:19,19)
n <- length(unique(X))
which(X == sort(unique(X),partial=n-1)[n-1])
fastest way of doing it correctly then. I deleted the order way, as that one doesn't work and is a lot slower, so not a good answer according to OP.
To point to the issue we ran into :
> X <- c(11:19,19)
> n <- length(X)
> which(X == sort(X,partial=n-1)[n-1])
[1] 9 10 #which is the indices of the double maximum 19
> n <- length(unique(X))
> which(X == sort(unique(X),partial=n-1)[n-1])
[1] 8 # which is the correct index of 18
The timings of the valid solutions :
> x <- runif(1000000)
> ind <- 2
> n <- length(unique(x))
> system.time(which(x == sort(unique(x),partial=n-ind+1)[n-ind+1]))
user system elapsed
0.11 0.00 0.11
> system.time(sapply(sort(unique(x), index.return=TRUE), `[`, n-ind+1))
user system elapsed
0.69 0.00 0.69
library Rfast has implemented the nth element function with return index option.
UPDATE (28/FEB/21) package kit offers a faster implementation (topn) as shown in the simulations below.
x <- runif(1e+6)
n <- 2
which_nth_highest_richie <- function(x, n)
{
for(i in seq_len(n - 1L)) x[x == max(x)] <- -Inf
which(x == max(x))
}
which_nth_highest_joris <- function(x, n)
{
ux <- unique(x)
nux <- length(ux)
which(x == sort(ux, partial = nux - n + 1)[nux - n + 1])
}
microbenchmark::microbenchmark(
topn = kit::topn(x, n,decreasing = T)[n],
Rfast = Rfast::nth(x,n,descending = T,index.return = T),
order = order(x, decreasing = TRUE)[n],
richie = which_nth_highest_richie(x,n),
joris = which_nth_highest_joris(x,n))
Unit: milliseconds
expr min lq mean median uq max neval
topn 3.741101 3.7917 4.517201 4.060752 5.108901 7.403901 100
Rfast 15.8121 16.7586 20.64204 17.73010 20.7083 47.6832 100
order 110.5416 113.4774 120.45807 116.84005 121.2291 164.5618 100
richie 22.7846 24.1552 39.35303 27.10075 42.0132 179.289 100
joris 131.7838 140.4611 158.20704 156.61610 165.1735 243.9258 100
Topn is the clear winner in finding the index of the 2nd biggest value in 1 million numbers.
Futher, simulations where run to estimate running times of finding the nth biggest number for varying n.
Variable x was repopulated for each n but it's size was always 1 million numbers.
As shown topn is the best option for finding the nth biggest element and it's index, given that n is not too big. In the plot we can observe that topn becomes slower than Rfast's nth for bigger n.
It is worthy to note that topn has not been implemented for n > 1000 and will throw an error in such cases.
Method: Set all max values to -Inf, then find the indices of the max. No sorting required.
X <- runif(1e7)
system.time(
{
X[X == max(X)] <- -Inf
which(X == max(X))
})
Works with ties and is very fast.
If you can guarantee no ties, then an even faster version is
system.time(
{
X[which.max(X)] <- -Inf
which.max(X)
})
EDIT: As Joris mentioned, this method doesn't scale that well for finding third, fourth, etc., highest values.
which_nth_highest_richie <- function(x, n)
{
for(i in seq_len(n - 1L)) x[x == max(x)] <- -Inf
which(x == max(x))
}
which_nth_highest_joris <- function(x, n)
{
ux <- unique(x)
nux <- length(ux)
which(x == sort(ux, partial = nux - n + 1)[nux - n + 1])
}
Using x <- runif(1e7) and n = 2, Richie wins
system.time(which_nth_highest_richie(x, 2)) #about half a second
system.time(which_nth_highest_joris(x, 2)) #about 2 seconds
For n = 100, Joris wins
system.time(which_nth_highest_richie(x, 100)) #about 20 seconds, ouch!
system.time(which_nth_highest_joris(x, 100)) #still about 2 seconds
The balance point, where they take the same length of time, is about n = 10.
No ties which() is probably your friend here. Combine the output from the sort() solution with which() to find the index that matches the output from the sort() step.
> set.seed(1)
> x <- sample(1000, 250)
> sort(x,partial=n-1)[n-1]
[1] 992
> which(x == sort(x,partial=n-1)[n-1])
[1] 145
Ties handling The solution above doesn't work properly (and wasn't intended to) if there are ties and the ties are the values that are the ith largest or larger values. We need to take the unique values of the vector before sorting those values and then the above solution works:
> set.seed(1)
> x <- sample(1000, 1000, replace = TRUE)
> length(unique(x))
[1] 639
> n <- length(x)
> i <- which(x == sort(x,partial=n-1)[n-1])
> sum(x > x[i])
[1] 0
> x.uni <- unique(x)
> n.uni <- length(x.uni)
> i <- which(x == sort(x.uni, partial = n.uni-1)[n.uni-1])
> sum(x > x[i])
[1] 2
> tail(sort(x))
[1] 994 996 997 997 1000 1000
order() is also very useful here:
> head(ord <- order(x, decreasing = TRUE))
[1] 220 145 209 202 211 163
So the solution here is ord[2] for the index of the 2nd highest/largest element of x.
Some timings:
> set.seed(1)
> X <- sample(1e7, 1e7)
> system.time({n <- length(X); which(X == sort(X, partial = n-1)[n-1])})
user system elapsed
0.319 0.058 0.378
> system.time({ord <- order(X, decreasing = TRUE); ord[2]})
user system elapsed
14.578 0.084 14.708
> system.time({order(X, decreasing = TRUE)[2]})
user system elapsed
14.647 0.084 14.779
But as the linked post was getting at and the timings above show, order() is much slower, but both provide the same results:
> all.equal(which(X == sort(X, partial = n-1)[n-1]),
+ order(X, decreasing = TRUE)[2])
[1] TRUE
And for the ties-handling version:
foo <- function(x, i) {
X <- unique(x)
N <- length(X)
i <- i-1
which(x == sort(X, partial = N-i)[N-i])
}
> system.time(foo(X, 2))
user system elapsed
1.249 0.176 1.454
So the extra steps slow this solution down a bit, but it is still very competitive with order().
Use maxN function given by Zach to find the next max value and use which() with arr.ind = TRUE.
which(x == maxN(x, 4), arr.ind = TRUE)
Using arr.ind will return index position in any of the above solutions as well and simplify the code.
This is my solution for finding the index of the top N highest values in a vector (not exactly what the OP wanted, but this might help other people)
index.top.N = function(xs, N=10){
if(length(xs) > 0) {
o = order(xs, na.last=FALSE)
o.length = length(o)
if (N > o.length) N = o.length
o[((o.length-N+1):o.length)]
}
else {
0
}
}