Let's make a small example first, that computes in R:
x<- c(1,3,1,4,2)
max(which(x<2))
[1] 3
Now, I would like to do this not just for one value 2, but for many values simultaneously. It should give me something like that:
max(which(x<c(1,2,3,4,5,6)))
[1] NA 3 5 5 5 5
Of course I could run a for loop, but that is very slow:
for(i in c(1,2,3,4,5,6)){
test[i]<-max(which(x<i))
}
Is there a fast way to do this?
Try this:
vapply(1:6, function(i) max(which(x < i)), double(1))
A fully vectorized approach:
x <- c(1,3,1,4,2)
y <- c(1,2,3,4,5,6)
f <- function(x, y) {
xo <- sort(unique(x))
xi <- cummax(1 + length(x) - match(xo, rev(x)))
xi[cut(y, c(xo, Inf))]
}
f(x,y)
# [1] NA 3 5 5 5 5
The advantages of full vectorization really start to kick in when both x and y are relatively long and each contains many distinct values:
x <- sample(1:1e4)
y <- 1:1e4
microbenchmark(nicola(), frank(), beauvel(), davida(), hallo(), josho(),times=5)
Unit: milliseconds
expr min lq mean median uq max neval cld
nicola() 4927.45918 4980.67901 5031.84199 4991.38240 5052.6861 5207.00330 5 d
frank() 513.05769 513.33547 552.29335 517.65783 540.9536 676.46221 5 b
beauvel() 1091.93823 1114.84647 1167.10033 1121.58251 1161.3828 1345.75158 5 c
davida() 562.71123 575.75352 585.83873 590.90048 597.0284 602.80002 5 b
hallo() 559.11618 574.60667 614.62914 624.19570 641.9639 673.26328 5 b
josho() 36.22829 36.57181 37.37892 37.52677 37.6373 38.93044 5 a
Are you looking for this?
y<-1:6
max.col(outer(y,x,">"),ties.method="last")*NA^(y<=min(x))
#[1] NA 3 5 5 5 5
Find the max index of each value seen in x:
xvals <- unique(x)
xmaxindx <- length(x) - match(xvals,rev(x)) + 1L
Rearrange
xvals <- xvals[order(xmaxindx,decreasing=TRUE)]
xmaxindx <- xmaxindx[order(xmaxindx,decreasing=TRUE)]
# 2 4 1 3
# 5 4 3 2
Select from those:
xmaxindx[vapply(1:6,function(z){
ok <- xvals < z
if(length(ok)) which(ok)[1] else NA_integer_
},integer(1))]
# <NA> 1 2 2 2 2
# NA 3 5 5 5 5
It handily reports the values (in the first row) along with the indices (second row).
The sapply way is simpler and probably not slower:
xmaxindx[sapply(1:6,function(z) which(xvals < z)[1])]
Benchmarks. The OP's case is not fully described, but here are some benchmarks anyway:
# setup
nicola <- function() max.col(outer(y,x,">"),ties.method="last")*NA^(y<=min(x))
frank <- function(){
xvals <- unique(x)
xmaxindx <- length(x) - match(xvals,rev(x)) + 1L
xvals <- xvals[order(xmaxindx,decreasing=TRUE)]
xmaxindx <- xmaxindx[order(xmaxindx,decreasing=TRUE)]
xmaxindx[vapply(y,function(z){
ok <- xvals < z
if(length(ok)) which(ok)[1] else NA_integer_
},integer(1))]
}
beauvel <- function()
Vectorize(function(u) ifelse(length(which(x<u))==0,NA,max(which(x<u))))(y)
davida <- function() vapply(y, function(i) c(max(which(x < i)),NA)[1], double(1))
hallo <- function(){
test <- vector("integer",length(y))
for(i in y){
test[i]<-max(which(x<i))
}
test
}
josho <- function(){
xo <- sort(unique(x))
xi <- cummax(1L + length(x) - match(xo, rev(x)))
xi[cut(y, c(xo, Inf))]
}
require(microbenchmark)
(#MrHallo's and #DavidArenburg's throw a bunch of warnings the way I have them written now, but that could be fixed.) Here are some results:
> x <- sample(1:4,1e6,replace=TRUE)
> y <- 1:6
> microbenchmark(nicola(),frank(),beauvel(),davida(),hallo(),josho(),times=10)
Unit: milliseconds
expr min lq mean median uq max neval
nicola() 76.17992 78.01171 99.75596 98.43919 120.81776 127.63058 10
frank() 25.27245 25.44666 36.41508 28.44055 45.32306 73.66652 10
beauvel() 47.70081 59.47828 67.44918 68.93808 74.12869 95.20936 10
davida() 26.52582 26.55827 33.93855 30.00990 35.55436 57.24119 10
hallo() 26.58186 26.63984 32.68850 28.68163 33.54364 50.49190 10
josho() 25.69634 26.28724 37.95341 30.50828 47.90526 68.30376 10
There were 20 warnings (use warnings() to see them)
>
>
> x <- sample(1:80,1e6,replace=TRUE)
> y <- 1:60
> microbenchmark(nicola(),frank(),beauvel(),davida(),hallo(),josho(),times=10)
Unit: milliseconds
expr min lq mean median uq max neval
nicola() 2341.96795 2395.68816 2446.60612 2481.14602 2496.77128 2504.8117 10
frank() 25.67026 25.81119 42.80353 30.41979 53.19950 123.7467 10
beauvel() 665.26904 686.63822 728.48755 734.04857 753.69499 784.7280 10
davida() 326.79072 359.22803 390.66077 397.50163 420.66266 456.8318 10
hallo() 330.10586 349.40995 380.33538 389.71356 397.76407 443.0808 10
josho() 26.06863 30.76836 35.04775 31.05701 38.84259 57.3946 10
There were 20 warnings (use warnings() to see them)
>
>
> x <- sample(sample(1e5,1e1),1e6,replace=TRUE)
> y <- sample(1e5,1e4)
> microbenchmark(frank(),josho(),times=10)
Unit: milliseconds
expr min lq mean median uq max neval
frank() 69.41371 74.53816 94.41251 89.53743 107.6402 134.01839 10
josho() 35.70584 37.37200 56.42519 54.13120 63.3452 90.42475 10
Of course, comparisons might come out differently for the OP's true case.
You can use Vectorize:
func = Vectorize(function(u) ifelse(length(which(x<u))==0,NA,max(which(x<u))))
> func(1:6)
#[1] NA 3 5 5 5 5
Related
For a given positive integer n, I want to know the fastest base R (not Rcpp) algorithm for constructing the integer vector c(1:1, 1:2, ..., 1:n), which has length n*(n+1)/2. There are bonus points for fast and memory-efficient algorithms, since I ultimately want to avoid allocating a vector of length n*n.
I'm aware of at least two approaches:
unlist(lapply(seq_len(n), seq_len), FALSE, FALSE)
{J <- .row(c(n, n)); J[upper.tri(J, TRUE)]}
the latter being particularly inefficient since it allocates two integer vectors of length n*n.
Note that if we assign the value .col(c(n, n)) to J above, then we obtain the sequence 1 2 2 3 3 3 4 4 4 4 .... This sequence can be constructed fast and efficiently with {i <- seq_len(n); rep.int(i, i)}.
I am wondering if a similarly fast (or faster) algorithm exists in the .row(c(n, n)) case, or if unlist-lapply is optimal from a base R standpoint.
FWIW, here is a benchmark of the three procedures I've mentioned so far:
## Seemingly optimal for 1 2 2 3 3 3 4 4 4 4 ...
f0 <- function(n) {i <- seq_len(n); rep.int(i, i)}
## Candidates for 1 1 2 1 2 3 1 2 3 4 ... (the sequence I actually want)
f1 <- function(n) unlist(lapply(seq_len(n), seq_len), FALSE, FALSE)
f2 <- function(n) {J <- .row(c(n, n)); J[upper.tri(J, TRUE)]}
n <- 1000L
microbenchmark::microbenchmark(f0(n), f1(n), f2(n), times = 10000L)
Unit: milliseconds
expr min lq mean median uq max neval
f0(n) 1.711873 1.797891 2.112043 1.810273 1.836636 14.96644 10000
f1(n) 1.986737 2.108630 2.472612 2.148195 2.214369 15.16282 10000
f2(n) 3.785981 4.624821 5.551115 5.051405 5.861954 17.28740 10000
(I'm aware that f1 is pretty close to f0 here, but is there something better than f1?)
I'm not sure what you're aware of, but if function from base is okay, try sequence.
f3 <- function(n) {sequence(1:n)}
It seems it's almost 2~3 times faster than f0
I think sequence is the one you are after (if you are not going to use Rcpp for a even faster version)
f1 <- function(n) unlist(lapply(seq_len(n), seq_len), FALSE, FALSE)
f2 <- function(n) {
J <- .row(c(n, n))
J[upper.tri(J, TRUE)]
}
f3 <- function(n) {
v <- 1:n
data.table::rowid(rep.int(v, v))
}
f4 <- function(n) sequence(1:n)
n <- 1000L
microbenchmark::microbenchmark(f1(n), f2(n), f3(n), f4(n), check = "identical")
Benchmarking
> microbenchmark::microbenchmark(f1(n), f2(n), f3(n), f4(n), check = "identical")
Unit: microseconds
expr min lq mean median uq max neval
f1(n) 3928.8 4144.50 5185.839 4227.5 4289.15 67457.1 100
f2(n) 9490.3 10083.90 14415.777 12951.0 15080.50 78014.2 100
f3(n) 8083.5 8572.10 12154.922 9063.0 9534.45 75408.7 100
f4(n) 213.9 425.05 787.637 442.6 494.00 7844.4 100
These 2 may also be options-
n <- 5
unlist(purrr::map(seq(5), ~seq(.x)))
#> [1] 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5
unlist(mapply(FUN = function(.x) seq(.x), seq(n)))
#> [1] 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5
Created on 2021-12-10 by the reprex package (v2.0.1)
This question already has answers here:
Find position of first value greater than X in a vector
(7 answers)
Closed 2 years ago.
I am looking for a function similar to match, but it should have an inequality as test and stop at the first occurrence.
a <- c(1,3,5,7,9,12)
b <- 6
# the output should be 4
# i tried this one: match(min(a [b <= a ]),a ), but with much data it is slow
1) if you have large vectors, where the stop should happen in the beginning, like:
a <- c(1,3,5,7,9,12, 100:1e6)
b <- 6
other approaches could be faster.
2) as we are benchmarking such a fast functions also compilation could make a difference, so I defined the approaches separately:
v1 <- function() match(min(a[b <= a ]), a)
v2 <- function() min(which(a >= b))
v3 <- function() which(b <= a)[1]
v4 <- function() match(TRUE, b <= a)
v5 <- function() Position(function(x) b <= x, a)
microbenchmark::microbenchmark(
v1(), v2(), v3(), v4(), v5()
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# v1() 10109.4 13822.10 23205.556 23481.95 26661.30 152809.4 100 c
# v2() 5796.8 6475.75 10404.439 6839.30 16991.90 34333.7 100 b
# v3() 5144.8 5648.75 9381.664 5963.05 6801.25 142202.0 100 b
# v4() 3963.0 4497.35 6938.296 4799.55 5419.80 22011.4 100 b
# v5() 4.3 14.20 35.509 16.85 18.45 1858.8 100 a
For original data, the last approach isn't the fastest:
a <- c(1,3,5,7,9,12)
b <- 6
microbenchmark::microbenchmark(
v1(), v2(), v3(), v4(), v5()
)
# Unit: nanoseconds
# expr min lq mean median uq max neval cld
# v1() 800 900 1151 1000 1100 12600 100 b
# v2() 900 1000 1105 1000 1100 5600 100 b
# v3() 800 900 953 900 1000 2700 100 b
# v4() 400 500 584 600 600 1500 100 a
# v5() 3600 3700 3983 3800 3900 10600 100 c
Additional comments on this can been seen, here:
Find position of first value greater than X in a vector
Probably this is was fastest. #Gregor found a faster one.
which(b <= a)[1]
# [1] 4
Benchmark
set.seed(42)
a <- sort(sample(1:100, 1e6, replace=T))
b <- 6
microbenchmark::microbenchmark(match(min(a[b <= a ]), a),
min(which(a >= b)),
which(b <= a)[1],
match(TRUE, b <= a))
Unit: milliseconds
expr min lq mean median uq max neval cld
match(min(a[b <= a]), a) 14.273060 14.770065 27.228264 26.043620 29.675095 273.45221 100 b
min(which(a >= b)) 9.537230 9.781012 13.306348 10.207344 10.445258 27.28422 100 a
which(b <= a)[1] 8.192664 8.475993 14.774976 8.869157 19.935542 267.22919 100 a
match(TRUE, b <= a) 6.045952 6.274809 9.125632 6.404545 6.845165 24.42210 100 a
You can still use match: match(TRUE, b <= a)
findInterval gives the index of b in a or the index of the largest value in a less than b if there is no match; therefore, use findInterval and then add 1 if it is not an exact match since you want the next rather than prior index. a must be sorted in ascending order.
ix <- findInterval(b, a, all.inside = TRUE)
ix <- ix + (a[ix] < b)
ix
## [1] 4
or possibly the first line is sufficient if you can change the requirements slightly. There is some question regarding how edge cases are to be handled but I have added all.inside = TRUE to handle one such situation.
I am looking for an efficient solution for the following problem:
b <- matrix(c(0,0,0,1,1,0), nrow = 2, byrow = T)
weight <- c(1,1)
times <- 5
abc <- do.call(rbind, replicate(times, b, simplify=FALSE))
weight <- rep.int(weight,times)
sum1 <- as.numeric(rep.int(NA,nrow(abc)))
##Rprof()
for(j in 1:nrow(abc)){
a <- abc[j,]
sum1[j] <- sum(weight[rowSums(t(a == t(abc)) + 0) == ncol(abc)])
}
##Rprof(NULL)
##summaryRprof()
Is there a faster way to do this? Rprof shows that rowSums(), t(), == and + are quite slow. If nrows is 20,000 it takes like 21 seconds.
Thanks for helping!
Edit: I have a matrix abc and a vector weight with length equal to nrow(abc). The first value of weight corresponds to the first row of matrix abc and so on... Now, I would like to determine which rows of matrix abc are equal. Then, I want to remember the position of those rows in order to sum up the corresponding weights which have the same position. The appropriate sum I wanna store for each row.
Here is a way that looks valid and fast:
ff <- function(mat, weights)
{
rs <- apply(mat, 1, paste, collapse = ";")
unlist(lapply(unique(rs),
function(x)
sum(weights[match(rs, x, 0) > 0])))[match(rs, unique(rs))]
}
ff(abc, weight)
# [1] 5 5 5 5 5 5 5 5 5 5
And comparing with your function:
ffOP <- function(mat, weights)
{
sum1 <- as.numeric(rep.int(NA,nrow(mat)))
for(j in 1:nrow(mat)) {
a <- mat[j,]
sum1[j] <- sum(weights[rowSums(t(a == t(mat)) + 0) == ncol(mat)])
}
sum1
}
ffOP(abc, weight)
# [1] 5 5 5 5 5 5 5 5 5 5
library(microbenchmark)
m = do.call(rbind, replicate(1e3, matrix(0:11, 3, 4), simplify = F))
set.seed(101); w = runif(1e3*3)
all.equal(ffOP(m, w), ff(m, w))
#[1] TRUE
microbenchmark(ffOP(m, w), ff(m, w), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# ffOP(m, w) 969.83968 986.47941 996.68563 1015.53552 1051.23847 10
# ff(m, w) 20.42426 20.64002 21.36508 21.97182 22.59127 10
For the record, I, also, implemented your approach in C and here are the benchmarkings:
#> microbenchmark(ffOP(m, w), ff(m, w), ffC(m, w), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# ffOP(m, w) 957.66691 967.09429 991.35232 1000.53070 1016.74100 10
# ff(m, w) 20.60243 20.85578 21.70578 22.13434 23.04924 10
# ffC(m, w) 36.24618 36.40940 37.18927 37.39877 38.83358 10
I know this is a very basic question, but I think I am having difficulty with the syntax.
I am doing a while loop and I'd like to use the results later. However, I do not know how to store the results in a list.
Here is a "short version" of the while loop I am trying to do.
z <- 0
while(z < 10) {
z <- z + 1
print(z)
}
How do I store the results of this while loop in a list?
Thank you!
The R answers are both disappointing in that they use the dreaded 'copy and append' pattern, the second chapter of Patrick Burn's R Inferno. The problem is that this makes n * (n-1) / 2 copies of elements as the vector is forced to grow. The first improvement is to pre-allocate and fill, the second to let R manage things for you with an lapply (list) or vapply (vector), the third is to use "vectorized" functions that implement the desired operation.
Here are some bad implementations
f1 <- function(n) {
## BAD, copy and append
res <- c()
for (i in seq_len(n))
res <- c(res, i)
res
}
f2 <- function(n) {
## BAD, copy and append
res <- c()
for (i in seq_len(n))
res[[i]] <- i
res
}
f3 <- function(n) {
## BAD copy and append
res <- c()
i <- 0
while (i < n) {
i <- i + 1
res <- c(res, i)
}
}
And a better implementation that still requires the user to manage the result
f4 <- function(n) {
## better, pre-allocate and fill
res <- integer(n)
for (i in seq_len(n))
res[[i]] <- i
res
}
And then implementations that allow R to do all the work
f5 <- function(n)
## better, lapply manages allocation
sapply(seq_len(n), function(i) i)
f6 <- function(n)
## better, vapply manages allocation and enforces return type
vapply(seq_len(n), function(i) i, integer(1))
Here are some timings
library(microbenchmark)
n <- 100
microbenchmark(f1(n), f2(n), f3(n), f4(n), f5(n), f6(n))
## Unit: microseconds
## expr min lq median uq max neval
## f1(n) 68.857 74.3045 75.5995 76.6050 87.270 100
## f2(n) 180.174 185.1460 187.1960 191.0030 221.571 100
## f3(n) 141.022 146.0605 148.0615 151.0435 184.322 100
## f4(n) 116.976 122.0740 124.8700 127.4540 166.803 100
## f5(n) 214.319 219.9760 223.4540 227.5000 294.203 100
## f6(n) 91.871 94.3685 95.4235 96.8335 126.893 100
n <- 10000
microbenchmark(f1(n), f2(n), f3(n), f4(n), f5(n), f6(n), times=10)
## Unit: milliseconds
## expr min lq median uq max neval
## f1(n) 226.239815 227.871791 229.115319 232.963898 274.052546 10
## f2(n) 134.979884 135.509744 136.726051 137.707050 152.690075 10
## f3(n) 185.598667 187.437479 189.442674 210.786491 333.767094 10
## f4(n) 11.523032 11.676948 11.777627 11.864006 12.099091 10
## f5(n) 14.670557 14.808911 15.041665 15.158167 15.675638 10
## f6(n) 8.295519 8.401100 8.424139 8.525598 10.374145 10
For this particular example of course there's a "vectorized" solution that is faster still
microbenchmark(f6(n), seq_len(n), times=10)
## Unit: microseconds
## expr min lq median uq max neval
## f6(n) 8240.384 9518.9940 9561.2310 9649.877 11427.134 100
## seq_len(n) 20.624 20.9535 22.0295 22.892 34.461 100
listy <- list()
z <- 0
while(z < 10) {
z <- z + 1
listy[z] <- z
print(z)
}
> listy
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
[[5]]
[1] 5
[[6]]
[1] 6
[[7]]
[1] 7
[[8]]
[1] 8
[[9]]
[1] 9
[[10]]
[1] 10
Maybe this helps
z <- 0
res <- c()
while(z<10) {
z <- z+1
res <- c(res, z)
}
The result is a vector, not a list, though. And this implementation is very inefficient.
If you know the number of iterations, pre-allocate, as #Dason and #Martin Morgan point out.
What language are you progamming in?
Java example:
int List<String> = new List<>();
while(z < 10) { List.add(z); z++ }
Here is a rather inefficient way to extract the number.
as.integer((x%%floor(x)*100)-(signif(x%%floor(x)*100,1)))
Does anyone else have a better way?
You could try
as.integer(x*100) %% 10
Not sure if it's faster after all the coercion to string and back, but you can try location-based character manipulation:
as.numeric(sub("[-0-9+].[0-9]([0-9]).+","\\1",as.character(x),perl=TRUE)),
as.numeric(substr(strsplit(as.character(x),".",fixed=TRUE)[[1]][2],2,2))
Edit: It turns out it's slower.
x <- runif(1)
re.fxn <- function(x) {
as.integer(sub("[0-9+].[0-9]([0-9]).+","\\1",as.character(x),perl=TRUE))
}
ss.fxn <- function(x) {
as.integer(substr(strsplit(as.character(x),".",fixed=TRUE)[[1]][2],2,2))
}
ai.fxn <- function(x) {
as.integer(x*100) %% 10
}
microbenchmark(
as.integer((x%%floor(x)*100)-(signif(x%%floor(x)*100,1))),
as.integer(x*100) %% 10,
re.fxn(x) ,
ss.fxn(x),
ai.fxn(x),
times=10
)
expr min
1 ai.fxn(x) 5989
2 as.integer((x%%floor(x) * 100) - (signif(x%%floor(x) * 100, 1))) 11121
3 as.integer(x * 100)%%10 4278
4 re.fxn(x) 103508
5 ss.fxn(x) 40206
lq median uq max
1 6844 8555.0 8556 10266
2 12832 12832.5 14543 17965
3 4278 4278.0 5134 11121
4 103509 104364.0 106075 207017
5 41062 42344.5 42773 65869
Seems like a speed win for the modulo. And the difference only narrows slightly if you're looking for higher-order digits in a generic function robust to negative x's:
ss.fxn <- function(x,d=2) {
as.integer(substr(strsplit(as.character(x),".",fixed=TRUE)[[1]][2],d,d))
}
ai.fxn <- function(x,d=2) {
as.integer(abs(x)*10^d) %% 10
}
d <- 8
microbenchmark(
ss.fxn(x,d),
ai.fxn(x,d),
times=1000
)
expr min lq median uq max
1 ai.fxn(x, d) 6845 8555 10266 10266 75280
2 ss.fxn(x, d) 40206 41061 41062 41917 284006