Finding position of the first TRUE of a series of `n` TRUEs - r

From a vector of TRUE/FALSE
set.seed(1)
x = rnorm(1503501) > 0
I am seeking for a performant (fast) method for getting the position of the first TRUE of the first series of n TRUEs.
The vectors (x) I am dealing with contain exactly 1503501 elements (with the exception of some of them that are much shorter). Below is my current solution. It uses for loop but for loops are extremely slow in R. Are there nicer and especially faster solutions?
n = 20
count = 0
solution = -1
for (i in 1:length(x)){
if (x[i]){
count = count + 1
if (count == n){solution = i+1-n; break}
} else {count = 0}
}
print(solution)
1182796
I was thinking about using vectorized functions and doing something like y = which(x) or eventually y = paste(which(x)) and seek for particular pattern but I am not sure how to do that.

You can use Rcpp:
library(Rcpp)
cppFunction('int fC(LogicalVector x, int n) {
int xs = x.size();
int count = 0;
int solution = -1;
for (int i = 0; i < xs; ++i) {
if (x[i]){
if (++count == n){solution = i+2-n; break;}
} else {
count = 0;
}
}
return solution;
}')
Here is a small benchmarking study:
f1 <- function(x,n) {
count = 0
solution = -1
for (i in 1:length(x)){
if (x[i]){
count = count + 1
if (count == n){solution = i+1-n; break}
} else {count = 0}
}
solution
}
set.seed(1)
x = rnorm(150350100) > 0
n = 20
print(f1(x,n)==fC(x,n))
# [1] TRUE
library(rbenchmark)
benchmark(f1(x,n),fC(x,n))
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 f1(x, n) 100 80.038 180.673 63.300 16.686 0 0
# 2 fC(x, n) 100 0.443 1.000 0.442 0.000 0 0
[Updated benchmark]
# Suggested by BondedDust
tpos <- function(x,pos) { rl <- rle(x); len <- rl$lengths;
sum(len[ 1:(which( len == pos & rl$values==TRUE)[1]-1)],1)}
set.seed(1)
x = rnorm(1503501) > 0
n = 20
print(f1(x,n)==fC(x,n))
# [1] TRUE
print(f1(x,n)==tpos(x,n))
# [1] TRUE
benchmark(f1(x,n),fC(x,n),tpos(x,n),replications = 10)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 f1(x, n) 10 4.756 110.605 4.735 0.020 0 0
# 2 fC(x, n) 10 0.043 1.000 0.043 0.000 0 0
# 3 tpos(x, n) 10 2.591 60.256 2.376 0.205 0 0

Take a look at this transcript (using only a much smaller random sample). I think it fairly clear that it will be easy to write a function that picks out the first position that satisfies the joint condition and use cumsum on the lengths up to that point:
> x = rnorm(1500) > 0
> rle(x)
Run Length Encoding
lengths: int [1:751] 1 1 1 2 1 3 1 2 2 1 ...
values : logi [1:751] FALSE TRUE FALSE TRUE FALSE TRUE ...
> table( rle(x)$lengths )
1 2 3 4 5 6 7 8 9
368 193 94 46 33 10 2 4 1
> table( rle(x)$lengths , rle(x)$values)
FALSE TRUE
1 175 193
2 100 93
3 47 47
4 23 23
5 21 12
6 5 5
7 2 0
8 3 1
9 0 1
> which( rle(x)$lengths==8 & rle(x)$values==TRUE)
[1] 542
> which( rle(x)$lengths==7 & rle(x)$values==TRUE)
integer(0)
> which( rle(x)$lengths==6 & rle(x)$values==TRUE)
[1] 12 484 510 720 744
This is my candidate function:
tpos <- function(x,pos) { rl <- rle(x); len <- rl$lengths;
sum(len[ 1:(which( len == pos & rl$values==TRUE)[1]-1)],1)}
tpos(x,6)
#[1] 18
Note that I subtracted one from the first index so the length of the first qualifying run of TRUE's would not be added in, and then added one to that sum so that the position of the first such TRUE would be calculated. I'm guessing the positon of the first run of n-TRUEs will be distributed as one of the extreme value distributions (although it's not always monotonic increase)
> tpos(x,8)
[1] 1045
> tpos(x,8)
[1] 1045
> tpos(x,9)
[1] 1417
> tpos(x,10)
[1] 4806
> tpos(x,11)
[1] 2845
> tpos(x,12)
Error in 1:(which(len == pos & rl$values == TRUE)[1] - 1) :
NA/NaN argument
> set.seed(1)
> x = rnorm(30000) > 0
> tpos(x,12)
[1] 23509

You can take your vector and add a FALSE (zero) to the beginning and remove the end and then add this augmented vector to your original vector (as 0/1 vectors of integers), and then do the same thing again by adding one more FALSE (zero) to the beginning from the previously augmented vector and removing the end and then adding this to your currently rolling sum vector (again, adding as vectors of integers) and do this until you have added up n total shifted copies of your vector. Then you can do which(sum_x == n) where sum_x is the sum vector and take the minimum returned by which(), and subtract n-1 and this will get you the start of the first occurrence of n TRUE's in a row. This will work much faster if n is somewhat small compared to the length of your vector.

Related

round but .5 should be floored

From R help function: Note that for rounding off a 5, the IEC 60559 standard is expected to be used, ‘go to the even digit’. Therefore round(0.5) is 0 and round(-1.5) is -2.
> round(0.5)
[1] 0
> round(1.5)
[1] 2
> round(2.5)
[1] 2
> round(3.5)
[1] 4
> round(4.5)
[1] 4
But I need all values ending with .5 to be rounded down. All other values should be rounded as it they are done by round() function.
Example:
round(3.5) = 3
round(8.6) = 9
round(8.1) = 8
round(4.5) = 4
Is there a fast way to do it?
Per Dietrich Epp's comment, you can use the ceiling() function with an offset to get a fast, vectorized, correct solution:
round_down <- function(x) ceiling(x - 0.5)
round_down(seq(-2, 3, by = 0.5))
## [1] -2 -2 -1 -1 0 0 1 1 2 2 3
I think this is faster and much simpler than many of the other solutions shown here.
As noted by Carl Witthoft, this adds much more bias to your data than simple rounding. Compare:
mean(round_down(seq(-2, 3, by = 0.5)))
## [1] 0.2727273
mean(round(seq(-2, 3, by = 0.5)))
## [1] 0.4545455
mean(seq(-2, 3, by = 0.5))
## [1] 0.5
What is the application for such a rounding procedure?
Check if the remainder of x %% 1 is equal to .5 and then floor or round the numbers:
x <- seq(1, 3, 0.1)
ifelse(x %% 1 == 0.5, floor(x), round(x))
> 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
I'll join the circus too:
rndflr <- function(x) {
sel <- vapply(x - floor(x), function(y) isTRUE(all.equal(y, 0.5)), FUN.VALUE=logical(1))
x[sel] <- floor(x[sel])
x[!sel] <- round(x[!sel])
x
}
rndflr(c(3.5,8.6,8.1,4.5))
#[1] 3 9 8 4
This function works by finding elements that have decimal part equal to 0.5, and adding a small negative number to to them before rounding, ensuring that they'll be rounded downwards. (It relies -- harmlessly but in slightly obfuscated manner --- on the fact that a Boolean vector in R will be converted to a vector of 0's and 1's when multiplied by a numeric vector.)
f <- function(x) {
round(x - .1*(x%%1 == .5))
}
x <- c(0.5,1,1.5,2,2.5,2.01,2.99)
f(x)
[1] 0 1 1 2 2 2 3
The function (not golfed) is very simple and checks whether the decimals that are left are .5 or less. In effect you could easily make it more useful and take 0.5 as an argument:
nice.round <- function(x, myLimit = 0.5) {
bX <- x
intX <- as.integer(x)
decimals <- x%%intX
if(is.na(decimals)) {
decimals <- 0
}
if(decimals <= myLimit) {
x <- floor(x)
} else {
x <- round(x)
}
if (bX > 0.5 & bX < 1) {
x <- 1
}
return(x)
}
Tests
Currently, this function does not work properly with values between 0.5 and 1.0.
> nice.round(1.5)
[1] 1
> nice.round(1.6)
[1] 2
> nice.round(10000.624541)
[1] 10001
> nice.round(0.4)
[1] 0
> nice.round(0.6)
[1] 1

Multiple range of rows deletion in R

Let's say I have
v <- matrix(seq(150), 50, 3)
k <- c(10, 40)
delta <- 5
How can I delete the 10-delta to 10+delta rows and 40-delta to 40+delta rows simultaneously?
I used vnew <- v[-((k-delta):(k+delta)),] but it seems that the command only delete using the first element of k (which is 10) and does not delete the 40-delta to 40+delta rows. Does anyone have any idea how to do this?
Oh and I will need to put this inside a loop where k is being updated in each iteration, so v[c(-{(10-delta):(10+delta)},-{(40-delta):(40+delta)}),] won't work.
If k is growing in each iteration and delta doesn't change I would suggest the following:
d <- -delta:delta
for (...) {
# ...
vnew <- v[-(rep(k, each=length(d)) + d),]
# ...
}
For your example:
d <- -5:5
k <- c(10, 40)
rep(k, each=length(d)) + d
# [1] 5 6 7 8 9 10 11 12 13 14 15 35 36 37 38 39 40 41 42 43 44 45
EDIT: a benchmark of both solutions:
library("rbenchmark")
idx1 <- function(k, delta) {
d <- -delta:delta
lapply(seq_along(k), function(i) {
rep(k[1:i], each=length(d)) + d
})
}
idx2 <- function(k, delta) {
lapply(seq_along(k), function(i) {
c(sapply(1:i, function(ii) {
(k[ii]-delta):(k[ii]+delta)
}))
})
}
set.seed(1)
k <- sample(1e3, 1e2)
delta <- 5
all.equal(idx1(k, delta), idx2(k, delta))
# [1] TRUE
benchmark(idx1(k, delta), idx2(k, delta), order="relative", replications=100)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 idx1(k, delta) 100 0.174 1.000 0.172 0 0 0
# 2 idx2(k, delta) 100 1.579 9.075 1.576 0 0 0
Richard Scriven's answer only returns the indexes 10-delta:10+delta and 40-delta:40+delta of the lines to be removed from v. To effectly do it, you must combined it with what you tried like this:
v[-c(sapply(seq(k), function(i) (k[i]-delta):(k[i]+delta))), ]
or shorter but dirtier(?): v[-sapply(seq(k), function(i) (k[i]-delta):(k[i]+delta)), ]

Combinatorical partial sum

I am looking in R for a function partial.sum() which takes a vector of numbers and returns an ascending sorted vector of all partial sums:
test=c(2,5,10)
partial.sum(test)
# 2 5 7 10 12 15 17
## 2 is the sum of element 2
## 5 is the sum of element 5
## 7 is the sum of elements 2 & 5
## 10 is the sum of element 10
## 12 is the sum of elements 2 & 10
## 15 is the sum of elements 5 & 10
## 17 is the sum of elements 2 & 5 & 10
Here is one using recursion. (Not making claims about it being efficient either)
partial.sum <- function(x) {
slave <- function(x) {
if (length(x)) {
y <- Recall(x[-1])
c(y + 0, y + x[1])
} else 0
}
sort(unique(slave(x)[-1]))
}
partial.sum(c(2,5,10))
# [1] 2 5 7 10 12 15 17
Edit: well, turns out it is a little faster than I thought:
x <- 1:20
microbenchmark(flodel(x), dason(x), matthew(x), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# flodel(x) 86.31128 86.9966 94.12023 125.1013 163.5824 10
# dason(x) 2407.27062 2461.2022 2633.73003 2846.2639 3031.7250 10
# matthew(x) 3084.59227 3191.7810 3304.36064 3693.8595 3883.2767 10
(I added sort and/or unique to dason and matthew's functions where appropriate for fair comparison.)
This probably doesn't scale too well and doesn't account for possible duplicates in the input vector or duplicates in the answer. You can use unique later if that is a concern for you.
partial.sum <- function(x){
n <- length(x)
# Something that will help get us every possible subset
# of the original vector
out <- do.call(expand.grid, replicate(n, c(T,F), simplify = F))
# Don't include the case where we don't grab any elements
out <- head(out, -1)
# ans <- apply(out, 1, function(row){sum(x[row])})
# As flodel points out the following will be faster than
# the previous line
ans <- data.matrix(out) %*% x
# If you want only unique value then add a call to unique here
ans <- sort(unname(ans))
ans
}
Here's an iterative approach using combn to produce combinations to sum. It works for vectors of of length greater than 1.
partial.sum <- function(x) {
sort(unique(unlist(sapply(seq_along(x), function(i) colSums(combn(x,i))))))
}
## [1] 2 5 7 10 12 15 17
To handle lengths less than 2, test for the length:
partial.sum <- function(x) {
if (length(x) > 1) {
sort(unique(unlist(sapply(seq_along(x), function(i) colSums(combn(x,i))))))
} else {
x
}
}
Some timings, out of rbenchmark, which don't entirely agree with flodel's results. I modified Dason's code, removing the comments and adding a call to unique. The version of my code is the first, without the if. flodel's code is a clear winner here.
> test <- 1:10
> benchmark(matthew(test), flodel(test), dason(test), replications=100)
test replications elapsed relative user.self sys.self user.child sys.child
3 dason(test) 100 0.180 12.857 0.175 0.004 0 0
2 flodel(test) 100 0.014 1.000 0.015 0.000 0 0
1 matthew(test) 100 0.244 17.429 0.242 0.001 0 0
> test <- 1:20
> benchmark(matthew(test), flodel(test), dason(test), replications=1)
test replications elapsed relative user.self sys.self user.child sys.child
3 dason(test) 1 5.231 98.698 5.158 0.058 0 0
2 flodel(test) 1 0.053 1.000 0.053 0.000 0 0
1 matthew(test) 1 2.184 41.208 2.180 0.000 0 0
> test <- 1:25
> benchmark(matthew(test), flodel(test), dason(test), replications=1)
test replications elapsed relative user.self sys.self user.child sys.child
3 dason(test) 1 288.957 163.345 264.068 23.859 0 0
2 flodel(test) 1 1.769 1.000 1.727 0.038 0 0
1 matthew(test) 1 75.712 42.799 74.745 0.847 0 0

Determine position of ith element in vector

I have a vector: a<-rep(sample(1:5,20, replace=T))
I determine the frequency of occurrence of each value:
tabulate(a)
I would now like to determine the position of the most frequently occurring values.
Let's say the vector is:
[1] 3 3 3 5 2 2 4 1 4 2 5 1 2 1 3 1 3 2 5 1
tabulate returns:
[1] 5 5 5 2 3
Now I determine the highest value returned by tabulate max(tabulate(a))
this returns
[1] 5
There are 3 values with frequency 5. I would like to know the position of these values in the tabulate output.
i.e. I the first three entries of tabulate.
Perhaps it is easier to work with table:
x <- table(a)
x
# a
# 1 2 3 4 5
# 5 5 5 2 3
names(x)[x == max(x)]
# [1] "1" "2" "3"
which(a %in% names(x)[x == max(x)])
# [1] 1 2 3 5 6 8 10 12 13 14 15 16 17 18 20
Alternatively, there's a similar approach with tabulate:
x <- tabulate(a)
sort(unique(a))[x == max(x)]
Here are some benchmarks on numeric and character vectors. The difference in performance is more noticeable with numeric data.
Sample data
set.seed(1)
a <- sample(20, 1000000, replace = TRUE)
b <- sample(letters, 1000000, replace = TRUE)
Functions to benchmark
t1 <- function() {
x <- table(a)
out1 <- names(x)[x == max(x)]
out1
}
t2 <- function() {
x <- tabulate(a)
out2 <- sort(unique(a))[x == max(x)]
out2
}
t3 <- function() {
x <- table(b)
out3 <- names(x)[x == max(x)]
out3
}
t4 <- function() {
x <- tabulate(factor(b))
out4 <- sort(unique(b))[x == max(x)]
out4
}
The results
library(rbenchmark)
benchmark(t1(), t2(), t3(), t4(), replications = 50)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 t1() 50 30.548 24.244 30.416 0.064 0 0
# 2 t2() 50 1.260 1.000 1.240 0.016 0 0
# 3 t3() 50 8.919 7.079 8.740 0.160 0 0
# 4 t4() 50 5.680 4.508 5.564 0.100 0 0

Advice wanted on getting rid of loops

I have written a program that works with the 3n + 1 problem (aka "wondrous numbers" and various other things). But it has a double loop. How could I vectorize it?
the code is
count <- vector("numeric", 100000)
L <- length(count)
for (i in 1:L)
{
x <- i
while (x > 1)
{
if (round(x/2) == x/2)
{
x <- x/2
count[i] <- count[i] + 1
} else
{
x <- 3*x + 1
count[i] <- count[i] + 1
}
}
}
Thanks!
I turned this 'inside-out' by creating a vector x where the ith element is the value after each iteration of the algorithm. The result is relatively intelligible as
f1 <- function(L) {
x <- seq_len(L)
count <- integer(L)
while (any(i <- x > 1)) {
count[i] <- count[i] + 1L
x <- ifelse(round(x/2) == x/2, x / 2, 3 * x + 1) * i
}
count
}
This can be optimized to (a) track only those values still in play (via idx) and (b) avoid unnecessary operations, e.g., ifelse evaluates both arguments for all values of x, x/2 evaluated twice.
f2 <- function(L) {
idx <- x <- seq_len(L)
count <- integer(L)
while (length(x)) {
ix <- x > 1
x <- x[ix]
idx <- idx[ix]
count[idx] <- count[idx] + 1L
i <- as.logical(x %% 2)
x[i] <- 3 * x[i] + 1
i <- !i
x[i] <- x[i] / 2
}
count
}
with f0 the original function, I have
> L <- 10000
> system.time(ans0 <- f0(L))
user system elapsed
7.785 0.000 7.812
> system.time(ans1 <- f1(L))
user system elapsed
1.738 0.000 1.741
> identical(ans0, ans1)
[1] TRUE
> system.time(ans2 <- f2(L))
user system elapsed
0.301 0.000 0.301
> identical(ans1, ans2)
[1] TRUE
A tweak is to update odd values to 3 * x[i] + 1 and then do the division by two unconditionally
x[i] <- 3 * x[i] + 1
count[idx[i]] <- count[idx[i]] + 1L
x <- x / 2
count[idx] <- count[idx] + 1
With this as f3 (not sure why f2 is slower this morning!) I get
> system.time(ans2 <- f2(L))
user system elapsed
0.36 0.00 0.36
> system.time(ans3 <- f3(L))
user system elapsed
0.201 0.003 0.206
> identical(ans2, ans3)
[1] TRUE
It seems like larger steps can be taken at the divide-by-two stage, e.g., 8 is 2^3 so we could take 3 steps (add 3 to count) and be finished, 20 is 2^2 * 5 so we could take two steps and enter the next iteration at 5. Implementations?
Because you need to iterate on values of x you can't really vectorize this. At some point, R has to work on each value of x separately and in turn. You might be able to run the computations on separate CPU cores to speed things up, perhaps using foreach in the package of the same name.
Otherwise, (and this is just hiding the loop from you), wrap the main body of your loop as a function, e.g.:
wonderous <- function(n) {
count <- 0
while(n > 1) {
if(isTRUE(all.equal(n %% 2, 0))) {
n <- n / 2
} else {
n <- (3*n) + 1
}
count <- count + 1
}
return(count)
}
and then you can use sapply() to run the function on a set of numbers:
> sapply(1:50, wonderous)
[1] 0 1 7 2 5 8 16 3 19 6 14 9 9 17 17
[16] 4 12 20 20 7 7 15 15 10 23 10 111 18 18 18
[31] 106 5 26 13 13 21 21 21 34 8 109 8 29 16 16
[46] 16 104 11 24 24
Or you can use Vectorize to return a vectorized version of wonderous which is itself a function that hides even more of this from you:
> wonderousV <- Vectorize(wonderous)
> wonderousV(1:50)
[1] 0 1 7 2 5 8 16 3 19 6 14 9 9 17 17
[16] 4 12 20 20 7 7 15 15 10 23 10 111 18 18 18
[31] 106 5 26 13 13 21 21 21 34 8 109 8 29 16 16
[46] 16 104 11 24 24
I think that is about as far as you can get with standard R tools at the moment.#Martin Morgan shows you can do a lot better than this with an ingenious take on solving the problem that does used R's vectorised abilities.
A different approach recognizes that one frequently revisits low numbers, so why not remember them and save the re-calculation cost?
memo_f <- function() {
e <- new.env(parent=emptyenv())
e[["1"]] <- 0L
f <- function(x) {
k <- as.character(x)
if (!exists(k, envir=e))
e[[k]] <- 1L + if (x %% 2) f(3L * x + 1L) else f(x / 2L)
e[[k]]
}
f
}
which gives
> L <- 100
> vals <- seq_len(L)
> system.time({ f <- memo_f(); memo1 <- sapply(vals, f) })
user system elapsed
0.018 0.000 0.019
> system.time(won <- sapply(vals, wonderous))
user system elapsed
0.921 0.005 0.930
> all.equal(memo1, won) ## integer vs. numeric
[1] TRUE
This might not parallelize well, but then maybe that's not necessary with the 50x speedup? Also the recursion might get too deep, but the recursion could be written as a loop (which is probably faster, anyway).

Resources