Fill all entries between two specified values - r

I have a long vector, thousands of entries, which has elements 0, 1, 2 in it sporadically. 0 means "no signal", 1 means "signal on", and 2 means "signal off". I am trying to find the runs from 1 to the next occurrence of 2 and fill the space with 1s. I also need to do the same thing between a 2 and the next occurrence of 1 but fill the space with 0s.
I currently have a solution for this issue using loops but it's slow and incredibly inefficient:
example vector:
exp = c(1,1,1,0,0,1,2,0,2,0,1,0,2)
desired result:
1,1,1,1,1,1,2,0,0,0,1,1,2
Thank you

You could use rle & shift from the data.table-package in the following way:
library(data.table)
# create the run-length object
rl <- rle(x)
# create indexes of the spots in the run-length object that need to be replaced
idx1 <- rl$values == 0 & shift(rl$values, fill = 0) == 1 & shift(rl$values, fill = 0, type = 'lead') %in% 1:2
idx0 <- rl$values == 2 & shift(rl$values, fill = 0) == 0 & shift(rl$values, fill = 2, type = 'lead') %in% 0:1
# replace these values
rl$values[idx1] <- 1
rl$values[idx0] <- 0
Now you will get the desired result by using inverse.rle:
> inverse.rle(rl)
[1] 1 1 1 1 1 1 2 0 0 0 1 1 2
As an alternative for the shift-function, you could also use the lag and lead functions from dplyr.
If you want to assess the speed of both approaches, the microbenchmark-package is a useful tool. Below you'll find 3 benchmarks, each for a different vector size:
# create functions for both approaches
jaap <- function(x) {
rl <- rle(x)
idx1 <- rl$values == 0 & shift(rl$values, fill = 0) == 1 & shift(rl$values, fill = 0, type = 'lead') %in% 1:2
idx0 <- rl$values == 2 & shift(rl$values, fill = 0) == 0 & shift(rl$values, fill = 2, type = 'lead') %in% 0:1
rl$values[idx1] <- 1
rl$values[idx0] <- 0
inverse.rle(rl)
}
john <- function(x) {
Reduce(f, x, 0, accumulate = TRUE)[-1]
}
Execute the benchmarks:
# benchmark on the original data
> microbenchmark(jaap(x), john(x), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
jaap(x) 58.766 61.2355 67.99861 63.8755 72.147 143.841 100 b
john(x) 13.684 14.3175 18.71585 15.7580 23.902 50.705 100 a
# benchmark on a somewhat larger vector
> x2 <- rep(x, 10)
> microbenchmark(jaap(x2), john(x2), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
jaap(x2) 69.778 72.802 84.46945 76.9675 87.3015 184.666 100 a
john(x2) 116.858 121.058 127.64275 126.1615 130.4515 223.303 100 b
# benchmark on a very larger vector
> x3 <- rep(x, 1e6)
> microbenchmark(jaap(x3), john(x3), times = 20)
Unit: seconds
expr min lq mean median uq max neval cld
jaap(x3) 1.30326 1.337878 1.389187 1.391279 1.425186 1.556887 20 a
john(x3) 10.51349 10.616632 10.689535 10.670808 10.761191 10.918953 20 b
From this you can conclude that the rle-approach has an advantage when applied to vectors that are larger than 100 elements (which is probably nearly always).

You could also use Reduce with the following function:
f <- function(x,y){
if(x == 1){
if(y == 2) 2 else 1
}else{
if(y == 1) 1 else 0
}
}
Then:
> x <- c(1,1,1,0,0,1,2,0,2,0,1,0,2)
> Reduce(f, x, 0, accumulate = TRUE)[-1]
[1] 1 1 1 1 1 1 2 0 0 0 1 1 2

Related

For a dataset of 0's and 1's, set all but the first 1 in each row to 0's

I have a data.frame of 1,480 rows and 1,400 columns like:
1 2 3 4 5 6 ..... 1399 1400
1 0 0 0 1 0 0 ..... 1 0 #first occurrence would be at 4
2 0 0 0 0 0 1 ..... 0 1
3 1 0 0 1 0 0 ..... 0 0
## and etc
Each row contains a series of 0's and 1's - predominantly 0's. For each row, I want to find at which column the first 1 shows up and set the remaining values to 0's.
My current implementation can efficiently find the occurrence of the first 1, but I've only figured out how to zero out the remaining values iteratively by row. In repeated simulations, this iterative process is taking too long.
Here is the current implementation:
N <- length(df[which(df$arm == 0), "pt_id"]) # of patients
M <- max_days
#
# df is like the data frame shown above
#
df[which(df$arm == 0), 5:length(colnames(df))] <- unlist(lapply(matrix(data = rep(pbo_hr, M*N), nrow=N, ncol = M), rbinom, n=1, size = 1))
event_day_post_rand <- apply(df[,5:length(colnames(df))], MARGIN = 1, FUN = function(x) which (x>0)[1])
df <- add_column(df, "event_day_post_rand" = event_day_post_rand, .after = "arm_id")
##
## From here trial days start on column 6 for df
##
#zero out events that occurred after the first event, since each patient can only have 1 max event which will be taken as the earliest event
for (pt_id in df[which(!is.na(df$event_day_post_rand)),"pt_id"]){
event_idx = df[which(df$pt_id == pt_id), "event_day_post_rand"]
df[which(df$pt_id == pt_id), as.character(5+event_idx+1):"1400"] <- 0
}
We can do
mat <- as.matrix(df) ## data frame to matrix
j <- max.col(mat, ties.method = "first")
mat[] <- 0
mat[cbind(1:nrow(mat), j)] <- 1
df <- data.frame(mat) ## matrix to data frame
I also suggest just using a matrix to store these values. In addition, the result will be a sparse matrix. So I recommend
library(Matrix)
sparseMatrix(i = 1:nrow(mat), j = j, x = rep(1, length(j)))
We can get a little more performance by setting the 1 elements to 0 whose rows are duplicates.
Since the OP is open to starting with a matrix rather than a data.frame, I'll do the same.
# dummy data
m <- matrix(sample(0:1, 1480L*1400L, TRUE, c(0.9, 0.1)), 1480L, 1400L)
# proposed solution
f1 <- function(m) {
ones <- which(m == 1L)
m[ones[duplicated((ones - 1L) %% nrow(m), nmax = nrow(m))]] <- 0L
m
}
# Zheyuan Li's solution
f2 <- function(m) {
j <- max.col(m, ties.method = "first")
m[] <- 0L
m[cbind(1:nrow(m), j)] <- 1L
m
}
microbenchmark::microbenchmark(f1 = f1(m),
f2 = f2(m),
check = "identical")
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> f1 9.1457 11.45020 12.04258 11.9011 12.3529 37.6716 100
#> f2 12.8424 14.92955 17.31811 15.3251 16.0550 43.6314 100
Zheyuan Li's suggestion to go with a sparse matrix is a good idea.
# convert to a memory-efficient nsparseMatrix
library(Matrix)
m1 <- as(Matrix(f1(m), dimnames = list(NULL, NULL), sparse = TRUE), "nsparseMatrix")
object.size(m)
#> 8288216 bytes
object.size(m1)
#> 12864 bytes
# proposed function to go directly to a sparse matrix
f3 <- function(m) {
n <- nrow(m)
ones <- which(m == 1L) - 1L
i <- ones %% n
idx <- which(!duplicated(i, nmax = n))
sparseMatrix(i[idx], ones[idx] %/% n, dims = dim(m), index1 = FALSE, repr = "C")
}
# going directly to a sparse matrix using Zheyuan Li's solution
f4 <- function(m) {
sparseMatrix(1:nrow(m), max.col(m, ties.method = "first"), dims = dim(m), repr = "C")
}
identical(m1, f3(m))
#> [1] TRUE
identical(m1, f4(m))
#> [1] TRUE
microbenchmark::microbenchmark(f1 = f1(m),
f3 = f3(m),
f4 = f4(m))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> f1 9.1719 9.30715 11.12569 9.52300 11.92740 83.8518 100
#> f3 7.4330 7.59875 12.62412 7.69610 11.08815 84.8291 100
#> f4 8.9607 9.31115 14.01477 9.49415 11.44825 87.1577 100

How to count consecutive zero in last run?

I just want to count the numbers of consecutive zero in last run if last run is zero for atomic vector.
For example:
a <- c(1, 0, 0, 0)
So, the number of consecutive zero in last run is 3.
If last run is not zero, then answer must be zero. For example
a <- c(0, 1, 1, 0, 0, 1)
So, answer is zero because in the last run there is one, not zero.
I do not want to use any external package. I manage to write a function that use loop. But I think more efficient method must exist.
czero <- function(a) {
k = 0
for(i in 1:length(a)){
if(a[i] == 0) {
k = k + 1
} else k = 0
}
return(k)
}
Reverse a and then compute its cumulative sum. The leading 0's will be the only 0's left and ! of that will be TRUE for each and FALSE for other elements. The sum of that is the desired number.
sum(!cumsum(rev(a)))
The simplest improvement is to start your loop from the end of the vector and work backwards, instead of starting from the front. You can then save time by exiting the loop at the first non-zero element, instead of looping through the whole vector.
I've checked this against the given vectors, and a much longer vector with a small number of zeros at the end, to show a case where looping from the start takes a lot of time.
a <- c(1, 0, 0, 0)
b <- c(0, 1, 1, 0, 0, 1)
long <- rep(c(0, 1, 0, 1, 0), c(4, 6, 5, 10000, 3))
czero is the original function, f1 is the solution by akrun that uses rle, fczero starts the loop from the end, and revczero reverses the vector, then starts from the front.
czero <- function(a) {
k = 0
for(i in 1:length(a)){
if(a[i] == 0) {
k = k + 1
} else k = 0
}
return(k)
}
f1 <- function(vec){
pmax(0, with(rle(vec), lengths[values == 0 &
seq_along(values) == length(values)])[1], na.rm = TRUE)
}
fczero <- function(vec) {
k <- 0L
for (i in length(vec):1) {
if (vec[i] != 0) break
k <- k + 1L
}
return(k)
}
revczero <- function(vec) {
revd <- rev(vec)
k <- 0L
for (i in 1:length(vec)) {
if (revd[i] != 0) break
k <- k + 1L
}
return(k)
}
Time benchmarks are below. EDIT: I've also added Grothendieck's version.
microbenchmark::microbenchmark(czero(a), f1(a), fczero(a), revczero(a), sum(!cumsum(rev(a))), times = 1000)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# czero(a) 0 514 621.035 514 515 21076 1000
# f1(a) 21590 23133 34455.218 27245 30843 3211826 1000
# fczero(a) 0 514 688.892 514 515 28274 1000
# revczero(a) 2570 3085 4626.047 3599 4626 112064 1000
# sum(!cumsum(rev(a))) 2056 2571 3879.630 3085 3599 62201 1000
microbenchmark::microbenchmark(czero(b), f1(b), fczero(b), revczero(b), sum(!cumsum(rev(b))), times = 1000)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# czero(b) 0 514 809.691 514 515 29815 1000
# f1(b) 22104 23647 29372.227 24675 26217 1319583 1000
# fczero(b) 0 0 400.502 0 514 26217 1000
# revczero(b) 2056 2571 3844.176 3085 3599 99727 1000
# sum(!cumsum(rev(b))) 2056 2570 3592.281 3084 3598.5 107952 1000
microbenchmark::microbenchmark(czero(long), f1(long), fczero(long), revczero(long), sum(!cumsum(rev(long))), times = 1000)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# czero(long) 353156 354699 422077.536 383486 443631.0 1106250 1000
# f1(long) 112579 119775 168408.616 132627 165269.5 2068050 1000
# fczero(long) 0 514 855.444 514 1028.0 43695 1000
# revczero(long) 24161 27245 35890.991 29301 36498.0 149591 1000
# sum(!cumsum(rev(long))) 49350 53462 71035.486 56546 71454 2006363 1000
We can use rle
f1 <- function(vec){
pmax(0, with(rle(vec), lengths[values == 0 &
seq_along(values) == length(values)])[1], na.rm = TRUE)
}
f1(a)
#[1] 3
In the second case,
b <- c(0, 1, 1, 0, 0, 1)
f1(b)
#[1] 0
Or another option is to create a function with which and cumsum
f2 <- function(vec) {
i1 <- which(!vec)
if(i1[length(i1)] != length(vec)) 0 else {
sum(!cumsum(rev(c(TRUE, diff(i1) != 1)))) + 1
}
}
f2(a)
f2(b)
with data.table:
ifelse(last(a) == 0,
sum(rleid(a) == last(rleid(a))),
0)
As
> rleid(a)
[1] 1 2 2 2
It is the length of the last group, if the last value is 0

Repeating calculation based on conditions

What I am trying to do is pretty simple. However, I am new to R and have not learned much about loops and functions and am not sure what is the most efficient way to get the results. Basically, I want to count the number of rows that meet my conditions and do a division. Here is an example:
df1 <- data.frame(
Main = c(0.0089, -0.050667, -0.030379, 0.066484, 0.006439, -0.026076),
B = c(NA, 0.0345, -0.0683, -0.052774, 0.014661, -0.040537),
C = c(0.0181, 0, -0.056197, 0.040794, 0.03516, -0.022662),
D = c(-0.0127, -0.025995, -0.04293, 0.057816, 0.033458, -0.058382)
)
df1
# Main B C D
# 1 0.008900 NA 0.018100 -0.012700
# 2 -0.050667 0.034500 0.000000 -0.025995
# 3 -0.030379 -0.068300 -0.056197 -0.042930
# 4 0.066484 -0.052774 0.040794 0.057816
# 5 0.006439 0.014661 0.035160 0.033458
# 6 -0.026076 -0.040537 -0.022662 -0.058382
My criteria for the numerator is to count the number of B/C/D that is >0 when Main is >0; For denominator, count the number of B/C/D that is != 0 when Main is != 0. I can use length(which(df1$Main >0 & df1$B>0)) / length(which(df1$Main !=0 & df1$B !=0)) to get the ratios for each of the column individually. But my data set has many more columns, and I am wondering if there is a way to get those ratio all at once so that my result will be like:
# B C D
# 1 0.2 0.6 0.3
Use apply:
apply(df1[,-1], 2, function(x) length(which(df1$Main >0 & x>0)) / length(which(df1$Main !=0 & x !=0)))
You could do this vectorized (No apply or for is needed):
tail(colSums(df[df$Main>0,]>0, na.rm = T) / colSums(df[df$Main!=0,]!=0, na.rm = T), -1)
# B C D
#0.2000000 0.6000000 0.3333333
One way to do this would be with a for loop that loops over the columns and applies the function that you wrote. Something like this:
ratio1<-vector()
for(i in 2:ncol(df1)){
ratio1[i-1] <- length(which(df1$Main >0 & df1[,i]>0)) / length(which(df1$Main !=0 & df1[,i] !=0))
}
Maybe there is a better way to do this with apply or data.table, but this is a simple solution that I can come up with. Works on any number of columns. Use round() if you want the answer in one decimal.
criteria1 <- df1[which(df1$Main > 0), -1] > 0
criteria2 <- df1[which(df1$Main != 0), -1] != 0
colSums(criteria1, na.rm = T)/colSums(criteria2, na.rm = T)
## B C D
## 0.2000000 0.6000000 0.3333333
Edit: It appears Niek's method is quickest for this specific data
# Unit: microseconds
# expr min lq mean median uq max neval
# Jim(df1) 216.468 230.0585 255.3755 239.8920 263.6870 802.341 300
# emilliman5(df1) 120.109 135.5510 155.9018 142.4615 156.0135 1961.931 300
# Niek(df1) 97.118 107.6045 123.5204 111.1720 119.6155 1966.830 300
# nine89(df1) 211.683 222.6660 257.6510 232.2545 252.6570 2246.225 300
#[[1]]
# [,1] [,2] [,3] [,4]
#median 239.892 142.462 111.172 232.255
#ratio 1.000 0.594 0.463 0.968
#diff 0.000 -97.430 -128.720 -7.637
However, when there are many columns the vectorized approach is quicker.
Nrow <- 1000
Ncol <- 1000
mat <- matrix(runif(Nrow*Ncol),Nrow)
df1 <- data.frame(Main = sample(-2:2,Nrow,T), mat) #1001 columns
#Unit: milliseconds
# expr min lq mean median uq max
# Jim(df1) 46.75627 53.88500 66.93513 56.58143 62.04375 185.0460
#emilliman5(df1) 73.35257 91.87283 151.38991 178.53188 185.06860 292.5571
# Niek(df1) 68.17073 76.68351 89.51625 80.14190 86.45726 200.7119
# nine89(df1) 51.36117 56.79047 74.53088 60.07220 66.34270 191.8294
#[[1]]
# [,1] [,2] [,3] [,4]
#median 56.581 178.532 80.142 60.072
#ratio 1.000 3.155 1.416 1.062
#diff 0.000 121.950 23.560 3.491
functions
Jim <- function(df1){
criteria1 <- df1[which(df1$Main > 0), -1] > 0
criteria2 <- df1[which(df1$Main != 0), -1] != 0
colSums(criteria1, na.rm = T)/colSums(criteria2, na.rm = T)
}
emilliman5 <- function(df1){
apply(df1[,-1], 2, function(x) length(which(df1$Main >0 & x>0)) / length(which(df1$Main !=0 & x !=0)))
}
Niek <- function(df1){
ratio1<-vector()
for(i in 2:ncol(df1)){
ratio1[i-1] <- length(which(df1$Main >0 & df1[,i]>0)) / length(which(df1$Main !=0 & df1[,i] !=0))
}
ratio1
}
nine89 <- function(df){
tail(colSums(df[df$Main>0,]>0, na.rm = T) / colSums(df[df$Main!=0,]!=0, na.rm = T), -1)
}

Count number of occurrences of vector in list

I have a list of vectors of variable length, for example:
q <- list(c(1,3,5), c(2,4), c(1,3,5), c(2,5), c(7), c(2,5))
I need to count the number of occurrences for each of the vectors in the list, for example (any other suitable datastructure acceptable):
list(list(c(1,3,5), 2), list(c(2,4), 1), list(c(2,5), 2), list(c(7), 1))
Is there an efficient way to do this? The actual list has tens of thousands of items so quadratic behaviour is not feasible.
match and unique accept and handle "list"s too (?match warns for being slow on "list"s). So, with:
match(q, unique(q))
#[1] 1 2 1 3 4 3
each element is mapped to a single integer. Then:
tabulate(match(q, unique(q)))
#[1] 2 1 2 1
And find a structure to present the results:
as.data.frame(cbind(vec = unique(q), n = tabulate(match(q, unique(q)))))
# vec n
#1 1, 3, 5 2
#2 2, 4 1
#3 2, 5 2
#4 7 1
Alternatively to match(x, unique(x)) approach, we could map each element to a single value with deparseing:
table(sapply(q, deparse))
#
# 7 c(1, 3, 5) c(2, 4) c(2, 5)
# 1 2 1 2
Also, since this is a case with unique integers, and assuming in a small range, we could map each element to a single integer after transforming each element to a binary representation:
n = max(unlist(q))
pow2 = 2 ^ (0:(n - 1))
sapply(q, function(x) tabulate(x, nbins = n)) # 'binary' form
sapply(q, function(x) sum(tabulate(x, nbins = n) * pow2))
#[1] 21 10 21 18 64 18
and then tabulate as before.
And just to compare the above alternatives:
f1 = function(x)
{
ux = unique(x)
i = match(x, ux)
cbind(vec = ux, n = tabulate(i))
}
f2 = function(x)
{
xc = sapply(x, deparse)
i = match(xc, unique(xc))
cbind(vec = x[!duplicated(i)], n = tabulate(i))
}
f3 = function(x)
{
n = max(unlist(x))
pow2 = 2 ^ (0:(n - 1))
v = sapply(x, function(X) sum(tabulate(X, nbins = n) * pow2))
i = match(v, unique(v))
cbind(vec = x[!duplicated(v)], n = tabulate(i))
}
q2 = rep_len(q, 1e3)
all.equal(f1(q2), f2(q2))
#[1] TRUE
all.equal(f2(q2), f3(q2))
#[1] TRUE
microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(q2) 7.980041 8.161524 10.525946 8.291678 8.848133 178.96333 100 b
# f2(q2) 24.407143 24.964991 27.311056 25.514834 27.538643 45.25388 100 c
# f3(q2) 3.951567 4.127482 4.688778 4.261985 4.518463 10.25980 100 a
Another interesting alternative is based on ordering. R > 3.3.0 has a grouping function, built off data.table, which, along with the ordering, provides some attributes for further manipulation:
Make all elements of equal length and "transpose" (probably the most slow operation in this case, though I'm not sure how else to feed grouping):
n = max(lengths(q))
qq = .mapply(c, lapply(q, "[", seq_len(n)), NULL)
Use ordering to group similar elements mapped to integers:
gr = do.call(grouping, qq)
e = attr(gr, "ends")
i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
i
#[1] 1 2 1 3 4 3
then, tabulate as before.
To continue the comparisons:
f4 = function(x)
{
n = max(lengths(x))
x2 = .mapply(c, lapply(x, "[", seq_len(n)), NULL)
gr = do.call(grouping, x2)
e = attr(gr, "ends")
i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
cbind(vec = x[!duplicated(i)], n = tabulate(i))
}
all.equal(f3(q2), f4(q2))
#[1] TRUE
microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2), f4(q2))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(q2) 7.956377 8.048250 8.792181 8.131771 8.270101 21.944331 100 b
# f2(q2) 24.228966 24.618728 28.043548 25.031807 26.188219 195.456203 100 c
# f3(q2) 3.963746 4.103295 4.801138 4.179508 4.360991 35.105431 100 a
# f4(q2) 2.874151 2.985512 3.219568 3.066248 3.186657 7.763236 100 a
In this comparison q's elements are of small length to accomodate for f3, but f3 (because of large exponentiation) and f4 (because of mapply) will suffer, in performance, if "list"s of larger elements are used.
One way is to paste each vector , unlist and tabulate, i.e.
table(unlist(lapply(q, paste, collapse = ',')))
#1,3,5 2,4 2,5 7
# 2 1 2 1

Find elements in vector in R

A matrix I have has exactly 2 rows and n columns example
c(0,0,0,0,1,0,2,0,1,0,1,1,1,0,2)->a1
c(0,2,0,0,0,0,2,1,1,0,0,0,0,2,0)->a2
rbind(a1,a2)->matr
for a specific column ( in this example 9 with 1 in both rows) I do need to find to the left and to the right the first instance of 2/0 or 0/2 - in this example to the left is 2 and the other is 14)
The elements of every row can either be 0,1,2 - nothing else . Is there a way to do that operation on large matrixes (with 2 rows) fast? I need to to it 600k times so speed might be a consideration
library(compiler)
myfun <- cmpfun(function(m, cl) {
li <- ri <- cl
nc <- ncol(m)
repeat {
li <- li - 1
if(li == 0 || ((m[1, li] != 1) && (m[1, li] + m[2, li] == 2))) {
l <- li
break
}
}
repeat {
ri <- ri + 1
if(ri == nc || ((m[1, ri] != 1) && (m[1, ri] + m[2, ri] == 2))) {
r <- ri
break
}
}
c(l, r)
})
and, after taking into account #Martin Morgan's observations,
set.seed(1)
N <- 1000000
test <- rbind(sample(0:2, N, replace = TRUE),
sample(0:2, N, replace = TRUE))
library(microbenchmark)
microbenchmark(myfun(test, N / 2), fun(test, N / 2), foo(test, N / 2),
AWebb(test, N / 2), RHertel(test, N / 2))
# Unit: microseconds
expr min lq mean median uq max neval cld
# myfun(test, N/2) 4.658 20.033 2.237153e+01 22.536 26.022 85.567 100 a
# fun(test, N/2) 36685.750 47842.185 9.762663e+04 65571.546 120321.921 365958.316 100 b
# foo(test, N/2) 2622845.039 3009735.216 3.244457e+06 3185893.218 3369894.754 5170015.109 100 d
# AWebb(test, N/2) 121504.084 142926.590 1.990204e+05 193864.670 209918.770 489765.471 100 c
# RHertel(test, N/2) 65998.733 76805.465 1.187384e+05 86089.980 144793.416 385880.056 100 b
set.seed(123)
test <- rbind(sample(0:2, N, replace = TRUE, prob = c(5, 90, 5)),
sample(0:2, N, replace = TRUE, prob = c(5, 90, 5)))
microbenchmark(myfun(test, N / 2), fun(test, N / 2), foo(test, N / 2),
AWebb(test, N / 2), RHertel(test, N / 2))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# myfun(test, N/2) 81.805 103.732 121.9619 106.459 122.36 307.736 100 a
# fun(test, N/2) 26362.845 34553.968 83582.9801 42325.755 106303.84 403212.369 100 b
# foo(test, N/2) 2598806.742 2952221.561 3244907.3385 3188498.072 3505774.31 4382981.304 100 d
# AWebb(test, N/2) 109446.866 125243.095 199204.1013 176207.024 242577.02 653299.857 100 c
# RHertel(test, N/2) 56045.309 67566.762 125066.9207 79042.886 143996.71 632227.710 100 b
I was slower than #Laterow, but anyhow, this is a similar approach
foo <- function(mtr, targetcol) {
matr1 <- colSums(mtr)
matr2 <- apply(mtr, 2, function(x) x[1]*x[2])
cols <- which(matr1 == 2 & matr2 == 0) - targetcol
left <- cols[cols < 0]
right <- cols[cols > 0]
c(ifelse(length(left) == 0, NA, targetcol + max(left)),
ifelse(length(right) == 0, NA, targetcol + min(right)))
}
foo(matr,9) #2 14
Combine the information by squaring the rows and adding them. The right result should be 4. Then, simply find the first column that is smaller than 9 (rev(which())[1]) and the first column that is larger than 9 (which()[1]).
fun <- function(matr, col){
valid <- which((matr[1,]^2 + matr[2,]^2) == 4)
if (length(valid) == 0) return(c(NA,NA))
left <- valid[rev(which(valid < col))[1]]
right <- valid[which(valid > col)[1]]
c(left,right)
}
fun(matr,9)
# [1] 2 14
fun(matr,1)
# [1] NA 2
fun(matrix(0,nrow=2,ncol=100),9)
# [1] NA NA
Benchmark
set.seed(1)
test <- rbind(sample(0:2,1000000,replace=T),
sample(0:2,1000000,replace=T))
microbenchmark::microbenchmark(fun(test,9))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun(test, 9) 22.7297 27.21038 30.91314 27.55106 28.08437 51.92393 100
Edit: Thanks to #MatthewLundberg for pointing out a lot of mistakes.
If you are doing this many times, precompute all the locations
loc <- which((a1==2 & a2==0) | (a1==0 & a2==2))
You can then find the first to the left and right with findInterval
i<-findInterval(9,loc);loc[c(i,i+1)]
# [1] 2 14
Note that findInterval is vectorized should you care to specify multiple target columns.
That is an interesting question. Here's how I would address it.
First a vector is defined which contains the product of each column:
a3 <- matr[1,]*matr[2,]
Then we can find the columns with pairs of (0/2) or (2/0) rather easily, since we know that the matrix can only contain the values 0, 1, and 2:
the02s <- which(colSums(matr)==2 & a3==0)
Next we want to find the pairs of (0/2) or (2/0) that are closest to a given column number, on the left and on the right of that column. The column number could be 9, for instance:
thecol <- 9
Now we have basically all we need to find the index (the column number in the matrix) of a combination of (0/2) or (2/0) that is closest to the column thecol. We just need to use the output of findInterval():
pos <- findInterval(thecol,the02s)
pos <- c(pos, pos+1)
pos[pos==0] <- NA # output NA if no column was found on the left
And the result is:
the02s[pos]
# 2 14
So the indices of the closest columns on either side of thecol fulfilling the required condition would be 2 and 14 in this case, and we can confirm that these column numbers both contain one of the relevant combinations:
matr[,14]
#a1 a2
# 0 2
matr[,2]
#a1 a2
# 0 2
Edit: I changed the answer such that NA is returned in the case where no column exists on the left and/or on the right of thecol in the matrix that fulfills the required condition.

Resources