I have a dataframe of 0/1 dummy variables. Each dummy variable only takes the value 1 once. For each column, I would want to replace n preceding/following observations counting from the observation with the value 1 to a particular value (say 1).
So for single vector, with n=1:
c(0, 0, 1, 0, 0)
I would want to get
c(0, 1, 1, 1, 0)
What would be a good general approach with n columns and allowing for a different number of preceding/following observations to replace (e.g n-1 before & n after)?
Thanks for help!
x<-c(0,0,1,0,0)
ind<-which(x==1)
x[(ind-1):(ind+x)]<-1
Another option:
f <- function(x, pre, post) {
idx <- which.max(x)
x[max(1, (idx-pre)):min(length(x), (idx+post))] <- 1
x
}
Sample data:
df <- data.frame(x = c(0, 0, 1, 0, 0), y = c(0, 1, 0, 0, 0))
Application:
df[] <- lapply(df, f, pre=2, post=1)
#df
# x y
#1 1 1
#2 1 1
#3 1 1
#4 1 0
#5 0 0
What you can do is the following:
vec <- c(0, 0, 1, 0, 0)
sapply(1:length(vec), function(i) {
minval <- max(0, i - 1)
maxval <- min(i + 1, length(vec))
return(sum(vec[minval:maxval]))
})
# [1] 0 1 1 1 0
Or to put it in a function (same code but a bit more compact)
f <- function(vec){
sapply(1:length(vec), function(i)
sum(vec[max(0, i-1):min(i+1, length(vec))]))
}
f(vec)
# [1] 0 1 1 1 0
Speedtest
To compare the two different solutions, I quickly ran a benchmark using microbenchmark, and the winner is: Clearly #Shenglin's code.... Always nice to see simple solutions (as well as to see how complicated some (my) solutions can be).
fDavid <- function(vec){
sapply(1:length(vec), function(i)
sum(vec[max(0, i-1):min(i+1, length(vec))]))
}
fHeroka <- function(vec){
res <- vec
test <- which(vec==1)
#create indices to be replaced
n=1 #variable n
replace_indices <- c(test+(1:n),test-(1:n))
#filter out negatives (may happen with larger n)
replace_indices <- replace_indices[replace_indices>0]
#replace items in 'res' that need to be replaced with 1
res[replace_indices] <- 1
}
fShenglin <- function(vec){
ind<-which(vec==1)
vec[(ind-1):(ind+x)]<-1
}
vect <- sample(0:1, size = 1000, replace = T)
library(microbenchmark)
microbenchmark(fHeroka(vect), fDavid(vect), fShenglin)
# # Unit: nanoseconds
# expr min lq mean median uq max
# fHeroka(vect) 38929 42999 54422.57 49546 61755.5 145451
# fDavid(vect) 2463805 2577935 2875024.99 2696844 2849548.5 5994596
# fShenglin 0 0 138.63 1 355.0 1063
# neval cld
# 100 a
# 100 b
# 100 a
# Warning message:
# In microbenchmark(fHeroka(vect), fDavid(vect), fShenglin) :
# Could not measure a positive execution time for 30 evaluations.
This might be a start:
myv <- c(0, 0, 1, 0, 0)
#make a copy
res <- myv
#check where the ones are
test <- which(myv==1)
#create indices to be replaced
n=1 #variable n
replace_indices <- c(test+(1:n),test-(1:n))
#filter out negatives (may happen with larger n)
replace_indices <- replace_indices[replace_indices>0]
#replace items in 'res' that need to be replaced with 1
res[replace_indices] <- 1
res
> res
[1] 0 1 1 1 0
This could be a solution:
dat<-data.frame(x=c(0,0,1,0,0,0),y=c(0,0,0,1,0,0),z=c(0,1,0,0,0,0))
which_to_change<-data.frame(prev=c(2,2,1),foll=c(1,1,3))
for(i in 1:nrow(which_to_change)){
dat[(which(dat[,i]==1)-which_to_change[i,1]):(which(dat[,i]==1)+which_to_change[i,2]),i]<-1
}
Related
I have a data.frame of 1,480 rows and 1,400 columns like:
1 2 3 4 5 6 ..... 1399 1400
1 0 0 0 1 0 0 ..... 1 0 #first occurrence would be at 4
2 0 0 0 0 0 1 ..... 0 1
3 1 0 0 1 0 0 ..... 0 0
## and etc
Each row contains a series of 0's and 1's - predominantly 0's. For each row, I want to find at which column the first 1 shows up and set the remaining values to 0's.
My current implementation can efficiently find the occurrence of the first 1, but I've only figured out how to zero out the remaining values iteratively by row. In repeated simulations, this iterative process is taking too long.
Here is the current implementation:
N <- length(df[which(df$arm == 0), "pt_id"]) # of patients
M <- max_days
#
# df is like the data frame shown above
#
df[which(df$arm == 0), 5:length(colnames(df))] <- unlist(lapply(matrix(data = rep(pbo_hr, M*N), nrow=N, ncol = M), rbinom, n=1, size = 1))
event_day_post_rand <- apply(df[,5:length(colnames(df))], MARGIN = 1, FUN = function(x) which (x>0)[1])
df <- add_column(df, "event_day_post_rand" = event_day_post_rand, .after = "arm_id")
##
## From here trial days start on column 6 for df
##
#zero out events that occurred after the first event, since each patient can only have 1 max event which will be taken as the earliest event
for (pt_id in df[which(!is.na(df$event_day_post_rand)),"pt_id"]){
event_idx = df[which(df$pt_id == pt_id), "event_day_post_rand"]
df[which(df$pt_id == pt_id), as.character(5+event_idx+1):"1400"] <- 0
}
We can do
mat <- as.matrix(df) ## data frame to matrix
j <- max.col(mat, ties.method = "first")
mat[] <- 0
mat[cbind(1:nrow(mat), j)] <- 1
df <- data.frame(mat) ## matrix to data frame
I also suggest just using a matrix to store these values. In addition, the result will be a sparse matrix. So I recommend
library(Matrix)
sparseMatrix(i = 1:nrow(mat), j = j, x = rep(1, length(j)))
We can get a little more performance by setting the 1 elements to 0 whose rows are duplicates.
Since the OP is open to starting with a matrix rather than a data.frame, I'll do the same.
# dummy data
m <- matrix(sample(0:1, 1480L*1400L, TRUE, c(0.9, 0.1)), 1480L, 1400L)
# proposed solution
f1 <- function(m) {
ones <- which(m == 1L)
m[ones[duplicated((ones - 1L) %% nrow(m), nmax = nrow(m))]] <- 0L
m
}
# Zheyuan Li's solution
f2 <- function(m) {
j <- max.col(m, ties.method = "first")
m[] <- 0L
m[cbind(1:nrow(m), j)] <- 1L
m
}
microbenchmark::microbenchmark(f1 = f1(m),
f2 = f2(m),
check = "identical")
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> f1 9.1457 11.45020 12.04258 11.9011 12.3529 37.6716 100
#> f2 12.8424 14.92955 17.31811 15.3251 16.0550 43.6314 100
Zheyuan Li's suggestion to go with a sparse matrix is a good idea.
# convert to a memory-efficient nsparseMatrix
library(Matrix)
m1 <- as(Matrix(f1(m), dimnames = list(NULL, NULL), sparse = TRUE), "nsparseMatrix")
object.size(m)
#> 8288216 bytes
object.size(m1)
#> 12864 bytes
# proposed function to go directly to a sparse matrix
f3 <- function(m) {
n <- nrow(m)
ones <- which(m == 1L) - 1L
i <- ones %% n
idx <- which(!duplicated(i, nmax = n))
sparseMatrix(i[idx], ones[idx] %/% n, dims = dim(m), index1 = FALSE, repr = "C")
}
# going directly to a sparse matrix using Zheyuan Li's solution
f4 <- function(m) {
sparseMatrix(1:nrow(m), max.col(m, ties.method = "first"), dims = dim(m), repr = "C")
}
identical(m1, f3(m))
#> [1] TRUE
identical(m1, f4(m))
#> [1] TRUE
microbenchmark::microbenchmark(f1 = f1(m),
f3 = f3(m),
f4 = f4(m))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> f1 9.1719 9.30715 11.12569 9.52300 11.92740 83.8518 100
#> f3 7.4330 7.59875 12.62412 7.69610 11.08815 84.8291 100
#> f4 8.9607 9.31115 14.01477 9.49415 11.44825 87.1577 100
I just want to count the numbers of consecutive zero in last run if last run is zero for atomic vector.
For example:
a <- c(1, 0, 0, 0)
So, the number of consecutive zero in last run is 3.
If last run is not zero, then answer must be zero. For example
a <- c(0, 1, 1, 0, 0, 1)
So, answer is zero because in the last run there is one, not zero.
I do not want to use any external package. I manage to write a function that use loop. But I think more efficient method must exist.
czero <- function(a) {
k = 0
for(i in 1:length(a)){
if(a[i] == 0) {
k = k + 1
} else k = 0
}
return(k)
}
Reverse a and then compute its cumulative sum. The leading 0's will be the only 0's left and ! of that will be TRUE for each and FALSE for other elements. The sum of that is the desired number.
sum(!cumsum(rev(a)))
The simplest improvement is to start your loop from the end of the vector and work backwards, instead of starting from the front. You can then save time by exiting the loop at the first non-zero element, instead of looping through the whole vector.
I've checked this against the given vectors, and a much longer vector with a small number of zeros at the end, to show a case where looping from the start takes a lot of time.
a <- c(1, 0, 0, 0)
b <- c(0, 1, 1, 0, 0, 1)
long <- rep(c(0, 1, 0, 1, 0), c(4, 6, 5, 10000, 3))
czero is the original function, f1 is the solution by akrun that uses rle, fczero starts the loop from the end, and revczero reverses the vector, then starts from the front.
czero <- function(a) {
k = 0
for(i in 1:length(a)){
if(a[i] == 0) {
k = k + 1
} else k = 0
}
return(k)
}
f1 <- function(vec){
pmax(0, with(rle(vec), lengths[values == 0 &
seq_along(values) == length(values)])[1], na.rm = TRUE)
}
fczero <- function(vec) {
k <- 0L
for (i in length(vec):1) {
if (vec[i] != 0) break
k <- k + 1L
}
return(k)
}
revczero <- function(vec) {
revd <- rev(vec)
k <- 0L
for (i in 1:length(vec)) {
if (revd[i] != 0) break
k <- k + 1L
}
return(k)
}
Time benchmarks are below. EDIT: I've also added Grothendieck's version.
microbenchmark::microbenchmark(czero(a), f1(a), fczero(a), revczero(a), sum(!cumsum(rev(a))), times = 1000)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# czero(a) 0 514 621.035 514 515 21076 1000
# f1(a) 21590 23133 34455.218 27245 30843 3211826 1000
# fczero(a) 0 514 688.892 514 515 28274 1000
# revczero(a) 2570 3085 4626.047 3599 4626 112064 1000
# sum(!cumsum(rev(a))) 2056 2571 3879.630 3085 3599 62201 1000
microbenchmark::microbenchmark(czero(b), f1(b), fczero(b), revczero(b), sum(!cumsum(rev(b))), times = 1000)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# czero(b) 0 514 809.691 514 515 29815 1000
# f1(b) 22104 23647 29372.227 24675 26217 1319583 1000
# fczero(b) 0 0 400.502 0 514 26217 1000
# revczero(b) 2056 2571 3844.176 3085 3599 99727 1000
# sum(!cumsum(rev(b))) 2056 2570 3592.281 3084 3598.5 107952 1000
microbenchmark::microbenchmark(czero(long), f1(long), fczero(long), revczero(long), sum(!cumsum(rev(long))), times = 1000)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# czero(long) 353156 354699 422077.536 383486 443631.0 1106250 1000
# f1(long) 112579 119775 168408.616 132627 165269.5 2068050 1000
# fczero(long) 0 514 855.444 514 1028.0 43695 1000
# revczero(long) 24161 27245 35890.991 29301 36498.0 149591 1000
# sum(!cumsum(rev(long))) 49350 53462 71035.486 56546 71454 2006363 1000
We can use rle
f1 <- function(vec){
pmax(0, with(rle(vec), lengths[values == 0 &
seq_along(values) == length(values)])[1], na.rm = TRUE)
}
f1(a)
#[1] 3
In the second case,
b <- c(0, 1, 1, 0, 0, 1)
f1(b)
#[1] 0
Or another option is to create a function with which and cumsum
f2 <- function(vec) {
i1 <- which(!vec)
if(i1[length(i1)] != length(vec)) 0 else {
sum(!cumsum(rev(c(TRUE, diff(i1) != 1)))) + 1
}
}
f2(a)
f2(b)
with data.table:
ifelse(last(a) == 0,
sum(rleid(a) == last(rleid(a))),
0)
As
> rleid(a)
[1] 1 2 2 2
It is the length of the last group, if the last value is 0
I have a large data frame that is taking to long to compute a for loop, I've tried removing all computations to time the for loop but I still have an inefficient code. I'm new to R but I think there should be a better way of coding my for loop.
If you could provide some guidance it would be appreciated.
My dataFrame has 2,772,807 obs of 6 variables.
Simplified code (Still takes long):
library("tictoc")
tic()
dataFlights <- read_delim("U.S._DOT_O&D_Monthly_Traffic_Report.tsv",
"\t", escape_double = FALSE, trim_ws = TRUE)
dataFlights["Connections"] = ""
pb <- txtProgressBar(min = 0, max = nrow(dataFlights), style = 3)
for (row in 1:nrow(dataFlights)) {
dataFlights[row,7] <- 1
setTxtProgressBar(pb, row)
}
close(pb)
toc()
Original Code:
#Reads DOT public flight information for 2017 & 2018,
#and computes the number of connections
#per route (Cp#1 or Cp#2) into a new column. Possible results 0,1, or 2 connections.
library("tictoc")
tic()
dataFlights <- read_delim("U.S._DOT_O&D_Monthly_Traffic_Report.tsv",
"\t", escape_double = FALSE, trim_ws = TRUE)
dataFlights["Connections"] = ""
pb <- txtProgressBar(min = 0, max = nrow(dataFlights), style = 3)
for (row in 1:nrow(dataFlights)) {
if(is.na(dataFlights[row,2]) & is.na(dataFlights[row,3])){
dataFlights[row,7] <- 0
} else if (is.na(dataFlights[row,2]) | is.na(dataFlights[row,3])) {
dataFlights[row,7] <- 1
} else {
dataFlights[row,7] <- 2
}
setTxtProgressBar(pb, row)
}
close(pb)
toc()
As indicated in the comments, this can be done effortlessly with ifelse
# data
set.seed(123)
n <- 1e+6
dataFlights <- data.frame(x1 = runif(n),
x2 = sample(c(runif(n/2), rep(NA, n/2)), n),
x3 = sample(c(runif(n/2), rep(NA, n/2)), n),
stringsAsFactors = FALSE
)
# conditions
na_2 <- is.na(.subset2(dataFlights, 2))
na_3 <- is.na(.subset2(dataFlights, 3))
na_sum <- na_2 + na_3
# ifelse
dataFlights$x4 <- ifelse(na_sum == 2, 0, ifelse(na_sum == 1, 1, 2))
head(dataFlights)
# x1 x2 x3 x4
# 1 0.2875775 NA NA 0
# 2 0.7883051 0.4415287 NA 1
# 3 0.4089769 NA 0.3130298 1
# 4 0.8830174 0.3077688 NA 1
# 5 0.9404673 NA NA 0
# 6 0.0455565 0.5718788 NA 1
where for simplicity I set column 4 as opposed to column 7.
Few suggestions:
dataFlights["Connections"] = ""
In this piece, if you use NA instead of "", it will keep the data size smaller. For comparison, I created a 3,000,000 x 3 matrix to see size. With only one column different, the one with "" had size 268Mb but the one with NA was only about 60Mb. Smaller the size, faster it will be to index.
pb <- txtProgressBar(min = 0, max = nrow(dataFlights), style = 3)
for (row in 1:nrow(dataFlights)) {
dataFlights[row,7] <- 1
setTxtProgressBar(pb, row)
}
In each iteration, you are assigning 1 to a matrix / data.frame cell. This assignment is a computationally expensive step. For your example, this can be completely vectorized. Here are few ways to get 7th column to replace your for loop
rowSums
col7.rowSums = rowSums(!is.na(dataFlights[, 2:3]))
sapply
col7.sapply = sapply(1:nrow(dataFlights), function(x) sum(!is.na(dataFlights[x, 2:3])))
apply
col7.apply = apply(!is.na(dataFlights[, 2:3]), 1, sum)
Microbenchmark
Unit: microseconds
expr min lq mean median uq max neval
for.loop 52604.86 56768.5590 58810.55595 58137.651 60064.056 81958.717 100
rowSums 35.87 49.2225 61.23889 53.845 72.010 139.409 100
sapply 49756.32 53131.1065 55778.95541 54414.455 56154.496 102558.473 100
apply 997.21 1060.5380 1225.48577 1135.066 1254.936 3864.779 100
I have a data.frame that has several variables with zero values. I need to construct an extra variable that would return the combination of variables that are not zero for each observation. E.g.
df <- data.frame(firm = c("firm1", "firm2", "firm3", "firm4", "firm5"),
A = c(0, 0, 0, 1, 2),
B = c(0, 1, 0, 42, 0),
C = c(1, 1, 0, 0, 0))
Now I would like to generate the new variable:
df$varCombination <- c("C", "B-C", NA, "A-B", "A")
I thought up something like this, which obviously did not work:
for (i in 1:nrow(df)){
df$varCombination[i] <- paste(names(df[i,2:ncol(df) & > 0]), collapse = "-")
}
This could be probably solved easily using apply(df, 1, fun), but here is an attempt to solve this column wise instead of row wise for performance sake (I once saw something similar done by #alexis_laz but can't find it right now)
## Create a logical matrix
tmp <- df[-1] != 0
## or tmp <- sapply(df[-1], `!=`, 0)
## Prealocate result
res <- rep(NA, nrow(tmp))
## Run per column instead of per row
for(j in colnames(tmp)){
res[tmp[, j]] <- paste(res[tmp[, j]], j, sep = "-")
}
## Remove the pre-allocated `NA` values from non-NA entries
gsub("NA-", "", res, fixed = TRUE)
# [1] "C" "B-C" NA "A-B" "A"
Some benchmarks on a bigger data set
set.seed(123)
BigDF <- as.data.frame(matrix(sample(0:1, 1e4, replace = TRUE), ncol = 10))
library(microbenchmark)
MM <- function(df) {
var_names <- names(df)[-1]
res <- character(nrow(df))
for (i in 1:nrow(df)){
non_zero_names <- var_names[df[i, -1] > 0]
res[i] <- paste(non_zero_names, collapse = '-')
}
res
}
ZX <- function(df) {
res <-
apply(df[,2:ncol(df)]>0, 1,
function(i)paste(colnames(df[, 2:ncol(df)])[i], collapse = "-"))
res[res == ""] <- NA
res
}
DA <- function(df) {
tmp <- df[-1] != 0
res <- rep(NA, nrow(tmp))
for(j in colnames(tmp)){
res[tmp[, j]] <- paste(res[tmp[, j]], j, sep = "-")
}
gsub("NA-", "", res, fixed = TRUE)
}
microbenchmark(MM(BigDF), ZX(BigDF), DA(BigDF))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# MM(BigDF) 239.36704 248.737408 253.159460 252.177439 255.144048 289.340528 100 c
# ZX(BigDF) 35.83482 37.617473 38.295425 38.022897 38.357285 76.619853 100 b
# DA(BigDF) 1.62682 1.662979 1.734723 1.735296 1.761695 2.725659 100 a
Using apply:
# paste column names
df$varCombination <-
apply(df[,2:ncol(df)]>0, 1,
function(i)paste(colnames(df[, 2:ncol(df)])[i], collapse = "-"))
# convert blank to NA
df$varCombination[df$varCombination == ""] <- NA
# result
df
# firm A B C varCombination
# 1 firm1 0 0 1 C
# 2 firm2 0 1 1 B-C
# 3 firm3 0 0 0 <NA>
# 4 firm4 1 42 0 A-B
# 5 firm5 2 0 0 A
You had the right idea but the logical comparison in your loop wasn't correct.
I've attempted to keep the code fairly similar to what you had before, this should work:
var_names <- names(df)[-1]
df$varCombination <- character(nrow(df))
for (i in 1:nrow(df)){
non_zero_names <- var_names[df[i, -1] > 0]
df$varCombination[i] <- paste(non_zero_names, collapse = '-')
}
> df
firm A B C varCombination
1 firm1 0 0 1 C
2 firm2 0 1 1 B-C
3 firm3 0 0 0
4 firm4 1 42 0 A-B
5 firm5 2 0 0 A
A matrix I have has exactly 2 rows and n columns example
c(0,0,0,0,1,0,2,0,1,0,1,1,1,0,2)->a1
c(0,2,0,0,0,0,2,1,1,0,0,0,0,2,0)->a2
rbind(a1,a2)->matr
for a specific column ( in this example 9 with 1 in both rows) I do need to find to the left and to the right the first instance of 2/0 or 0/2 - in this example to the left is 2 and the other is 14)
The elements of every row can either be 0,1,2 - nothing else . Is there a way to do that operation on large matrixes (with 2 rows) fast? I need to to it 600k times so speed might be a consideration
library(compiler)
myfun <- cmpfun(function(m, cl) {
li <- ri <- cl
nc <- ncol(m)
repeat {
li <- li - 1
if(li == 0 || ((m[1, li] != 1) && (m[1, li] + m[2, li] == 2))) {
l <- li
break
}
}
repeat {
ri <- ri + 1
if(ri == nc || ((m[1, ri] != 1) && (m[1, ri] + m[2, ri] == 2))) {
r <- ri
break
}
}
c(l, r)
})
and, after taking into account #Martin Morgan's observations,
set.seed(1)
N <- 1000000
test <- rbind(sample(0:2, N, replace = TRUE),
sample(0:2, N, replace = TRUE))
library(microbenchmark)
microbenchmark(myfun(test, N / 2), fun(test, N / 2), foo(test, N / 2),
AWebb(test, N / 2), RHertel(test, N / 2))
# Unit: microseconds
expr min lq mean median uq max neval cld
# myfun(test, N/2) 4.658 20.033 2.237153e+01 22.536 26.022 85.567 100 a
# fun(test, N/2) 36685.750 47842.185 9.762663e+04 65571.546 120321.921 365958.316 100 b
# foo(test, N/2) 2622845.039 3009735.216 3.244457e+06 3185893.218 3369894.754 5170015.109 100 d
# AWebb(test, N/2) 121504.084 142926.590 1.990204e+05 193864.670 209918.770 489765.471 100 c
# RHertel(test, N/2) 65998.733 76805.465 1.187384e+05 86089.980 144793.416 385880.056 100 b
set.seed(123)
test <- rbind(sample(0:2, N, replace = TRUE, prob = c(5, 90, 5)),
sample(0:2, N, replace = TRUE, prob = c(5, 90, 5)))
microbenchmark(myfun(test, N / 2), fun(test, N / 2), foo(test, N / 2),
AWebb(test, N / 2), RHertel(test, N / 2))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# myfun(test, N/2) 81.805 103.732 121.9619 106.459 122.36 307.736 100 a
# fun(test, N/2) 26362.845 34553.968 83582.9801 42325.755 106303.84 403212.369 100 b
# foo(test, N/2) 2598806.742 2952221.561 3244907.3385 3188498.072 3505774.31 4382981.304 100 d
# AWebb(test, N/2) 109446.866 125243.095 199204.1013 176207.024 242577.02 653299.857 100 c
# RHertel(test, N/2) 56045.309 67566.762 125066.9207 79042.886 143996.71 632227.710 100 b
I was slower than #Laterow, but anyhow, this is a similar approach
foo <- function(mtr, targetcol) {
matr1 <- colSums(mtr)
matr2 <- apply(mtr, 2, function(x) x[1]*x[2])
cols <- which(matr1 == 2 & matr2 == 0) - targetcol
left <- cols[cols < 0]
right <- cols[cols > 0]
c(ifelse(length(left) == 0, NA, targetcol + max(left)),
ifelse(length(right) == 0, NA, targetcol + min(right)))
}
foo(matr,9) #2 14
Combine the information by squaring the rows and adding them. The right result should be 4. Then, simply find the first column that is smaller than 9 (rev(which())[1]) and the first column that is larger than 9 (which()[1]).
fun <- function(matr, col){
valid <- which((matr[1,]^2 + matr[2,]^2) == 4)
if (length(valid) == 0) return(c(NA,NA))
left <- valid[rev(which(valid < col))[1]]
right <- valid[which(valid > col)[1]]
c(left,right)
}
fun(matr,9)
# [1] 2 14
fun(matr,1)
# [1] NA 2
fun(matrix(0,nrow=2,ncol=100),9)
# [1] NA NA
Benchmark
set.seed(1)
test <- rbind(sample(0:2,1000000,replace=T),
sample(0:2,1000000,replace=T))
microbenchmark::microbenchmark(fun(test,9))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun(test, 9) 22.7297 27.21038 30.91314 27.55106 28.08437 51.92393 100
Edit: Thanks to #MatthewLundberg for pointing out a lot of mistakes.
If you are doing this many times, precompute all the locations
loc <- which((a1==2 & a2==0) | (a1==0 & a2==2))
You can then find the first to the left and right with findInterval
i<-findInterval(9,loc);loc[c(i,i+1)]
# [1] 2 14
Note that findInterval is vectorized should you care to specify multiple target columns.
That is an interesting question. Here's how I would address it.
First a vector is defined which contains the product of each column:
a3 <- matr[1,]*matr[2,]
Then we can find the columns with pairs of (0/2) or (2/0) rather easily, since we know that the matrix can only contain the values 0, 1, and 2:
the02s <- which(colSums(matr)==2 & a3==0)
Next we want to find the pairs of (0/2) or (2/0) that are closest to a given column number, on the left and on the right of that column. The column number could be 9, for instance:
thecol <- 9
Now we have basically all we need to find the index (the column number in the matrix) of a combination of (0/2) or (2/0) that is closest to the column thecol. We just need to use the output of findInterval():
pos <- findInterval(thecol,the02s)
pos <- c(pos, pos+1)
pos[pos==0] <- NA # output NA if no column was found on the left
And the result is:
the02s[pos]
# 2 14
So the indices of the closest columns on either side of thecol fulfilling the required condition would be 2 and 14 in this case, and we can confirm that these column numbers both contain one of the relevant combinations:
matr[,14]
#a1 a2
# 0 2
matr[,2]
#a1 a2
# 0 2
Edit: I changed the answer such that NA is returned in the case where no column exists on the left and/or on the right of thecol in the matrix that fulfills the required condition.