Finding a pattern in a binary matrix with R - r

I have a nxn symetrical binary matrix and I want to find the largest rectangle (area) with 0 at the top-left and bottom-right corners and 1 at the top-right and bottom-left corner. If I just do it with loops, checking all the rectangles from the biggest to the smallest it takes "days" for n=100. Does anyone have an idea to do it efficiently?
Thanks a lot !

thanks for your answers. Matrices I use are adjacency matrices of random Erdos-Renyi graphs. But one can take any random symetrical binary matrix to test it. Until now, I use 4 nested loops :
switch<-function(Mat)
{
n=nrow(Mat)
for (i in 1:(n-1)) {
for(j in seq(n,i+1,by=-1)) {
for(k in 1:(n-1)) {
if ((k==i)||(k==j) || (Mat[i,k]==1)||(Mat[j,k]==0)) next
for(l in seq(n,k+1,by=-1)) {
if ((l==i)||(l==j)|| (Mat[i,l]==0)||(Mat[j,l]==1)) next
return(i,j,k,l)
}
}
}
}

Here's an approach that you can try for now. It doesn't require symmetry, and it treats all nonzero elements like ones for efficiency.
It loops over the ones, assuming that there are fewer ones than zeros. (You would want to loop over zeros in the reverse case with fewer zeros than ones.)
This approach probably isn't optimal, since it loops over all of the ones even if the largest box is identified early. You can devise a clever stopping condition to short-circuit the loop in that case.
But it is still fast for n = 100, requiring less than half of a second on my machine, even when ones and zeros occur in roughly equal proportion (the worst case):
f <- function(X) {
if (!is.logical(X)) {
storage.mode(X) <- "logical"
}
J <- which(X, arr.ind = TRUE, useNames = FALSE)
i <- J[, 1L]
j <- J[, 2L]
nmax <- 0L
res <- NULL
for (k in seq_along(i)) {
i0 <- i[k]
j0 <- j[k]
ok <- i < i0 & j > j0
if (any(ok)) {
i1 <- i[ok]
j1 <- j[ok]
ok <- !(X[i0, j1] | X[i1, j0])
if (any(ok)) {
i1 <- i1[ok]
j1 <- j1[ok]
n <- (i0 - i1 + 1L) * (j1 - j0 + 1L)
w <- which.max(n)
if (n[w] > nmax) {
nmax <- n[w]
res <- c(i0 = i0, j0 = j0, i1 = i1[w], j1 = j1[w])
}
}
}
}
res
}
mkX <- function(n) {
X <- matrix(sample(0:1, n * n, TRUE), n, n)
X[upper.tri(X)] <- t(X)[upper.tri(X)]
X
}
set.seed(1L)
X <- mkX(6L)
X
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 1 0 0 1 0
## [2,] 1 0 1 1 0 0
## [3,] 0 1 0 1 1 1
## [4,] 0 1 1 0 0 0
## [5,] 1 0 1 0 0 1
## [6,] 0 0 1 0 1 0
f(X)
## i0 j0 i1 j1
## 5 1 1 5
Y <- mkX(100L)
microbenchmark::microbenchmark(f(Y))
## Unit: milliseconds
## expr min lq mean median uq max neval
## f(Y) 310.139 318.3363 327.8116 321.4109 326.5088 391.9081 100

Related

Faster way of filling a matrix in R

I want to fill a matrix in R but every column must have an iterative downward shift of vector.
So in a sense it will be a lower triangular matrix.
My effort is this:
x = c(3,4,8,9)
E <- matrix(0,length(x),length(x));E
for (i in 1:nrow(E)){
E[i,1]=x[i]
}
E
for (i in 2:nrow(E)){
for (j in 2:ncol(E)) {
E[i,2] =x[i-1] } }
E
for (i in 3:nrow(E)){
for (j in 3:ncol(E)) {
E[i,3] =x[i-2] } }
E
for (i in 4:nrow(E)){
for (j in 4:ncol(E)) {
E[i,4] =x[i-3] } }
E
Each time a remove an element from the vector.But is there a a faster way to do it with less for loops and for n length of the vector instead of 4, for as a genearalization ?
Sorry, I couldn't resist. Here's another base approach:
x <- c(3,4,8,9)
n <- length(x)
E <- diag(rep(x[1], n))
j <- unlist(sapply(length(x):2, function(i) x[2:i]))
E[lower.tri(E)] <- j
Added to Rui's benchmark code we get this:
I think it would be interesting if you add this code to the benchmarking
TIC <- function(x) {
E <- diag(x)
E[lower.tri(E, TRUE)] <- x[sequence(rev(seq_along(x)))]
E
}
which gives
> TIC(x)
[,1] [,2] [,3] [,4]
[1,] 3 0 0 0
[2,] 4 3 0 0
[3,] 8 4 3 0
[4,] 9 8 4 3
and
Here is a base R way.
E <- diag(length(x))
apply(lower.tri(E, diag = TRUE), 2, function(i) {
c(rep(0, nrow(E) - sum(i)), x)[seq_along(x)]
})
# [,1] [,2] [,3] [,4]
#[1,] 3 0 0 0
#[2,] 4 3 0 0
#[3,] 8 4 3 0
#[4,] 9 8 4 3
Performance tests
If the question is about faster code, here are benchmarks.
The functions are mine and Ben Bolker's code.
Rui <- function(x){
E <- diag(length(x))
inx <- seq_along(x)
apply(lower.tri(E, diag = TRUE), 2, function(i) {
c(rep(0, nrow(E) - sum(i)), x)[inx]
})
}
Ben <- function(x){
E <- matrix(0, nrow=length(x), ncol=length(x))
diag(E) <- x[1]
for (i in 2:length(x)) {
E[row(E)==col(E)+i-1] <- x[i]
}
E
}
Tests with increasing vector size and plot with ggplot.
library(microbenchmark)
library(ggplot2)
test_speed <- function(n){
out <- lapply(1:n, function(i){
x <- sample(10*i)
mb <- microbenchmark(
Rui = Rui(x),
Ben = Ben(x)
)
mb <- aggregate(time ~ expr, mb, median)
mb$size <- 10*i
mb
})
out <- do.call(rbind, out)
out
}
res <- test_speed(10)
ggplot(res, aes(size, time, color = expr)) +
geom_line() +
geom_point() +
scale_y_continuous(trans = "log10")
This isn't super-efficient but better than your solution. (The inefficiency is that we are constructing the row()/col() matrices and generating a full logical matrix each time, rather than doing something with indexing.) On the other hand, it seems to be almost instantaneous for length(x)==100 (kind of slow when we go to 1000 though).
E <- matrix(0, nrow=length(x), ncol=length(x))
diag(E) <- x[1]
for (i in 2:length(x)) {
E[row(E)==col(E)+i-1] <- x[i]
}
It's possible that someone has written more efficient code (in Rcpp?) for indexing sub-diagonals/off-diagonal elements of a matrix.
Despite its slowness, the advantage of this one (IMO) is that it's a little easier to understand; you can also adjust it to a lot of different patterns by coming up with different conditions on the relationship between rows and columns.

Adding adjustable random noise to a matrix in R

I have a matrix generating function that produces lower-triangle of 1s and upper-triangle of 0s.
I was wondering if it might be possible to add some adjustable random noise (from some distribution that gives random 0 and 1) to the outputted matrix such that the random 0s randomly replace some of the bottom 1s, and random 1s randomly replace some of the top 0s?
lower_mat <- function(r, c) {
m <- matrix(0, nrow=r,ncol=c)
m[lower.tri(m)] <- 1
m
}
lower_mat(5,4)
# [,1] [,2] [,3] [,4]
# [1,] 0 0 0 0
# [2,] 1 0 0 0
# [3,] 1 1 0 0
# [4,] 1 1 1 0
# [5,] 1 1 1 1
If you want to assume that you are swapping from lower to upper a certain number of positions, you could do
swap_upper_lower <- function(m, n) {
tops <- which(upper.tri(m))
bots <- which(lower.tri(m))
stopifnot(length(bots)>=n && length(tops)>=n)
tops <- sample(tops, n)
bots <- sample(tops, n)
vals <- m[tops]
m[tops] <- m[bots]
m[bots] <- vals
m
}
mm <- lower_mat(5,4)
swap_upper_lower(mm, 3)
That will swap 3 values from the lower triangle to the upper triangle
If you would prefer to think of it as swapping the positions of 0's and 1's you could instead do
swap_0_1 <- function(m, n) {
ones <- which(m==1)
zers <- which(m==0)
stopifnot(length(ones)>=n && length(zers)>=n)
ones <- sample(ones, n)
zers <- sample(zers, n)
vals <- m[ones]
m[ones] <- m[zers]
m[zers] <- vals
m
}
Note this will treat values on the diagonal differently than the other function.

Why isn't my random walk simulation working correctly?

I have written the following code to simulate an unbiased random walk on Z^2. With probability 1/4, the "destination" is supposed to move one unit up, left, right, or down. So I made "destination" a matrix with two columns, one for the x-coordinate and one for the y-coordinate, and increment/decrement the appropriate coordinate as according to the value of runif(1).
N_trials <- 10
N_steps <- 10
destination <- matrix(0,N_trials,2)
for(n in 1:N_steps) {
p <- runif(1)
if(p < 1/4) {
destination[n,1] <- destination[n,1] - 1
}
else if(p < 1/2) {
destination[n,1] <- destination[n,1] + 1
}
else if(p < 3/4) {
destination[n,2] <- destination[n,2] + 1
}
else if(p < 1) {
destination[n,2] <- destination[n,2] - 1
}
}
However, the process never seems to move out of the set {(0,0),(1,0),(-1,0),(0,1),(0,-1)}. Why is this? Is there an error in the logic of my code?
Rather than using loops, you can vectorize the random walk.
The idea is to first create a matrix of possible steps:
steps <- matrix(c(0,0,-1,1,-1,1,0,0),nrow = 4)
which is:
[,1] [,2]
[1,] 0 -1
[2,] 0 1
[3,] -1 0
[4,] 1 0
Then you can feed random subscripts into it:
steps[sample(1:4,10,replace = TRUE),]
for example will create a matrix of 9 rows where each row is randomly chosen from the steps matrix.
If you rbind this with c(0,0) as a starting position, and then take the cumulative sum (cumsum) of each column, you have your walk. You can wrap this all in a function:
rand.walk <- function(n){
steps <- matrix(c(0,0,-1,1,-1,1,0,0),nrow = 4)
walk <- steps[sample(1:4,n,replace = TRUE),]
walk <-rbind(c(0,0),walk)
apply(walk,2,cumsum)
}
For example, plot(rand.walk(1000),type = 'l') produces a graph which looks something like:
Here's what I have --- is this what you had in mind?
set.seed(1)
N_trials <- 10
N_steps <- 10
destination <- matrix(0, N_trials, 2)
for(n in 1:(N_steps-1)) {
p <- runif(1)
if(p < 1/4) {
destination[n+1,1] <- destination[n,1] - 1
destination[n+1,2] <- destination[n,2]
}
else if(p < 1/2) {
destination[n+1,1] <- destination[n,1] + 1
destination[n+1,2] <- destination[n,2]
}
else if(p < 3/4) {
destination[n+1,1] <- destination[n,1]
destination[n+1,2] <- destination[n,2] + 1
}
else if(p < 1) {
destination[n+1,1] <- destination[n,1]
destination[n+1,2] <- destination[n,2] - 1
}
}
destination
[,1] [,2]
[1,] 0 0
[2,] 1 0
[3,] 2 0
[4,] 2 1
[5,] 2 0
[6,] 1 0
[7,] 1 -1
[8,] 1 -2
[9,] 1 -1
[10,] 1 0

Comparison of rows and columns of a matrix

Lets assume we have p by n matrix. I want to generate an output matrix, w (p x p) such as w_ij represent how many times i_th rows number is bigger than j_th (can be at most n obviously).
My code is here, I'm looking for a faster way.
p <- dim(dat)[1]
n <- dim(dat)[2]
w <- matrix(0,p,p)
for(i in 1:n){
for(j in 1:(p-1)){
for(k in (j+1):p){
if(dat[j,i] > dat[k,i]){
w[j,k] <- w[j,k]+1
}else{
w[k,j] <- w[k,j]+1
}
}
}
}
A small example
If the input data is
dat <- matrix(1:9, 3)
dat
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
the expected outcome is
W <- matrix(c(0,3,3,0,0,3,0,0,0),3)
W
# [,1] [,2] [,3]
#[1,] 0 0 0
#[2,] 3 0 0
#[3,] 3 3 0
This seems to give a quick speed gain, without much extra work
newd <- t(dat)
for(i in 1:p) {
w[,i] <- colSums((newd - dat[i,]) > 0)
}
Quick comparison: wrap code in functions
f1 <- function(dat){
p <- dim(dat)[1]
n <- dim(dat)[2]
w <- matrix(0,p,p)
for(i in 1:n){
for(j in 1:(p-1)){
for(k in (j+1):p){
if(dat[j,i] > dat[k,i]){
w[j,k] <- w[j,k]+1
}else{
w[k,j] <- w[k,j]+1
}
}
}
}
w
}
f2 <- function(dat){
p <- dim(dat)[1]
w <- matrix(0,p,p)
newd <- t(dat)
for(i in 1:p) {
w[,i] <- colSums((newd - dat[i,]) > 0)
} ; w}
Generate slightly larger data
set.seed(1)
dat <- matrix(rnorm(1e4), 100)
Compare
all.equal(f1(dat), f2(dat))
Benchmark
library(microbenchmark)
microbenchmark(f1(dat), f2(dat), times=10)
# expr min lq mean median uq max neval cld
# f1(dat) 1586.10589 1594.40701 1619.03102 1616.14899 1635.05695 1688.08589 10 b
# f2(dat) 22.56083 23.13493 23.98392 23.34228 24.39766 28.29201 10 a
Of course, depending on the size of your matrix it may be worth writing your loops in c++/Rcpp for larger speed gains

Counting column data in a matrix with resets

I'm gathering data on how much my cats poop into a matrix:
m <- cbind(fluffy=c(1.1,1.2,1.3,1.4),misterCuddles=c(0.9,NA,1.1,1.0))
row.names(m) <- c("2013-01-01", "2013-01-02", "2013-01-03","2013-01-04")
Which gives me this:
fluffy misterCuddles
2013-01-01 1.1 0.9
2013-01-02 1.2 NA
2013-01-03 1.3 1.1
2013-01-04 1.4 1.0
On every date, I'd like to know how many days in a row each cat has gone number 2. So the resulting matrix should look like this:
fluffy misterCuddles
2013-01-01 1 1
2013-01-02 2 0
2013-01-03 3 1
2013-01-04 4 2
Is there a way to do this efficiently? The cumsum function does something similar, but that's a primitive so I can't modify it to suit my dirty, dirty needs.
I could run a for loop and store a count like so:
m.output <- matrix(nrow=nrow(m),ncol=ncol(m))
for (column in 1:ncol(m)) {
sum <- 0
for (row in 1:nrow(m)) {
if (is.na(m[row,column])) sum <- 0
else sum <- sum + 1
m.output[row,column] <- sum
}
}
Is this the most efficient way to do this? I have a lot of cats, and I've recorded years worth of poop data. Can I parallellize this by column somehow?
All of the answers here are actually too complicated (including my own, from earlier, copied below). The Reduce family of answers is just masking a for-loop in a single function call. I like Roland's and Ananda's, but both I think have a little too much going on.
Thus, here's a simple vectorized solution:
reset <- function(x) {
s <- seq_along(x)
s[!is.na(x)] <- 0
seq_along(x) - cummax(s)
}
> apply(m, 2, reset)
fluffy misterCuddles
[1,] 1 1
[2,] 2 0
[3,] 3 1
[4,] 4 2
It also works on Roland's example:
m2 <- cbind(fluffy=c(NA,1.1,1.2,1.3,1.4,1.0,2),
misterCuddles=c(NA,1.3,2,NA,NA,1.1,NA))
> apply(m2, 2, reset)
fluffy misterCuddles
[1,] 0 0
[2,] 1 1
[3,] 2 2
[4,] 3 0
[5,] 4 0
[6,] 5 1
[7,] 6 0
From earlier: this is not vectorized, but also works:
pooprun <- function(x){
z <- numeric(length=length(x))
count <- 0
for(i in 1:length(x)){
if(is.na(x[i]))
count <- 0
else
count <- + count + 1
z[i] <- count
}
return(z)
}
apply(m, 2, pooprun)
> apply(m, 2, pooprun)
fluffy misterCuddles
[1,] 1 1
[2,] 2 0
[3,] 3 1
[4,] 4 2
THE BENCHMARKING
Here I simply wrap everyone's answers in a function call (based on their name).
> library(microbenchmark)
> microbenchmark(alexis(), hadley(), thomas(), matthew(), thomasloop(), usobi(), ananda(), times=1000)
Unit: microseconds
expr min lq median uq max neval
alexis() 1.540 4.6200 5.3890 6.1590 372.185 1000
hadley() 87.755 92.758 94.298 96.6075 1767.012 1000
thomas() 92.373 99.6860 102.7655 106.6140 315.223 1000
matthew() 128.168 136.2505 139.7150 145.4880 5196.344 1000
thomasloop() 133.556 141.6390 145.1030 150.4920 84131.427 1000
usobi() 148.182 159.9210 164.7320 174.1620 5010.445 1000
ananda() 720.507 742.4460 763.6140 801.3335 5858.733 1000
And here are the results for Roland's example data:
> microbenchmark(alexis(), hadley(), thomas(), matthew(), thomasloop(), usobi(), ananda(), times=1000)
Unit: microseconds
expr min lq median uq max neval
alexis() 2.310 5.3890 6.1590 6.9290 75.438 1000
hadley() 75.053 78.902 80.058 83.136 1747.767 1000
thomas() 90.834 97.3770 100.2640 104.3050 358.329 1000
matthew() 139.715 149.7210 154.3405 161.2680 5084.728 1000
thomasloop() 144.718 155.4950 159.7280 167.4260 5182.103 1000
usobi() 177.048 188.5945 194.3680 210.9180 5360.306 1000
ananda() 705.881 729.9370 753.4150 778.8175 8226.936 1000
Note: Alexis's and Hadley's solutions took quite a while to actually define as functions on my machine, whereas the others work out-of-the-box, but Alexis's is otherwise the clear winner.
This should work. Note that each of your cats is an independent individual so you can turn your data frame into a list and use mclapply which uses a paralleled approach.
count <- function(y,x){
if(is.na(x)) return(0)
return (y + 1)
}
oneCat = m[,1]
Reduce(count,oneCat,init=0,accumulate=TRUE)[-1]
EDIT: here is the full answer
count <- function(x,y){
if(is.na(y)) return(0)
return (x + 1)
}
mclapply(as.data.frame(m),Reduce,f=count,init=0,accumulate=TRUE)
EDIT2: The main bad problem is that I do get extra 0's at the beginning so...
result = mclapply(as.data.frame(m),Reduce,f=count,init=0,accumulate=TRUE)
finalResult = do.call('cbind',result)[-1,]
rownames(finalResult) = rownames(m)
does the job.
Another option, similar #Usobi's in that it uses Reduce, but with a slightly different approach:
apply(!is.na(m), 2, Reduce, f=function(x,y) if (y) x + y else y, accumulate=TRUE)
# fluffy misterCuddles
# [1,] 1 1
# [2,] 2 0
# [3,] 3 1
# [4,] 4 2
I had saved a snippet from here that translates almost exactly for a problem like this:
countReset <- function(x) {
x[!is.na(x)] <- 1
y <- ave(x, rev(cumsum(rev(is.na(x)))), FUN=cumsum)
y[is.na(y)] <- 0
y
}
apply(m, 2, countReset)
# fluffy misterCuddles
# 2013-01-01 1 1
# 2013-01-02 2 0
# 2013-01-03 3 1
# 2013-01-04 4 2
Since I'm in a period where I'm trying to get used to .Call, here's another idea that seems to work and -probably- is fast. (Don't take my word for it, though, my skills are not trustworthy!!):
library(inline) #use "inline" package for convenience
f <- cfunction(sig = c(R_mat = "numeric", R_dims = "integer"), body = '
R_len_t *dims = INTEGER(R_dims);
R_len_t rows = dims[0], cols = dims[1];
double *mat = REAL(R_mat);
SEXP ans;
PROTECT(ans = allocMatrix(INTSXP, rows, cols));
R_len_t *pans = INTEGER(ans);
for(int ic = 0; ic < cols; ic++)
{
pans[0 + ic*rows] = ISNA(mat[0 + ic*rows]) ? 0 : 1;
for(int ir = 1; ir < rows; ir++)
{
if(ISNA(mat[ir + ic*rows]))
{
pans[ir + ic*rows] = 0;
}else
{
if(!ISNA(mat[(ir - 1) + ic*rows]))
{
pans[ir + ic*rows] = pans[(ir - 1) + ic*rows] + 1;
}else
{
pans[ir + ic*rows] = 1;
}
}
}
}
UNPROTECT(1);
return(ans);
')
f(m, dim(m))
# [,1] [,2]
#[1,] 1 1
#[2,] 2 0
#[3,] 3 1
#[4,] 4 2
f(mm, dim(mm)) #I named Roland's matrix, mm ; I felt that I had to pass this test!
# [,1] [,2]
#[1,] 0 0
#[2,] 1 1
#[3,] 2 2
#[4,] 3 0
#[5,] 4 0
#[6,] 5 1
#[7,] 6 0
So the solution to this problem has two parts:
A function that accepts a vector per cat and returns a vector telling me at each date, how many days since the last NA
A function that accepts an NxM matrix and returns an NxM matrix, applying function (1) to each column
For (2), I adapted this from #Usobi's answer:
daysSinceLastNA <- function(matrix, vectorFunction, cores=1) {
listResult <- mclapply(as.data.frame(matrix), vectorFunction, mc.cores=cores)
result <- do.call('cbind', listResult)
rownames(result) <- rownames(matrix)
result
}
For (1), I have two solutions:
#ananda-mahto's solution:
daysSinceLastNA_1 <- function(vector) {
vector[!is.na(vector)] <- 1
result <- ave(vector, rev(cumsum(rev(is.na(vector)))), FUN=cumsum)
result[is.na(result)] <- 0
result
}
#Usobi's solution:
daysSinceLastNA_2 <- function(vector) {
reduction <- function(total, additional) ifelse(is.na(additional), 0, total + 1)
Reduce(reduction, vector, init=0, accumulate=TRUE)[-1]
}
Then I call them like this:
> system.time(result1 <- daysSinceLastNA (test, daysSinceLastNA_1 ))
user system elapsed
5.40 0.01 5.42
> system.time(result2 <- daysSinceLastNA (test, daysSinceLastNA_2 ))
user system elapsed
58.02 0.00 58.03
On my test dataset, which is roughly a 2500x2500 matrix, the first approach is an order of magnitude faster.
If I run on linux with 64 cores, solution (1) runs in 2 seconds, and solution (2) runs in 6 seconds.
For this sort of problem, which is easily solved with a for loop, I find Rcpp a very natural answer.
library(Rcpp)
cppFunction("NumericVector cumsum2(NumericVector x) {
int n = x.length();
NumericVector out(x);
for(int i = 0; i < n; ++i) {
if (NumericVector::is_na(x[i]) || i == 0) {
x[i] = 0;
} else {
x[i] = x[i - 1] + 1;
}
}
return out;
}")
The code requires a little more bookkeeping than the equivalent R code, but the bulk of the function is a very simple for loop.
You can then apply in R like any other vectorised function:
m2 <- cbind(
fluffy=c(NA,1.1,1.2,1.3,1.4,1.0,2),
misterCuddles=c(NA,1.3,2,NA,NA,1.1,NA)
)
apply(m2, 2, cumsum2)
You could of course make the C++ code iterate over the columns of the matrix, but I think that since this is already easily expressed in R, you might as well use the built in tools.

Resources