Related
I have a nxn symetrical binary matrix and I want to find the largest rectangle (area) with 0 at the top-left and bottom-right corners and 1 at the top-right and bottom-left corner. If I just do it with loops, checking all the rectangles from the biggest to the smallest it takes "days" for n=100. Does anyone have an idea to do it efficiently?
Thanks a lot !
thanks for your answers. Matrices I use are adjacency matrices of random Erdos-Renyi graphs. But one can take any random symetrical binary matrix to test it. Until now, I use 4 nested loops :
switch<-function(Mat)
{
n=nrow(Mat)
for (i in 1:(n-1)) {
for(j in seq(n,i+1,by=-1)) {
for(k in 1:(n-1)) {
if ((k==i)||(k==j) || (Mat[i,k]==1)||(Mat[j,k]==0)) next
for(l in seq(n,k+1,by=-1)) {
if ((l==i)||(l==j)|| (Mat[i,l]==0)||(Mat[j,l]==1)) next
return(i,j,k,l)
}
}
}
}
Here's an approach that you can try for now. It doesn't require symmetry, and it treats all nonzero elements like ones for efficiency.
It loops over the ones, assuming that there are fewer ones than zeros. (You would want to loop over zeros in the reverse case with fewer zeros than ones.)
This approach probably isn't optimal, since it loops over all of the ones even if the largest box is identified early. You can devise a clever stopping condition to short-circuit the loop in that case.
But it is still fast for n = 100, requiring less than half of a second on my machine, even when ones and zeros occur in roughly equal proportion (the worst case):
f <- function(X) {
if (!is.logical(X)) {
storage.mode(X) <- "logical"
}
J <- which(X, arr.ind = TRUE, useNames = FALSE)
i <- J[, 1L]
j <- J[, 2L]
nmax <- 0L
res <- NULL
for (k in seq_along(i)) {
i0 <- i[k]
j0 <- j[k]
ok <- i < i0 & j > j0
if (any(ok)) {
i1 <- i[ok]
j1 <- j[ok]
ok <- !(X[i0, j1] | X[i1, j0])
if (any(ok)) {
i1 <- i1[ok]
j1 <- j1[ok]
n <- (i0 - i1 + 1L) * (j1 - j0 + 1L)
w <- which.max(n)
if (n[w] > nmax) {
nmax <- n[w]
res <- c(i0 = i0, j0 = j0, i1 = i1[w], j1 = j1[w])
}
}
}
}
res
}
mkX <- function(n) {
X <- matrix(sample(0:1, n * n, TRUE), n, n)
X[upper.tri(X)] <- t(X)[upper.tri(X)]
X
}
set.seed(1L)
X <- mkX(6L)
X
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 1 0 0 1 0
## [2,] 1 0 1 1 0 0
## [3,] 0 1 0 1 1 1
## [4,] 0 1 1 0 0 0
## [5,] 1 0 1 0 0 1
## [6,] 0 0 1 0 1 0
f(X)
## i0 j0 i1 j1
## 5 1 1 5
Y <- mkX(100L)
microbenchmark::microbenchmark(f(Y))
## Unit: milliseconds
## expr min lq mean median uq max neval
## f(Y) 310.139 318.3363 327.8116 321.4109 326.5088 391.9081 100
I have written the following code to simulate an unbiased random walk on Z^2. With probability 1/4, the "destination" is supposed to move one unit up, left, right, or down. So I made "destination" a matrix with two columns, one for the x-coordinate and one for the y-coordinate, and increment/decrement the appropriate coordinate as according to the value of runif(1).
N_trials <- 10
N_steps <- 10
destination <- matrix(0,N_trials,2)
for(n in 1:N_steps) {
p <- runif(1)
if(p < 1/4) {
destination[n,1] <- destination[n,1] - 1
}
else if(p < 1/2) {
destination[n,1] <- destination[n,1] + 1
}
else if(p < 3/4) {
destination[n,2] <- destination[n,2] + 1
}
else if(p < 1) {
destination[n,2] <- destination[n,2] - 1
}
}
However, the process never seems to move out of the set {(0,0),(1,0),(-1,0),(0,1),(0,-1)}. Why is this? Is there an error in the logic of my code?
Rather than using loops, you can vectorize the random walk.
The idea is to first create a matrix of possible steps:
steps <- matrix(c(0,0,-1,1,-1,1,0,0),nrow = 4)
which is:
[,1] [,2]
[1,] 0 -1
[2,] 0 1
[3,] -1 0
[4,] 1 0
Then you can feed random subscripts into it:
steps[sample(1:4,10,replace = TRUE),]
for example will create a matrix of 9 rows where each row is randomly chosen from the steps matrix.
If you rbind this with c(0,0) as a starting position, and then take the cumulative sum (cumsum) of each column, you have your walk. You can wrap this all in a function:
rand.walk <- function(n){
steps <- matrix(c(0,0,-1,1,-1,1,0,0),nrow = 4)
walk <- steps[sample(1:4,n,replace = TRUE),]
walk <-rbind(c(0,0),walk)
apply(walk,2,cumsum)
}
For example, plot(rand.walk(1000),type = 'l') produces a graph which looks something like:
Here's what I have --- is this what you had in mind?
set.seed(1)
N_trials <- 10
N_steps <- 10
destination <- matrix(0, N_trials, 2)
for(n in 1:(N_steps-1)) {
p <- runif(1)
if(p < 1/4) {
destination[n+1,1] <- destination[n,1] - 1
destination[n+1,2] <- destination[n,2]
}
else if(p < 1/2) {
destination[n+1,1] <- destination[n,1] + 1
destination[n+1,2] <- destination[n,2]
}
else if(p < 3/4) {
destination[n+1,1] <- destination[n,1]
destination[n+1,2] <- destination[n,2] + 1
}
else if(p < 1) {
destination[n+1,1] <- destination[n,1]
destination[n+1,2] <- destination[n,2] - 1
}
}
destination
[,1] [,2]
[1,] 0 0
[2,] 1 0
[3,] 2 0
[4,] 2 1
[5,] 2 0
[6,] 1 0
[7,] 1 -1
[8,] 1 -2
[9,] 1 -1
[10,] 1 0
While there is a function used for Trace Matrix as seen below:
sum(diag(matrix))
This may incorrectly give you a result if the matrix is not Square (i.e. an "n x n" size). Are there any other inbuilt functions for running "Trace" of a matrix?
Package: psych
Function: tr()
Example:
> x <- matrix(replicate(9,1), ncol = 3, nrow = 3)
> x
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
> tr(x)
[1] 3
> x <- matrix(replicate(12,1), ncol = 4, nrow = 3)
> x
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 1 1 1 1
[3,] 1 1 1 1
> tr(x)
Fehler in tr(x) : m must be a square matrix
("Fehler" means error)
Moreover
Package: matrixcalc
Function: matrix.trace
Below is a quick function to test if the object is a matrix and then test if it is also square.
tr <- function (m)
{
total_sum <- 0
if(is.matrix(m))
{
row_count <- nrow(m)
col_count <- ncol(m)
if(row_count == col_count)
{
total_sum <-sum(diag(m))
total_sum
}
else
{
message ('Matrix is not square')
}
}
else
{
message( 'Object is not a matrix')
}
}
I also found the following package for Matrix.Trace:
Matrixcalc
You can try using eigenvalues
# first find eigenvalues
e = eigen(matrix)
# Calculate the trace of the matrix, and compare with the sum of the eigenvalues.
# function to calculate the trace using sum of the diagonal
trace <- function(data)sum(diag(data))
trace(H)
# using sum of the eigenvalues
sum(e$values)
Hope it helps.
I am trying to do something that I am sure should be quite simple: I am trying to make a function which turns a list of number pairs (pairedList) and a vector (botList) into a series of vectors (one for each pair) of length(botlist) where the numbers in those vectors are all equal to zero except for those corresponding to the index points identified by the pair which will be 1.
#generating mock data to simulate my application:
pair1 <- c(2,4)
pair2 <- c(1,3)
pair3 <- c(5,6)
pairedList <- c(pair1, pair2, pair3)
botList <- c(1:length(pairedList))
Here is what the output should ultimately look like:
[1] 0 1 0 1 0 0
[1] 1 0 1 0 0 0
[1] 0 0 0 0 1 1
The code below allows me to print the vectors in the right manner (by replacing the line in the if loop with print(prob) and commenting out the final print statement):
library(gtools)
test <- function() {
#initialising empty list
output <- list()
for (i in botList) {
x <- rep(0, length(pairedList))
ind <- pairedList[i:(i+1)]
ind.inv <- sort(ind, decreasing=T)
val <- rep(1,length(ind))
new.x <- vector(mode="numeric",length(x)+length(val))
new.x <- new.x[-ind]
new.x[ind] <- val
prob <- new.x
if (odd(i)) {
output[i] <- prob
}
print(output)
}
}
However I need to return this list of vectors from my function rather than printing it and when I do so, I get the following output and am met with an error and a number of warnings:
[[1]]
[1] 0
[[1]]
[1] 0
[[1]]
[1] 0
[[2]]
NULL
[[3]]
[1] 1
[[1]]
[1] 0
[[2]]
NULL
[[3]]
[1] 1
[[1]]
[1] 0
[[2]]
NULL
[[3]]
[1] 1
[[4]]
NULL
[[5]]
[1] 0
Error in new.x[-ind] : only 0's may be mixed with negative subscripts
In addition: Warning messages:
1: In output[i] <- prob :
number of items to replace is not a multiple of replacement length
2: In output[i] <- prob :
number of items to replace is not a multiple of replacement length
3: In output[i] <- prob :
number of items to replace is not a multiple of replacement length
My question is:
How can I change my code to output what I need from this function? I thought this was going to be a five minute job, and after hours on this one little thing I am stuck!
Thanks in advance
Something you can try, although there must be nicer ways:
# create a list with all the "pair1", "pair2", ... objects
l_pairs <- mget(ls(pattern="^pair\\d+"))
# compute maximum number among the values of pair., it determines the number of columns of the results
n_max <- max(unlist(l_pairs))
# finally, create for each pair. a vector of 0s and put 1s at the positions specified in pair.
res <- t(sapply(l_pairs, function(x){y <- rep(0, n_max); y[x]<-1; y}))
res
# [,1] [,2] [,3] [,4] [,5] [,6]
#pair1 0 1 0 1 0 0
#pair2 1 0 1 0 0 0
#pair3 0 0 0 0 1 1
You could use row/col indexing
m1 <- matrix(0, ncol=max(pairedList), nrow=3)
m1[cbind(rep(1:nrow(m1),each=2), pairedList)] <- 1
m1
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 0 1 0 1 0 0
#[2,] 1 0 1 0 0 0
#[3,] 0 0 0 0 1 1
James, the following should work. I've just tested it.
pair1 <- c(2,4)
pair2 <- c(1,3)
pair3 <- c(5,6)
pairedList <- c(pair1, pair2, pair3)
botList <- c(1:(length(pairedList)/2)
library(gtools)
test <- function(pairedList, botList) {
#initialising empty list
output <- list()
for (i in botList) {
x <- rep(0, length(pairedList))
ind <- pairedList[i:(i+1)]
ind.inv <- sort(ind, decreasing=T)
val <- rep(1,length(ind))
new.x <- vector(mode="numeric",length(x)+length(val))
new.x <- new.x[-ind]
new.x[ind] <- val
prob <- new.x
output[[i]] <- prob
print(prob)
}
return(output)
}
The reason for the strange error is that botList was being created as length 6 rather than length 3. Also if you want to assign a value to a list within a function you need to use double [[]] rather than []
Once you've removed them from the function rbind them all together as follows:
output <- test(pairedList, botList)
result <- do.call(rbind,output)
I'm gathering data on how much my cats poop into a matrix:
m <- cbind(fluffy=c(1.1,1.2,1.3,1.4),misterCuddles=c(0.9,NA,1.1,1.0))
row.names(m) <- c("2013-01-01", "2013-01-02", "2013-01-03","2013-01-04")
Which gives me this:
fluffy misterCuddles
2013-01-01 1.1 0.9
2013-01-02 1.2 NA
2013-01-03 1.3 1.1
2013-01-04 1.4 1.0
On every date, I'd like to know how many days in a row each cat has gone number 2. So the resulting matrix should look like this:
fluffy misterCuddles
2013-01-01 1 1
2013-01-02 2 0
2013-01-03 3 1
2013-01-04 4 2
Is there a way to do this efficiently? The cumsum function does something similar, but that's a primitive so I can't modify it to suit my dirty, dirty needs.
I could run a for loop and store a count like so:
m.output <- matrix(nrow=nrow(m),ncol=ncol(m))
for (column in 1:ncol(m)) {
sum <- 0
for (row in 1:nrow(m)) {
if (is.na(m[row,column])) sum <- 0
else sum <- sum + 1
m.output[row,column] <- sum
}
}
Is this the most efficient way to do this? I have a lot of cats, and I've recorded years worth of poop data. Can I parallellize this by column somehow?
All of the answers here are actually too complicated (including my own, from earlier, copied below). The Reduce family of answers is just masking a for-loop in a single function call. I like Roland's and Ananda's, but both I think have a little too much going on.
Thus, here's a simple vectorized solution:
reset <- function(x) {
s <- seq_along(x)
s[!is.na(x)] <- 0
seq_along(x) - cummax(s)
}
> apply(m, 2, reset)
fluffy misterCuddles
[1,] 1 1
[2,] 2 0
[3,] 3 1
[4,] 4 2
It also works on Roland's example:
m2 <- cbind(fluffy=c(NA,1.1,1.2,1.3,1.4,1.0,2),
misterCuddles=c(NA,1.3,2,NA,NA,1.1,NA))
> apply(m2, 2, reset)
fluffy misterCuddles
[1,] 0 0
[2,] 1 1
[3,] 2 2
[4,] 3 0
[5,] 4 0
[6,] 5 1
[7,] 6 0
From earlier: this is not vectorized, but also works:
pooprun <- function(x){
z <- numeric(length=length(x))
count <- 0
for(i in 1:length(x)){
if(is.na(x[i]))
count <- 0
else
count <- + count + 1
z[i] <- count
}
return(z)
}
apply(m, 2, pooprun)
> apply(m, 2, pooprun)
fluffy misterCuddles
[1,] 1 1
[2,] 2 0
[3,] 3 1
[4,] 4 2
THE BENCHMARKING
Here I simply wrap everyone's answers in a function call (based on their name).
> library(microbenchmark)
> microbenchmark(alexis(), hadley(), thomas(), matthew(), thomasloop(), usobi(), ananda(), times=1000)
Unit: microseconds
expr min lq median uq max neval
alexis() 1.540 4.6200 5.3890 6.1590 372.185 1000
hadley() 87.755 92.758 94.298 96.6075 1767.012 1000
thomas() 92.373 99.6860 102.7655 106.6140 315.223 1000
matthew() 128.168 136.2505 139.7150 145.4880 5196.344 1000
thomasloop() 133.556 141.6390 145.1030 150.4920 84131.427 1000
usobi() 148.182 159.9210 164.7320 174.1620 5010.445 1000
ananda() 720.507 742.4460 763.6140 801.3335 5858.733 1000
And here are the results for Roland's example data:
> microbenchmark(alexis(), hadley(), thomas(), matthew(), thomasloop(), usobi(), ananda(), times=1000)
Unit: microseconds
expr min lq median uq max neval
alexis() 2.310 5.3890 6.1590 6.9290 75.438 1000
hadley() 75.053 78.902 80.058 83.136 1747.767 1000
thomas() 90.834 97.3770 100.2640 104.3050 358.329 1000
matthew() 139.715 149.7210 154.3405 161.2680 5084.728 1000
thomasloop() 144.718 155.4950 159.7280 167.4260 5182.103 1000
usobi() 177.048 188.5945 194.3680 210.9180 5360.306 1000
ananda() 705.881 729.9370 753.4150 778.8175 8226.936 1000
Note: Alexis's and Hadley's solutions took quite a while to actually define as functions on my machine, whereas the others work out-of-the-box, but Alexis's is otherwise the clear winner.
This should work. Note that each of your cats is an independent individual so you can turn your data frame into a list and use mclapply which uses a paralleled approach.
count <- function(y,x){
if(is.na(x)) return(0)
return (y + 1)
}
oneCat = m[,1]
Reduce(count,oneCat,init=0,accumulate=TRUE)[-1]
EDIT: here is the full answer
count <- function(x,y){
if(is.na(y)) return(0)
return (x + 1)
}
mclapply(as.data.frame(m),Reduce,f=count,init=0,accumulate=TRUE)
EDIT2: The main bad problem is that I do get extra 0's at the beginning so...
result = mclapply(as.data.frame(m),Reduce,f=count,init=0,accumulate=TRUE)
finalResult = do.call('cbind',result)[-1,]
rownames(finalResult) = rownames(m)
does the job.
Another option, similar #Usobi's in that it uses Reduce, but with a slightly different approach:
apply(!is.na(m), 2, Reduce, f=function(x,y) if (y) x + y else y, accumulate=TRUE)
# fluffy misterCuddles
# [1,] 1 1
# [2,] 2 0
# [3,] 3 1
# [4,] 4 2
I had saved a snippet from here that translates almost exactly for a problem like this:
countReset <- function(x) {
x[!is.na(x)] <- 1
y <- ave(x, rev(cumsum(rev(is.na(x)))), FUN=cumsum)
y[is.na(y)] <- 0
y
}
apply(m, 2, countReset)
# fluffy misterCuddles
# 2013-01-01 1 1
# 2013-01-02 2 0
# 2013-01-03 3 1
# 2013-01-04 4 2
Since I'm in a period where I'm trying to get used to .Call, here's another idea that seems to work and -probably- is fast. (Don't take my word for it, though, my skills are not trustworthy!!):
library(inline) #use "inline" package for convenience
f <- cfunction(sig = c(R_mat = "numeric", R_dims = "integer"), body = '
R_len_t *dims = INTEGER(R_dims);
R_len_t rows = dims[0], cols = dims[1];
double *mat = REAL(R_mat);
SEXP ans;
PROTECT(ans = allocMatrix(INTSXP, rows, cols));
R_len_t *pans = INTEGER(ans);
for(int ic = 0; ic < cols; ic++)
{
pans[0 + ic*rows] = ISNA(mat[0 + ic*rows]) ? 0 : 1;
for(int ir = 1; ir < rows; ir++)
{
if(ISNA(mat[ir + ic*rows]))
{
pans[ir + ic*rows] = 0;
}else
{
if(!ISNA(mat[(ir - 1) + ic*rows]))
{
pans[ir + ic*rows] = pans[(ir - 1) + ic*rows] + 1;
}else
{
pans[ir + ic*rows] = 1;
}
}
}
}
UNPROTECT(1);
return(ans);
')
f(m, dim(m))
# [,1] [,2]
#[1,] 1 1
#[2,] 2 0
#[3,] 3 1
#[4,] 4 2
f(mm, dim(mm)) #I named Roland's matrix, mm ; I felt that I had to pass this test!
# [,1] [,2]
#[1,] 0 0
#[2,] 1 1
#[3,] 2 2
#[4,] 3 0
#[5,] 4 0
#[6,] 5 1
#[7,] 6 0
So the solution to this problem has two parts:
A function that accepts a vector per cat and returns a vector telling me at each date, how many days since the last NA
A function that accepts an NxM matrix and returns an NxM matrix, applying function (1) to each column
For (2), I adapted this from #Usobi's answer:
daysSinceLastNA <- function(matrix, vectorFunction, cores=1) {
listResult <- mclapply(as.data.frame(matrix), vectorFunction, mc.cores=cores)
result <- do.call('cbind', listResult)
rownames(result) <- rownames(matrix)
result
}
For (1), I have two solutions:
#ananda-mahto's solution:
daysSinceLastNA_1 <- function(vector) {
vector[!is.na(vector)] <- 1
result <- ave(vector, rev(cumsum(rev(is.na(vector)))), FUN=cumsum)
result[is.na(result)] <- 0
result
}
#Usobi's solution:
daysSinceLastNA_2 <- function(vector) {
reduction <- function(total, additional) ifelse(is.na(additional), 0, total + 1)
Reduce(reduction, vector, init=0, accumulate=TRUE)[-1]
}
Then I call them like this:
> system.time(result1 <- daysSinceLastNA (test, daysSinceLastNA_1 ))
user system elapsed
5.40 0.01 5.42
> system.time(result2 <- daysSinceLastNA (test, daysSinceLastNA_2 ))
user system elapsed
58.02 0.00 58.03
On my test dataset, which is roughly a 2500x2500 matrix, the first approach is an order of magnitude faster.
If I run on linux with 64 cores, solution (1) runs in 2 seconds, and solution (2) runs in 6 seconds.
For this sort of problem, which is easily solved with a for loop, I find Rcpp a very natural answer.
library(Rcpp)
cppFunction("NumericVector cumsum2(NumericVector x) {
int n = x.length();
NumericVector out(x);
for(int i = 0; i < n; ++i) {
if (NumericVector::is_na(x[i]) || i == 0) {
x[i] = 0;
} else {
x[i] = x[i - 1] + 1;
}
}
return out;
}")
The code requires a little more bookkeeping than the equivalent R code, but the bulk of the function is a very simple for loop.
You can then apply in R like any other vectorised function:
m2 <- cbind(
fluffy=c(NA,1.1,1.2,1.3,1.4,1.0,2),
misterCuddles=c(NA,1.3,2,NA,NA,1.1,NA)
)
apply(m2, 2, cumsum2)
You could of course make the C++ code iterate over the columns of the matrix, but I think that since this is already easily expressed in R, you might as well use the built in tools.