I'm doing some kind of optical character recognition and face the following issue. I store the glyphs in a list of binary matrices and they can be of different size, but their maximum possible width is wid = 3 columns (may be any defined constant, not just 3). In some cases after the first stage of processing I get data which look like this:
myll <- list(matrix(c(0, 0, 0, 1, 1, 0), ncol = 2),
matrix(c(0), ncol = 1),
matrix(c(1, 1, 0), ncol = 3),
matrix(c(1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1), ncol = 7),
matrix(c(1, 1, 1, 1), ncol = 2))
# [[1]]
# [,1] [,2]
# [1,] 0 1
# [2,] 0 1
# [3,] 0 0
#
# [[2]]
# [,1]
# [1,] 0
#
# [[3]]
# [,1] [,2] [,3]
# [1,] 1 1 0
#
# [[4]]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 1 1 1 0 0 0 1
# [2,] 0 1 0 1 0 0 1
# [3,] 1 1 1 1 0 0 1
#
# [[5]]
# [,1] [,2]
# [1,] 1 1
# [2,] 1 1
So, some glyphs may be not separated for some reasons. This happens only with glyphs of maximum possible width. Moreover, there may be some junk at the end of the matrix. I have to split them into matrices of width ncol = wid leaving the last piece (junk) as is. Then I store this matrices in separate elements of list to get the following output:
# [[1]]
# [,1] [,2]
# [1,] 0 1
# [2,] 0 1
# [3,] 0 0
#
# [[2]]
# [,1]
# [1,] 0
#
# [[3]]
# [,1] [,2] [,3]
# [1,] 1 1 0
#
# [[4]]
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 0 1 0
# [3,] 1 1 1
#
# [[5]]
# [,1] [,2] [,3]
# [1,] 0 0 0
# [2,] 1 0 0
# [3,] 1 0 0
#
# [[6]]
# [,1]
# [1,] 1
# [2,] 1
# [3,] 1
#
# [[7]]
# [,1] [,2]
# [1,] 1 1
# [2,] 1 1
At the moment I can make it with the help of this functions
checkGlyphs <- function(gl_m, wid = 3) {
if (ncol(gl_m) > wid)
return(list(gl_m[,1:wid], matrix(gl_m[,-(1:wid)], nrow = nrow(gl_m)))) else
return(gl_m)
}
separateGlyphs <- function(myll, wid = 3) {
require("magrittr")
presplit <- lapply(myll, checkGlyphs, wid)
total_new_length <-
presplit[unlist(lapply(presplit, is.list))] %>% lapply(length) %>% unlist() %>% sum() +
as.integer(!unlist(lapply(presplit, is.list))) %>% sum()
splitted <- vector("list", length = total_new_length)
spl_index <- 1
for (i in 1:length(presplit))
{
if (!is.list(presplit[[i]]))
{
splitted[[spl_index]] <- presplit[[i]]
spl_index <- spl_index + 1
} else
{
for (j in 1:length(presplit[[i]]))
{
splitted[[spl_index]] <- presplit[[i]][[j]]
spl_index <- spl_index + 1
}
}
}
if (any(lapply(splitted, ncol) > wid)) return(separateGlyphs(splitted, wid)) else
return(splitted)
}
But I believe there is more fast and convenient way to achieve the same result (without using for loops and this enlooped reassignment of elements and then recursion if needed O_o).
I will be thankful for any suggestions on the point or, alternatively, for recommending some OCR-packages for R.
This should do the trick, with the values in final being what you're after.
combined <- do.call(cbind, lapply(myll, unlist))
idx <- seq(1, ncol(combined), 2)
final <- do.call(list, lapply(idx, function(x) combined[, x:(x+1)]))
Related
How can I create all binary combinations of matrices with the condition that there can only be a single 1 per column and row. The example will clarify. This particular example must have 6 matrices of combinations, I am showing only the first 2.
c1 <- matrix(c(1, 0, 0, 0, 1, 0, 0, 0, 1), nrow = 3) #First combination
c2 <- matrix(c(0, 1, 0, 1, 0, 0, 0, 0, 1), nrow = 3) #Second combination
What you are asking for is equivalent to finding all permutations of length = n where n = nrow(c1) (or c2 above). Using the FUN argument of permuteGeneral from RcppAlgos (I am the author), we can easily generate the desired outcome:
n <- 3L
myIdentity <- diag(nrow = n)
library(RcppAlgos)
permuteGeneral(n, n, FUN = function(x) myIdentity[x, ])
[[1]]
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
[[2]]
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 0 1
[3,] 0 1 0
[[3]]
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 1 0 0
[3,] 0 0 1
[[4]]
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 0 0 1
[3,] 1 0 0
[[5]]
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 1 0 0
[3,] 0 1 0
[[6]]
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 0 1 0
[3,] 1 0 0
There are many other ways of generating the requested output. Most notably, utilizing the tried and true combinat package, we can get a similar result (the output will be in a different order):
combinat::permn(3, fun = function(x) myIdentity[x, ])
Now that we have reduced the problem to simply generating permutations, we can use any of the great packages (arrangements, gtools, multicool, partitions, etc.) for generating permutations to obtain our desired result with the help of lapply:
library(arrangements)
myPerms <- permutations(n)
lapply(1:nrow(myPerms), function(x) myIdentity[myPerms[x,], ])
I wrote a loop to conditionally replace elements in a list of lists. While still relatively new to R, I feel certain that I am not going about this as efficiently as possible. The following loop runs very slowly (an hour or so) on my actual data. I've included a minimal working example below that exactly replicates the structure of my data.
A <- matrix(c(0, 1, 1, 2, 0, 0, 1, 0, 1, 2, 0, 0), nrow = 2, ncol = 6, byrow = TRUE)
B <- matrix(c(1, 1, 1, 2, 0, 1, 1, 0, 1, 2, 0, 0), nrow = 2, ncol = 6, byrow = TRUE)
C <- matrix(c(1, 0, 0, 1, 0, 1), nrow = 1, ncol = 6, byrow = TRUE)
D <- matrix(c(0, 0, 0, 1, 1, 1), nrow = 1, ncol = 6, byrow = TRUE)
mList <-list(list(A, B))
dList <- list(list(C, D))
The goal of the loop is to replace all cells in column j of the nth item of mList2 with 0 if element j of the nth item of dList equals 0.
mList
# [[1]]
# [[1]][[1]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 1 1 2 0 0
# [2,] 1 0 1 2 0 0
#
# [[1]][[2]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 1 1 2 0 1
# [2,] 1 0 1 2 0 0
dList
# [[1]]
# [[1]][[1]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 0 0 1 0 1
#
# [[1]][[2]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 0 0 1 1 1
Is this another instance of unrealized gains from using the set of apply functions? Is there some better way of doing this that doesn't involve using four indices?
for(i in 1:length(dList)) {
for(j in 1:length(dList[[i]])) {
for(k in 1:length(dList[[i]][[j]])) {
for(m in 1:nrow(mList[[i]][[j]])) {
mList[[i]][[j]][m, k] <-
ifelse(
dList[[i]][[j]][k] == 1,
mList[[i]][[j]][m, k],
0
)
}
}
}
}
Resulting in:
mList
# [[1]]
# [[1]][[1]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 0 0 2 0 0
# [2,] 1 0 0 2 0 0
# [[1]][[2]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 0 0 2 0 1
# [2,] 0 0 0 2 0 0
I would use a nested lapply to loop through the nested structure of the input lists, recomputing the relevant entry in mList using a single vectorized operation instead of a loop through the columns and rows:
lapply(seq_along(dList), function(i) {
lapply(seq_along(dList[[i]]), function(j) {
t(t(mList[[i]][[j]]) * as.vector(dList[[i]][[j]] != 0))
})
})
# [[1]]
# [[1]][[1]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 0 0 2 0 0
# [2,] 1 0 0 2 0 0
#
# [[1]][[2]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 0 0 2 0 1
# [2,] 0 0 0 2 0 0
Here's a benchmark on a list of the same structure, with 10 x 10000 matrices in mList. I've benchmarked the solution you provide, the solution by #thelatemail, and my solution:
set.seed(144)
A <- matrix(sample(0:2, 100000, replace=TRUE), nrow=10)
B <- matrix(sample(0:2, 100000, replace=TRUE), nrow=10)
C <- matrix(sample(0:1, 10000, replace=TRUE), nrow=1)
D <- matrix(sample(0:1, 10000, replace=TRUE), nrow=1)
mList <-list(list(A, B))
dList <- list(list(C, D))
OP <- function(mList, dList) {
for(i in 1:length(dList)) {
for(j in 1:length(dList[[i]])) {
for(k in 1:ncol(dList[[i]][[j]])) {
for(m in 1:nrow(mList[[i]][[j]])) {
mList[[i]][[j]][m, k] <-
ifelse(
dList[[i]][[j]][k] == 1,
mList[[i]][[j]][m, k],
0
)
}
}
}
}
mList
}
josilber <- function(mList, dList) {
lapply(seq_along(dList), function(i) {
lapply(seq_along(dList[[i]]), function(j) {
t(t(mList[[i]][[j]]) * as.vector(dList[[i]][[j]] != 0))
})
})
}
thelatemail <- function(mList, dList) {
Map(
function(L,s) Map(function(sL,ss) {sL[,ss] <- 0; sL}, L, s),
mList,
lapply(dList, function(x) lapply(x, function(y) y==0) )
)
}
library(microbenchmark)
microbenchmark(OP(mList, dList), josilber(mList, dList), thelatemail(mList, dList), times=10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# OP(mList, dList) 12252.468288 13318.745019 13478.116388 13486.732412 13840.106332 14259.053497 10
# josilber(mList, dList) 2.299442 2.401806 2.561809 2.480822 2.552620 3.511609 10
# thelatemail(mList, dList) 4.259594 4.438562 4.683855 4.612297 5.002605 5.122605 10
Both solutions run more than 1000 times faster, mainly because they're not tightly looping through the matrices but instead performing the operation in a vectorized manner.
It's not pretty, but you essentially always need to double loop with a nested list structure.
Map(
function(L,s) Map(function(sL,ss) {sL[,ss] <- 0; sL}, L, s),
mList,
lapply(dList, function(x) lapply(x, function(y) y==0) )
)
#[[1]]
#[[1]][[1]]
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 0 0 0 2 0 0
#[2,] 1 0 0 2 0 0
#
#[[1]][[2]]
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 0 0 0 2 0 1
#[2,] 0 0 0 2 0 0
I've just replaced all the for() loops with a nested lapply to generate the lists of columns to overwrite, then used a nested Map to replace the columns within each list.
Say I have three lists:
l_zero
[[1]]
[,1] [,2]
[1,] 0 0
[2,] 0 0
[[2]]
[,1] [,2]
[1,] 0 0
[2,] 0 0
l_ind <- list(matrix(c(1,1), ncol = 2), matrix(c(1,1,1,2), ncol = 2))
l_ind
[[1]]
[,1] [,2]
[1,] 1 1
[[2]]
[,1] [,2]
[1,] 1 1
[2,] 1 2
l_val <- list(5, c(4, 7))
l_val
[[1]]
[1] 5
[[2]]
[1] 4 7
I would like to run Map over the three lists with the goal of replacing in l_zero the zeros with the coordinates in l_ind with the values from l_val.
My attempt gives me the following:
Map(function(l_zero, l_ind, l_val) l_zero[l_ind] <- l_val, l_zero = l_zero, l_ind = l_ind, l_val = l_val)
[[1]]
[1] 5
[[2]]
[1] 4 7
As you can see, the original dimensions of the matrices are reduced, but I would like to keep the dimensions of the matrices and just replace the values with the coordinates in l_ind. I tried l_zero[l_ind, drop = FALSE], but that didn't help either.
Can someone help me with this?
Here's a bit simpler method, The [<- replacement function can be used in Map()'s function argument. It takes three arguments, in order.
Map("[<-", l_zero, l_ind, l_val)
# [[1]]
# [,1] [,2]
# [1,] 5 0
# [2,] 0 0
#
# [[2]]
# [,1] [,2]
# [1,] 4 7
# [2,] 0 0
You need to return the modified value from your mapped function (see return(l_zero) below).
l_zero <- replicate(2,matrix(0,2,2),simplify=FALSE)
l_ind <- list(matrix(c(1,1), ncol = 2), matrix(c(1,1,1,2), ncol = 2))
l_val <- list(5, c(4, 7))
ff <- function(l_zero, l_ind, l_val) {
l_zero[l_ind] <- l_val
return(l_zero)
}
Map(ff, l_zero = l_zero, l_ind = l_ind, l_val = l_val)
Results:
## [[1]]
## [,1] [,2]
## [1,] 5 0
## [2,] 0 0
##
## [[2]]
## [,1] [,2]
## [1,] 4 7
## [2,] 0 0
I'm looking for a (build-in) function, which efficiently returns the list of building blocks of a block-diagonal matrix in the following way (rather than iterating over the slots to get the list manually):
#construct bdiag-matrix
library("Matrix")
listElems <- list(matrix(1:4,ncol=2,nrow=2),matrix(5:8,ncol=2,nrow=2))
mat <- bdiag(listElems)
#get back the list
res <- theFunctionImLookingFor(mat)
The result res yields the building blocks:
[[1]]
[,1] [,2]
[1,] 1 3
[2,] 2 4
[[2]]
[,1] [,2]
[1,] 5 7
[2,] 6 8
Edit: Regarding my use case, the list elements in listElems are square and symmetric matrices. If the block is a diagonal matrix, theFunctionImLookingFor should return a list element for each diagonal element.
However, the function should be able to deal with building block matrices like
[,1] [,2] [,3]
[1,] 1 1 0
[2,] 1 1 1
[3,] 0 1 1
or
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 0 1 1
[3,] 1 1 1
i.e. deal with zeros in blocks, which are not diagonal matrices.
I hope this will work for all your cases, the test at the bottom includes a block that contains zeroes.
theFunctionImLookingFor <- function(mat, plot.graph = FALSE) {
stopifnot(nrow(mat) == ncol(mat))
x <- mat
diag(x) <- 1
edges <- as.matrix(summary(x)[c("i", "j")])
library(igraph)
g <- graph.edgelist(edges, directed = FALSE)
if (plot.graph) plot(g)
groups <- unique(Map(sort, neighborhood(g, nrow(mat))))
sub.Mat <- Map(`[`, list(mat), groups, groups, drop = FALSE)
sub.mat <- Map(as.matrix, sub.Mat)
return(sub.mat)
}
listElems <- list(matrix(1:4,ncol=2,nrow=2),
matrix(5:8,ncol=2,nrow=2),
matrix(c(0, 1, 0, 0, 0, 1, 0, 0, 1),ncol=3,nrow=3),
matrix(1:1,ncol=1, nrow=1))
mat <- bdiag(listElems)
theFunctionImLookingFor(mat, plot.graph = TRUE)
# [[1]]
# [,1] [,2]
# [1,] 1 3
# [2,] 2 4
# [[2]]
# [,1] [,2]
# [1,] 5 7
# [2,] 6 8
# [[3]]
# [,1] [,2] [,3]
# [1,] 0 0 0
# [2,] 1 0 0
# [3,] 0 1 1
# [[4]]
# [,1]
# [1,] 1
A few questions, for which the R language might have elegant solutions....
Given, a matrix m containing binary values 1 and 0, and a vector v of column indices
how would I write a function to extract the all rows in m that have
the value of 1 in each of the columns indexed by the integers in v?
as an extra feature, how would one return the row indices along with
the corresponding rows?
Probably best if I illustrating, with an example....
Assuming the logic I'm asking for resides in function selectByIndices( matrix, indexVector).
so if we have the matrix (or perhaps the equivalent dataframe):
>(m= matrix(c( 1, 0, 1, 1, 1,0, 1, 1, 0, 1,1, 0, 1, 1, 0,1, 1, 1,
0, 1,0, 1, 0, 0, 1), 5))
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 1 0
[2,] 0 1 0 1 1
[3,] 1 1 1 1 0
[4,] 1 0 1 0 0
[5,] 1 1 0 1 1
and index vectors:
>c1 = c(1,3,4)
>c2 = c(4,5)
>c3 = c(1,3,5)
The function would behave something like this:
>selectByIndices( m, c1)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 1 0
[3,] 1 1 1 1 0
>selectByIndices( m, c2)
[,1] [,2] [,3] [,4] [,5]
[2,] 0 1 0 1 1
[5,] 1 1 0 1 1
>selectByIndices( m, c3)
#no rows (i.e. empty collection) returned
Hoping it's clear enough, thanks in advance for your help.
## Create a function that extracts the qualifying rows
f <- function(m, j) {
m[rowSums(m[, j]) == length(j),]
# m[apply(m[, j], 1, function(X) all(X==1)),] ## This would also work
# which(rowSums(m[, j]) == length(j)) ## & this would get row indices
}
## Try it out
f(m, c1)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 0 1 1 0
# [2,] 1 1 1 1 0
f(m, c2)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 1 0 1 1
# [2,] 1 1 0 1 1
> selectRows <- function(mat, rown) suppressWarnings(mat[apply( mat[, rown], 1, all) , ])
> selectRows(m, c1)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 1 0
[2,] 1 1 1 1 0
> whichRows <-function(mat, rown) suppressWarnings( which( apply( mat[, rown], 1, all) ) )
> whichRows(m, c1)
[1] 1 3