Subset selection from binary matrix with dynamic column indices - r

A few questions, for which the R language might have elegant solutions....
Given, a matrix m containing binary values 1 and 0, and a vector v of column indices
how would I write a function to extract the all rows in m that have
the value of 1 in each of the columns indexed by the integers in v?
as an extra feature, how would one return the row indices along with
the corresponding rows?
Probably best if I illustrating, with an example....
Assuming the logic I'm asking for resides in function selectByIndices( matrix, indexVector).
so if we have the matrix (or perhaps the equivalent dataframe):
>(m= matrix(c( 1, 0, 1, 1, 1,0, 1, 1, 0, 1,1, 0, 1, 1, 0,1, 1, 1,
0, 1,0, 1, 0, 0, 1), 5))
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 1 0
[2,] 0 1 0 1 1
[3,] 1 1 1 1 0
[4,] 1 0 1 0 0
[5,] 1 1 0 1 1
and index vectors:
>c1 = c(1,3,4)
>c2 = c(4,5)
>c3 = c(1,3,5)
The function would behave something like this:
>selectByIndices( m, c1)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 1 0
[3,] 1 1 1 1 0
>selectByIndices( m, c2)
[,1] [,2] [,3] [,4] [,5]
[2,] 0 1 0 1 1
[5,] 1 1 0 1 1
>selectByIndices( m, c3)
#no rows (i.e. empty collection) returned
Hoping it's clear enough, thanks in advance for your help.

## Create a function that extracts the qualifying rows
f <- function(m, j) {
m[rowSums(m[, j]) == length(j),]
# m[apply(m[, j], 1, function(X) all(X==1)),] ## This would also work
# which(rowSums(m[, j]) == length(j)) ## & this would get row indices
}
## Try it out
f(m, c1)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 0 1 1 0
# [2,] 1 1 1 1 0
f(m, c2)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 1 0 1 1
# [2,] 1 1 0 1 1

> selectRows <- function(mat, rown) suppressWarnings(mat[apply( mat[, rown], 1, all) , ])
> selectRows(m, c1)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 1 0
[2,] 1 1 1 1 0
> whichRows <-function(mat, rown) suppressWarnings( which( apply( mat[, rown], 1, all) ) )
> whichRows(m, c1)
[1] 1 3

Related

Conditional combinations matrixes in r

How can I create all binary combinations of matrices with the condition that there can only be a single 1 per column and row. The example will clarify. This particular example must have 6 matrices of combinations, I am showing only the first 2.
c1 <- matrix(c(1, 0, 0, 0, 1, 0, 0, 0, 1), nrow = 3) #First combination
c2 <- matrix(c(0, 1, 0, 1, 0, 0, 0, 0, 1), nrow = 3) #Second combination
What you are asking for is equivalent to finding all permutations of length = n where n = nrow(c1) (or c2 above). Using the FUN argument of permuteGeneral from RcppAlgos (I am the author), we can easily generate the desired outcome:
n <- 3L
myIdentity <- diag(nrow = n)
library(RcppAlgos)
permuteGeneral(n, n, FUN = function(x) myIdentity[x, ])
[[1]]
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
[[2]]
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 0 1
[3,] 0 1 0
[[3]]
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 1 0 0
[3,] 0 0 1
[[4]]
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 0 0 1
[3,] 1 0 0
[[5]]
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 1 0 0
[3,] 0 1 0
[[6]]
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 0 1 0
[3,] 1 0 0
There are many other ways of generating the requested output. Most notably, utilizing the tried and true combinat package, we can get a similar result (the output will be in a different order):
combinat::permn(3, fun = function(x) myIdentity[x, ])
Now that we have reduced the problem to simply generating permutations, we can use any of the great packages (arrangements, gtools, multicool, partitions, etc.) for generating permutations to obtain our desired result with the help of lapply:
library(arrangements)
myPerms <- permutations(n)
lapply(1:nrow(myPerms), function(x) myIdentity[myPerms[x,], ])

create a X % probability matrix from list of matrices

I have a list of matrices (some hundred thousands). I want to create a single matrix where the cells correspond to e.g. the 95%. With that I mean this: if e.g. cell mat[1,2] is positive (i.e. >0) in 95% of the matrices it is scored a 1, and if e.g. cell mat[2,1] is negative (i.e. <0) in 95% of the matrices it is scored a -1. If they fall below this threshold they are scored a 0.
#Dummy data
listX <- list()
for(i in 1:10){listX[[i]]<-matrix(rnorm(n = 25, mean = 0.5, sd = 1),5,5)}
listX2 <- listX
for(i in 1:10) { listX2[[i]] <- ifelse(listX[[i]] >0, 1, -1) }
For the sake of the dummy data, the 95% can be changed to say 60%, such that the cells that keep their sign in 6 out of 10 matrices are kept and scored either 1 or -1 and the rest 0.
I'm stuck on this, hence cannot provide any more code.
I would do:
listX <- list()
set.seed(20)
# I set seed for reproducability, and changed your mean so you could see the negatives
for(i in 1:10){listX[[i]]<-matrix(rnorm(n = 25, mean = 0, sd = 1),5,5)}
threshold <- 0.7
(Reduce('+',lapply(listX,function(x){x > 0}))/length(listX) >= threshold) - (Reduce('+',lapply(listX,function(x){x < 0}))/length(listX) >= threshold)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 -1 1
[2,] -1 1 -1 -1 1
[3,] 0 0 0 1 1
[4,] 0 1 0 0 0
[5,] 0 0 0 0 0
This basically checks both conditions, and adds the two checks together. To break down one of the conditions (Reduce('+',lapply(listX,function(x){x > 0}))/length(listX) > threshold)
lapply(listX,function(x){x > 0}) loops through each matrix and converts it to a a matrix of true/false for every value that is above zero.
Reduce('+',lapply(listX,function(x){x > 0}))/length(listX) then adds these all together (Reduce), and divides by the number of obeservations. If the proportion is greater than our threshold, we set that value to one, and if not it is zero.
We then subtract the same matrix with x < 0 as the test, which gives -1 in each case where enough sub-values are negative.
You can change the list to an array and then take the mean over the dimensions.
arr <- simplify2array(listX)
grzero = rowMeans(arr > 0, dims = 2)
lezero = rowMeans(arr < 0, dims = 2)
prop = 0.6
1* (grzero >= prop) + -1* (lezero >= prop)
Test case showing which answers work so far! (edit)
Below you'll find my original answer. It ended up producing comparable results to the other answers on test cases involving randomly seeded data. To triple check, I created a small test data set with a known answer. It turns out that only answer by #Chris passes right now (though #user20650 should be ok if using >= on this example as indicated in comments). Here it is in case anyone else wants to use it:
listX <- list(
matrix(c(1,0,-1,1), nrow = 2),
matrix(c(1,0,-1,1), nrow = 2),
matrix(c(1,0, 1,0), nrow = 2)
)
# With any threshold < .67,
# result should be...
matrix(c(1, 0, -1, 1), nrow = 2)
#> [,1] [,2]
#> [1,] 1 -1
#> [2,] 0 1
# Otherwise...
matrix(c(1, 0, 0, 0), nrow = 2)
#> [,1] [,2]
#> [1,] 1 0
#> [2,] 0 0
# #Chris answer passes
threshold <- 0.5
(Reduce('+',lapply(listX,function(x){x > 0}))/length(listX) >= threshold) - (Reduce('+',lapply(listX,function(x){x < 0}))/length(listX) >= threshold)
#> [,1] [,2]
#> [1,] 1 -1
#> [2,] 0 1
threshold <- 1.0
(Reduce('+',lapply(listX,function(x){x > 0}))/length(listX) >= threshold) - (Reduce('+',lapply(listX,function(x){x < 0}))/length(listX) >= threshold)
#> [,1] [,2]
#> [1,] 1 0
#> [2,] 0 0
# My function fails...
prob_matrix(listX, .5)
#> [,1] [,2]
#> [1,] 1 -1
#> [2,] 0 1
prob_matrix(listX, 1)
#> [,1] [,2]
#> [1,] 1 0
#> [2,] 0 1
# #user20650 answer fails...
arr <- simplify2array(listX)
grzero = rowSums(arr > 0, dims = 2) / length(listX)
lezero = rowSums(arr < 0, dims = 2) / length(listX)
prop = 0.5
1* (grzero > prop) + -1* (lezero > prop)
#> [,1] [,2]
#> [1,] 1 -1
#> [2,] 0 1
arr <- simplify2array(listX)
grzero = rowSums(arr > 0, dims = 2) / length(listX)
lezero = rowSums(arr < 0, dims = 2) / length(listX)
prop = 1.0
1* (grzero > prop) + -1* (lezero > prop)
#> [,1] [,2]
#> [1,] 0 0
#> [2,] 0 0
Original answer
Here's one approach...
Combine sign and Reduce to do a cumulative sum of the signs of values in each cell, returning a single matrix.
Any cells where this value is less than the threshold number (your probability * number of matrices in the list) is converted to 0.
Return the sign() of all cells.
Below is an example with a wrapper function:
Toy data...
set.seed(12)
listX <- list()
for(i in 1:10){listX[[i]]<-matrix(rnorm(n = 25, mean = 0, sd = 1), 5, 5)}
Function...
prob_matrix <- function(matrix_list, prob) {
# Sum the signs of values in each cell
matrix_list <- lapply(matrix_list, sign)
x <- Reduce(`+`, matrix_list)
# Convert cells below prob to 0, others to relevant sign
x[abs(x) < (prob * length(matrix_list)) / 2] <- 0
sign(x)
}
Example cases...
prob_matrix(listX, .2)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] -1 1 0 1 0
#> [2,] -1 0 -1 -1 0
#> [3,] 1 -1 1 1 1
#> [4,] 0 -1 1 1 -1
#> [5,] -1 0 -1 0 -1
prob_matrix(listX, .4)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] -1 1 0 1 0
#> [2,] -1 0 -1 -1 0
#> [3,] 1 -1 1 1 1
#> [4,] 0 -1 1 1 -1
#> [5,] -1 0 -1 0 -1
prob_matrix(listX, .6)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0 1 0 1 0
#> [2,] -1 0 0 -1 0
#> [3,] 1 -1 0 1 1
#> [4,] 0 0 0 1 -1
#> [5,] -1 0 0 0 -1
prob_matrix(listX, .8)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0 1 0 1 0
#> [2,] -1 0 0 -1 0
#> [3,] 1 -1 0 1 1
#> [4,] 0 0 0 1 -1
#> [5,] -1 0 0 0 -1

Slow loop for element replacement in list of lists

I wrote a loop to conditionally replace elements in a list of lists. While still relatively new to R, I feel certain that I am not going about this as efficiently as possible. The following loop runs very slowly (an hour or so) on my actual data. I've included a minimal working example below that exactly replicates the structure of my data.
A <- matrix(c(0, 1, 1, 2, 0, 0, 1, 0, 1, 2, 0, 0), nrow = 2, ncol = 6, byrow = TRUE)
B <- matrix(c(1, 1, 1, 2, 0, 1, 1, 0, 1, 2, 0, 0), nrow = 2, ncol = 6, byrow = TRUE)
C <- matrix(c(1, 0, 0, 1, 0, 1), nrow = 1, ncol = 6, byrow = TRUE)
D <- matrix(c(0, 0, 0, 1, 1, 1), nrow = 1, ncol = 6, byrow = TRUE)
mList <-list(list(A, B))
dList <- list(list(C, D))
The goal of the loop is to replace all cells in column j of the nth item of mList2 with 0 if element j of the nth item of dList equals 0.
mList
# [[1]]
# [[1]][[1]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 1 1 2 0 0
# [2,] 1 0 1 2 0 0
#
# [[1]][[2]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 1 1 2 0 1
# [2,] 1 0 1 2 0 0
dList
# [[1]]
# [[1]][[1]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 0 0 1 0 1
#
# [[1]][[2]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 0 0 1 1 1
Is this another instance of unrealized gains from using the set of apply functions? Is there some better way of doing this that doesn't involve using four indices?
for(i in 1:length(dList)) {
for(j in 1:length(dList[[i]])) {
for(k in 1:length(dList[[i]][[j]])) {
for(m in 1:nrow(mList[[i]][[j]])) {
mList[[i]][[j]][m, k] <-
ifelse(
dList[[i]][[j]][k] == 1,
mList[[i]][[j]][m, k],
0
)
}
}
}
}
Resulting in:
mList
# [[1]]
# [[1]][[1]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 0 0 2 0 0
# [2,] 1 0 0 2 0 0
# [[1]][[2]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 0 0 2 0 1
# [2,] 0 0 0 2 0 0
I would use a nested lapply to loop through the nested structure of the input lists, recomputing the relevant entry in mList using a single vectorized operation instead of a loop through the columns and rows:
lapply(seq_along(dList), function(i) {
lapply(seq_along(dList[[i]]), function(j) {
t(t(mList[[i]][[j]]) * as.vector(dList[[i]][[j]] != 0))
})
})
# [[1]]
# [[1]][[1]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 0 0 2 0 0
# [2,] 1 0 0 2 0 0
#
# [[1]][[2]]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0 0 0 2 0 1
# [2,] 0 0 0 2 0 0
Here's a benchmark on a list of the same structure, with 10 x 10000 matrices in mList. I've benchmarked the solution you provide, the solution by #thelatemail, and my solution:
set.seed(144)
A <- matrix(sample(0:2, 100000, replace=TRUE), nrow=10)
B <- matrix(sample(0:2, 100000, replace=TRUE), nrow=10)
C <- matrix(sample(0:1, 10000, replace=TRUE), nrow=1)
D <- matrix(sample(0:1, 10000, replace=TRUE), nrow=1)
mList <-list(list(A, B))
dList <- list(list(C, D))
OP <- function(mList, dList) {
for(i in 1:length(dList)) {
for(j in 1:length(dList[[i]])) {
for(k in 1:ncol(dList[[i]][[j]])) {
for(m in 1:nrow(mList[[i]][[j]])) {
mList[[i]][[j]][m, k] <-
ifelse(
dList[[i]][[j]][k] == 1,
mList[[i]][[j]][m, k],
0
)
}
}
}
}
mList
}
josilber <- function(mList, dList) {
lapply(seq_along(dList), function(i) {
lapply(seq_along(dList[[i]]), function(j) {
t(t(mList[[i]][[j]]) * as.vector(dList[[i]][[j]] != 0))
})
})
}
thelatemail <- function(mList, dList) {
Map(
function(L,s) Map(function(sL,ss) {sL[,ss] <- 0; sL}, L, s),
mList,
lapply(dList, function(x) lapply(x, function(y) y==0) )
)
}
library(microbenchmark)
microbenchmark(OP(mList, dList), josilber(mList, dList), thelatemail(mList, dList), times=10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# OP(mList, dList) 12252.468288 13318.745019 13478.116388 13486.732412 13840.106332 14259.053497 10
# josilber(mList, dList) 2.299442 2.401806 2.561809 2.480822 2.552620 3.511609 10
# thelatemail(mList, dList) 4.259594 4.438562 4.683855 4.612297 5.002605 5.122605 10
Both solutions run more than 1000 times faster, mainly because they're not tightly looping through the matrices but instead performing the operation in a vectorized manner.
It's not pretty, but you essentially always need to double loop with a nested list structure.
Map(
function(L,s) Map(function(sL,ss) {sL[,ss] <- 0; sL}, L, s),
mList,
lapply(dList, function(x) lapply(x, function(y) y==0) )
)
#[[1]]
#[[1]][[1]]
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 0 0 0 2 0 0
#[2,] 1 0 0 2 0 0
#
#[[1]][[2]]
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 0 0 0 2 0 1
#[2,] 0 0 0 2 0 0
I've just replaced all the for() loops with a nested lapply to generate the lists of columns to overwrite, then used a nested Map to replace the columns within each list.

R index matrix with vector / create index matrix from index vector

I am looking for an easier way to do the following:
m <- matrix(0, nrow=3, 3)
v <- c(1, 3, 2)
for (i in 1:nrow(m)) {
m[[i, v[i]]] = 1
}
The above code creates the following index matrix:
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 0 1
[3,] 0 1 0
There surely must be a better way to do this?!
One way to do it without pre-defining the matrix would be to use outer:
num.col <- 3
outer(v, seq_len(num.col), "==") * 1
# [,1] [,2] [,3]
# [1,] 1 0 0
# [2,] 0 0 1
# [3,] 0 1 0

splitting list elements expanding the list

I'm doing some kind of optical character recognition and face the following issue. I store the glyphs in a list of binary matrices and they can be of different size, but their maximum possible width is wid = 3 columns (may be any defined constant, not just 3). In some cases after the first stage of processing I get data which look like this:
myll <- list(matrix(c(0, 0, 0, 1, 1, 0), ncol = 2),
matrix(c(0), ncol = 1),
matrix(c(1, 1, 0), ncol = 3),
matrix(c(1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1), ncol = 7),
matrix(c(1, 1, 1, 1), ncol = 2))
# [[1]]
# [,1] [,2]
# [1,] 0 1
# [2,] 0 1
# [3,] 0 0
#
# [[2]]
# [,1]
# [1,] 0
#
# [[3]]
# [,1] [,2] [,3]
# [1,] 1 1 0
#
# [[4]]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 1 1 1 0 0 0 1
# [2,] 0 1 0 1 0 0 1
# [3,] 1 1 1 1 0 0 1
#
# [[5]]
# [,1] [,2]
# [1,] 1 1
# [2,] 1 1
So, some glyphs may be not separated for some reasons. This happens only with glyphs of maximum possible width. Moreover, there may be some junk at the end of the matrix. I have to split them into matrices of width ncol = wid leaving the last piece (junk) as is. Then I store this matrices in separate elements of list to get the following output:
# [[1]]
# [,1] [,2]
# [1,] 0 1
# [2,] 0 1
# [3,] 0 0
#
# [[2]]
# [,1]
# [1,] 0
#
# [[3]]
# [,1] [,2] [,3]
# [1,] 1 1 0
#
# [[4]]
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 0 1 0
# [3,] 1 1 1
#
# [[5]]
# [,1] [,2] [,3]
# [1,] 0 0 0
# [2,] 1 0 0
# [3,] 1 0 0
#
# [[6]]
# [,1]
# [1,] 1
# [2,] 1
# [3,] 1
#
# [[7]]
# [,1] [,2]
# [1,] 1 1
# [2,] 1 1
At the moment I can make it with the help of this functions
checkGlyphs <- function(gl_m, wid = 3) {
if (ncol(gl_m) > wid)
return(list(gl_m[,1:wid], matrix(gl_m[,-(1:wid)], nrow = nrow(gl_m)))) else
return(gl_m)
}
separateGlyphs <- function(myll, wid = 3) {
require("magrittr")
presplit <- lapply(myll, checkGlyphs, wid)
total_new_length <-
presplit[unlist(lapply(presplit, is.list))] %>% lapply(length) %>% unlist() %>% sum() +
as.integer(!unlist(lapply(presplit, is.list))) %>% sum()
splitted <- vector("list", length = total_new_length)
spl_index <- 1
for (i in 1:length(presplit))
{
if (!is.list(presplit[[i]]))
{
splitted[[spl_index]] <- presplit[[i]]
spl_index <- spl_index + 1
} else
{
for (j in 1:length(presplit[[i]]))
{
splitted[[spl_index]] <- presplit[[i]][[j]]
spl_index <- spl_index + 1
}
}
}
if (any(lapply(splitted, ncol) > wid)) return(separateGlyphs(splitted, wid)) else
return(splitted)
}
But I believe there is more fast and convenient way to achieve the same result (without using for loops and this enlooped reassignment of elements and then recursion if needed O_o).
I will be thankful for any suggestions on the point or, alternatively, for recommending some OCR-packages for R.
This should do the trick, with the values in final being what you're after.
combined <- do.call(cbind, lapply(myll, unlist))
idx <- seq(1, ncol(combined), 2)
final <- do.call(list, lapply(idx, function(x) combined[, x:(x+1)]))

Resources