R remove duplicated elements for two sparse matrix - r

We have a sparse matrix using library Matrix:
library(Matrix)
M = sparseMatrix(i = uidx,j = midx,x = freq)
suppose the matrix M is like:
i j x
1 2 0.2
1 3 0.3
1 15 0.15
2 7 0.1
...
280 2 0.6
281 7 0.25
and after some calculation we got another sparse matrix Q
i j x
1 2 18
1 4 16
1 9 8
2 10 19
...
I want to use Q as base matrix and remove those (i,j) already exists in M from Q
something like a set minus:
Q-M
In my example it will brings result like:
i j x
1 4 16
1 9 8
...
#we have 1 2 18 in original Q but 1 2 0.2 with same index (1,2) already exists in M so remove that row from Q.
Any efficient way or existing function to do this work?
to reproduce this case you could run the following code:
library(Matrix)
M = sparseMatrix(i = c(1,1,1),j = c(2,3,15),x = c(0.2,0.3,0.15))
Q = sparseMatrix(i = c(1,1,1),j = c(2,4,9),x = c(18,16,8))
#result should produce a sparse matrix like:
#R = sparseMatrix(i = c(1,1),j = (4,9),x = c(16,8))

You can get there with using the summary function when the Matrix package is loaded. This give a full overview of the sparse matrix (and keeping it as a sparse matrix). Based on this, you can compare values directly. And to select you can compare them to each other. I expanded the example a bit to check if other values are being kept / removed as expected. The result matches what you expect from your R matrix.
library(Matrix)
M = sparseMatrix(i = c(1,1,1,1, 2, 2),
j = c(2,3,15, 16, 4, 8),
x = c(0.2,0.3,0.15, 0.16, 0.2, 0.08))
Q = sparseMatrix(i = c(1,1,1,1, 2),
j = c(2,4,9,16, 4),
x = c(18,16,8,50, 40))
#result should produce a sparse matrix like:
R = sparseMatrix(i = c(1,1),
j = c(4,9),
x = c(16, 8))
# creates a summary of the sparse matrices (summary is coming from Matrix)
summary_m <- summary(M)
summary_q <- summary(Q)
# which records to remove
# all records where i and j match (TRUE). Exclude x value in matching comparison.
# summed this should be 2.
# which shows which records are equal and should be removed.
remove <- which(rowSums(summary_m[, c("i", "j") ] == summary_q[, c("i", "j") ]) == 2)
# build summary sparse matrix from summary_q to keep all Q records that do not match M
q_left <- summary_q[-remove, ]
# build full sparse matrix
result <- sparseMatrix(i = q_left$i, j = q_left$j, x = q_left$x)
identical(result, R)
[1] TRUE

Related

How to modify non-zero elements of a large sparse matrix based on a second sparse matrix in R

I have two large sparse matrices (about 41,000 x 55,000 in size). The density of nonzero elements is around 10%. They both have the same row index and column index for nonzero elements.
I now want to modify the values in the first sparse matrix if values in the second matrix are below a certain threshold.
library(Matrix)
# Generating the example matrices.
set.seed(42)
# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
I now get the rows, columns and values from the first matrix in a new matrix. This way, I can simply subset them and only the ones I am interested in remain.
# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1#i, j = summary(m1)$j, x = m1#x)
position_matrix_from_m2 <- rbind(i = m2#i, j = summary(m2)$j, x = m2#x)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]
# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,] + 1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,] + 1
Now I am getting into trouble. Overwriting the values in the second matrix takes too long. I let it run for several hours and it did not finish.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
I thought about pasting the row and column information together. Then I have a unique identifier for each value. This also takes too long and is probably just very bad practice.
# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])
# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory.
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)
# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2#x in the sparse matrix object.
Is there a fundamental error in my approach? What should I do to run this efficiently?
Is there a fundamental error in my approach?
Yes. Here it is.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
Syntax as mat[rn, cn] (whether mat is a dense or sparse matrix) is selecting all rows in rn and all columns in cn. So you get a length(rn) x length(cn) matrix. Here is a small example:
A <- matrix(1:9, 3, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
rn <- 1:2
cn <- 2:3
A[rn, cn]
# [,1] [,2]
#[1,] 4 7
#[2,] 5 8
What you intend to do is to select (rc[1], cn[1]), (rc[2], cn[2]) ..., only. The correct syntax is then mat[cbind(rn, cn)]. Here is a demo:
A[cbind(rn, cn)]
#[1] 4 8
So you need to fix your code to:
m2[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 1
m1[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 0
Oh wait... Based on your construction of position_matrix_from_m1, this is just
ij <- t(position_matrix_from_m1[1:2, ])
m2[ij] <- 1
m1[ij] <- 0
Now, let me explain how you can do better. You have underused summary(). It returns a 3-column data frame, giving (i, j, x) triplet, where both i and j are index starting from 1. You could have worked with this nice output directly, as follows:
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# you never seem to use `position_matrix_from_m2` so I skip it
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
Now you can do:
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2[ij] <- 1
m1[ij] <- 0
Is there a even better solution? Yes! Note that nonzero elements in m1 and m2 are located in the same positions. So basically, you just need to change m2#x according to m1#x.
ind <- m1#x > 0 & m1#x < 0.05
m2#x[ind] <- 1
m1#x[ind] <- 0
A complete R session
I don't have enough RAM to create your large matrix, so I reduced your problem size a little bit for testing. Everything worked smoothly.
library(Matrix)
# Generating the example matrices.
set.seed(42)
## reduce problem size to what my laptop can bear with
squeeze <- 0.1
# Rows with values.
i <- sample(1:(41000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Columns with values.
j <- sample(1:(55000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000 * squeeze ^ 2)
# Values for the second matrix.
x2 <- sample(1:3, 227000000 * squeeze ^ 2, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
## give me more usable RAM
rm(i, j, x1, x2)
##
## fix to your code
##
m1a <- m1
m2a <- m2
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2a[ij] <- 1
m1a[ij] <- 0
##
## the best solution
##
m1b <- m1
m2b <- m2
ind <- m1#x > 0 & m1#x < 0.05
m2b#x[ind] <- 1
m1b#x[ind] <- 0
##
## they are identical
##
all.equal(m1a, m1b)
#[1] TRUE
all.equal(m2a, m2b)
#[1] TRUE
Caveat:
I know that some people may propose
m1c <- m1
m2c <- m2
logi <- m1 > 0 & m1 < 0.05
m2c[logi] <- 1
m1c[logi] <- 0
It looks completely natural in R's syntax. But trust me, it is extremely slow for large matrices.

Replace a specific row depending on input in a matrix with zeros

I want to create a function which replaces the a chosen row of a matrix with zeros. I try to think of the matrix as arbitrary but for this example I have done it with a sample 3x3 matrix with the numbers 1-9, called a_matrix
1 4 7
2 5 8
3 6 9
I have done:
zero_row <- function(M, n){
n <- c(0,0,0)
M*n
}
And then I have set the matrix and tried to get my desired result by using my zero_row function
mat1 <- a_matrix
zero_row(M = mat1, n = 1)
zero_row(M = mat1, n = 2)
zero_row(M = mat1, n = 3)
However, right now all I get is a matrix with only zeros, which I do understand why. But if I instead change the vector n to one of the following
n <- c(0,1,1)
n <- c(1,0,1)
n <- c(1,1,0)
I get my desired result for when n=1, n=2, n=3 separately. But what i want is, depending on which n I put in, I get that row to zero, so I have a function that does it for every different n, instead of me having to change the vector for every separate n. So that I get (n=2 for example)
1 4 7
0 0 0
3 6 9
And is it better to do it in another form, instead of using vectors?
Here is a way.
zero_row <- function(M, n){
stopifnot(n <= nrow(M))
M[n, ] <- 0
M
}
A <- matrix(1:9, nrow = 3)
zero_row(A, 1)
zero_row(A, 2)
zero_row(A, 3)

Find combinations of genome without any package

I want to find how many combinations of genome are found in a sequence. I mean for binary combinations: AA,AT,AG,AC,... 16 combinations like that;or for 3-elemented combinations ATG,ACG,... 64 combinations like that. I know how to do that with a package and I will write down it here. I want to create my own code to perform this
seqinr package is perfect on its job. That is the code that i used for;
install.packages('seqinr')
library(seqinr)
m = read.fasta(file='sequence.fasta')
mseq = m[[1]]
count(mseq,2) # gives how many binary combinations are found in the seq
count(mseq,3) # gives how many 3-elemented combinations are found in the seq
This is a slow way to do it. I am certain it is faster in the bioconductor package.
# some practice data
mseq = paste(sample(c("A", "C", "G", "T"), 1000, rep=T), collapse="")
# define a function called count
count = function(mseq, n){
# split the sequence into every possible sub sequence of length n
x = sapply(1:(nchar(mseq) - n + 1), function(i) substr(mseq, i, i+n-1))
# how many unique sub sequences of length R are there?
length(table(x))
}
Actually just checked and this is pretty much how they did it:
function (seq, wordsize, start = 0, by = 1, freq = FALSE, alphabet = s2c("acgt"),
frame = start)
{
if (!missing(frame))
start = frame
istarts <- seq(from = 1 + start, to = length(seq), by = by)
oligos <- seq[istarts]
oligos.levels <- levels(as.factor(words(wordsize, alphabet = alphabet)))
if (wordsize >= 2) {
for (i in 2:wordsize) {
oligos <- paste(oligos, seq[istarts + i - 1], sep = "")
}
}
counts <- table(factor(oligos, levels = oligos.levels))
if (freq == TRUE)
counts <- counts/sum(counts)
return(counts)
}
If you want to find the code for a function use getAnywhere()
getAnywhere(count)
The simple thing to do is just something like this:
# Generate a test sequence
set.seed(1234)
testSeq <- paste(sample(LETTERS[1:3], 100, replace = T), collapse = "")
# Split string into chunks of size 2 and then count occurrences
testBigram <- substring(testSeq, seq(1, nchar(testSeq), 2), seq(2, nchar(testSeq), 2))
table(testBigram)
AA AB AC BA BB BC CA CB CC
10 10 14 3 3 2 2 5 1
Here is a way using a "function factory" (https://adv-r.hadley.nz/function-factories.html).
The 2-element and 3-element combinations are n-grams of size 2 and 3. So we make this n-gram function factory.
# Generate a function to create a function
ngram <- function(size) {
function(myvector) {
substring(myvector, seq(1, nchar(myvector), size), seq(size, nchar(myvector), size))
}
}
# Assign the functions names (optional)
bigram <- ngram(2)
trigram <- ngram(3)
# 2 element combinations
table(bigram(testSeq))
AA AB AC BA BB BC CA CB CC
10 10 14 3 3 2 2 5 1
# count of 2 element combinations
length(unique(bigram(testSeq)))
[1] 9
# counting function
count <- function(mseq, n) length(unique(ngram(n)(mseq)))
count(testSeq, 2)
[1] 9
# and if we wanted to do with with 3 element combinations
table(trigram(testSeq))

Extract the combinations of cells without repeating the index

I am trying to calculate the combinations of elements of a matrix but each element should appear only once.
The (real) matrix is symmetric, and can have more then 5 elements (up to ~2000):
o <- matrix(runif(25), ncol = 5, nrow = 5)
dimnames(o) <- list(LETTERS[1:5], LETTERS[1:5])
# A B C D E
# A 0.4400317 0.1715681 0.7319108946 0.3994685 0.4466997
# B 0.5190471 0.1666164 0.3430245044 0.3837903 0.9322599
# C 0.3249180 0.6122229 0.6312876740 0.8017402 0.0141673
# D 0.1641411 0.1581701 0.0001703419 0.7379847 0.8347536
# E 0.4853255 0.5865909 0.6096330935 0.8749807 0.7230507
I desire to calculate the product of all the combinations of pairs (If possible it should appear all elements:AB, CD, EF if the matrix is of 6 elements), where for each pair one letter is the column, the other one is the row. Here are some combinations:
AB, CD, E
AC, BD, E
AD, BC, E
AE, BC, D
AE, BD, C
Where the value of the single element is just 1.
Combinations not desired:
AB, BC: Element B appears twice
AB, AC: Element A appears twice
Things I tried:
I thought about removing the unwanted part of the matrix:
out <- which(upper.tri(o), arr.ind = TRUE)
out <- cbind.data.frame(out, value = o[upper.tri(o)])
out[, 1] <- colnames(o)[out[, 1]]
out[, 2] <- colnames(o)[out[, 2]]
# row col value
# 1 A B 0.1715681
# 2 A C 0.7319109
# 3 B C 0.3430245
# 4 A D 0.3994685
# 5 B D 0.3837903
# 6 C D 0.8017402
# 7 A E 0.4466997
# 8 B E 0.9322599
# 9 C E 0.0141673
# 10 D E 0.8347536
My attempt involves the following process:
Make a copy of the matrix (out)
Store first value of the first row.
Remove all the pairs that involve any of the pair.
Select the next pair of the resulting matrix
Repeat until all rows are removed of the matrix
Repeat 2:5 starting from a different row
However, this method has one big problem, it doesn't guarantee that all the combinations are stored, and it could store several times the same combination.
My expected output is a vector, where each element is the product of the values in the cell selected by the combination:
AB, CD: 0.137553
How can I extract all those combinations efficiently?
This might work. I tested this on N elements = 5 and 6.
Note that this is not optimised, and hopefully can provide a framework for you to work from. With a much larger array, I can see steps involving apply and combn being a bottleneck.
The idea here is to generate a collection of unique sets first before calculating the product of the sets from another data.frame that stores values of sets.
Unique sets are identified by counting the number of unique elements in all combination pairs. For example, if N elements = 6, we expect length(unlist(combination)) == 6. The same is true if N elements = 7 (there will only be 3 pairs plus a remainder element). In cases where N elements is odd, we can ignore the remaining, unpaired element since it is constrained by the other elements.
library(dplyr)
library(reshape2)
## some functions
unique_by_n <- function(inlist, N){
## select unique combinations by count
## if unique, expect n = 6 if n elements = 6)
if(N %% 2) N <- N - 1 ## for odd numbers
return(length(unique(unlist(inlist))) == N)
}
get_combs <- function(x,xall){
## format and catches remainder if matrix of odd elements
xu <- unlist(x)
remainder <- setdiff(xall,xu) ## catch remainder if any
xset <- unlist(lapply(x, paste0, collapse=''))
finalset <- c(xset, remainder)
return(finalset)
}
## make dataset
set.seed(0) ## set reproducible example
#o <- matrix(runif(25), ncol = 5, nrow = 5) ## uncomment to test 5
#dimnames(o) <- list(LETTERS[1:5], LETTERS[1:5])
o <- matrix(runif(36), ncol = 6, nrow = 6)
dimnames(o) <- list(LETTERS[1:6], LETTERS[1:6])
o[lower.tri(o)] <- t(o)[lower.tri(o)] ## make matrix symmetric
n_elements = nrow(o)
#### get matrix
dat <- melt(o, varnames = c('Rw', 'Cl'), as.is = TRUE)
dat$Set <- apply(dat, 1, function(x) paste0(sort(unique(x[1:2])), collapse = ''))
## get unique sets (since your matrix is symmetric)
dat <- subset(dat, !duplicated(Set))
#### get sets
elements <- rownames(o)
allpairs <- expand.grid(Rw = elements, Cl = elements) %>%
filter(Rw != Cl) ## get all pairs
uniqpairsgrid <- unique(t(apply(allpairs,1,sort)))
uniqpairs <- split(uniqpairsgrid, seq(nrow(uniqpairsgrid))) ## get unique pairs
allpaircombs <- combn(uniqpairs,floor(n_elements/2)) ## get combinations of pairs
uniqcombs <- allpaircombs[,apply(allpaircombs, 2, unique_by_n, N = n_elements)] ## remove pairs with repeats
finalcombs <- apply(uniqcombs, 2, get_combs, xall=elements)
#### calculate results
res <- apply(finalcombs, 2, function(x) prod(subset(dat, Set %in% x)$value)) ## calculate product
names(res) <- apply(finalcombs, 2, paste0, collapse=',') ## add names
resdf <- data.frame(Sets = names(res), Products = res, stringsAsFactors = FALSE, row.names = NULL)
print(resdf)
#> Sets Products
#> 1 AB,CD,EF 0.130063454
#> 2 AB,CE,DF 0.171200062
#> 3 AB,CF,DE 0.007212619
#> 4 AC,BD,EF 0.012494787
#> 5 AC,BE,DF 0.023285088
#> 6 AC,BF,DE 0.001139712
#> 7 AD,BC,EF 0.126900247
#> 8 AD,BE,CF 0.158919605
#> 9 AD,BF,CE 0.184631344
#> 10 AE,BC,DF 0.042572488
#> 11 AE,BD,CF 0.028608495
#> 12 AE,BF,CD 0.047056905
#> 13 AF,BC,DE 0.003131029
#> 14 AF,BD,CE 0.049941770
#> 15 AF,BE,CD 0.070707311
Created on 2018-07-23 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0.9000).
Maybe the following does what you want.
Note that I was more interested in being right than in performance.
Also, I have set the RNG seed, to have reproducible results.
set.seed(9840) # Make reproducible results
o <- matrix(runif(25), ncol = 5, nrow = 5)
dimnames(o) <- list(LETTERS[1:5], LETTERS[1:5])
cmb <- combn(LETTERS[1:5], 2)
n <- ncol(cmb)
res <- NULL
nms <- NULL
for(i in seq_len(n)){
for(j in seq_len(n)[-seq_len(i)]){
x <- unique(c(cmb[, i], cmb[, j]))
if(length(x) == 4){
res <- c(res, o[cmb[1, i], cmb[2, i]] * o[cmb[1, j], cmb[2, j]])
nms <- c(nms, paste0(cmb[1, i], cmb[2, i], '*', cmb[1, j], cmb[2, j]))
}
}
}
names(res) <- nms
res

Computing number of bits that are set to 1 for matching rows in terms of hamming distance between two data frames

I have two data frames of same number of columns (but not rows) df1 and df2. For each row in df2, I was able to find the best (and second best) matching rows from df1 in terms of hamming distance, in my previous post. In that post, we have been using the following example data:
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2)) ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2)) ## 3 rows 2 cols
I now need to compute the number of bits equal to 1 for:
each row in df2
the best matching rows in df1
the second matching rows in df1
The number of bits equal to 1 of an integer a maybe computed as
sum(as.integer(intToBits(a)))
And I have applied this to #ZheyuanLi's original function, so I have got item 1>. However I'm unable to apply the same logic to get item 2> and 3>, by simple modification of #ZheyuanLi's function.
Below are the functions from #ZheyuanLi's with modification:
hmd <- function(x,y) {
rawx <- intToBits(x)
rawy <- intToBits(y)
nx <- length(rawx)
ny <- length(rawy)
if (nx == ny) {
## quick return
return (sum(as.logical(xor(rawx,rawy))))
} else if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
nc <- nx / ny ## number of cycles
return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
}
}
foo <- function(df1, df2, p = 2) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
sb <- integer(n)
k <- 1:p
for (i in 1:n) {
set.bits <- sum(as.integer(intToBits(yt[,i])))
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
sb[i] <- set.bits
k <- k + p
}
## recode "id", "d" and "sb" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
sb <- as.data.frame(matrix(sb, ncol = 1)) ## no need for byrow as you have only 1 column
colnames(sb) <- "set.bits.1"
list(id = id, d = d, sb = sb)
}
Running these gives:
> foo(df1, df2)
$id
min1 min2 ## row id for best/second best match in df1
1 1 4
2 2 3
3 5 2
$d
mindist.1 mindist.2 ## minimum 2 hamming distance
1 2 2
2 1 3
3 1 3
$sb
set.bits.1 ## number of bits equal to 1 for each row of df2
1 3
2 2
3 4
OK, after reading through while re-editing your question (many times!), I think I know what you want. Essentially we need change nothing to hmd(). Your required items 1>, 2>, 3> can all be computed after the for loop in foo().
To get item 1>, which you called sb, we can use a tapply(). However, your computation of sb along the for loop is fine, so I will not change it. In the following, I will demonstrate the basic procedure to get item 2> and item 3>.
The id vector inside foo() stores all matching rows in df1:
id <- c(1, 4, 2, 3, 5, 2)
so we can simply extract those rows of df1 (actually, columns of xt), to compute the number of bits equal to 1. As you can see, there are lots of duplicity in id, so we can only computes on unique(id):
id0 <- sort(unique(id))
## [1] 1 2 3 4 5
We now extract those subset columns of xt:
sub_xt <- xt[, id0]
## [,1] [,2] [,3] [,4] [,5]
## V1 9 3 10 5 6
## V2 2 4 8 7 1
To compute the number of bits equal to 1 for each column of sub_xt, we again use tapply() and vectorized approach.
rawbits <- as.integer(intToBits(as.numeric(sub_xt))) ## convert sub_xt to binary
sbxt0 <- unname(tapply(X = rawbits,
INDEX = rep(1:length(id0), each = length(rawbits) / length(id0)),
FUN = sum))
## [1] 3 3 3 5 3
Now we need to map sbxt0 to sbxt:
sbxt <- sbxt0[match(id, id0)]
## [1] 3 5 3 3 3 3
Then we can convert sbxt to a data frame sb1:
sb1 <- as.data.frame(matrix(sbxt, ncol = p, byrow = TRUE))
colnames(sb1) <- paste(paste0("min.", 1:p), "set.bits.1", sep = ".")
## min.1.set.bits.1 min.2.set.bits.1
## 1 3 5
## 2 3 3
## 3 3 3
Finally we can assemble these things up:
foo <- function(df1, df2, p = 2) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
sb2 <- integer(n)
k <- 1:p
for (i in 1:n) {
set.bits <- sum(as.integer(intToBits(yt[,i])))
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
sb2[i] <- set.bits
k <- k + p
}
## compute "sb1"
id0 <- sort(unique(id))
sub_xt <- xt[, id0]
rawbits <- as.integer(intToBits(as.numeric(sub_xt))) ## convert sub_xt to binary
sbxt0 <- unname(tapply(X = rawbits,
INDEX = rep(1:length(id0), each = length(rawbits) / length(id0)),
FUN = sum))
sbxt <- sbxt0[match(id, id0)]
sb1 <- as.data.frame(matrix(sbxt, ncol = p, byrow = TRUE))
colnames(sb1) <- paste(paste0("min.", 1:p), "set.bits.1", sep = ".")
## recode "id", "d" and "sb2" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
sb2 <- as.data.frame(matrix(sb2, ncol = 1)) ## no need for byrow as you have only 1 column
colnames(sb2) <- "set.bits.1"
list(id = id, d = d, sb1 = sb1, sb2 = sb2)
}
Now, running foo(df1, df2) gives:
> foo(df1,df2)
$id
min.1 min.2
1 1 4
2 2 3
3 5 2
$d
mindist.1 mindist.2
1 2 2
2 1 3
3 1 3
$sb1
min.1.set.bits.1 min.2.set.bits.1
1 3 5
2 3 3
3 3 3
$sb2
set.bits.1
1 3
2 2
3 4
Note that I have renamed the sb you used to sb2.

Resources