Related
I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second best) matching row to df1 in terms of hamming distance.
I used e1071 package to compute hamming distance. Hamming distance between two vectors x and y can be computed as for example:
x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
24197, 610187, 402471, 157122, 866381, 582868, 878)
y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)
xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)
distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))
and the resulting distance is 25. Yet I need to do this for all rows of df1 and df2. A trivial method takes a double loop nest and looks terribly slow.
Any ideas how to do this more efficiently? In the end I need to append to df2:
a column with the row id from df1 that gives the lowest distance;
a column with the lowest distance;
a column with the row id from df1 that gives the 2nd lowest distance;
a column with the second lowest distance.
Thanks.
Fast computation of hamming distance between two integers vectors of equal length
As I said in my comment, we can do:
hmd0 <- function(x,y) sum(as.logical(xor(intToBits(x),intToBits(y))))
to compute hamming distance between two integers vectors of equal length x and y. This only uses R base, yet is more efficient than e1071::hamming.distance, because it is vectorized!
For the example x and y in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)
Fast hamming distance between a matrix and a vector
If we want to compute the hamming distance between a single y and multiple xs, i.e., the hamming distance between a vector and a matrix, we can use the following function.
hmd <- function(x,y) {
rawx <- intToBits(x)
rawy <- intToBits(y)
nx <- length(rawx)
ny <- length(rawy)
if (nx == ny) {
## quick return
return (sum(as.logical(xor(rawx,rawy))))
} else if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
nc <- nx / ny ## number of cycles
return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
}
}
Note that:
hmd performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first;
there is no obvious loop here; instead, we use tapply().
Fast hamming distance computation between two matrices/data frames
This is what you want. The following function foo takes two data frames or matrices df1 and df2, computing the distance between df1 and each row of df2. argument p is an integer, showing how many results you want to retain. p = 3 will keep the smallest 3 distances with their row ids in df1.
foo <- function(df1, df2, p) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
k <- 1:p
for (i in 1:n) {
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
k <- k + p
}
## recode "id" and "d" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
list(id = id, d = d)
}
Note that:
transposition is done at the beginning, according to reasons before;
a for loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using *apply family, since we ask for multiple output (row id id and distance d).
Experiment
This part uses small dataset to test/demonstrate our functions.
Some toy data:
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2)) ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2)) ## 3 rows 2 cols
Test hmd first (needs transposition):
hmd(t(as.matrix(df1)), df2[1, ]) ## df1 & first row of df2
# [1] 2 4 6 2 4
Test foo:
foo(df1, df2, p = 2)
# $id
# min1 min2
# 1 1 4
# 2 2 3
# 3 5 2
# $d
# mindist.1 mindist.2
# 1 2 2
# 2 1 3
# 3 1 3
If you want to append some columns to df2, you know what to do, right?
Please don't be surprised why I take another section. This part gives something relevant. It is not what OP asks for, but may help any readers.
General hamming distance computation
In the previous answer, I start from a function hmd0 that computes hamming distance between two integer vectors of the same length. This means if we have 2 integer vectors:
set.seed(0)
x <- sample(1:100, 6)
y <- sample(1:100, 6)
we will end up with a scalar:
hmd0(x,y)
# 13
What if we want to compute pairwise hamming distance of two vectors?
In fact, a simple modification to our function hmd will do:
hamming.distance <- function(x, y, pairwise = TRUE) {
nx <- length(x)
ny <- length(y)
rawx <- intToBits(x)
rawy <- intToBits(y)
if (nx == 1 && ny == 1) return(sum(as.logical(xor(intToBits(x),intToBits(y)))))
if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
bits <- length(intToBits(0)) ## 32-bit or 64 bit?
result <- unname(tapply(as.logical(xor(rawx,rawy)), rep(1:ny, each = bits), sum))
}
if (pairwise) result else sum(result)
}
Now
hamming.distance(x, y, pairwise = TRUE)
# [1] 0 3 3 2 5 0
hamming.distance(x, y, pairwise = FALSE)
# [1] 13
Hamming distance matrix
If we want to compute the hamming distance matrix, for example,
set.seed(1)
x <- sample(1:100, 5)
y <- sample(1:100, 7)
The distance matrix between x and y is:
outer(x, y, hamming.distance) ## pairwise argument has no effect here
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 2 3 4 3 4 4 2
# [2,] 7 6 3 4 3 3 3
# [3,] 4 5 4 3 6 4 2
# [4,] 2 3 2 5 6 4 2
# [5,] 4 3 4 3 2 0 2
We can also do:
outer(x, x, hamming.distance)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 5 2 2 4
# [2,] 5 0 3 5 3
# [3,] 2 3 0 2 4
# [4,] 2 5 2 0 4
# [5,] 4 3 4 4 0
In the latter situation, we end up with a symmetric matrix with 0 on the diagonal. Using outer is inefficient here, but it is still more efficient than writing R loops. Since our hamming.distance is written in R code, I would stay with using outer. In my answer to this question, I demonstrate the idea of using compiled code. This of course requires writing a C version of hamming.distance, but I will not show it here.
Here's an alternative solution that uses only base R, and should be very fast, especially when your df1 and df2 have many rows. The main reason for this is that it does not use any R-level looping for calculating the Hamming distances, such as for-loops, while-loops, or *apply functions. Instead, it uses matrix multiplication for computing the Hamming distance. In R, this is much faster than any approach using R-level looping. Also note that using an *apply function will not necessarily make your code any faster than using a for loop. Two other efficiency-related features of this approach are: (1) It uses partial sorting for finding the best two matches for each row in df2, and (2) It stores the entire bitwise representation of df1 in one matrix (same for df2), and does so in one single step, without using any R-level loops.
The function that does all the work:
# INPUT:
# X corresponds to your entire df1, but is a matrix
# Y corresponds to your entire df2, but is a matrix
# OUTPUT:
# Matrix with four columns corresponding to the values
# that you specified in your question
fun <- function(X, Y) {
# Convert integers to bits
X <- intToBits(t(X))
# Reshape into matrix
dim(X) <- c(ncols * 32, nrows)
# Convert integers to bits
Y <- intToBits(t(Y))
# Reshape into matrix
dim(Y) <- c(ncols * 32, nrows)
# Calculate pairwise hamming distances using matrix
# multiplication.
# Columns of H index into Y; rows index into X.
# The code for the hamming() function was retrieved
# from this page:
# https://johanndejong.wordpress.com/2015/10/02/faster-hamming-distance-in-r-2/
H <- hamming(X, Y)
# Now, for each row in Y, find the two best matches
# in X. In other words: for each column in H, find
# the two smallest values and their row indices.
t(apply(H, 2, function(h) {
mindists <- sort(h, partial = 1:2)
c(
ind1 = which(h == mindists[1])[1],
val1 = mindists[1],
hmd2 = which(h == mindists[2])[1],
val2 = mindists[2]
)
}))
}
To call the function on some random data:
# Generate some random test data with no. of columns
# corresponding to your data
nrows <- 1000
ncols <- 26
# X corresponds to your df1
X <- matrix(
sample(1e6, nrows * ncols, replace = TRUE),
nrow = nrows,
ncol = ncols
)
# Y corresponds to your df2
Y <- matrix(
sample(1e6, nrows * ncols, replace = TRUE),
nrow = nrows,
ncol = ncols
)
res <- fun(X, Y)
The above example with 1000 rows in both X (df1) and Y (df2) took about 1.1 - 1.2 seconds to run on my laptop.
Sometimes I want to transform several data columns (usually character or factor) into one new column (usually a number). I try to do this using a lookup matrix. For example, my dataset is
dset <- data.frame(
x=c("a", "a", "b"),
y=c("v", "w", "w"),
stringsAsFactors=FALSE
)
lookup <- matrix(c(1:4), ncol=2)
rownames(lookup) <- c("a", "b")
colnames(lookup) <- c("v", "w")
Ideally (for my purpose here), I would now do
transform(dset, z=lookup[x,y])
and get my new data column. While this works in the one-dimensional case, this fails here, as lookup[x,y] returns a matrix. I came up with this function, which looks rather slow:
fill_from_matrix <- function(m, ...) {
arg <- list(...)
len <- sapply(arg, length)
if(sum(diff(len))!=0) stop("differing lengths in fill_from_matrix")
if(length(arg)!=length(dim(m))) stop("differing dimensions in fill_from_matrix")
n <- len[[1]]
dims <- length(dim(m))
res <- rep(NA, n)
for (i in seq(1,n)) {
one_arg <- list(m)
for (j in seq(1,dims)) one_arg[[j+1]] <- arg[[j]][[i]]
res[i] <- do.call("[", one_arg)
}
return(res)
}
With this function, I can call transform and get the result I wanted:
transform(dset, z=fill_from_matrix(lookup,x,y))
# x y z
# 1 a v 1
# 2 a w 3
# 3 b w 4
However, I am not satisfied with the code and wonder if there is a more elegant (and faster) way to perform this kind of transformation. How do I get rid of the for loops?
This is really quite easy and I suspect fast with base R indexing because the "[" function accepts a two-column matrix for this precise purpose:
> dset$z <- lookup[ with(dset, cbind(x,y)) ]
> dset
x y z
1 a v 1
2 a w 3
3 b w 4
If you needed it as a specific function then:
lkup <- function(tbl, rowidx, colidx){ tbl[ cbind(rowidx, colidx)]}
zvals <- lkup(lookup, dset$x, dset$y)
zvals
#[1] 1 3 4
(I'm pretty sure you can also use three and four column matrices if you have arrays of those dimensions.)
You can use library dplyr for inner_join and use a data.frame instead of matrix as lookup table:
library(dplyr)
lookup = transform(expand.grid(c('a','b'),c('v','w')), v=1:4) %>%
setNames(c('x','y','val'))
inner_join(dset, lookup, by=c('x','y'))
# x y val
#1 a v 1
#2 a w 3
#3 b w 4
A fast way is also to use data.table package, with my definition of lookup:
library(data.table)
setDT(lookup)
setDT(dset)
setkey(lookup, x ,y)[dset]
# x y val
#1: a v 1
#2: a w 3
#3: b w 4
If for any reason you have your matrix lookup as input, transform it in a dataframe:
lookup = transform(expand.grid(rownames(lookup), colnames(lookup)), v=c(lookup))
names(lookup) = c('x','y','val')
I wish to combine equivalent, deeply-nested columns from all elements of a reasonably long list. What I would like to do, though it's not possible in R, is this:
combined.columns <- my.list[[1:length(my.list)]]$my.matrix[,"my.column"]
The only thing I can think of is to manually type out all the elements in cbind() like this:
combined.columns <- cbind(my.list[[1]]$my.matrix[,"my.column"], my.list[[2]]$my.matrix[,"my.column"], . . . )
This answer is pretty close to what I need, but I can't figure out how to make it work for the extra level of nesting.
There must be a more elegant way of doing this, though. Any ideas?
Assuming all your matrices have the same column name you wish to extract you could use sapply
set.seed(123)
my.list <- vector("list")
my.list[[1]] <- list(my.matrix = data.frame(A=rnorm(10,sd=0.3), B=rnorm(10,sd=0.3)))
my.list[[2]] <- list(my.matrix = data.frame(C=rnorm(10,sd=0.3), B=rnorm(10,sd=0.3)))
my.list[[3]] <- list(my.matrix = data.frame(D=rnorm(10,sd=0.3), B=rnorm(10,sd=0.3)))
sapply(my.list, FUN = function(x) x$my.matrix[,"B"])
Free data:
myList <- list(list(myMat = matrix(1:10, 2, dimnames=list(NULL, letters[1:5])),
myVec = 1:10),
list(myMat = matrix(10:1, 2, dimnames=list(NULL, letters[1:5])),
myVec = 10:1))
We can get column a of myMat a few different ways. Here's one that uses with.
sapply(myList, with, myMat[,"a"])
# [,1] [,2]
# [1,] 1 10
# [2,] 2 9
This mapply one might be better for a more recursive type problem. It works too and might be faster than sapply.
mapply(function(x, y, z) x[[y]][,z] , myList, "myMat", "a")
# [,1] [,2]
# [1,] 1 10
# [2,] 2 9
I am trying to implement the following simple formulas in R:
Formula 1:
I have no idea how to implement in R the product operator when the limits of the interval are very large (e.g. value of the upper limit = 10,000 instead of 5)
Formula 2
Example input for second formula (in reality, the dimension of the interval S is much much bigger)
S = list(c(1,0,0), c(0,1,0), c(0,0,1))
X = c(1,2,3)
Any help would be appreciated!
For the first, take the log:
i *log(1+x)
For the second formula: (not clear what is the expected output)
ss<-matrix(unlist(S), ncol = 3, byrow = TRUE)
X<-as.matrix(X)
crossprod(crossprod(X,ss),t(X))
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 4 6
[3,] 3 6 9
Maybe more compactly:
First formula:
function(n, x) exp(sum(seq_len(n)*log(1+x)))
Second formula:
function(X, S) rowSums(sapply(S, function(y) sum(X*y)*X ))
For the first formula it has been mentioned that it is better to do this on the log scale, if your true values of x are near 0 then the log1p function may be of help.
In general for these types of problems you can use lapply or sapply to compute the pieces that need to be multiplied or summed (or whatever), then use sum or prod to sum, multiply. If you want to collapse/combine the values with an operator that does not have a nice function like sum or prod then use the Reduce function.
S = list c((1,0,0), c(0,1,0), c(0,0,1))
X = c(1,2,3)
lapply( lapply(S, function(x) X %*% x %*% t(X) ) , sum)
[[1]]
[1] 6
[[2]]
[1] 12
[[3]]
[1] 18
I have a sparse matrix (dgCMatrix) as the result of fitting a glmnet. I want to write this result to a .csv but can't use write.table() the matrix because it can't coerced into a data.frame.
Is there a way to coerce the sparse matrix to either a data.frame or a regular matrix? Or is there a way to write it to a file while keeping the coefficient names which are probably row names?
That will be dangerous to transform the sparse matrix to a normal one, if the sparse matrix size is too large. In my case (text classification task), I got a matrix of size 22490 by 120,000. If you try get the dense matrix, that will be more than 20 GB, I think. Then R will break down !
So my suggestion, you may simply store the sparse matrix in an efficient and memory friendly way, such as Matrix Market Format, which keeps all non-zero values and their coordinates (row & col number). In the R you can use the method writeMM
as.matrix() will convert to the full dense representation:
> as.matrix(Matrix(0, 3, 2))
[,1] [,2]
[1,] 0 0
[2,] 0 0
[3,] 0 0
You can write the resulting object out using write.csv or write.table.
Converting directly to a dense matrix is likely to waste a lot of memory. The R package Matrix allows converting the sparse matrix into a memory-efficient coordinate triplet format data frame using the summary() function, which could then be written easily to csv. This is probably simpler and easier than the matrix market approach. See the answer to this related question: Sparse matrix to a data frame in R
Also, here is an illustration from the Matrix package documentation:
## very simple export - in triplet format - to text file:
data(CAex)
s.CA <- summary(CAex)
s.CA # shows (i, j, x) [columns of a data frame]
message("writing to ", outf <- tempfile())
write.table(s.CA, file = outf, row.names=FALSE)
## and read it back -- showing off sparseMatrix():
str(dd <- read.table(outf, header=TRUE))
## has columns (i, j, x) -> we can use via do.call() as arguments to sparseMatrix():
mm <- do.call(sparseMatrix, dd)
stopifnot(all.equal(mm, CAex, tolerance=1e-15))
# input: a sparse matrix with named rows and columns (dimnames)
# returns: a data frame representing triplets (r, c, x) suitable for writing to a CSV file
sparse2triples <- function(m) {
SM = summary(m)
D1 = m#Dimnames[[1]][SM[,1]]
D2 = m#Dimnames[[2]][SM[,2]]
data.frame(row=D1, col=D2, x=m#x)
}
Example
> library(Matrix)
> dn <- list(LETTERS[1:3], letters[1:5])
> m <- sparseMatrix(i = c(3,1,3,2,2,1), p= c(0:2, 4,4,6), x = 1:6, dimnames = dn)
> m
3 x 5 sparse Matrix of class "dgCMatrix"
a b c d e
A . 2 . . 6
B . . 4 . 5
C 1 . 3 . .
> sparse2triples(m)
row col x
1 C a 1
2 A b 2
3 B c 4
4 C c 3
5 A e 6
6 B e 5
[EDIT: use data.frame]