Related
I have two large sparse matrices (about 41,000 x 55,000 in size). The density of nonzero elements is around 10%. They both have the same row index and column index for nonzero elements.
I now want to modify the values in the first sparse matrix if values in the second matrix are below a certain threshold.
library(Matrix)
# Generating the example matrices.
set.seed(42)
# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
I now get the rows, columns and values from the first matrix in a new matrix. This way, I can simply subset them and only the ones I am interested in remain.
# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1#i, j = summary(m1)$j, x = m1#x)
position_matrix_from_m2 <- rbind(i = m2#i, j = summary(m2)$j, x = m2#x)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]
# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,] + 1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,] + 1
Now I am getting into trouble. Overwriting the values in the second matrix takes too long. I let it run for several hours and it did not finish.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
I thought about pasting the row and column information together. Then I have a unique identifier for each value. This also takes too long and is probably just very bad practice.
# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])
# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory.
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)
# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2#x in the sparse matrix object.
Is there a fundamental error in my approach? What should I do to run this efficiently?
Is there a fundamental error in my approach?
Yes. Here it is.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
Syntax as mat[rn, cn] (whether mat is a dense or sparse matrix) is selecting all rows in rn and all columns in cn. So you get a length(rn) x length(cn) matrix. Here is a small example:
A <- matrix(1:9, 3, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
rn <- 1:2
cn <- 2:3
A[rn, cn]
# [,1] [,2]
#[1,] 4 7
#[2,] 5 8
What you intend to do is to select (rc[1], cn[1]), (rc[2], cn[2]) ..., only. The correct syntax is then mat[cbind(rn, cn)]. Here is a demo:
A[cbind(rn, cn)]
#[1] 4 8
So you need to fix your code to:
m2[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 1
m1[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 0
Oh wait... Based on your construction of position_matrix_from_m1, this is just
ij <- t(position_matrix_from_m1[1:2, ])
m2[ij] <- 1
m1[ij] <- 0
Now, let me explain how you can do better. You have underused summary(). It returns a 3-column data frame, giving (i, j, x) triplet, where both i and j are index starting from 1. You could have worked with this nice output directly, as follows:
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# you never seem to use `position_matrix_from_m2` so I skip it
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
Now you can do:
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2[ij] <- 1
m1[ij] <- 0
Is there a even better solution? Yes! Note that nonzero elements in m1 and m2 are located in the same positions. So basically, you just need to change m2#x according to m1#x.
ind <- m1#x > 0 & m1#x < 0.05
m2#x[ind] <- 1
m1#x[ind] <- 0
A complete R session
I don't have enough RAM to create your large matrix, so I reduced your problem size a little bit for testing. Everything worked smoothly.
library(Matrix)
# Generating the example matrices.
set.seed(42)
## reduce problem size to what my laptop can bear with
squeeze <- 0.1
# Rows with values.
i <- sample(1:(41000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Columns with values.
j <- sample(1:(55000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000 * squeeze ^ 2)
# Values for the second matrix.
x2 <- sample(1:3, 227000000 * squeeze ^ 2, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
## give me more usable RAM
rm(i, j, x1, x2)
##
## fix to your code
##
m1a <- m1
m2a <- m2
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2a[ij] <- 1
m1a[ij] <- 0
##
## the best solution
##
m1b <- m1
m2b <- m2
ind <- m1#x > 0 & m1#x < 0.05
m2b#x[ind] <- 1
m1b#x[ind] <- 0
##
## they are identical
##
all.equal(m1a, m1b)
#[1] TRUE
all.equal(m2a, m2b)
#[1] TRUE
Caveat:
I know that some people may propose
m1c <- m1
m2c <- m2
logi <- m1 > 0 & m1 < 0.05
m2c[logi] <- 1
m1c[logi] <- 0
It looks completely natural in R's syntax. But trust me, it is extremely slow for large matrices.
I am generating a matrix in R using following,
ncolumns = 3
nrows = 10
my.mat <- matrix(runif(ncolumns*nrows), ncol=ncolumns)
This matrix indicates the co-ordinates of a point in 3D. How to calculate following in R?
sum of x(i)*y(i)
e.g. if the matrix is,
x y z
1 2 3
4 5 6
then output = 1*2 + 4*5
I'm trying to learn R. So any help will be really appreciated.
Thanks
You're looking for the %*% function.
ncolumns = 3
nrows = 10
my.mat <- matrix(runif(ncolumns*nrows), ncol=ncolumns)
(my.answer <- my.mat[,1] %*% my.mat[,2])
# [,1]
# [1,] 1.519
you simply do:
# x is the first column; y is the 2nd
sum(my.mat[i, 1] * my.mat[i, 2])
Now, if you want to name your columns, you can refer to them directly
colnames(my.mat) <- c("x", "y", "z")
sum(my.mat[i, "x"] * my.mat[i, "y"])
# or if you want to get the product of each i'th element
# just leave empty the space where the i would go
sum(my.mat[ , "x"] * my.mat[ , "y"])
each column is designated by the second argument in [], so
my_matrix[,1] + my_matrix[,2]
is all you need.
I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second best) matching row to df1 in terms of hamming distance.
I used e1071 package to compute hamming distance. Hamming distance between two vectors x and y can be computed as for example:
x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
24197, 610187, 402471, 157122, 866381, 582868, 878)
y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)
xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)
distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))
and the resulting distance is 25. Yet I need to do this for all rows of df1 and df2. A trivial method takes a double loop nest and looks terribly slow.
Any ideas how to do this more efficiently? In the end I need to append to df2:
a column with the row id from df1 that gives the lowest distance;
a column with the lowest distance;
a column with the row id from df1 that gives the 2nd lowest distance;
a column with the second lowest distance.
Thanks.
Fast computation of hamming distance between two integers vectors of equal length
As I said in my comment, we can do:
hmd0 <- function(x,y) sum(as.logical(xor(intToBits(x),intToBits(y))))
to compute hamming distance between two integers vectors of equal length x and y. This only uses R base, yet is more efficient than e1071::hamming.distance, because it is vectorized!
For the example x and y in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)
Fast hamming distance between a matrix and a vector
If we want to compute the hamming distance between a single y and multiple xs, i.e., the hamming distance between a vector and a matrix, we can use the following function.
hmd <- function(x,y) {
rawx <- intToBits(x)
rawy <- intToBits(y)
nx <- length(rawx)
ny <- length(rawy)
if (nx == ny) {
## quick return
return (sum(as.logical(xor(rawx,rawy))))
} else if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
nc <- nx / ny ## number of cycles
return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
}
}
Note that:
hmd performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first;
there is no obvious loop here; instead, we use tapply().
Fast hamming distance computation between two matrices/data frames
This is what you want. The following function foo takes two data frames or matrices df1 and df2, computing the distance between df1 and each row of df2. argument p is an integer, showing how many results you want to retain. p = 3 will keep the smallest 3 distances with their row ids in df1.
foo <- function(df1, df2, p) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
k <- 1:p
for (i in 1:n) {
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
k <- k + p
}
## recode "id" and "d" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
list(id = id, d = d)
}
Note that:
transposition is done at the beginning, according to reasons before;
a for loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using *apply family, since we ask for multiple output (row id id and distance d).
Experiment
This part uses small dataset to test/demonstrate our functions.
Some toy data:
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2)) ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2)) ## 3 rows 2 cols
Test hmd first (needs transposition):
hmd(t(as.matrix(df1)), df2[1, ]) ## df1 & first row of df2
# [1] 2 4 6 2 4
Test foo:
foo(df1, df2, p = 2)
# $id
# min1 min2
# 1 1 4
# 2 2 3
# 3 5 2
# $d
# mindist.1 mindist.2
# 1 2 2
# 2 1 3
# 3 1 3
If you want to append some columns to df2, you know what to do, right?
Please don't be surprised why I take another section. This part gives something relevant. It is not what OP asks for, but may help any readers.
General hamming distance computation
In the previous answer, I start from a function hmd0 that computes hamming distance between two integer vectors of the same length. This means if we have 2 integer vectors:
set.seed(0)
x <- sample(1:100, 6)
y <- sample(1:100, 6)
we will end up with a scalar:
hmd0(x,y)
# 13
What if we want to compute pairwise hamming distance of two vectors?
In fact, a simple modification to our function hmd will do:
hamming.distance <- function(x, y, pairwise = TRUE) {
nx <- length(x)
ny <- length(y)
rawx <- intToBits(x)
rawy <- intToBits(y)
if (nx == 1 && ny == 1) return(sum(as.logical(xor(intToBits(x),intToBits(y)))))
if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
bits <- length(intToBits(0)) ## 32-bit or 64 bit?
result <- unname(tapply(as.logical(xor(rawx,rawy)), rep(1:ny, each = bits), sum))
}
if (pairwise) result else sum(result)
}
Now
hamming.distance(x, y, pairwise = TRUE)
# [1] 0 3 3 2 5 0
hamming.distance(x, y, pairwise = FALSE)
# [1] 13
Hamming distance matrix
If we want to compute the hamming distance matrix, for example,
set.seed(1)
x <- sample(1:100, 5)
y <- sample(1:100, 7)
The distance matrix between x and y is:
outer(x, y, hamming.distance) ## pairwise argument has no effect here
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 2 3 4 3 4 4 2
# [2,] 7 6 3 4 3 3 3
# [3,] 4 5 4 3 6 4 2
# [4,] 2 3 2 5 6 4 2
# [5,] 4 3 4 3 2 0 2
We can also do:
outer(x, x, hamming.distance)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 5 2 2 4
# [2,] 5 0 3 5 3
# [3,] 2 3 0 2 4
# [4,] 2 5 2 0 4
# [5,] 4 3 4 4 0
In the latter situation, we end up with a symmetric matrix with 0 on the diagonal. Using outer is inefficient here, but it is still more efficient than writing R loops. Since our hamming.distance is written in R code, I would stay with using outer. In my answer to this question, I demonstrate the idea of using compiled code. This of course requires writing a C version of hamming.distance, but I will not show it here.
Here's an alternative solution that uses only base R, and should be very fast, especially when your df1 and df2 have many rows. The main reason for this is that it does not use any R-level looping for calculating the Hamming distances, such as for-loops, while-loops, or *apply functions. Instead, it uses matrix multiplication for computing the Hamming distance. In R, this is much faster than any approach using R-level looping. Also note that using an *apply function will not necessarily make your code any faster than using a for loop. Two other efficiency-related features of this approach are: (1) It uses partial sorting for finding the best two matches for each row in df2, and (2) It stores the entire bitwise representation of df1 in one matrix (same for df2), and does so in one single step, without using any R-level loops.
The function that does all the work:
# INPUT:
# X corresponds to your entire df1, but is a matrix
# Y corresponds to your entire df2, but is a matrix
# OUTPUT:
# Matrix with four columns corresponding to the values
# that you specified in your question
fun <- function(X, Y) {
# Convert integers to bits
X <- intToBits(t(X))
# Reshape into matrix
dim(X) <- c(ncols * 32, nrows)
# Convert integers to bits
Y <- intToBits(t(Y))
# Reshape into matrix
dim(Y) <- c(ncols * 32, nrows)
# Calculate pairwise hamming distances using matrix
# multiplication.
# Columns of H index into Y; rows index into X.
# The code for the hamming() function was retrieved
# from this page:
# https://johanndejong.wordpress.com/2015/10/02/faster-hamming-distance-in-r-2/
H <- hamming(X, Y)
# Now, for each row in Y, find the two best matches
# in X. In other words: for each column in H, find
# the two smallest values and their row indices.
t(apply(H, 2, function(h) {
mindists <- sort(h, partial = 1:2)
c(
ind1 = which(h == mindists[1])[1],
val1 = mindists[1],
hmd2 = which(h == mindists[2])[1],
val2 = mindists[2]
)
}))
}
To call the function on some random data:
# Generate some random test data with no. of columns
# corresponding to your data
nrows <- 1000
ncols <- 26
# X corresponds to your df1
X <- matrix(
sample(1e6, nrows * ncols, replace = TRUE),
nrow = nrows,
ncol = ncols
)
# Y corresponds to your df2
Y <- matrix(
sample(1e6, nrows * ncols, replace = TRUE),
nrow = nrows,
ncol = ncols
)
res <- fun(X, Y)
The above example with 1000 rows in both X (df1) and Y (df2) took about 1.1 - 1.2 seconds to run on my laptop.
I am trying to implement the following simple formulas in R:
Formula 1:
I have no idea how to implement in R the product operator when the limits of the interval are very large (e.g. value of the upper limit = 10,000 instead of 5)
Formula 2
Example input for second formula (in reality, the dimension of the interval S is much much bigger)
S = list(c(1,0,0), c(0,1,0), c(0,0,1))
X = c(1,2,3)
Any help would be appreciated!
For the first, take the log:
i *log(1+x)
For the second formula: (not clear what is the expected output)
ss<-matrix(unlist(S), ncol = 3, byrow = TRUE)
X<-as.matrix(X)
crossprod(crossprod(X,ss),t(X))
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 4 6
[3,] 3 6 9
Maybe more compactly:
First formula:
function(n, x) exp(sum(seq_len(n)*log(1+x)))
Second formula:
function(X, S) rowSums(sapply(S, function(y) sum(X*y)*X ))
For the first formula it has been mentioned that it is better to do this on the log scale, if your true values of x are near 0 then the log1p function may be of help.
In general for these types of problems you can use lapply or sapply to compute the pieces that need to be multiplied or summed (or whatever), then use sum or prod to sum, multiply. If you want to collapse/combine the values with an operator that does not have a nice function like sum or prod then use the Reduce function.
S = list c((1,0,0), c(0,1,0), c(0,0,1))
X = c(1,2,3)
lapply( lapply(S, function(x) X %*% x %*% t(X) ) , sum)
[[1]]
[1] 6
[[2]]
[1] 12
[[3]]
[1] 18
Let's say you have a matrix
m <- matrix(1:25*2, nrow = 5, ncol=5)
How do you go from matrix subscripts (row index, column index) to a linear index you can use on the matrix. For example you can extract values of the matrix with either of these two methods
m[2,3] == 24
m[12] == 24
How do you go from (2,3) => 12 or 12 => (2,3) in R
In Matlab the functions you would use for converting matrix subscripts to linear indices and vice versa are ind2sub and `sub2ind
Is there an equivalent way in R?
This is not something I've used before, but according to this handy dandy Matlab to R cheat sheet, you might try something like this, where m is the number of rows in the matrix, r and c are row and column numbers respectively, and ind the linear index:
MATLAB:
[r,c] = ind2sub(size(A), ind)
R:
r = ((ind-1) %% m) + 1
c = floor((ind-1) / m) + 1
MATLAB:
ind = sub2ind(size(A), r, c)
R:
ind = (c-1)*m + r
For higher dimension arrays, there is the arrayInd function.
> abc <- array(dim=c(10,5,5))
> arrayInd(12,dim(abc))
dim1 dim2 dim3
[1,] 2 2 1
You mostly don't need those functions in R. In Matlab you need those because you can't do e.g.
A(i, j) = x
where i,j,x are three vectors of row and column indices and x contains the corresponding values. (see also this question)
In R you can simply:
A[ cbind(i, j) ] <- x
There are row and col functions that return those indices in matrix form. So it should be as simple as indexing the return from those two functions:
M<- matrix(1:6, 2)
row(M)[5]
#[1] 1
col(M)[5]
#[1] 3
rc.ind <- function(M, ind) c(row(M)[ind], col(M)[ind] )
rc.ind(M,5)
[1] 1 3
Late answer but there's an actual function for ind2sub in the base package called arrayInd
m <- matrix(1:25, nrow = 5, ncol=5)
# linear indices in R increase row number first, then column
arrayInd(5, dim(m))
arrayInd(6, dim(m))
# so, for any arbitrary row/column
numCol <- 3
numRow <- 4
arrayInd(numRow + ((numCol-1) * nrow(m)), dim(m))
# find the row/column of the maximum element in m
arrayInd(which.max(m), dim(m))
# actually which has an arr.ind parameter for returning array indexes
which(m==which.max(m), arr.ind = T)
For sub2ind, JD Long's answer seems to be the best
Something like this works for arbitrary dimensions-
ind2sub = function(sz,ind)
{
ind = as.matrix(ind,ncol=1);
sz = c(1,sz);
den = 1;
sub = c();
for(i in 2:length(sz)){
den = den * sz[i-1];
num = den * sz[i];
s = floor(((ind-1) %% num)/den) + 1;
sub = cbind(sub,s);
}
return(sub);
}