Changes of sign of the rows of a matrix - r

I have a matrix with three columns (say), for example:
M0 <- rbind(
c(1, 2, 3),
c(4, 5, 6)
)
I want to generate all changes of sign of every row of the matrix. Here the desired output is:
[,1] [,2] [,3]
[1,] -1 -2 -3
[2,] 1 -2 -3
[3,] -1 2 -3
[4,] 1 2 -3
[5,] -1 -2 3
[6,] 1 -2 3
[7,] -1 2 3
[8,] 1 2 3
[9,] -4 -5 -6
[10,] 4 -5 -6
[11,] -4 5 -6
[12,] 4 5 -6
[13,] -4 -5 6
[14,] 4 -5 6
[15,] -4 5 6
[16,] 4 5 6
Here is my solution:
signs <- as.matrix(expand.grid(c(-1,1),c(-1,1),c(-1,1)))
M1 <- vapply(1:nrow(M0),
function(i) t(signs %*% diag(M0[i,])),
array(0, dim = c(3,8)))
t(array(M1, dim = c(3, 8*dim(M1)[3])))
# [,1] [,2] [,3]
# [1,] -1 -2 -3
# [2,] 1 -2 -3
# [3,] -1 2 -3
# [4,] 1 2 -3
# [5,] -1 -2 3
# [6,] 1 -2 3
# [7,] -1 2 3
# [8,] 1 2 3
# [9,] -4 -5 -6
# [10,] 4 -5 -6
# [11,] -4 5 -6
# [12,] 4 5 -6
# [13,] -4 -5 6
# [14,] 4 -5 6
# [15,] -4 5 6
# [16,] 4 5 6
Do you have a more elegant solution ?
Moreover, there's one caveat with this solution: if there are some zeros in a row of the source matrix, then this solution generates some duplicates (because -0 = 0). I remove them with mgcv::uniqueCombs. Do you have a solution which doesn't generate some duplicates in the case when there are some zeros, without resorting to a "unique" function ?
EDIT: Benchmark of solutions
Let's compare the performance of three given solutions.
# #Aurèle
changesOfSign1 <- function(M){
signs <- as.matrix(expand.grid(rep(list(c(1, -1)), ncol(M))))
out <- matrix(c(apply(M, 1, `*`, c(t(signs)))), ncol = ncol(M), byrow = TRUE)
out[!duplicated(out),]
}
# #989
changesOfSign2 <- function(M){
signs <- as.matrix(expand.grid(rep(list(c(1, -1)), ncol(M))))
# signs for each row in the resultant matrix
m1 <- signs[rep(1:nrow(signs), times = nrow(M)), ]
# values for each row in the resultant matrix
m2 <- M[rep(1:nrow(M), each = nrow(signs)), ]
#
res <- m1*m2
res[!duplicated(res), ]
}
# #DS_UNI
changesOfSign3 <- function(M){
as.matrix(do.call(rbind, apply(M, 1, function(row){
expand.grid(lapply(row, function(x) if(x==0) 0 else c(-x,x)))
})))
}
# benchmark ####
library(microbenchmark)
benchmark <- function(nrows, ncols){
M0 <- matrix(rpois(nrows*ncols, 3), nrow = nrows, ncol = ncols)
microbenchmark(
changesOfSign1 = changesOfSign1(M0),
changesOfSign2 = changesOfSign2(M0),
changesOfSign3 = changesOfSign3(M0),
times = 1000
)
}
benchmark(nrows = 20, ncols = 3)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# changesOfSign1 493.990 542.4075 639.2895 577.884 642.589 7912.316 1000 a
# changesOfSign2 475.248 522.7730 618.2550 554.232 608.005 7346.927 1000 a
# changesOfSign3 3506.123 3757.8030 4380.9164 3928.491 4464.204 22603.045 1000 b
benchmark(nrows = 20, ncols = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# changesOfSign1 30.09545 35.95840 46.39465 41.37086 49.56855 344.2176 1000 b
# changesOfSign2 41.20642 47.99532 58.59760 52.83705 60.85200 349.4958 1000 c
# changesOfSign3 13.56397 15.21439 21.34205 18.21113 22.34445 319.3990 1000 a
#Aurèle and #989 win when there are 3 columns. #DS_UNI wins when there are 10 columns.
We can improve #DS_UNI's solution with data.table:
# #DS_UNI with data.table
library(data.table)
changesOfSign4 <- function(M){
as.matrix(rbindlist(apply(M, 1, function(row){
do.call(function(...) CJ(..., sorted = FALSE),
lapply(row, function(x) if(x==0) 0 else c(-x,x)))
})))
}

matrix(c(apply(M0, 1, `*`, c(t(signs)))), ncol = ncol(M0), byrow = TRUE)
No claims about elegance :)

I would suspect this to be fast (no loop):
signs <- as.matrix(expand.grid(c(-1,1),c(-1,1),c(-1,1)))
# signs for each row in the resultant matrix
m1 <- signs[ rep( 1:nrow(signs), times = nrow(M0) ), ]
# values for each row in the resultant matrix
m2 <- M0[ rep( 1:nrow(M0), each = nrow(signs) ), ]
res <- m1*m2
# Var1 Var2 Var3
# [1,] -1 -2 -3
# [2,] 1 -2 -3
# [3,] -1 2 -3
# [4,] 1 2 -3
# [5,] -1 -2 3
# [6,] 1 -2 3
# [7,] -1 2 3
# [8,] 1 2 3
# [9,] -4 -5 -6
# [10,] 4 -5 -6
# [11,] -4 5 -6
# [12,] 4 5 -6
# [13,] -4 -5 6
# [14,] 4 -5 6
# [15,] -4 5 6
# [16,] 4 5 6
To deal with duplicated rows caused by zeros:
res[ !duplicated(res), ]

How about this:
M0 <- rbind(
c(1, 2, 3),
c(4, 5, 6)
)
signs <- expand.grid(rep(list(c(1, -1)), ncol(M0)))
do.call(rbind, apply(M0, FUN = `*`, signs, MARGIN = 1))
EDIT:
Ok I gave up on elegance, and I do prefer the one-liner solution by #Aurèle, however I'm editing the answer to get at least the desired output, and on the plus side it works with zero :P
my_fun <- function(row){
expand.grid(
lapply(row,
function(x) {
if(x != -x)
return(c(x, -x))
else
return(x)}))}
do.call(rbind, apply(M0, FUN = my_fun, MARGIN = 1))

High, is this elegant enough :)
x <- 1:3
signs <- as.matrix(expand.grid(x = c(-1,1), y = c(-1, 1), z = c(-1, 1)))
t(x * t(signs))
x y z
[1,] -1 -2 -3
[2,] 1 -2 -3
[3,] -1 2 -3
[4,] 1 2 -3
[5,] -1 -2 3
[6,] 1 -2 3
[7,] -1 2 3
[8,] 1 2 3

Related

extract every two elements in matrix row in r in sequence to calculate euclidean distance

How to extract every two elements in sequence in a matrix and return the result as a matrix so that I could feed the answer in a formula for calculation:
For example, I have a one row matrix with 6 columns:
[,1][,2][,3][,4][,5][,6]
[1,] 2 1 5 5 10 1
I want to extract column 1 and two in first iteration, 3 and 4 in second iteration and so on. The result has to be in the form of matrix.
[1,] 2 1
[2,] 5 5
[3,] 10 1
My original codes:
data <- matrix(c(1,1,1,2,2,1,2,2,5,5,5,6,10,1,10,2,11,1,11,2), ncol = 2)
Center Matrix:
[,1][,2][,3][,4][,5][,6]
[1,] 2 1 5 5 10 1
[2,] 1 1 2 1 10 1
[3,] 5 5 5 6 11 2
[4,] 2 2 5 5 10 1
[5,] 2 1 5 6 5 5
[6,] 2 2 5 5 11 1
[7,] 2 1 5 5 10 1
[8,] 1 1 5 6 11 1
[9,] 2 1 5 5 10 1
[10,] 5 6 11 1 10 2
objCentroidDist <- function(data, centers) {
resultMatrix <- matrix(NA, nrow=dim(data)[1], ncol=dim(centers)[1])
for(i in 1:nrow(centers)) {
resultMatrix [,i] <- sqrt(rowSums(t(t(data)-centers[i, ])^2))
}
resultMatrix
}
objCentroidDist(data,centers)
I want the Result matrix to be as per below:
[1,][,2][,3]
[1,]
[2,]
[3,]
[4,]
[5,]
[7,]
[8,]
[9,]
[10]
My concern is, how to calculate the data-centers distance if the dimensions of the data matrix are two, and centers matrix are six. (to calculate the distance from the data matrix and every two columns in centers matrix). Each row of the centers matrix has three centers.
Something like this maybe?
m <- matrix(c(2,1,5,5,10,1), ncol = 6)
list.seq.pairs <- lapply(seq(1, ncol(m), 2), function(x) {
m[,c(x, x+1)]
})
> list.seq.pairs
[[1]]
[1] 2 1
[[2]]
[1] 5 5
[[3]]
[1] 10 1
And, in case you're wanting to iterate over multiple rows in a matrix,
you can expand on the above like this:
mm <- matrix(1:18, ncol = 6, byrow = TRUE)
apply(mm, 1, function(x) {
lapply(seq(1, length(x), 2), function(y) {
x[c(y, y+1)]
})
})
EDIT:
I'm really not sure what you're after exactly. I think, if you want each row transformed into a 2 x 3 matrix:
mm <- matrix(1:18, ncol = 6, byrow = TRUE)
list.mats <- lapply(1:nrow(mm), function(x){
a = matrix(mm[x,], ncol = 2, byrow = TRUE)
})
> list.mats
[[1]]
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[[2]]
[,1] [,2]
[1,] 7 8
[2,] 9 10
[3,] 11 12
[[3]]
[,1] [,2]
[1,] 13 14
[2,] 15 16
[3,] 17 18
If, however, you want to get to your results matrix- I think it's probably easiest to do whatever calculations you need to do while you're dealing with each row:
results <- t(apply(mm, 1, function(x) {
sapply(seq(1, length(x), 2), function(y) {
val1 = x[y] # Get item one
val2 = x[y+1] # Get item two
val1 / val2 # Do your calculation here
})
}))
> results
[,1] [,2] [,3]
[1,] 0.5000000 0.7500 0.8333333
[2,] 0.8750000 0.9000 0.9166667
[3,] 0.9285714 0.9375 0.9444444
That said, I don't understand what you're trying to do so this may miss the mark. You may have more luck if you ask a new question where you show example input and the actual expected output that you're after, with the actual values you expect.

Element intersection between two matrices in R

This probably has an easy solution, but I can still not find one. I have two matrices, one of size M1 = (4, 2000000), and the other, M2=(4,209). I want to find the length of elements intersection between each column of M2 to all columns of M1.
For one column in M2 I do:
res <- apply(M1, 2, function(x) length(intersect(tmp, x)))
where tmp is the first column of M2.
This takes about 30 seconds. To speed up the calculation for all columns of M2, I do foreach:
list <- foreach(k=1:ncol(M2)) %dopar% {
tmp <- M2[,k]
res <- apply(M1, 2, function(x) length(intersect(tmp, x)))
}
This takes about 20 minutes.
Is there a way to avoid this foreach loop with an apply function?
Thank you!
Having data:
set.seed(991)
M1 = matrix(sample(5, 50, TRUE), 5)
M2 = matrix(sample(5, 25, TRUE), 5)
your solution returns:
op = sapply(1:ncol(M2),
function(k) apply(M1, 2, function(x) length(intersect(M2[, k], x))))
op
# [,1] [,2] [,3] [,4] [,5]
# [1,] 3 1 3 2 3
# [2,] 3 2 3 3 4
# [3,] 2 2 2 2 3
# [4,] 2 3 3 2 3
# [5,] 2 2 3 1 2
# [6,] 2 2 2 2 3
# [7,] 2 3 3 2 3
# [8,] 2 2 3 3 3
# [9,] 2 2 3 3 3
#[10,] 1 3 2 1 2
which is what
ans1 = tcrossprod(table(col(M1), M1) > 0L, table(col(M2), M2) > 0L)
returns.
all.equal(op, ans1, check.attributes = FALSE)
#[1] TRUE
Since we don't need the number of occurences, we could replace the expensive calls to table with simple matrix manipulations:
m1 = matrix(0L, ncol(M1), max(M1))
m1[cbind(rep(1:ncol(M1), each = nrow(M1)), c(M1))] = 1L
m2 = matrix(0L, ncol(M2), max(M2))
m2[cbind(rep(1:ncol(M2), each = nrow(M2)), c(M2))] = 1L
ans2 = tcrossprod(m1, m2)
all.equal(op, ans2)
#[1] TRUE
For your case, it seems more suitable to start by making sparse tabulations, if there is a chance to avoid memory contraints:
library(Matrix)
sm1 = sparseMatrix(x = 1L,
i = rep(1:ncol(M1), each = nrow(M1)),
j = M1,
use.last.ij = TRUE)
sm2 = sparseMatrix(x = 1L,
i = rep(1:ncol(M2), each = nrow(M2)),
j = M2,
use.last.ij = TRUE)
ans3 = tcrossprod(sm1, sm2)
all.equal(op, as.matrix(ans3), check.attributes = FALSE)
#[1] TRUE
Given your matrix dimensions, you could do this which should be faster:
apply(m2, 2, function(x) colSums(m1==x[1] | m1==x[2] | m1==x[3] | m1==x[4]))
For example, suppose:
m1
[,1] [,2] [,3]
[1,] 3 6 4
[2,] 9 8 11
[3,] 10 1 12
[4,] 2 5 7
m2
[,1] [,2]
[1,] 3 6
[2,] 2 7
[3,] 1 5
[4,] 8 4
Then, it will give you:
[,1] [,2]
[1,] 2 0
[2,] 2 2
[3,] 0 2
Update about time efficiency
So to summarize, as the OP has mentioned in the comments,
The naive for solution takes about 20 mins
My solution takes about 36 secs
That of #alexis_laz about 12 secs
for doing the same job.

Optimization : conditional test on several matrices & extraction in R

I have a matrix
mat_a <- matrix(data = c( c(rep(1,3), rep(2,3), rep(3,3))
, rep(seq(1,300,100), 3)
, runif(15, 0, 1))
, ncol=3)
[,1] [,2] [,3]
[1,] 1 1 0.8393401
[2,] 1 101 0.5486805
[3,] 1 201 0.4449259
[4,] 2 1 0.3949137
[5,] 2 101 0.4002575
[6,] 2 201 0.3288861
[7,] 3 1 0.7865035
[8,] 3 101 0.2581155
[9,] 3 201 0.8987769
that I compare to another matrix with higher dimensions
mat_b <- matrix(data = c(
c(rep(1,3), rep(2,3), rep(3,3), rep(4,3))
, rep(seq(1,300,100), 4)
, rep(3:5, 4))
, ncol = 3)
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 101 4
[3,] 1 201 5
[4,] 2 1 3
[5,] 2 101 4
[6,] 2 201 5
[7,] 3 1 3
[8,] 3 101 4
[9,] 3 201 5
[10,] 4 1 3
[11,] 4 101 4
[12,] 4 201 5
I need to extract the lines of mat_a where columns #2 of both matrices match. For those matches, both columns 1 also have to match. Also, column 3 of mat_b must be higher or equal to 4.
I cannot find any solution based on vectorization. I only came out with a loop-based solution.
output <- NULL
for (i in 1:nrow(mat_a)) {
if (mat_a[i,2] %in% mat_b[,2][mat_b[,3] >= 4]) {
rows <- which( mat_b[,2] %in% mat_a[i,2])
row <- which(mat_b[,1][rows] == mat_a[i,1])
if (mat_b[,3][rows[row]] >= 4) {
output <- rbind(output, mat_a[i,])
}
}
}
This works but is extremely slow. It took less than one hour to run. mat_a has 9 col with 40,000 rows (could go higher), mat_b has 5 col and around 1.2 millions rows.
Any idea?
It is better to work with data frames when comparing tables as you are. That will use R's structures to their strengths instead of working against them. We use a simple merge to match the correct values. Then subset b with the necessary condition, b$V3 >= 4. On the end, [-4] lets the output more closely match your desired output:
a <- as.data.frame(mat_a)
b <- as.data.frame(mat_b)
merge(a,b[b$V3 >= 4,], by=c("V1","V2"))[-4]
# V1 V2 V3.x
# 1 1 101 0.1118960
# 2 1 201 0.1543351
# 3 2 101 0.3950491
# 4 2 201 0.5688684
# 5 3 201 0.4749941

Imputation mean in a matrix in R

I have on matrix in R with 440 rows and 261 columns.
There are some 0 values.
In each row I need to change the 0 values to the mean of all the values.
I tried to do it with the code below, but every time it changed with only the first mean value.
snp2<- read.table("snp2.txt",h=T)
mean <- rowMeans(snp2)
for(k in 1:nrow(snp2))
{
snp2[k==0]<-mean[k]
}
Instead of looping through the rows, you could do this in one shot by identifying all the 0 indices in the matrix and replacing them with the appropriate row mean:
# Sample data
(mat <- matrix(c(0, 1, 2, 1, 0, 3, 11, 11, 11), nrow=3))
# [,1] [,2] [,3]
# [1,] 0 1 11
# [2,] 1 0 11
# [3,] 2 3 11
(zeroes <- which(mat == 0, arr.ind=TRUE))
# row col
# [1,] 1 1
# [2,] 2 2
mat[zeroes] <- rowMeans(mat)[zeroes[,"row"]]
mat
# [,1] [,2] [,3]
# [1,] 4 1 11
# [2,] 1 4 11
# [3,] 2 3 11
While you could fix up your function to replace this missing values row-by-row, this will not be as efficient as the one-shot approach (in addition to being more typing):
josilber <- function(mat) {
zeroes <- which(mat == 0, arr.ind=TRUE)
mat[zeroes] <- rowMeans(mat)[zeroes[,"row"]]
mat
}
OP.fixed <- function(mat) {
means <- rowMeans(mat)
for(k in 1:nrow(mat)) {
mat[k,][mat[k,] == 0] <- means[k]
}
mat
}
bgoldst <- function(m) ifelse(m==0,rowMeans({ mt <- m; mt[mt==0] <- NA; mt; },na.rm=T)[row(m)],m);
# 4400 x 2610 matrix
bigger <- matrix(sample(0:10, 4400*2610, replace=TRUE), nrow=4400)
all.equal(josilber(bigger), OP.fixed(bigger))
# [1] TRUE
# bgoldst differs because it takes means of non-zero values only
library(microbenchmark)
microbenchmark(josilber(bigger), OP.fixed(bigger), bgoldst(bigger), times=10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# josilber(bigger) 262.541 382.0706 406.1107 395.3815 452.0872 532.4742 10
# OP.fixed(bigger) 1033.071 1184.7288 1236.6245 1238.8298 1271.7677 1606.6737 10
# bgoldst(bigger) 3820.044 4033.5826 4368.5848 4201.6302 4611.9697 5581.5514 10
For a fairly large matrix (4400 x 2610), the one-shot procedure is about 3 times quicker than the fixed up solution from the question and about 10 times faster than the one proposed by #bgoldst.
Here's a solution using ifelse(), assuming you want to exclude zeroes from the mean calculation:
NR <- 5; NC <- 5;
set.seed(1); m <- matrix(sample(c(rep(0,5),1:5),NR*NC,replace=T),NR);
m;
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 4 0 0 5
## [2,] 0 5 0 3 0
## [3,] 1 2 2 5 2
## [4,] 5 2 0 0 0
## [5,] 0 0 3 3 0
ifelse(m==0,rowMeans({ mt <- m; mt[mt==0] <- NA; mt; },na.rm=T)[row(m)],m);
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4.5 4 4.5 4.5 5.0
## [2,] 4.0 5 4.0 3.0 4.0
## [3,] 1.0 2 2.0 5.0 2.0
## [4,] 5.0 2 3.5 3.5 3.5
## [5,] 3.0 3 3.0 3.0 3.0

Subset data.table altering subset limit between factor levels

I am having a hard time trying to subset a data.table (package) in R. Giving a following example
library(data.table)
x = c(rep("a", 6), rep("b", 5))
y = c(0,2,1,0,1,2, 0,1,0,2,1)
z = c(1:6,1:5) + rnorm(11, 0.02, 0.1)
DT = data.table(ind = x, cond = y, dist = z)
ind cond dist
[1,] a 0 1.078966
[2,] a 2 1.987159
[3,] a 1 3.143391
[4,] a 0 3.937058
[5,] a 1 5.037681
[6,] a 2 6.036432
[7,] b 0 1.057809
[8,] b 1 2.144755
[9,] b 0 3.010903
[10,] b 2 3.937765
[11,] b 1 4.976273
I want to subset everything after the first 1 in cond column. In other words, everything that is larger than 3.143391 for a and 2.144755 for b (in this example).
DT.sub <- DT[cond == "1",] # Please, combine this row
DT.sub[,.SD[dist==min(dist)],by=ind] # With this to make the code shorter, if you can.
ind cond dist
[1,] a 1 3.143391
[2,] b 1 2.144755
The result should look like this:
ind cond dist
[1,] a 0 3.937058
[2,] a 1 5.037681
[3,] a 2 6.036432
[4,] b 0 3.010903
[5,] b 2 3.937765
[6,] b 1 4.976273
How about :
DT[,.SD[seq(match(1,cond)+1,.N)],by=ind]
ind cond dist
[1,] a 0 3.937058
[2,] a 1 5.037681
[3,] a 2 6.036432
[4,] b 0 3.010903
[5,] b 2 3.937765
[6,] b 1 4.976273
Btw, it's good to set.seed(1) first so we can work with the same random data.

Resources