Identity rows between two matrices in R - r

Let say we have two matrices, i.e. M1 and M2, of dimensions n1 x m and n2 x m, respectively.
How we can find which rows of M1 are identity to those of M2 (and exceptional vice versa) ?
The preferable output is a matrix, whose the number of rows is equal to the identity rows between the matrices M1 and M2, and two columns, that is, the first column will contain the number of row of matrix M1 and the second one the number of row of matrix M2.

There might be a slicker way, but this seems to work...
#dummy data
M1 <- matrix(1:8,ncol=2)
M2 <- matrix(c(1,3,4,5,6,8),ncol=2)
M1
[,1] [,2]
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
M2
[,1] [,2]
[1,] 1 5
[2,] 3 6
[3,] 4 8
which(apply(M2, 1, function(v)
apply(M1, 1, function(w) sum(abs(w-v))))==0,
arr.ind = TRUE)
row col
[1,] 1 1
[2,] 4 3
The row column is the row index of M1, the col column is the index of matching rows in M2.

Create example matrices with 4 matching rows
set.seed(0)
M1 <- matrix(runif(100), 10)
M2 <- rbind(M1[sample(10, 4),], matrix(runif(60), 6))
Create output
splits <- lapply(list(M1, M2), function(x) split(x, row(x)))
out <- cbind(M1 = seq(nrow(M1)), M2 = do.call(match, splits))
out[!is.na(out[,2]),]
# M1 M2
# [1,] 2 4
# [2,] 3 3
# [3,] 6 2
# [4,] 7 1

Related

Identify groups of identical rows in a matrix

tl;dr What is the idiomatic way to identify groups of identical rows in a matrix in R?
Given an n-by-2 matrix where some rows occur more than once,
> mat <- matrix(c(2,5,5,3,4,6,2,5,4,6,4,6), ncol=2, byrow=T)
> mat
[,1] [,2]
[1,] 2 5
[2,] 5 3
[3,] 4 6
[4,] 2 5
[5,] 4 6
[6,] 4 6
I am looking to get the groups of row indices of identical rows. In the example above, rows (1,4) are identical, and so are rows (3,5,6). Finally, there is row (2). I am looking to get these groups, represented in whatever way is idiomatic in R.
The output could be something like this,
> groups <- matrix(c(1,1, 2,2, 3,3, 4,1, 5,3, 6,3), ncol=2, byrow=T)
> groups
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 1
[5,] 5 3
[6,] 6 3
where the first column contains the row indices of mat and the second the group index for each row index. Or it could be like this:
> split(groups[,1], groups[,2])
$`1`
[1] 1 4
$`2`
[1] 2
$`3`
[1] 3 5 6
Either will do. I am not sure what is the best way to represent groups in R, and advice on this is also welcome.
For benchmarking purposes, here's a larger dataset:
set.seed(123)
n <- 10000000
mat <- matrix(sample.int(10, 2*n, replace = T), ncol=2)
cbind with sequence of rows and the match between the rows and unique values of the row
v1 <- paste(mat[,1], mat[,2])
# or if there are more columns
#v1 <- do.call(paste, as.data.frame(mat))
out <- cbind(seq_len(nrow(mat)), match(v1, unique(v1)))
-output
> out
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 1
[5,] 5 3
[6,] 6 3
If we want a list output
split(out[,1], out[,2])
-ouptut
$`1`
[1] 1 4
$`2`
[1] 2
$`3`
[1] 3 5 6
Benchmarks
With the OP's big data
> system.time({
+ v1 <- paste(mat[,1], mat[,2])
+
+ out <- cbind(seq_len(nrow(mat)), match(v1, unique(v1)))
+
+ })
user system elapsed
2.603 0.130 2.706

rowsums for matrix over randomly specified subsets of columns in R

I have this matrix
mu<-1:100
sigma<-100:1
sample.size<-10
toy.mat<-mapply(function(x,y){rnorm(x,y,n=sample.size)},x=mu,y=sigma)
colnames(toy.mat) <- c(rep(1,10),rep(2,10), rep(3,10), rep(4,10), rep(5,10),
rep(6,10), rep(7,10), rep(8,10), rep(9,10), rep(10,10) )
For the 10 columns named (1) I like to randomly select 5 pairs and rowsums each pair to generate 5 columns named (1a, 1b, 1c, 1d, 1e). I will do the same with columns named 2, 3 to 10.
Is there a data.table method to do this?
I'm still unsure about what you're trying to do.
This is what I understood.
I first split toy.mat into a list of 10 matrices (chunks). This is for convenience.
# Split toy.mat into list of matrices
lst <- lapply(seq(1, 100, by = 10), function(i) toy.mat[, i:(i+9)]);
Next, generate 5 random pairs, by sampling 10 numbers from the sequence 1:10 and coercing them into a 5x2 matrix. Repeat for all 10 matrix chunks.
# Generate 5 random pairs
set.seed(2017); # For reproducibility of results
rand <- replicate(10, matrix(sample(1:10, 10), ncol = 5), simplify = FALSE);
head(rand, n = 2);
#[[1]]
# [,1] [,2] [,3] [,4] [,5]
#[1,] 10 4 9 1 6
#[2,] 5 3 8 2 7
#
#[[2]]
# [,1] [,2] [,3] [,4] [,5]
#[1,] 7 9 3 5 10
#[2,] 1 4 2 6 8
Select corresponding columns based on pairs from rand and calculate the rowSums. Do that for every matrix chunk.
# Select column pairs and calculate rowSums
lst.rand <- lapply(1:10, function(i)
sapply(as.data.frame(rand[[i]]), function(w) rowSums(lst[[i]][, w])));
Bind list elements into matrix, and set column names.
# Bind into
mat <- do.call(cbind, lst.rand);
colnames(mat) <- as.vector(sapply(1:10, function(i) paste0(i, letters[1:5])));
mat[1:5, 1:6];
# 1a 1b 1c 1d 1e 2a
#[1,] 21.410826 34.90337 -11.297396 -50.56332 -115.82456 51.32369
#[2,] 5.323713 -144.26640 169.697538 -58.35540 96.25637 -78.95717
#[3,] -78.925937 -45.32790 -177.546469 251.69348 -52.85132 123.38741
#[4,] -33.673704 -95.64937 3.561921 -253.95046 -136.88182 -10.20650
#[5,] 51.080564 -180.87033 -161.861342 108.41120 188.07454 52.34226

what does rbind.fill.matrix really do?

I have this code and can't understand how rbind.fill.matrix is used.
dtmat is a matrix with the documents on rows and words on columns.
word <- do.call(rbind.fill.matrix,lapply(1:ncol(dtmat), function(i) {
t(rep(1:length(dtmat[,i]), dtmat[,i]))
}))
I read the description of the function and says that binds matrices but cannot understand which ones and fills with NA missing columns.
From what I understand, the function replaces columns that dont bind with NA.
Lets say I have 2 matrices A with two columns col1 and col2, B with three columns col1, col2 and colA. Since I want to bind all both these matrices, but rbind only binds matrices with equal number of columns and same column names, rbind.fill.matrix binds the columns but adds NA to all values that should be in both the matrices that are not. The code below will explain it more clearly.
a <- matrix(c(1,1,2,2), nrow = 2, byrow = T)
> a
[,1] [,2]
[1,] 1 1
[2,] 2 2
>
> b <- matrix(c(1,1,1,2,2,2,3,3,3), nrow = 3, byrow = T)
> b
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 2 2
[3,] 3 3 3
>
> library(plyr)
> r <- rbind.fill.matrix(a,b)
> r
1 2 3
[1,] 1 1 NA
[2,] 2 2 NA
[3,] 1 1 1
[4,] 2 2 2
[5,] 3 3 3
>
>
The documentation also mentions about column names, which I think you can also understand from the example.

Moving Data from Matrix A to Matrix B in R

I want to cut/move/replace some data (to be precise 2500) from Matrix A to Matrix B in R.
for example Move cell(i,j) from matrix A to cell(i,j) in matrix B. both i and j have some fixed value(50 to be precise) and replace that cell(i,j) in matrix A with "0".
Since I am newto programming can anyone help me with the coding?
Thanks in Advance
Regards
You can first define a two column coordinate-matrix of the values you want to replace, where the first column refers is the row-index and the second column is the column-index. As an example, suppose you want to replace the cells c(2,1), c(2,2) and c(1,2) in a 3x3 matrix B with the calues from a 3x3 matrix A:
ind <- cbind(c(2,2,1), c(1,2,2))
A <- matrix(1:9, ncol = 3)
B <- matrix(NA, ncol = 3, nrow = 3)
B[ind] <- A[ind]; A[ind] <- 0
B
[,1] [,2] [,3]
[1,] NA 4 NA
[2,] 2 5 NA
[3,] NA NA NA
A
[,1] [,2] [,3]
[1,] 1 0 7
[2,] 0 0 8
[3,] 3 6 9

Questions about missing data

In a matrix, if there is some missing data recorded as NA.
how could I delete rows with NA in the matrix?
can I use na.rm?
na.omit() will take matrices (and data frames) and return only those rows with no NA values whatsoever - it takes complete.cases() one step further by deleting the FALSE rows for you.
> x <- data.frame(c(1,2,3), c(4, NA, 6))
> x
c.1..2..3. c.4..NA..6.
1 1 4
2 2 NA
3 3 6
> na.omit(x)
c.1..2..3. c.4..NA..6.
1 1 4
3 3 6
I think na.rm usually only works within functions, say for the mean function. I would go with complete.cases: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/complete.cases.htm
let's say you have the following 3x3-matrix:
x <- matrix(c(1:8, NA), 3, 3)
> x
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 NA
then you can get the complete cases of this matrix with
y <- x[complete.cases(x),]
> y
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
The complete.cases-function returns a vector of truth values that says whether or not a case is complete:
> complete.cases(x)
[1] TRUE TRUE FALSE
and then you index the rows of matrix x and add the "," to say that you want all columns.
If you want to remove rows that contain NA's you can use apply() to apply a quick function to check each row. E.g., if your matrix is x,
goodIdx <- apply(x, 1, function(r) !any(is.na(r)))
newX <- x[goodIdx,]

Resources