fast identify duplicated rows (with same entries) in matrix in R - r

I have a matrix like this:
M <- rbind(c("CD4", "CD8"),
c("CD8", "CD4"),
c("DN", "CD8"),
c("CD8", "DN"),
c("CD4", "DN"),
c("DN", "CD4"))
The 1st and 2nd is duplicated, and 3rd and 4th is duplicated, and 5th and 6th is duplicated since they included the same elements (no matter what order it is).
I know that the following code can did it.
Msort <- t(apply(M, 1, sort))
duplicated(Msort)
I want to get this Logical vector:
> duplicated(Msort)
[1] FALSE TRUE FALSE TRUE FALSE TRUE
But if the matrix is large, say 10,000 rows and 10,000 columns, how to deal with this situation efficicently?
Thanks.

I have tried to do using matrix. Please try this once:
M[duplicated(M[c("V1", "V2")]),]
# [,1] [,2]
#[1,] "CD8" "CD4"
#[2,] "CD8" "DN"
#[3,] "DN" "CD4"

Related

Return the indices of the k smallest values of a matrix in R

Suggest I have the following matrix in R:
set.seed(123) # the only way to have a reproducible result
m <- matrix(runif(16,1,20), 8,1)
m
[,1]
[1,] 6.463973
[2,] 15.977798
[3,] 8.770562
[4,] 17.777331
[5,] 18.868878
[6,] 1.865573
[7,] 11.034004
[8,] 17.955962
I now wish to return the indices of the 4 smallest values in a sorted way. In that example I seek to obtain an object that contains the indices 6, 1, 3, 7 , sorted from the 1st smallest to the 4th smallest value. Thanks!
This function sort your matrix elements in ascending order. The option index.return set to TRUE will add the indexes in addition to the sorted value. $ix allows you to directly return the sorted indexes
sort(m,index.return = TRUE)$ix

Using which(), !is.na() and parameter like [1,]

Can someone describe exactly (I understand partially) what the following line does?
which(!is.na(table[1,]))
1) table[1,] = ? line 1 or column 1 or of a file called "table"?
2) !is.na = why the !? (is.na is used to eliminate the NA but why the !? Normally, ! represents negative (not equal).
If we split the function to pieces,
table[1,]
subset the first row of the dataset
is.na(table[1,])
checks whether there are NA values in the first row. It will return a vector of logical elements (TRUE for NA and FALSE for non-NA).
! is negation operator. So, it will convert the TRUE to FALSE and vice versa to give a vector of logical elements that are non NA for TRUE values
!is.na(table[1,])
and lastly the which wrapper gives the numeric index of TRUE values
To demonstrate an example, say we have a matrix
m1 <- matrix(c(NA, 0, 1, 2), 2, 2)
Then, if we follow the steps
m1[1,] #returns the 1st row as a vector
#[1] NA 1
is.na(m1[1,]) #returns TRUE for NA
#[1] TRUE FALSE
!is.na(m1[1,]) #returns TRUE for non-NA elements
#[1] FALSE TRUE
which(!is.na(m1[1,]))
#[1] 2
#or perhaps more usefully
which(is.na(m1[1,]))
#[1] 1

reporting identical values across columns in matrix

I have a matrix that I am performing a for loop over. I want to know if the values of position i in the for loop exist anywhere else in the matrix, and if so, report TRUE. The matrix looks like this
dim
x y
[1,] 5 1
[2,] 2 2
[3,] 5 1
[4,] 5 9
In this case, dim[1,] is the same as dim[3,] and should therefore report TRUE if I am in position i=1 in the for loop. I could write another for loop to deal with this, but I am sure there are more clever and possibly vectorized ways to do this.
We can use duplicated
duplicated(m1)|duplicated(m1, fromLast=TRUE)
#[1] TRUE FALSE TRUE FALSE
The duplicated(m1) gives a logical vector of 'TRUE/FALSE' values. If there is a duplicate row, it will be TRUE
duplicated(m1)
#[1] FALSE FALSE TRUE FALSE
In this case, the third row is duplicate of first row. Suppose if we need both the first and third row, we can do the duplication from the reverse side and use | to make both positions TRUE. i.e.
duplicated(m1, fromLast=TRUE)
#[1] TRUE FALSE FALSE FALSE
duplicated(m1)|duplicated(m1, fromLast=TRUE)
#[1] TRUE FALSE TRUE FALSE
According to ?duplicated, the input data can be
x: a vector or a data frame or an array or ‘NULL’.
data
m1 <- cbind(x=c(5,2,5,5), y=c(1,2,1,9))

Remove rows from dataframe with boolean values

I have the following dataframe a:
> a <- cbind(c(FALSE,FALSE,TRUE,TRUE),c(TRUE,FALSE,FALSE,TRUE))
> a
[,1] [,2]
[1,] FALSE TRUE
[2,] FALSE FALSE
[3,] TRUE FALSE
[4,] TRUE TRUE
I want to remove all rows whose first column value and second column value is false. Note that I do have some other, non-boolean columns.
So you want to keep each row which contains at least one TRUE column:
keep <- a[,1] | a[,2]
a <- a[keep, ]
You can use rowSums.
a[(rowSums(a[,1:2])!=0),]

How to specify the last column of a matrix

I have a matrix A, how can I represent the last column, since I want to sort the matrix based on that.
> A <- matrix(rnorm(16), 4, 4)
> ncol(A)
[1] 4
> # Get the last column
> A[,ncol(A)]
[1] 0.7593943 0.0726012 2.2784912 -0.2571095
> # If you want to sort based on the last column...
> A[order(A[,ncol(A)]),]
[,1] [,2] [,3] [,4]
[1,] -0.9013910 -0.06612518 -1.51267548 -0.2571095
[2,] 0.3851738 -0.81303780 0.01062751 0.0726012
[3,] -1.6940473 -1.15323294 -1.50261705 0.7593943
[4,] 0.3120409 -0.30047966 0.59672449 2.2784912
If A is your matrix then the last column of A is:
A[,ncol(A)]
If you are not familiar with bracket indexing in R, this code selects all rows of A (since the space before the comma is blank) and then the last column of A since R indexing begins at 1 (unlike languages like Python). ncol(A) returns the number of columns in A as an integer so indexing in this way gives your desired result.

Resources