Order a matrix depending on row and column of another in R - r

Hello I need to order column and row names in a mtrix according to another matrix, here is an exemple
M1
D E F
A 1 2 3
B 4 5 6
C 7 8 9
M2
F D E
C T F F
A F T T
So here I would like to 1 sort the M2 columns in order to have the same as M1
and then sort the rows (as you can see here there is not the row B as in M1, so I simply add a new one filled by F letters.
New_M2
D E F
A T T F
B F F F
C F F T
I know for exemple how to sort the column using M2[,colnames(M1)] but that is all...

Step 1. Match column and row names of M1 and M2
M3 <- M2[match(rownames(M1), rownames(M2)),
match(colnames(M1), colnames(M2))]
# D E F
# A TRUE TRUE FALSE
# <NA> NA NA NA
# C FALSE FALSE TRUE
Step 2. Set the dimnames and replace missing values with FALSE
dimnames(M3) <- dimnames(M1)
M3[is.na(M3)] <- FALSE
# D E F
# A TRUE TRUE FALSE
# B FALSE FALSE FALSE
# C FALSE FALSE TRUE
Data
M1 <- matrix(1:9, 3, 3, T, dimnames = list(c("A", "B", "C"), c("D", "E", "F")))
M2 <- matrix(c(T, F, F, T, F, T), 2, 3, dimnames = list(c("C", "A"), c("F", "D", "E")))

Here's a way but perhaps it isn't the best one :
#Get the rownames which are missing
diff_row <- setdiff(rownames(m1), rownames(m2))
#Create a matrix with `FALSE` values for those rownames
m3 <- matrix(FALSE, nrow = length(diff_row), ncol = ncol(m2),
dimnames = list(diff_row, colnames(m2)))
#rbind it to m2 matrix
m4 <- rbind(m2, m3)
#rearrange based on m1 matrix
m4[rownames(m1), colnames(m1)]
# D E F
#A TRUE TRUE FALSE
#B FALSE FALSE FALSE
#C FALSE FALSE TRUE

Related

Sort matrix by colnames from another matrix

I have two matrices with the same dimensions and they both have the same stock names as colnames, but in a different order!
I would like to sort the matrix "A" by the colnames of the matrix "B".
So the A colnames and the according value should be in the same order as the colnames of B.
How can I do this?
Example:
Kind Regards
Your example in R terms would be
A <- matrix(c(1, 4, 2), nrow = 1)
colnames(A) <- c("B", "D", "E")
A
# B D E
# [1,] 1 4 2
B <- matrix(c(2, 5, 1), nrow = 1)
colnames(B) <- c("E", "B", "D")
B
# E B D
# [1,] 2 5 1
Then we may simply subset the columns of A in the same order as they are in B:
A[, colnames(B)]
# E B D
# 2 1 4

Summarise a logical Matrix [duplicate]

This question already has answers here:
Find how many times duplicated rows repeat in R data frame [duplicate]
(4 answers)
Closed 4 years ago.
I have a large matrix filled with True/False values under each column. Is there a way I can summarize the matrix so that every row is unique and I have a new column with the sum of how often that row appeared.
Example:
A B C D E
[1] T F F T F
[2] T T T F F
[3] T F F T T
[4] T T T F F
[5] T F F T F
Would become:
A B C D E total
[1] T F F T F 2
[2] T T T F F 2
[3] T F F T F 1
EDIT
I cbind this matrix with a new column rev so I now have a data.frame that looks like
A B C D E rev
[1] T F F T F 2
[2] T T T F F 3
[3] T F F T T 5
[4] T T T F F 2
[5] T F F T F 1
And would like a data.frame that also sums the rev column as follows:
A B C D E rev total
[1] T F F T F 3 2
[2] T T T F F 5 2
[3] T F F T T 5 1
An approach with dplyr :
use as.data.frame (or here as_tibble) first if you start from a matrix. In the end you need to have a data.frame anyway as you'll have both numeric and logical in your table.
mat <- matrix(
c(T, F, F, T, F, T, T, T, F, F, T, F, F, T, T, T, T, T, F, F, T, F, F, T, F),
ncol = 5,
byrow = TRUE,
dimnames = list(NULL, LETTERS[1:5])
)
library(dplyr)
mat %>%
as_tibble %>% # convert matrix to tibble, to be able to group
group_by_all %>% # group by every column so we can count by group of equal values
tally %>% # tally will add a count column and keep distinct grouped values
ungroup # ungroup the table to be clean
#> # A tibble: 3 x 6
#> A B C D E n
#> <lgl> <lgl> <lgl> <lgl> <lgl> <int>
#> 1 TRUE FALSE FALSE TRUE FALSE 2
#> 2 TRUE FALSE FALSE TRUE TRUE 1
#> 3 TRUE TRUE TRUE FALSE FALSE 2
Created on 2018-05-29 by the reprex package (v0.2.0).
And a base solution:
df <- as.data.frame(mat)
df$n <- 1
aggregate(n~.,df,sum)
# A B C D E n
# 1 TRUE TRUE TRUE FALSE FALSE 2
# 2 TRUE FALSE FALSE TRUE FALSE 2
# 3 TRUE FALSE FALSE TRUE TRUE 1
Or as a one liner: aggregate(n~.,data.frame(mat,n=1),sum)
count function from plyr is exactly what you are looking for (suppose m is your matrix):
plyr::count(m)
# x.A x.B x.C x.D x.E freq
#1 TRUE FALSE FALSE TRUE FALSE 2
#2 TRUE FALSE FALSE TRUE TRUE 1
#3 TRUE TRUE TRUE FALSE FALSE 2
If you have an object mat as defined in #Moody_Mudskipper's answer, you can do
library(data.table)
dt <- as.data.table(mat)
dt[, .N, by = names(dt)]
# A B C D E N
# 1: TRUE FALSE FALSE TRUE FALSE 2
# 2: TRUE TRUE TRUE FALSE FALSE 2
# 3: TRUE FALSE FALSE TRUE TRUE 1
Explanation
by = <names> divides the data table into groups of rows, where the value of all the variables in <names> is equal across rows. If you do by = names(dt) it will divide into groups where all variables are equal.
.N is the number of observations in the given group of rows.
For your edit, if your data.frame is named df, you can do
setDT(df) # convert to data table
df[, .(rev = sum(rev), total = .N), by = A:E] # get desired output
# A B C D E rev N
# 1: TRUE FALSE FALSE TRUE FALSE 3 2
# 2: TRUE TRUE TRUE FALSE FALSE 5 2
# 3: TRUE FALSE FALSE TRUE TRUE 5 1

Overlap in row values from previous rows

I have a dataframe like this:
set.seed(123)
a <- c("A", "B", "C", "D", "E", "F", "G", "H", "I")
df <- data.frame(
V1 = sample(a,4, replace=TRUE),
V2 = sample(a,4, replace=TRUE),
V3 = sample(a,4, replace=TRUE),
V4 = sample(a,4, replace=TRUE)
)
which looks like
V1 V2 V3 V4
1 C I E G
2 H A E F
3 D E I A
4 H I E I
I'd like to count the number of unique values in a row in comparison to the previous rows, so the result would look like:
V1 V2 V3 V4 V5
1 C I E G 4
2 H A E F 3
3 D E I A 2
4 H I E I 1
V5 equals 4 for row 1 since it's the 1st row and all are unique
V5 equals 3 for row 2 since H, A, and F were not in row 1
V5 equals 2 for row 3 since 1) D and I were not in row 2. and 2) D and A were not in row 1.
V5 equals 1 for row 4 since 1) H was not in row 1, 2) I was not in row 2, and 3) H was not in row 4.
if row 4 were H I E A, then V5 for row 4 would have been still been 1 since it only has 1 value not in row 3, even though it would have 2 values not in row 2 and 2 values not in row 1.
Here is a multi-step method in base R.
# Create a list of the elements by row, using mike H's method
myList <- strsplit(Reduce(paste0, df), "")
# previous method, could create new object first t(df) if large df
# myList <- split(t(df), col(t(df)))
# get pairwise combinations of rows
combos <- t(combn(nrow(df):1, 2))[choose(nrow(df), 2):1,]
# get desired values, sapply runs through pairs of rows, tapply calculates min with row
df$cnts <- c(length(unique(myList[[1]])), # value for first row
tapply(sapply(1:nrow(combos), # sapply through pairs, taking set diffs
function(x) length(setdiff(myList[[combos[x,1]]],
myList[[combos[x,2]]]))),
combos[,1], min)) # split set diff lengths by row, get min length
This returns
df
V1 V2 V3 V4 cnts
1 C I E G 4
2 H A E F 3
3 D E I A 2
4 H I E I 1
For such tasks, storing the rows/sets of data like "df" in a tabulation format can be helpful to solve problems:
tab = table(as.matrix(df), row(df)) > 0
#> tab
#
# 1 2 3 4
# A FALSE TRUE TRUE FALSE
# C TRUE FALSE FALSE FALSE
# D FALSE FALSE TRUE FALSE
# E TRUE TRUE TRUE TRUE
# F FALSE TRUE FALSE FALSE
# G TRUE FALSE FALSE FALSE
# H FALSE TRUE FALSE TRUE
# I TRUE FALSE TRUE TRUE
crossprod can be used to retrieve (in a very efficient manner) the number of items that belong to a row but not to any other:
ct = crossprod(tab, !tab)
#> ct
#
# 1 2 3 4
# 1 0 3 2 2
# 2 3 0 2 2
# 3 2 2 0 2
# 4 1 1 1 0
Above we can see that, e.g., row 4 contains 1 element that row 1 does not contain, while row 1 contains 2 elements that are not in row 4, etc.
Since here we only care about the previous rows of each row and, specifically, the minimum of each set of one-to-all comparisons, an idea to get the result is:
ct[upper.tri(ct, TRUE)] = Inf ## to ignore 'upper.tri' values in 'max.col'
j_min = max.col(-ct, "first") ## row-index of the minimum difference per row
c(sum(tab[, 1]),
ct[cbind(2:nrow(df), j_min[-1])])
#[1] 4 3 2 1
Here's an approach that uses Reduce and mapply:
df$cols_paste <- strsplit(Reduce(paste0, df), split = "")
df$V5 <- lapply(1:length(df$cols_paste), function(x){
if(x==1) compare = NA
else compare = df$cols_paste[seq(1:(x-1))]
min(mapply(function(x, y) length(setdiff(x,y)), df$cols_paste[x], compare))
})
df[,setdiff(names(df), "cols_paste")]
V1 V2 V3 V4 V5
1 C I E G 4
2 H A E F 3
3 D E I A 2
4 H I E I 1

Select rows based on value in multiple columns defined by vector

I have the following data frame
df <- data.frame(A1 = c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B"),
B2 = c("C","D","C","D","C","D","C","D","C","D","C","D","C","D","C","D","C","D","C","D"),
C3 = c("E","F","E","F","E","F","E","F","E","F","E","F","E","F","E","F","E","F","E","F"),
D4=c(1,12,5,41,45,4,5,6,12,7,3,4,6,8,12,4,12,1,6,7))
and I would like to subset all the rows for which the first 3 column match the vector c("A","C","E")
I have tried to use which but it does not work
vct <- c("A","C","E")
df[which(df[1:3] == vct)]
You can probably use paste (or interaction):
vct <- c("A","C","E")
do.call(paste, df[1:3]) %in% paste(vct, collapse = " ")
# [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
# [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
df[do.call(paste, df[1:3]) %in% paste(c("A", "C", "E"), collapse = " "), ]
# A1 B2 C3 D4
# 1 A C E 1
# 3 A C E 5
# 5 A C E 45
# 7 A C E 5
# 9 A C E 12
## with "interaction"
df[interaction(df[1:3], drop=TRUE) %in% paste(vct, collapse = "."), ]
You can also do something like this:
df[with(df, A1 == "A" & B2 == "C" & C3 == "E"), ]

order while splitting (eg. TA should be split to two column "A" in first "T" second) in r

I have following issue, I could solve:
set.seed (1234)
mydf <- data.frame (var1a = sample (c("TA", "AA", "TT"), 5, replace = TRUE),
varb2 = sample (c("GA", "AA", "GG"), 5, replace = TRUE),
varAB = sample (c("AC", "AA", "CC"), 5, replace = TRUE)
)
mydf
var1a varb2 varAB
1 TA AA CC
2 AA GA AA
3 AA GA AC
4 AA AA CC
5 TT AA AC
I want to split two letter into different column, and then order alphabetically.
Edit: Ordering can be done before split, for example var1a value "TA" var1a should be "AT" or after split so that var1aa should be "A", and var1ab be "T" (instead of "T", "A").
so sorting is within each cell.
split_col <- function(.col, data){
.x <- colsplit( data[[.col]], names = paste0(.col, letters[1:2]))
}
split each column and combine
require(reshape)
splitdf <- do.call(cbind, lapply(names(mydf), split_col, data = mydf))
var1aa var1ab varb2a varb2b varABa varABb
1 T A A A C C
2 A A G A A A
3 A A G A A C
4 A A A A C C
5 T T A A A C
But the unsolved part is I want to order the pair of columns such that columnname"a" and columname"b" are ordered, alphabetically. Thus expected output:
var1aa var1ab varb2a varb2b varABa varABb
1 A T A A C C
2 A A A G A A
3 A A A G A C
4 A A A A C C
5 T T A A A C
Can how can order (short with each pair of variable) ?
mylist <-as.list(mydf)
splits <- lapply(mylist, reshape::colsplit, names=c("a", "b"))
rowsort <- lapply(splits, function(x) t(apply(x, 1, sort)))
comb <- do.call(data.frame, rowsort)
comb
var1a.1 var1a.2 varb2.1 varb2.2 varAB.a varAB.b
1 A T A A C C
2 A A A G A A
3 A A A G A C
4 A A A A C C
5 T T A A A C
EDIT:
If names are important, you can replace them:
replaceNums <- function(x){
.which <- regmatches(x, regexpr("[[:alnum:]]*(?=.)", x, perl=TRUE))
stopifnot(length(x) %% 2 == 0) #checkstep
paste0(.which, c("a", "b"))
}
names(comb) <- replaceNums(names(comb))
comb
var1aa var1ab varb2a varb2b varABa varABb
1 A T A A C C
2 A A A G A A
3 A A A G A C
4 A A A A C C
5 T T A A A C

Resources