Create sub-matrices that have identical cell values - r

I have a matrix with many rows and columns (rxp), I am trying to create a sub matrices that contains only those rows and columns that have identical cell values. For example
This is my matrix
a b c d
a 0 1 1 1
b 1 0 0 1
c 1 0 0 1
d 0 1 0 0
e 0 1 1 1
Here row b, c have identical values so the code should create, 1st sub matrix with only b and c rows and 2nd sub matrix with rows a and e
a b c d
b 1 0 0 1
c 1 0 0 1
a b c d
a 0 1 1 1
e 0 1 1 1

Presumably there can be more than one set of repeated rows so if m is your matrix this creates a list of matrices in which each such matrix has rows that are repeated:
DF <- as.data.frame(m)
Filter(function(x) nrow(x) > 1, split(DF, do.call(paste, DF)))

You could use duplicated in both directions.
m[duplicated(m) | duplicated(m, fromLast=TRUE),]
# a b c d
# b 1 0 0 1
# c 1 0 0 1
Where m is
structure(c(0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L,
1L, 1L, 0L), .Dim = c(4L, 4L), .Dimnames = list(c("a", "b", "c",
"d"), c("a", "b", "c", "d")))

You could also use.
indx <- which(duplicated(m)) #m from #Richard Scriven post
returns a list of matrices
lapply(indx, function(i) m[colSums(t(m)==m[i,])==ncol(m),])
[[1]]
# a b c d
#b 1 0 0 1
#c 1 0 0 1
[[2]]
# a b c d
#a 0 1 1 1
#e 0 1 1 1

Related

Replace multiple columns by head string into one column

I want to replace multiple columns of a data frame by one column each for each group whereas I also want to change the numbers. Example:
A1 A2 A3 A4 B1 B2 B3
1 1 1 0 1 1 0 0
2 1 0 1 1 0 1 1
3 1 1 1 1 0 1 1
4 0 0 1 0 0 0 1
5 0 0 0 0 0 1 0
I want to sort this data frame by it's headers meaning I only want one column "A" instead of 4 here and only column "B" instead of 3 here. The numbers should change with the following pattern: If you are in group "A2" and the observation has the number "1" it should be changed into a "2" instead. If you are in group "A3" and the observation has the number "1" it should be changed into a "3" instead. The end result should be that I want to contain the highest number in that specific column and row (if I have 3 "1"s in my row and group, the number which is going to replace all of them is going to be the one of the highest group)
If the number is 0 then nothing changes. Here is the result I'm looking for:
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2
How can I replace all of these groups by a single column each? (one column for each group)
So far I've tried a lot with the function unite(data= testdata, col= "A") for example, but doing this manually would take too long. There has to be a better way, right?
Thanks in advance!
You can do:
dat <- read.table(header=TRUE, text=
"A1 A2 A3 A4 B1 B2 B3
1 1 1 0 1 1 0 0
2 1 0 1 1 0 1 1
3 1 1 1 1 0 1 1
4 0 0 1 0 0 0 1
5 0 0 0 0 0 1 0")
myfu <- function(x) if (any(x)) max(which(x)) else 0
new <- data.frame(
A=apply(dat[, 1:4]==1, 1, myfu),
B=apply(dat[, 5:7]==1, 1, myfu))
new
A more general solution:
new2 <- data.frame(
A=apply(dat[, grepl("^A", names(dat))]==1, 1, myfu),
B=apply(dat[, grepl("^B", names(dat))]==1, 1, myfu))
new2
You can try the code like below
dfout <- as.data.frame(
lapply(
split.default(df, gsub("\\d+$", "", names(df))),
function(v) max.col(v, ties.method = "last") * +(rowSums(v) >= 1)
)
)
such that
> dfout
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2
Data
df <- structure(list(A1 = c(1L, 1L, 1L, 0L, 0L), A2 = c(1L, 0L, 1L,
0L, 0L), A3 = c(0L, 1L, 1L, 1L, 0L), A4 = c(1L, 1L, 1L, 0L, 0L
), B1 = c(1L, 0L, 0L, 0L, 0L), B2 = c(0L, 1L, 1L, 0L, 1L), B3 = c(0L,
1L, 1L, 1L, 0L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5"))
assuming your data is in a data.frame called df1 this works in Base-R
df1 <- t(df1)*as.numeric(regmatches(colnames(df1), regexpr("\\d+$", colnames(df1))))
df1 <- split(as.data.frame(df1),sub("\\d+$","",row.names(df1)))
df1 <- sapply(df1, apply, 2, max)
output:
> df1
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2

Refer to column name and row name within an apply statement in R

I have a dataframe in R which looks like the one below.
a b c d e f
0 1 1 0 0 0
1 1 1 1 0 1
0 0 0 1 0 1
1 0 0 1 0 1
1 1 1 0 0 0
The database is big, spanning over 100 columns and 5000 rows and contain all binaries (0's and 1's). I want to construct an overlap between each and every columns in R. Something like the one given below. This overlap dataframe will be a square matrix with equal number of rows and columns and that will be same as the number of columns in the 1st dataframe.
a b c d e f
a 3 2 2 2 0 2
b 2 3 3 3 0 1
c 2 3 3 1 0 1
d 2 3 1 3 0 3
e 0 0 0 0 0 0
f 2 1 1 3 0 3
Each cell of the second dataframe is populated by the number of cases where both row and column have 1 in the first dataframe.
I'm thinking of constructing a empty matrix like this:
df <- matrix(ncol = ncol(data), nrow = ncol(data))
colnames(df) <- names(data)
rownames(df) <- names(data)
.. and iterating over each cell of this matrix using an apply command reading the corresponding row name (say, x) and column name (say, y) and running a function like the one below.
summation <- function (x,y) (return (sum(data$x * data$y)))
The problem with is I can't find out the row name and column name while within an apply function. Any help will be appreciated.
Any more efficient way than what I'm thinking is more than welcome.
You are looking for crossprod
crossprod(as.matrix(df1))
# a b c d e f
#a 3 2 2 2 0 2
#b 2 3 3 1 0 1
#c 2 3 3 1 0 1
#d 2 1 1 3 0 3
#e 0 0 0 0 0 0
#f 2 1 1 3 0 3
data
df1 <- structure(list(a = c(0L, 1L, 0L, 1L, 1L), b = c(1L, 1L, 0L, 0L,
1L), c = c(1L, 1L, 0L, 0L, 1L), d = c(0L, 1L, 1L, 1L, 0L), e = c(0L,
0L, 0L, 0L, 0L), f = c(0L, 1L, 1L, 1L, 0L)), .Names = c("a",
"b", "c", "d", "e", "f"), class = "data.frame", row.names = c(NA,
-5L))

How to convert a non-square matrix to a square matrix with R?

I have a network data and trying to analyze it. The problem is it has some missing rows or columns. I want to match rows and columns, so it can be a square matrix
My data looks like this:
A B C D E
A 0 2 1 4 5
B 1 0 2 4 2
D 2 4 0 2 2
E 1 2 2 2 0
And I want to make it looks like this:
A B C D E
A 0 2 1 4 5
B 1 0 2 4 2
C NA NA NA NA NA
D 2 4 0 2 2
E 1 2 2 2 0
As my data is very huge so I cannot do it by hands. It there any syntax to do it automatically?
One option is to create a NA matrix based on the unique column names and row names (assuming that it is symmetric) and then fill it by matching row names and column names in original dataset
un1 <- unique(sort(c(colnames(m1), rownames(m1))))
m2 <- matrix(NA, length(un1), length(un1), dimnames = list(un1, un1))
m2[row.names(m1), colnames(m1)] <- m1
m2
# A B C D E
#A 0 2 1 4 5
#B 1 0 2 4 2
#C NA NA NA NA NA
#D 2 4 0 2 2
#E 1 2 2 2 0
data
m1 <- structure(c(0L, 1L, 2L, 1L, 2L, 0L, 4L, 2L, 1L, 2L, 0L, 2L, 4L,
4L, 2L, 2L, 5L, 2L, 2L, 0L), .Dim = 4:5, .Dimnames = list(c("A",
"B", "D", "E"), c("A", "B", "C", "D", "E")))

Ordering rows and columns of R Matrix by criteria

I have a matrix in R like this:
A B C D E F
A 2 5 0 1 3 6
B 5 0 0 1 5 9
C 0 0 0 0 0 1
D 6 1 1 3 4 4
E 3 1 5 2 1 6
F 0 0 1 1 7 9
mat = structure(c(2L, 5L, 0L, 6L, 3L, 0L, 5L, 0L, 0L, 1L, 1L, 0L, 0L,
0L, 0L, 1L, 5L, 1L, 1L, 1L, 0L, 3L, 2L, 1L, 3L, 5L, 0L, 4L, 1L,
7L, 6L, 9L, 1L, 4L, 6L, 9L), .Dim = c(6L, 6L), .Dimnames = list(
c("A", "B", "C", "D", "E", "F"), c("A", "B", "C", "D", "E",
"F")))
The matrix is not symmetric.
I want to reorder the rows and columns according to the following criteria:
NAME TYPE
A Dog
B Cat
C Cat
D Other
E Cat
F Dog
crit = structure(list(NAME = c("A", "B", "C", "D", "E", "F"), TYPE = c("Dog",
"Cat", "Cat", "Other", "Cat", "Dog")), .Names = c("NAME", "TYPE"
), row.names = c(NA, -6L), class = "data.frame")
I am trying to get the matrix rows and columns to be re-ordered, so that each category is grouped together:
A F B C E D
A
F
B
C
E
D
I am un-able to find any reasonable way of doing this.
In case it matters, or makes things simpler, I can get rid of the category 'Others' and just stick with 'Cat' and 'Dog'.
I need to find a way to write code for this re-ordering to happen as the matrix is quite big.
In base, just index by order:
mat[order(crit$TYPE), order(crit$TYPE)]
#
# B C E A F D
# B 0 0 5 5 9 1
# C 0 0 0 0 1 0
# E 1 5 1 3 6 2
# A 5 0 3 2 6 1
# F 0 1 7 0 9 1
# D 1 1 4 6 4 3
It orders on an alphabetical sort of crit$TYPE, so Cat (B, C, and E) comes before Dog (A and F). If you want to set the order, use factor levels:
mat[order(factor(crit$TYPE, levels = c('Dog', 'Cat', 'Other'))),
order(factor(crit$TYPE, levels = c('Dog', 'Cat', 'Other')))]
#
# A F B C E D
# A 2 6 5 0 3 1
# F 0 9 0 1 7 1
# B 5 9 0 0 5 1
# C 0 1 0 0 0 0
# E 3 6 1 5 1 2
# D 6 4 1 1 4 3

R: add matrices based on row and column names

I have two matrices I want to sum based on their row and column names. The matrices will not necessarily have all rows and columns in common - some may be missing from either matrix.
For example, consider two matrices A and B:
A= B=
a b c d a c d e
v 1 1 1 0 v 0 0 0 1
w 1 1 0 1 w 0 0 1 0
x 1 0 1 1 y 0 1 0 0
y 0 1 1 1 z 1 0 0 0
Column e is missing from matrix A and column b is missing from matrix B.
Row z is missing from matrix A and row x is missing from matrix B.
The summed table I'm looking for is:
Sum=
a b c d e
v 1 1 1 0 1
w 1 1 0 2 0
x 1 0 1 1 na
y 0 1 2 1 0
z 1 na 0 0 0
The row and column ordering in the final matrix don't matter, as long as the matrix is complete, i.e. has all the data. Missing values don't have to be "Na", but could be "0" instead.
I'm not sure if there is a way to do this that doesn't involve for loops. Any help would be much appreciated.
My solution
I managed to do this easily by converting the matrices to dataframes, binding the dataframes by row and then casting the resulting dataframe back into a matrix. This looks like it works, but maybe someone could double check or let me know if there is a better way.
library(reshape2)
A_df=as.data.frame(as.table(A))
B_df=as.data.frame(as.table(B))
merged_df=rbind(A_df,B_df)
Summed_matrix=acast(merged_df, Var1 ~ Var2, sum)
merged_df looks like this:
Var1 Var2 Freq
1 v a 1
2 w a 1
3 x a 1
4 y a 0
5 v b 1
6 w b 1
etc...
May be you can try:
cAB <- union(colnames(A), colnames(B))
rAB <- union(rownames(A), rownames(B))
A1 <- matrix(0, ncol=length(cAB), nrow=length(rAB), dimnames=list(rAB, cAB))
B1 <- A1
indxA <- outer(rAB, cAB, FUN=paste) %in% outer(rownames(A), colnames(A), FUN=paste)
indxB <- outer(rAB, cAB, FUN=paste) %in% outer(rownames(B), colnames(B), FUN=paste)
A1[indxA] <- A
B1[indxB] <- B
A1+B1 #because it was mentioned to have `0` as missing values
# a b c d e
#v 1 1 1 0 1
#w 1 1 0 2 0
#x 1 0 1 1 0
#y 0 1 2 1 0
#z 1 0 0 0 0
If you want to get the NA as missing values
A1 <- matrix(NA, ncol=length(cAB), nrow=length(rAB), dimnames=list(rAB, cAB))
B1 <- A1
A1[indxA] <- A
B1[indxB] <- B
indxNA <- is.na(A1) & is.na(B1)
A1[is.na(A1)!= indxNA] <- 0
B1[is.na(B1)!= indxNA] <- 0
A1+B1
# a b c d e
#v 1 1 1 0 1
#w 1 1 0 2 0
#x 1 0 1 1 NA
#y 0 1 2 1 0
#z 1 NA 0 0 0
Or using reshape2
library(reshape2)
acast(rbind(melt(A), melt(B)), Var1~Var2, sum) #Inspired from the OP's idea
# a b c d e
#v 1 1 1 0 1
#w 1 1 0 2 0
#x 1 0 1 1 0
#y 0 1 2 1 0
#z 1 0 0 0 0
data
A <- structure(c(1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L,
1L, 1L, 1L), .Dim = c(4L, 4L), .Dimnames = list(c("v", "w", "x",
"y"), c("a", "b", "c", "d")))
B <- structure(c(0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 0L), .Dim = c(4L, 4L), .Dimnames = list(c("v", "w", "y",
"z"), c("a", "c", "d", "e")))

Resources