Find reciprocal row duplicates - r

(Please feel free to change the title to something more appropriate)
I would like extract all reciprocal pairs from a asymmetric square matrix.
Some dummy data to clarify:
m <- matrix(c(NA,0,1,0,0,-1,NA,1,-1,0,1,1,NA,-1,-1,-1,1,0,NA,0,-1,1,0,0,NA), ncol=5, nrow=5)
colnames(m) <- letters[seq(ncol(m))]
rownames(m) <- letters[seq(nrow(m))]
require(reshape2)
m.m <- melt(m) # get all pairs
m.m <- m.m[complete.cases(m.m),] # remove NAs
How would I now extract all "reciprocal duplicates" from m.m (or directly from m)?
This is what I mean with reciprocal duplicate:
Var1 Var2 value
b a 0
a b -1
And I would like to store each value combination, i.e. {1,1},{-1,-1},{1,0},{-1,0},{0,0} in a list with its Var combination {a,b},{a,c},{a,d},{a,e},{b,c},{b,d},{b,e},{c,d},{c,e},{d,e} pointing to it, something like
$`a,b`
[1] 0,-1
I haven't manage to solve this. Feel like it could be possible with merge() or inner_join. Also, I apologize for not providing the best example.
Any pointers would be highly appreciated.

Here's an approach based on the object m.m:
# extract the unique combinations
levs <- apply(m.m[-3], 1, function(x) paste(sort(x), collapse = ","))
# create a list of values for these combinations
split(m.m$value, levs)

Using the matrix representation, you can get vectors of each triangle of the matrix (which align as you wish) using:
m[upper.tri(m)]
t(m)[upper.tri(m)]
To name them:
nm <- matrix(paste("(",rep(rownames(m),times=nrow(m)), ",",rep(rownames(m),each=nrow(m)),")",sep=""), nrow=nrow(m))
nm[as.vector(upper.tri(m))]
Finally to convert to a list as you wish. First I put them in a new 2 x 10 matrix. Then I used lapply to create the list structure.
pairs<- cbind(m[upper.tri(m)], t(m)[upper.tri(m)] )
rownames(pairs) <- nm[as.vector(upper.tri(m))]
pairs
m.list <- lapply(seq_len(nrow(pairs)),function(i) pairs[i,])
names(m.list) <- rownames(pairs)
m.list

Related

I want to apply two functions one function on the block diagonal and the second function on the off-diagonal elements in the data frame

df<- data.frame(a=c(1:10), b=c(21:30),c=c(1:10), d=c(14:23),e=c(11:20),f=c(-6:-15),g=c(11:20),h=c(-14:-23),i=c(4:13),j=c(1:10))
In this data frame, I have three block-diagonal matrices which are as shown in the image below
I want to apply two functions, one is the sine function for block diagonal and the second is cosine function for the other elements and generates the same structure of the data frame.
sin(df[1:2,1:2])
sin(df[3:5,3:5])
sin(df[6:10,6:10])
cos(the rest of the elements)
1) outer/arithmetic Create a logical block diagonal matrix indicating whether the current cell is on the block diagonal or not and then use that to take a convex combination of the sin and cos values giving a data.frame as follows:
v <- rep(1:3, c(2, 3, 5))
ind <- outer(v, v, `==`)
ind * sin(df) + (!ind) * cos(df)
2) ifelse Alternately, this gives a matrix result (or use as.matrix on the above). ind is from above.
m <- as.matrix(df)
ifelse(ind, sin(m), cos(m))
3) Matrix::bdiag Another approach is to use bdiag in the Matrix package (which comes with R -- no need to install it).
library(Matrix)
ones <- function(n) matrix(1, n, n)
ind <- bdiag(ones(2), ones(3), ones(5)) == 1
Now proceed as in the last line of (1) or as in (2).
If it's okay for you that the result is stored in a new data frame you could change the order of your instructions and do it like that:
ndf <- cos(df)
ndf[1:2,1:2] <- sin(df[1:2,1:2])
ndf[3:5,3:5] <- sin(df[3:5,3:5])
ndf[6:10,6:10] <- sin(df[6:10,6:10])

Way to reference list names in R S3 object?

I'm just learning how to create my own S3 object in R. The class I'm creating is named DAT. It contains a number of matrices that will be populated as the pre-processing of my data ensues. Here's how I define it:
createDAT <- function(M){ # M is a data.matrix
# it's assumed that samples are rows and genes are columns already
z <- list(M_orig <- M, # assumed log2 scale
M_nat <- matrix(), # M_orig on natural scale
M_filt <- matrix(), # after gene filtering
M_scaled <- matrix(), # as fraction of all counts
M_norm <- matrix(), # after gene normalization
ZEROGENES <- list(),
outcome <- list(),
RefSampleName <- character(),
RefSample <- matrix(),
RefSampleUnscaled <- matrix(),
seed <- numeric())
#names(z[[2]]) <- "Nat"
class(z) <- "DAT"
return(z)
}
I'll instantiate this here with the following code:
x <- rnorm(100)
y <- rnorm(100)
df <- data.frame(x, y)
df_wide <- t(df)
data <- createDAT(df_wide)
I have a list of "zero-expressed genes", called ZERO. All I want to do is to add that list to the data instance of DAT. I can successfully do that with the line:
data[[6]] <- ZERO
However, to make things more intuitive, instead of referencing data[[6]], I'd somehow like to use data$ZERO or something.
Is there some way of doing that? I haven't been able to find anything online.
Thank you!!
The problem with your code is that you are using arrows <- (assing operator) within the list function. Use equal signs instead and the elements in your DAT object will be named. That way you will be able to access to its elements using the $ operator as you are expecting.

R: Performance issue when finding maximum of splitted list

When trying to find the maximum values of a splitted list, I run into serious performance issues.
Is there a way I can optimize the following code:
# Generate data for this MWE
x <- matrix(runif(900 * 9000), nrow = 900, ncol = 9000)
y <- rep(1:100, each = 9)
my_data <- cbind(y, x)
my_data <- data.frame(my_data)
# This is the critical part I would like to optimize
my_data_split <- split(my_data, y)
max_values <- lapply(my_data_split, function(x) x[which.max(x[ , 50]), ])
I want to get the rows where a given column hits its maximum for a given group (it should be easier to understand from the code).
I know that splitting into a list is probably the reason for the slow performance, but I don't know how to circumvent it.
This may not be immediately clear to you.
There is an internal function max.col doing something similar, except that it finds position index of the maximum along a matrix row (not column). So if you transpose your original matrix x, you will be able to use this function.
Complexity steps in when you want to do max.col by group. The split-lapply convention is needed. But, if after the transpose, we convert the matrix to a data frame, we can do split.default. (Note it is not split or split.data.frame. Here the data frame is treated as a list (vector), so the split happens among the data frame columns.) Finally, we do an sapply to apply max.col by group and cbind the result into a matrix.
tx <- data.frame(t(x))
tx.group <- split.default(tx, y) ## note the `split.default`, not `split`
pos <- sapply(tx.group, max.col)
The resulting pos is something like a look-up table. It has 9000 rows and 100 columns (groups). The pos[i, j] gives the index you want for the i-th column (of your original non-transposed matrix) and j-th group. So your final extraction for the 50-th column and all groups is
max_values <- Map("[[", tx.group, pos[50, ])
You just generate the look-up table once, and make arbitrary extraction at any time.
Disadvantage of this method:
After the split, data in each group are stored in a data frame rather than a matrix. That is, for example, tx.group[[1]] is a 9000 x 9 data frame. But max.col expects a matrix so it will convert this data frame into a matrix internally.
Thus, the major performance / memory overhead includes:
initial matrix transposition;
matrix to data frame conversion;
data frame to matrix conversion (per group).
I am not sure whether we eliminate all above with some functions from MatrixStats package. I look forward to seeing a solution with that.
But anyway, this answer is already much faster than what OP originally does.
A solution using {dplyr}:
# Generate data for this MWE
x <- matrix(runif(900 * 9000), nrow = 900, ncol = 9000)
y <- rep(1:100, each = 9)
my_data <- cbind.data.frame(y, x)
# This is the critical part I would like to optimize
system.time({
my_data_split <- split(my_data, y)
max_values <- lapply(my_data_split, function(x) x[which.max(x[ , 50]), ])
})
# Using {dplyr} is 9 times faster, but you get results in a slightly different format
library(dplyr)
system.time({
max_values2 <- my_data %>%
group_by(y) %>%
do(max_values = .[which.max(.[[50]]), ])
})
all.equal(max_values[[1]], max_values2$max_values[[1]], check.attributes = FALSE)

list with matrices times always the same vector

i am working with consumer price index CPI and in order to calculate it i have to multiply the index matrix with the corresponding weights:
grossCPI77_10 <- grossIND1977 %*% weights1910/100
grossCPI82_10 <- grossIND1982 %*% weights1910/100
of course i would rather like to have a code like the one beyond:
grossIND1982 <- replicate(20, cbind(1:61))
grossIND1993 <- replicate(20, cbind(1:61))
weights1910_sc <- c(1:20)
grossIND_list <- mget(ls(pattern = "grossIND...."))
totalCPI <- mapply("*", grossIND_list, weights1910_sc)
the problem is that it gives me a 1200x20 matrix. i expected a normal matrix (61x20) vector (20x1) multiplication which should result in a 20x1 vector? could you explain me what i am doing wrong? thanks
part of your problem is that you don't have matrices but 3D arrays, with one singleton dimension. The other issue is that mapply likes to try and combine the results into a matrix, and also that constant arguments should be passed via MoreArgs. But actually, this is more a case for lapply.
grossIND1982 <- replicate(20, cbind(1:61))[,1,]
grossIND1993 <- replicate(20, cbind(1:61))[,1,]
weights1910_sc <- c(1:20)
grossIND_list <- mget(ls(pattern = "grossIND...."))
totalCPI <- mapply("*", grossIND_list, MoreArgs=list(e2 = weights1910_sc), SIMPLIFY = FALSE)
totalCPI <- lapply(grossIND_list, "*", e2 = weights1910_sc)
I am not sure if I understood all aspects of your problem (especially concerning what should be colums, what should be rows, and in which order the crossproduct shall be applied), but I will try at least to cover some aspects. See comments in below code for clarifications of what you did and what you might want. I hope it helps, let me know if this is what you need.
#instead of using mget, I recommend to use a list structure
#otherwise you might capture other variables with similar names
#that you do not want
INDlist <- sapply(c("1990", "1991"), function(x) {
#this is how to set up a matrix correctly, check `?matrix`
#I think your combination of cbind and rep did not give you what you wanted
matrix(rep(1:61, 20), nrow = 61)
}, USE.NAMES = TRUE, simplify = F)
weights <- list(c(1:20))
#the first argument of mapply needs to be a function, in this case of two variables
#the body of the function calculates the cross product
#you feed the arguments (both lists) in the following part of mapply
#I have repeated your weights, but you might assign different weights for each year
res <- mapply(function(x, y) {x %*% y}, INDlist, rep(weights, length(INDlist)))
dim(res)
#[1] 61 2

Can the loop parameter be accessed in implicit loops during vectorization?

I have a question for R gurus out there. I'll illustrate it on the following example:
I have a vector, say 1,2,3,4,5,6,7,8
I'd like to get a vector of sums of 2 elements: 3,5,7,9,11,13,15
This is just an example, I'm not looking for a trick, I want to do it with just vectorization and indexing. Is there any way to get access to the implicit loop parameter as it goes through it?
Thanks a lot.
You can use rollapply from zoo package
> x <- 1:8
> rollapply(x, width=2, FUN=sum)
[1] 3 5 7 9 11 13 15
You can use sapply or a variation of it, and write a function that sums up appropriate elements given the indexes, and your matrix. For example,
m <- matrix(1:9, nrow=3)
m
Create a data frame with all possible index pairs
m_ind <- expand.grid(1:nrow(m),1:ncol(m), stringsAsFactors = FALSE)
names(m_ind) <- c("i","j")
m_ind
m[as.matrix(m_ind[,1:2])]
Diagonals, or the parallel lines can be described by constant diffs, or constant sums of the indexes
m_ind$dif_ij <- m_ind$i - m_ind$j
m_ind$sum_ij <- m_ind$i + m_ind$j
Then sum up the elements you want
m_ind$sum1 <- sapply(1:nrow(m_ind), function(k, mydf, colname, mymatr)
sum(mymatr[as.matrix(mydf[mydf[, colname]==mydf[k, colname], c("i","j")])]),mydf=m_ind, colname="dif_ij", mymatr=m)
m_ind$sum2 <- sapply(1:nrow(m_ind), function(k, mydf, colname, mymatr)
sum(mymatr[as.matrix(mydf[mydf[, colname]==mydf[k, colname], c("i","j")])]), mydf=m_ind, colname="sum_ij", mymatr=m)
and, finally combine them
m_ind$sum <- m_ind$sum1 + m_ind$sum2
m_ind

Resources