R manipulate a list of matrices - r

Been stuck with this for a while!
How can I manipulate the first list into the second?
list("X" = X, "Y" = Y, ...)
list("X" = c(X,n), "Y" = c(Y,n), ...)
where X and Y are matrices and n is an integer and the lists are of unknown size. Thanks!

If c(X,n), which coerces matrix to vector, is what you really want, then
lst <- list(a = matrix(1:4, 2), b = matrix(1:4, 2))
n <- 5
lapply(lst, c, n)
# $a
# [1] 1 2 3 4 5
#
# $b
# [1] 1 2 3 4 5

Related

R values that go into matrix multiplication

What is the fastest approach to saving unique values that go into matrix multiplication (without 0)?
For example, if I have a data.table object
library(data.table)
A = data.table(j3=c(3,0,3),j5=c(0,5,5),j7=c(0,7,0),j8=c(8,0,8))
I would like to see which unique values go into A*transpose(A) (or as.matrix(A) %*% as.matrix(t(A))). Right now, I can do it using for loops as:
B=t(A)
L = list()
models = c('A1','A2','A3')
for(i in 1:nrow(A)){
for(j in 1:ncol(B)){
u = union(unlist(A[i,]),B[,j])
u = u[u!=0] # remove 0
L[[paste(models[i],models[j])]]= u
}
}
However, is there a faster and more RAM-efficient way? The output doesn't have to be a list object, as in my case, it can be a data.table (data.frame) as well. Also, the order of values is not important. For example, 3 5 8 is as good as 5 3 8, 8 5 3 etc.
Any help is appreciated.
EDIT: So as.matrix(A) %*% as.matrix(t(A)) is:
[,1] [,2] [,3]
[1,] 73 0 73
[2,] 0 74 25
[3,] 73 25 98
The first element is calculated as 3*3+0*0+0*0+8*8 = 73, the second element is 3*0+0*5+0*7+8*0 = 0, etc. I need unique numbers that go to this calculation but without 0.
Therefore outputs (saved in the list L) are:
> L
$`A1 A1`
[1] 3 8
$`A1 A2`
[1] 3 8 5 7
$`A1 A3`
[1] 3 8 5
$`A2 A1`
[1] 5 7 3 8
$`A2 A2`
[1] 5 7
$`A2 A3`
[1] 5 7 3 8
$`A3 A1`
[1] 3 5 8
$`A3 A2`
[1] 3 5 8 7
$`A3 A3`
[1] 3 5 8
Once again, the output doesn't have to be a list object. I would prefer data.table if it is doable. Is it possible to rewrite my approach as Rcpp function?
Potential optimizations
Following up on #user2554330's answer, note that if A is an m-by-n matrix, then AAT = A %*% t(A) (equivalently tcrossprod(A)) is an m-by-m symmetric matrix. AAT[i, j] and AAT[j, i] are computed using the same entries of A, so you only need to inspect m*(m+1)/2 pairs of rows of A, not m*m.
You can do even better by finding and caching the unique elements of each row before pairing them. Preprocessing in this way avoids redundant computation and should noticeably improve performance when m << n.
Limitations
Another aspect of the problem is how unique works under the hood. unique has an argument nmax that you can use to specify an expected maximum number of unique elements. From ?duplicated:
Except for factors, logical and raw vectors the default nmax = NA is equivalent to nmax = length(x). Since a hash table of size 8*nmax bytes is allocated, setting nmax suitably can save large amounts of memory. For factors it is automatically set to the smaller of length(x) and the number of levels plus one (for NA). If nmax is set too small there is liable to be an error: nmax = 1 is silently ignored.
Long vectors are supported for the default method of duplicated, but may only be usable if nmax is supplied.
These comments apply to unique as well. Since you have a 300-by-4e+07 matrix, you would be evaluating (with preprocessing):
unique(<4e+07-length vector>), 300 times,
unique(<up to 8e+07-length vector>), 299*300/2 times.
That can consume a lot of memory if you don't know anything about your matrix that might allow you to set nmax. And it can take a long time if you don't have access to many CPUs.
So I agree with comments asking you to consider why you need to do this at all and whether your underlying problem has a nicer solution.
Two answers
FWIW, here are two approaches to your general problem that actually take advantage of symmetry. f and g are without and with preprocessing. [[.utri allows you to extract elements from the return value, an m*(m+1)/2-length list, as if it were an m-by-m matrix. as.matrix.utri constructs the full, symmetric m-by-m list matrix.
f <- function(A, nmax = NA) {
a <- seq_len(nrow(A))
J <- cbind(sequence(a), rep.int(a, a))
FUN <- function(i) {
if (i[1L] == i[2L]) {
x <- A[i[1L], ]
} else {
x <- c(A[i[1L], ], A[i[2L], ])
}
unique.default(x[x != 0], nmax = nmax)
}
res <- apply(J, 1L, FUN, simplify = FALSE)
class(res) <- "utri"
res
}
g <- function(A, nmax = NA) {
l <- lapply(asplit(A, 1L), function(x) unique.default(x[x != 0], nmax = nmax))
a <- seq_along(l)
J <- cbind(sequence(a), rep.int(a, a))
FUN <- function(i) {
if (i[1L] == i[2L]) {
l[[i[1L]]]
} else {
unique.default(c(l[[i[1L]]], l[[i[2L]]]))
}
}
res <- apply(J, 1L, FUN, simplify = FALSE)
class(res) <- "utri"
res
}
`[[.utri` <- function(x, i, j) {
stopifnot(length(i) == 1L, length(j) == 1L)
class(x) <- NULL
if (i <= j) {
x[[i + (j * (j - 1L)) %/% 2L]]
} else {
x[[j + (i * (i - 1L)) %/% 2L]]
}
}
as.matrix.utri <- function(x) {
p <- length(x)
n <- as.integer(round(0.5 * (-1 + sqrt(1 + 8 * p))))
i <- rep.int(seq_len(n), n)
j <- rep.int(seq_len(n), rep.int(n, n))
r <- i > j
ir <- i[r]
i[r] <- j[r]
j[r] <- ir
res <- x[i + (j * (j - 1L)) %/% 2L]
dim(res) <- c(n, n)
res
}
Here is a simple test on a 4-by-4 integer matrix:
mkA <- function(m, n) {
A <- sample(0:(n - 1L), size = as.double(m) * n, replace = TRUE,
prob = rep.int(c(n - 1, 1), c(1L, n - 1L)))
dim(A) <- c(m, n)
A
}
set.seed(1L)
A <- mkA(4L, 4L)
A
## [,1] [,2] [,3] [,4]
## [1,] 0 0 2 3
## [2,] 0 1 0 0
## [3,] 2 1 0 3
## [4,] 1 2 0 0
identical(f(A), gA <- g(A))
## [1] TRUE
gA[[1L, 1L]] # used for 'tcrossprod(A)[1L, 1L]'
## [1] 2 3
gA[[1L, 2L]] # used for 'tcrossprod(A)[1L, 2L]'
## [1] 2 3 1
gA[[2L, 1L]] # used for 'tcrossprod(A)[2L, 1L]'
## [1] 2 3 1
gA # under the hood, an 'm*(m+1)/2'-length list
## [[1]]
## [1] 2 3
##
## [[2]]
## [1] 2 3 1
##
## [[3]]
## [1] 1
##
## [[4]]
## [1] 2 3 1
##
## [[5]]
## [1] 1 2 3
##
## [[6]]
## [1] 2 1 3
##
## [[7]]
## [1] 2 3 1
##
## [[8]]
## [1] 1 2
##
## [[9]]
## [1] 2 1 3
##
## [[10]]
## [1] 1 2
##
## attr(,"class")
## [1] "utri"
mgA <- as.matrix(gA) # the full, symmetric, 'm'-by-'m' list matrix
mgA
## [,1] [,2] [,3] [,4]
## [1,] integer,2 integer,3 integer,3 integer,3
## [2,] integer,3 1 integer,3 integer,2
## [3,] integer,3 integer,3 integer,3 integer,3
## [4,] integer,3 integer,2 integer,3 integer,2
mgA[1L, ] # used for first row of 'tcrossprod(A)'
## [[1]]
## [1] 2 3
##
## [[2]]
## [1] 2 3 1
##
## [[3]]
## [1] 2 3 1
##
## [[4]]
## [1] 2 3 1
## If you need names
dimnames(mgA) <- rep.int(list(sprintf("A%d", seq_len(nrow(mgA)))), 2L)
mgA["A1", ]
## $A1
## [1] 2 3
##
## $A2
## [1] 2 3 1
##
## $A3
## [1] 2 3 1
##
## $A4
## [1] 2 3 1
## If you need an 'm'-by-'m' 'data.table' result
DT <- data.table::as.data.table(mgA)
DT
## A1 A2 A3 A4
## 1: 2,3 2,3,1 2,3,1 2,3,1
## 2: 2,3,1 1 1,2,3 1,2
## 3: 2,3,1 1,2,3 2,1,3 2,1,3
## 4: 2,3,1 1,2 2,1,3 1,2
And here are two benchmarks on two large integer matrices, showing that preprocessing can help quite a bit:
set.seed(1L)
A <- mkA(100L, 1e+04L)
microbenchmark::microbenchmark(f(A), g(A), times = 10L, setup = gc(FALSE))
## Unit: milliseconds
## expr min lq mean median uq max neval
## f(A) 2352.0572 2383.3100 2435.7954 2403.8968 2431.6214 2619.553 10
## g(A) 843.0206 852.5757 858.7262 858.2746 863.8239 881.450 10
A <- mkA(100L, 1e+06L)
microbenchmark::microbenchmark(f(A), g(A), times = 10L, setup = gc(FALSE))
## Unit: seconds
## expr min lq mean median uq max neval
## f(A) 290.93327 295.54319 302.57001 301.17810 307.50226 318.14203 10
## g(A) 72.85608 73.83614 76.67941 76.57313 77.78056 83.73388 10
Perhaps we can try this
f <- function(A, models) {
AA <- replace(A, A == 0, NA)
setNames(
c(t(outer(
1:nrow(A),
1:nrow(A),
Vectorize(function(x, y) unique(na.omit(c(t(AA[c(x, y)])))))
))),
t(outer(models, models, paste))
)
}
which gives
$`A1 A1`
[1] 3 8
$`A1 A2`
[1] 3 8 5 7
$`A1 A3`
[1] 3 8 5
$`A2 A1`
[1] 5 7 3 8
$`A2 A2`
[1] 5 7
$`A2 A3`
[1] 5 7 3 8
$`A3 A1`
[1] 3 5 8
$`A3 A2`
[1] 3 5 8 7
$`A3 A3`
[1] 3 5 8
If you care about the speed, you can try
lst <- asplit(replace(A, A == 0, NA), 1)
mat <- matrix(list(), nrow = nrow(A), ncol = nrow(A))
mat[lower.tri(mat)] <- combn(lst, 2, function(...) unique(na.omit(unlist(...))), simplify = FALSE)
mat[upper.tri(mat)] <- t(mat)[upper.tri(mat)]
diag(mat) <- Map(function(x) unname(x)[!is.na(x)], lst)
L <- c(t(mat))
Thanks for posting the additional information in your edits. From what you posted, it appears that for all pairs of rows of a matrix or data table A, you want the unique non-zero values in those two rows.
To do that efficiently I'd suggest ensuring that A is a matrix. Row indexing in dataframes or data tables is a lot slower than doing so in matrices. (Column indexing can be faster, but I doubt if it's worth transposing the table to get that.)
Once you have a matrix, A[i, ] is a vector containing the values in row i, and that's a pretty fast calculation. You want the unique non-zero values in c(A[i, ], A[j, ]). The unique function will produce this, but won't leave out the zeros. I'd suggest experimenting. Depending on the contents of each row, it is conceivable that leaving the zeros out of the rows first before computing the unique entries could be either faster or slower than calculating all the unique values and deleting 0 afterwards.
You say you want to do this for a few hundred rows, but each row is very long. I'd guess you won't be able to improve much on nested loops: the time will be spent on each entry, not on the loops. However, you could experiment with vectorization using the apply() function, e.g.
result <- vector("list", nrows)
for (i in 1:nrows)
result[[i]] <- apply(A, 1, function(row) setdiff(unique(c(row, A[i,])), 0))
This will give a list of lists; if you want to examine entry i, j, you can use result[[c(i,j)]].

Index n-dimensional array with vector and a blank/dummy index

How do you index a n dimensional array in R using a vector, but leaving eg one index blank/dummy index? Here is an example.
m <- array(1:9, c(3,3,3))
m[,c(1,2)]
#Error in m[, c(1, 2)] : incorrect number of dimensions
m[,1,2]
#[1] 1 2 3
I want
m[,c(1,2)]
#[1] 1 2 3
This works with a matrix (https://rspatial.org/intr/4-indexing.html)
m <- matrix(1:9, nrow=3, ncol=3, byrow=TRUE)
colnames(m) <- c('a', 'b', 'c')
m
m[, c('a', 'c')]
## a c
## [1,] 1 3
## [2,] 4 6
## [3,] 7 9
An option is to construct a matrix
ind <- do.call(cbind, c(list(seq(dim(m)[1])), as.list(c(1, 2))))
m[ind]
#[1] 1 2 3
Or as #user20650 commented
do.call("[", c(list(m), TRUE, 1, 2))

Naming objects from functions

I am a beginner in R. I have a vast data set and I am trying to create a loop to optimize the time.
I have something like:
a <- c ('exam12', 'example22', 'e33')
b <- list (c (2,4,5,6), c (10,4,8,6), c (25, 3, 7, 30))
And I would like to use the strings of a as the name of objects for other values, obtaining, in my environment, something like:
exam <- c (2,4,5,6)
example <- c (10,4,8,6)
e <- c (25, 3, 7, 30)
I tried the following:
for (i in seq_along (a)) {
for (j in seq_along (b)) {
str_sub (a [i], start = 1, end = -1) <- b [j]
}
}
But I was not successful. I appreciate any help.
You can use list2env:
a <- c ('exam12', 'example22', 'e33')
b <- list (c (2,4,5,6), c (10,4,8,6), c (25, 3, 7, 30))
a
# [1] "exam12" "example22" "e33"
b
# [[1]]
# [1] 2 4 5 6
#
# [[2]]
# [1] 10 4 8 6
#
# [[3]]
# [1] 25 3 7 30
ls()
# [1] "a" "b"
list2env(setNames(b, sub("\\d+$", "", a)), .GlobalEnv)
# <environment: R_GlobalEnv>
ls()
# [1] "a" "b" "e" "exam" "example"
exam
# [1] 2 4 5 6
For reference, you could also do this with assign, for example:
for (i in seq_along(a)) {
assign(sub("\\d+$", "", a[i]), b[[i]])
}

Find indices of vector elements in a another vector

This extends a previous question I asked.
I have 2 vectors:
a <- c("a","b","c","d","e","d,e","f")
b <- c("a","b","c","d,e","f")
I created b from a by eliminating elements of a that are contained in other, comma separated, elements in a (e.g., "d" and "e" in a are contained in "d,e" and therefore only "d,e" is represented in b).
I am looking for an efficient way to map between indices of the elements of a and b.
Specifically, I would like to have a list of the length of b where each element is a vector with the indices of the elements in a that map to that b element.
For this example the output should be:
list(1, 2, 3, c(4,5,6), 7)
Modifying slightly from my answer at your previous question, try:
a <- c("a","b","c","d","e","d,e","f")
b <- c("a","b","c","d,e","f")
B <- setNames(lapply(b, gsub, pattern = ",", replacement = "|"), seq_along(b))
lapply(B, function(x) which(grepl(x, a)))
# $`1`
# [1] 1
#
# $`2`
# [1] 2
#
# $`3`
# [1] 3
#
# $`4`
# [1] 4 5 6
#
# $`5`
# [1] 7

R print out matrix with row and column names using apply

I have the following matrix 'x'
a b
a 1 3
b 2 4
It is a really large matrix (trimmed down for this question)
I would like to print out this matrix by each row name and column name combination along with the value in that cell. So the expected output would be
a,a,1
a,b,3
b,a,2
b,b,4
I could loop through them, but I'm pretty sure this can be avoided (apply?). Any pointers much appreciated.
One way is to use the melt function from the reshape2 package.
x <- matrix(1:4, nrow = 2, ncol = 2,
dimnames = list(dim1 = c("a", "b"), dim2 = c("a", "b")))
library(reshape2)
melt(x)
# dim1 dim2 value
# 1 a a 1
# 2 b a 2
# 3 a b 3
# 4 b b 4
Edit
If your data is so big that speed is an issue, I would also suggest:
data.frame(dim1 = rep(rownames(x), ncol(x)),
dim2 = rep(colnames(x), each = nrow(x)),
value = c(x))
Edit2
After testing with relatively big data, I would not rule out melt:
x <- matrix(runif(9e6), nrow = 3000, ncol = 3000,
dimnames = list(dim1 = paste0("x", runif(3000)),
dim2 = paste0("y", runif(3000))))
system.time(y1 <- melt(x))
# user system elapsed
# 1.17 0.44 1.61
system.time(y2 <- data.frame(dim1 = rep(rownames(x), ncol(x)),
dim2 = rep(colnames(x), each = nrow(x)),
value = c(x)))
# user system elapsed
# 1.98 0.37 2.36
You can also use the base R function row and col
If you want to reference the row.names and col.names then use as.factor = T. Using as.character and as.numeric flattens the matrix.
do.call(data.frame,list(lapply(list(row = row(x, T),col=col(x,T)), as.character),
value =as.numeric(x)))
## row col value
## 1 a a 1
## 2 b a 2
## 3 a b 3
## 4 b b 4
If you want a matrix you will need to have all the columns as the same class (character or numeric. You could then use
do.call(cbind, lapply(list(row = row(x), col = col(x), value = x), as.numeric))
## row col value
## [1,] 1 1 1
## [2,] 2 1 2
## [3,] 1 2 3
## [4,] 2 2 4
Or as character
do.call(cbind, lapply(list(row = row(x, T), col = col(x, T), value = x), as.character))
## row col value
## [1,] "a" "a" "1"
## [2,] "b" "a" "2"
## [3,] "a" "b" "3"
## [4,] "b" "b" "4"

Resources