Short version:
How does one programmatically select sub-arrays from an n-dimensional array when n is arbitrary?
(If the short version of this question is clear enough, feel free to skip the rest of this post.)
Suppose that A is an array such that dim(A) is the vector of positive integers (d1, d2, …, dn), with n > 2.
For example:
> d <- 5:2
> set.seed(0)
> A <- array(runif(prod(d)), dim = d)
Here the array A corresponds to the definition given earlier, with n = 4, and dk = 6 - k, for k ∈ {1, 2, 3, 4}.
Then, if 1 ≤ i ≤ d1 and 1 ≤ j ≤ d2, the expression A[i, j … ] (where … is a placeholder for n - 2 commas) evaluates to an (n - 2)-dimensional array.
To continue the previous example, if we take i = 3 and j = 2, my notation A[i, j … ] would denote the (n - 2 = 2)-dimensional array shown below:
> A[3, 2, ,]
[,1] [,2]
[1,] 0.94467527 0.4785452
[2,] 0.01339033 0.7111212
[3,] 0.02333120 0.1293723
More generally, if
1 ≤ k1 < k2 < … < km ≤ n
and
1 ≤ ir ≤ dkr, ∀r ∈ {1, … m}, then an expression of the general form
A[ … i1 … i2 … … im … ]
...(where the …'s are placeholders for sequences of indices ik and commas), evaluates to an (n - m)-dimensional array.
For example,
> d <- c(4, 2, 5, 4, 2, 7, 3)
> set.seed(1)
> A <- array(runif(prod(d)), dim = d)
> A[3, 1, 4, , 1, 6, ]
[,1] [,2] [,3]
[1,] 0.5320469 0.77282382 0.18034186
[2,] 0.6817434 0.08627063 0.77227529
[3,] 0.8572805 0.32337850 0.63322550
[4,] 0.6555618 0.20578391 0.01257377
Now, one can write out expressions like A[i, j … ] and
A[ … i1 … i2 … … im … ] in full (i.e. filling in all the … placeholders) only if one knows n.
Of course, when one is working interactively, one usually knows (or can easily find out) what n is, and can use this knowledge to decide how many commas to insert in, e.g., A[i, j … ]. This is not the case, however, when one is writing code to work with multi-dimensional arrays of any number of dimensions.
How would one express selections such as A[i, j … ] and A[ … i1 … i2 … … im … ] when one does not know n?
Perhaps this will work for you:
func <- function(ary, ..., drop = TRUE) {
d <- length(dim(ary))
dots <- list(...)
if (length(dots) > d) stop("incorrect number of dimensions")
rest <- rep(TRUE, d - length(dots))
do.call(`[`, c(list(ary), c(dots, rest, drop = drop)))
}
Using your data:
d <- rev(2:5)
set.seed(0)
A <- array(runif(prod(d)), dim = d)
You normally need to know how many commas to include for the correct dimensionality:
A[3,2]
# Error in A[3, 2] : incorrect number of dimensions
This function "fills in" the rest of it for you:
func(A, 3, 2)
# [,1] [,2]
# [1,] 0.94467527 0.4785452
# [2,] 0.01339033 0.7111212
# [3,] 0.02333120 0.1293723
func(A, 3)
# , , 1
# [,1] [,2] [,3]
# [1,] 0.3721239 0.21214252 0.6470602
# [2,] 0.9446753 0.01339033 0.0233312
# [3,] 0.1765568 0.59956583 0.8612095
# [4,] 0.7176185 0.79423986 0.3162717
# , , 2
# [,1] [,2] [,3]
# [1,] 0.2936034 0.71251468 0.3531973
# [2,] 0.4785452 0.71112122 0.1293723
# [3,] 0.8394404 0.05893438 0.7317925
# [4,] 0.8643395 0.45527445 0.7155661
It correctly handles all dimensions:
A[3,2,1,1]
# [1] 0.9446753
func(A, 3, 2, 1, 1)
# [1] 0.9446753
And errors similarly with too many dimensions:
A[3,2,1,1,1]
# Error in A[3, 2, 1, 1, 1] : incorrect number of dimensions
func(A, 3, 2, 1, 1, 1)
# Error in func(A, 3, 2, 1, 1, 1) (from #4) : incorrect number of dimensions
Edit: and the part that I missed. In order to catch blanks, we need to have a little fun.
func <- function(ary, ..., drop = TRUE) {
d <- length(dim(ary))
dots <- as.list(match.call()[-(1:2)])
if (length(dots) > d) stop("incorrect number of dimensions")
pf <- parent.frame()
dots <- lapply(seq_along(dots), function(i) {
x <- dots[[i]]
if (missing(x)) TRUE else eval(dots[[i]], env = pf)
})
rest <- rep(TRUE, d - length(dots))
do.call(`[`, c(list(ary), c(dots, rest, drop = drop)))
}
I had a simpler version of this function (without the lappy), but it tended to fail if any of the positional arguments were variables vice literals.
d <- c(4, 2, 5, 4, 2, 7, 3)
set.seed(1)
A <- array(runif(prod(d)), dim = d)
A[3, 1, 4, , 1, 6, ]
# [,1] [,2] [,3]
# [1,] 0.007668596 0.1818094 0.3278203
# [2,] 0.286473525 0.4119333 0.4825088
# [3,] 0.008869468 0.4767760 0.7649491
# [4,] 0.330141563 0.3438217 0.8710419
func(A, 3, 1, 4, , 1, 6)
# [,1] [,2] [,3]
# [1,] 0.007668596 0.1818094 0.3278203
# [2,] 0.286473525 0.4119333 0.4825088
# [3,] 0.008869468 0.4767760 0.7649491
# [4,] 0.330141563 0.3438217 0.8710419
i <- 3
func(A, i, 1, 2+2, , 1, 6)
# [,1] [,2] [,3]
# [1,] 0.007668596 0.1818094 0.3278203
# [2,] 0.286473525 0.4119333 0.4825088
# [3,] 0.008869468 0.4767760 0.7649491
# [4,] 0.330141563 0.3438217 0.8710419
Related
I review the python question How to remove every occurrence of sub-list from list.
Now I want to know how many creative ways are there in R.
For example, removing any occurrences of sub_list from the main_list.
main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1)
sub_list = c(1,2)
desired result: 2 3 4 2 2 1
My suggestions:
a<-c()
for(i in 1:(length(main_list)-1)){
if (all(main_list[c(i,i+1)]==sub_list))
{a<-c(a,c(i,i+1))}
}
main_list[-a]
[1] 2 3 4 2 2 1
2
as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))
Ohh it is really dangerous. Let's try:
main_list = c(2, 1, 2, 3, 12, 1, 2, 4, 2, 2, 1)
as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))
[1] 2 3 4 2 2 1
####However
a<-c()
for(i in 1:(length(main_list)-1)){
if (all(main_list[c(i,i+1)]==sub_list))
{a<-c(a,c(i,i+1))}
}
main_list[-a]
[1] 2 3 12 4 2 2 1
Update Sat Sep 08 2018
Benchmarking Solutions:
I Benchmarked solutions base on the memory and time, each solution takes, with a big vector of numbers and used profmem and microbenchmark libraries.
set.seed(1587)
main_list<-sample(c(8:13,102:105),size = 10000000,replace = T)
main_list<-c(c(8,9,12,103),main_list,c(8,9,12,103))
sub_list<-c(8,9,12,103)
d.b's solution does not work for main_list so I modified it as follows:
ML = paste(main_list, collapse = ",") # collapse should not be empty
SL = paste(sub_list, collapse = ",")
out<-gsub(SL, "", ML)
out<-gsub("^\\,","",out)
out<-gsub("\\,$","",out)
out<-gsub("\\,,","\\,",out)
out<-as.numeric(unlist(strsplit(out,split = ",")))
The result:
solution seconds memory_byte memory_base seconds_base
<chr> <dbl> <dbl> <dbl> <dbl>
1 d.b 26.0 399904560 1 16.8
2 Grothendieck_2 1.55 1440070304 3.60 1
3 Grothendieck_1 109. 4968036376 12.4 70.3
4 李哲源 2.17 1400120824 3.50 1.40
Any comment about the benchmarking?
Here are two solutions. The first one is obviously simpler and would be used if you favour clarity and maintainability while the second one has no package dependencies and is faster.
1) zoo Use a moving window to compare each subsequence of c(main_list, sub_list) having the required length to the sub_list. (We append sub_list to ensure that there is always something to remove.) This statements returns TRUE or FALSE according to whether the current position is the end of a matching subsequence. Then compute the TRUE index numbers and from that the indices of all elements to be removed and remove them.
library(zoo)
w <- length(sub_list)
r <- rollapplyr(c(main_list, sub_list), w, identical, sub_list, fill = FALSE)
main_list[-c(outer(which(r), seq_len(w) - 1, "-"))]
## [1] 2 3 4 2 2 1
2) Base R. The middle line setting r has the same purpose as the corresponding line in (1) and the last line is the same as the last line in (2) except we use + instead of - due to the fact that embed effectively uses left alignment.
w <- length(sub_list)
r <- colSums(t(embed(c(main_list, sub_list), w)) == rev(sub_list)) == w
main_list[-c(outer(which(r), seq_len(w) - 1, "+"))]
## [1] 2 3 4 2 2 1
Here is a function that does this general thing.
xm is a main list of integer / character / logical values;
xs is a sub list of integer /character / logical values.
It is required that length(xm) > length(xs) but no such check is made right now.
foo <- function (xm, xs) {
nm <- length(xm)
ns <- length(xs)
shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
d <- xm[shift_ind] == xs
first_drop_ind <- which(.colSums(d, ns, length(d) / ns) == ns)
if (length(first_drop_ind) > 0L) {
drop_ind <- outer(0:(ns - 1), first_drop_ind, "+")
return(xm[-drop_ind])
} else {
return(xm)
}
}
main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1)
sub_list = c(1,2)
foo(main_list, sub_list)
#[1] 2 3 4 2 2 1
Explanation
xm <- main_list
xs <- sub_list
nm <- length(xm)
ns <- length(xs)
shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
MAT <- matrix(xm[shift_ind], ns)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] 2 1 2 3 1 2 4 2 2
#[2,] 1 2 3 1 2 4 2 2 1
So the first step is a shifting and matrix representation, as above.
LOGIC <- MAT == xs
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
If a co-occurrence is found, a column should contain all TRUE, i.e., the colSums should be ns. In this way we can identify the location of the first value of the matching.
first_drop_ind <- which(colSums(LOGIC) == ns)
#[1] 2 5
Now we need to expand it to cover the subsequent values after those initial matches.
drop_ind <- outer(0:(ns - 1), first_drop_ind, "+")
# [,1] [,2]
#[1,] 2 5
#[2,] 3 6
Finally we remove values at those positions from xm:
xm[-drop_ind]
#[1] 2 3 4 2 2 1
Note that in the function, the matrix is not explicitly formed. .colSums is used instead of colSums.
watch out for bug
The if ... else ... in the function is necessary. If no match is found then drop_ind would be integer(0), and using xm[-drop_ind] gives xm[integer(0)] which is integer(0).
comparison with zoo::rollapplyr
## require package `zoo`
bar <- function (xm, xs) {
w <- length(xs)
r <- rollapplyr(xm, w, identical, xs, fill = FALSE)
if (length(r) > 0L) {
return(xm[-c(outer(which(r), seq_len(w) - 1, "-"))])
} else {
return(xm)
}
}
set.seed(0)
xm <- sample.int(10, 10000, TRUE)
xs <- 1:2
library(zoo)
system.time(a <- foo(xm, xs))
# user system elapsed
# 0.004 0.000 0.001
system.time(b <- bar(xm, xs))
# user system elapsed
# 0.276 0.000 0.273
all.equal(a, b)
#[1] TRUE
I guess that rollapplyr is slower is because
it needs to first coerce xm to a "zoo" object;
internally it uses lapply so that there is a frequent jump between R and C.
I want to apply a function over one margin (column in my example) of a matrix. The problem is that the function returns matrix and apply converts it to vector so that it returns a matrix. My goal is to get three-dimensional array. Here is the example (note that matrix() is not the function of interest, just an example):
x <- matrix(1:12, 4, 3)
apply(x, 2, matrix, nrow = 2, ncol = 2)
The output is exactly the same as the input. I have pretty dull solution to this:
library(abind)
abind2 <- function (x, ...)
abind(x, ..., along = dim(x) + 1)
apply(x, 2, list) %>%
lapply(unlist) %>%
lapply(matrix, nrow = 2, ncol = 2) %>%
do.call(what = 'abind2')
I believe there must exist something better than this. Something that does not include list()ing and unlist()ing columns.
Edit:
Also, the solution should be ready to be easily applicable to any-dimensional array with any choice of MARGIN which my solution is not.
This, for example, I want to return 4-dimensional array.
x <- array(1:24, c(4,3,2))
apply(x, 2:3, list) %>%
lapply(unlist) %>%
lapply(matrix, nrow = 2, ncol = 2) %>%
do.call(what = 'abind2')
Not that complicated at all. Simply use
array(x, dim = c(2, 2, ncol(x)))
Matrix and general arrays are stored by column into a 1D long array in physical address. You can just reallocate dimension.
OK, here is possibly what you want to do in general:
tapply(x, col(x), FUN = matrix, nrow = 2, ncol = 2)
#$`1`
# [,1] [,2]
#[1,] 1 3
#[2,] 2 4
#
#$`2`
# [,1] [,2]
#[1,] 5 7
#[2,] 6 8
#
#$`3`
# [,1] [,2]
#[1,] 9 11
#[2,] 10 12
You can try to convert your matrix into a data.frame and use lapply to apply your function on the columns (as a data.frame is a list), it will return a list, where each element represents the function result for a column:
lapply(as.data.frame(x), matrix, nrow = 2, ncol = 2)
# $V1
# [,1] [,2]
# [1,] 1 3
# [2,] 2 4
# $V2
# [,1] [,2]
# [1,] 5 7
# [2,] 6 8
# $V3
# [,1] [,2]
# [1,] 9 11
# [2,] 10 12
EDIT with the second definition of x:
x <- array(1:24, c(4,3,2))
lapply(as.data.frame(x), matrix, nrow = 2, ncol = 2)
# $V1
# [,1] [,2]
# [1,] 1 3
# [2,] 2 4
# $V2
# [,1] [,2]
# [1,] 5 7
# [2,] 6 8
# $V3
# [,1] [,2]
# [1,] 9 11
# [2,] 10 12
# $V4
# [,1] [,2]
# [1,] 13 15
# [2,] 14 16
# $V5
# [,1] [,2]
# [1,] 17 19
# [2,] 18 20
# $V6
# [,1] [,2]
# [1,] 21 23
# [2,] 22 24
EDIT2: a try to get an arry as result
Based on this similar question, you may try this code:
x <- array(1:24, c(4,3,2))
sapply(1:3,
function(y) sapply(1:ncol(x[, y, ]),
function(z) matrix(x[,y,z], ncol=2, nrow=2),
simplify="array"),
simplify="array")
Dimension of the result is 2 2 2 3.
Actually, the problem here is that it needs two different calls to apply when x is an array of more than 2 dimension. In the last example of the quesion (with x <- array(1:24, c(4,3,2))), we want to apply to each element of third dimension a function that apply to each element of second dimension the matrix function.
I'd like to insert a dataframe into a dataframe element, such that if I called:df1[1,1] I would get:
[A B]
[C D]
I thought this was possible in R but perhaps I am mistaken. In a project of mine, I am essentially working with a 50x50 matrix, where I'd like each element to contain column of data containing numbers and labeled rows.
Trying to do something like df1[1,1] <- df2 yields the following warning
Warning message:
In [<-.data.frame(*tmp*, i, j, value = list(DJN.10 = c(0, 3, :
replacement element 1 has 144 rows to replace 1 rows
And calling df1[1,1] yields 0 . I've tried inserting the data in various ways, as with as.vector() and as.list() to no success.
Best,
Perhaps a matrix could work for you, like so:
x <- matrix(list(), nrow=2, ncol=3)
print(x)
# [,1] [,2] [,3]
#[1,] NULL NULL NULL
#[2,] NULL NULL NULL
x[[1,1]] <- data.frame(a=c("A","C"), b=c("B","D"))
x[[1,2]] <- data.frame(c=2:3)
x[[2,3]] <- data.frame(x=1, y=2:4)
x[[2,1]] <- list(1,2,3,5)
x[[1,3]] <- list("a","b","c","d")
x[[2,2]] <- list(1:5)
print(x)
# [,1] [,2] [,3]
#[1,] List,2 List,1 List,4
#[2,] List,4 List,1 List,2
x[[1,1]]
# a b
#1 A B
#2 C D
class(x)
#[1] "matrix"
typeof(x)
#[1] "list"
See here for details.
Each column in your data.frame can be a list. Just make sure that the list is as long as the number of rows in your data.frame.
Columns can be added using the standard $ notation.
Example:
x <- data.frame(matrix(NA, nrow=2, ncol=3))
x$X1 <- I(list(data.frame(a=c("A","C"), b=c("B","D")), matrix(1:10, ncol = 5)))
x$X2 <- I(list(data.frame(c = 2:3), list(1, 2, 3, 4)))
x$X3 <- I(list(list("a", "b", "c"), 1:5))
x
# X1 X2 X3
# 1 1:2, 1:2 2:3 a, b, c
# 2 1, 2, 3,.... 1, 2, 3, 4 1, 2, 3,....
x[1, 1]
# [[1]]
# a b
# 1 A B
# 2 C D
#
x[2, 1]
# [[1]]
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 3 5 7 9
# [2,] 2 4 6 8 10
I would like to fast determine top k maximum values in a matrix, and then put those not the top k maximum value as zero, currently I work out the following solution. Can somebody improve these one, since when the matrix have many many rows, this one is not so fast?
thanks.
mat <- matrix(c(5, 1, 6, 4, 9, 1, 8, 9, 10), nrow = 3, byrow = TRUE)
sortedMat <- t(apply(mat, 1, function(x) sort(x, decreasing = TRUE, method = "quick")))
topK <- 2
sortedMat <- sortedMat[, 1:topK, drop = FALSE]
lmat <- mat
for (i in 1:nrow(mat)) {
lmat[i, ] <- mat[i, ] %in% sortedMat[i, ]
}
kMat <- mat * lmat
> mat
[,1] [,2] [,3]
[1,] 5 1 6
[2,] 4 9 1
[3,] 8 9 10
> kMat
[,1] [,2] [,3]
[1,] 5 0 6
[2,] 4 9 0
[3,] 0 9 10
In Rfast the command sort_mat sorts the columns of a matrix, colOrder does order for each column, colRanks gives ranks for each column and the colnth gives the nth value for each column. I believe at least one of them suit you.
You could use rank to speed this up. In case there are ties, you would have to decide on a method to break these (e.g. ties.method = "random").
kmat <- function(mat, k){
mat[t(apply(mat, 1, rank)) <= (ncol(mat)-k)] <- 0
mat
}
kmat(mat, 2)
## [,1] [,2] [,3]
## [1,] 5 0 6
## [2,] 4 9 0
## [3,] 0 9 10
I would like to determine k nearest neighbor for each row a matrix, here is my code, I think data.table can do it more efficient, but currently, I can not figure it out
mm <- matrix(c(5, 1, 2, 3, 5, 4), 3, 2)
mm <- tcrossprod(mm)
getNearest <- function(simmat, k = 2) {
res <- apply(simmat, 1, function(x) {
s <- sort(x, index.return = TRUE)
x[s$ix[1:(length(x) - k)]] <- 0
return(x)
}
)
return(res / rowSums(res))
}
getNearest(mm, k = 2)
> getNearest(mm, k = 2)
[,1] [,2] [,3]
[1,] 0.6071429 0.0000000 0.3928571
[2,] 0.0000000 0.5416667 0.4583333
[3,] 0.5000000 0.5000000 0.0000000
The original mm is:
> mm
[,1] [,2] [,3]
[1,] 34 20 22
[2,] 20 26 22
[3,] 22 22 20
For 2000 by 2000 matrix and k = 30, it will cost 1.17 seconds, is there more efficient way to finish the same thing, in data.table?
Thanks.