Check that a vector is contained in a matrix in R - r

I can't believe this is taking me this long to figure out, and I still can't figure it out.
I need to keep a collection of vectors, and later check that a certain vector is in that collection. I tried lists combined with %in% but that doesn't appear to work properly.
My next idea was to create a matrix and rbind vectors to it, but now I don't know how to check if a vector is contained in a matrix. %in appears to compare sets and not exact rows. Same appears to apply to intersect.
Help much appreciated!

Do you mean like this:
wantVec <- c(3,1,2)
myList <- list(A = c(1:3), B = c(3,1,2), C = c(2,3,1))
sapply(myList, function(x, want) isTRUE(all.equal(x, want)), wantVec)
## or, is the vector in the set?
any(sapply(myList, function(x, want) isTRUE(all.equal(x, want)), wantVec))
We can do a similar thing with a matrix:
myMat <- matrix(unlist(myList), ncol = 3, byrow = TRUE)
## As the vectors are now in the rows, we use apply over the rows
apply(myMat, 1, function(x, want) isTRUE(all.equal(x, want)), wantVec)
## or
any(apply(myMat, 1, function(x, want) isTRUE(all.equal(x, want)), wantVec))
Or by columns:
myMat2 <- matrix(unlist(myList), ncol = 3)
## As the vectors are now in the cols, we use apply over the cols
apply(myMat, 2, function(x, want) isTRUE(all.equal(x, want)), wantVec)
## or
any(apply(myMat, 2, function(x, want) isTRUE(all.equal(x, want)), wantVec))
If you need to do this a lot, write your own function
vecMatch <- function(x, want) {
isTRUE(all.equal(x, want))
}
And then use it, e.g. on the list myList:
> sapply(myList, vecMatch, wantVec)
A B C
FALSE TRUE FALSE
> any(sapply(myList, vecMatch, wantVec))
[1] TRUE
Or even wrap the whole thing:
vecMatch <- function(x, want) {
out <- sapply(x, function(x, want) isTRUE(all.equal(x, want)), want)
any(out)
}
> vecMatch(myList, wantVec)
[1] TRUE
> vecMatch(myList, 5:3)
[1] FALSE
EDIT: Quick comment on why I used isTRUE() wrapped around the all.equal() calls. This is due to the fact that where the two arguments are not equal, all.equal() doesn't return a logical value (FALSE):
> all.equal(1:3, c(3,2,1))
[1] "Mean relative difference: 1"
isTRUE() is useful here because it returns TRUE iff it's argument is TRUE, whilst it returns FALSE if it is anything else.

> M
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
v <- c(2, 5, 8)
check each column:
c1 <- which(M[, 1] == v[1])
c2 <- which(M[, 2] == v[2])
c3 <- which(M[, 3] == v[3])
Here is a way to still use intersect() on more than 2 elements
> intersect(intersect(c1, c2), c3)
[1] 2

Related

concatenation sublists of two different lists [duplicate]

I have two lists
first = list(a = 1, b = 2, c = 3)
second = list(a = 2, b = 3, c = 4)
I want to merge these two lists so the final product is
$a
[1] 1 2
$b
[1] 2 3
$c
[1] 3 4
Is there a simple function to do this?
If lists always have the same structure, as in the example, then a simpler solution is
mapply(c, first, second, SIMPLIFY=FALSE)
This is a very simple adaptation of the modifyList function by Sarkar. Because it is recursive, it will handle more complex situations than mapply would, and it will handle mismatched name situations by ignoring the items in 'second' that are not in 'first'.
appendList <- function (x, val)
{
stopifnot(is.list(x), is.list(val))
xnames <- names(x)
for (v in names(val)) {
x[[v]] <- if (v %in% xnames && is.list(x[[v]]) && is.list(val[[v]]))
appendList(x[[v]], val[[v]])
else c(x[[v]], val[[v]])
}
x
}
> appendList(first,second)
$a
[1] 1 2
$b
[1] 2 3
$c
[1] 3 4
Here are two options, the first:
both <- list(first, second)
n <- unique(unlist(lapply(both, names)))
names(n) <- n
lapply(n, function(ni) unlist(lapply(both, `[[`, ni)))
and the second, which works only if they have the same structure:
apply(cbind(first, second),1,function(x) unname(unlist(x)))
Both give the desired result.
Here's some code that I ended up writing, based upon #Andrei's answer but without the elegancy/simplicity. The advantage is that it allows a more complex recursive merge and also differs between elements that should be connected with rbind and those that are just connected with c:
# Decided to move this outside the mapply, not sure this is
# that important for speed but I imagine redefining the function
# might be somewhat time-consuming
mergeLists_internal <- function(o_element, n_element){
if (is.list(n_element)){
# Fill in non-existant element with NA elements
if (length(n_element) != length(o_element)){
n_unique <- names(n_element)[! names(n_element) %in% names(o_element)]
if (length(n_unique) > 0){
for (n in n_unique){
if (is.matrix(n_element[[n]])){
o_element[[n]] <- matrix(NA,
nrow=nrow(n_element[[n]]),
ncol=ncol(n_element[[n]]))
}else{
o_element[[n]] <- rep(NA,
times=length(n_element[[n]]))
}
}
}
o_unique <- names(o_element)[! names(o_element) %in% names(n_element)]
if (length(o_unique) > 0){
for (n in o_unique){
if (is.matrix(n_element[[n]])){
n_element[[n]] <- matrix(NA,
nrow=nrow(o_element[[n]]),
ncol=ncol(o_element[[n]]))
}else{
n_element[[n]] <- rep(NA,
times=length(o_element[[n]]))
}
}
}
}
# Now merge the two lists
return(mergeLists(o_element,
n_element))
}
if(length(n_element)>1){
new_cols <- ifelse(is.matrix(n_element), ncol(n_element), length(n_element))
old_cols <- ifelse(is.matrix(o_element), ncol(o_element), length(o_element))
if (new_cols != old_cols)
stop("Your length doesn't match on the elements,",
" new element (", new_cols , ") !=",
" old element (", old_cols , ")")
}
return(rbind(o_element,
n_element,
deparse.level=0))
return(c(o_element,
n_element))
}
mergeLists <- function(old, new){
if (is.null(old))
return (new)
m <- mapply(mergeLists_internal, old, new, SIMPLIFY=FALSE)
return(m)
}
Here's my example:
v1 <- list("a"=c(1,2), b="test 1", sublist=list(one=20:21, two=21:22))
v2 <- list("a"=c(3,4), b="test 2", sublist=list(one=10:11, two=11:12, three=1:2))
mergeLists(v1, v2)
This results in:
$a
[,1] [,2]
[1,] 1 2
[2,] 3 4
$b
[1] "test 1" "test 2"
$sublist
$sublist$one
[,1] [,2]
[1,] 20 21
[2,] 10 11
$sublist$two
[,1] [,2]
[1,] 21 22
[2,] 11 12
$sublist$three
[,1] [,2]
[1,] NA NA
[2,] 1 2
Yeah, I know - perhaps not the most logical merge but I have a complex parallel loop that I had to generate a more customized .combine function for, and therefore I wrote this monster :-)
merged = map(names(first), ~c(first[[.x]], second[[.x]])
merged = set_names(merged, names(first))
Using purrr. Also solves the problem of your lists not being in order.
In general one could,
merge_list <- function(...) by(v<-unlist(c(...)),names(v),base::c)
Note that the by() solution returns an attributed list, so it will print differently, but will still be a list. But you can get rid of the attributes with attr(x,"_attribute.name_")<-NULL. You can probably also use aggregate().
We can do a lapply with c(), and use setNames to assign the original name to the output.
setNames(lapply(1:length(first), function(x) c(first[[x]], second[[x]])), names(first))
$a
[1] 1 2
$b
[1] 2 3
$c
[1] 3 4
Following #Aaron left Stack Overflow and #Theo answer, the merged list's elements are in form of vector c.
But if you want to bind rows and columns use rbind and cbind.
merged = map(names(first), ~rbind(first[[.x]], second[[.x]])
merged = set_names(merged, names(first))
Using dplyr, I found that this line works for named lists using the same names:
as.list(bind_rows(first, second))

Checking for sequences in an R vector

I'm looking for a function or operation such that if I have
A <- c(1, 2, 3, 4, 5)
and
B <- c(1, 2, 3)
and C <- c(2, 1)
I'd get a TRUE when checking whether A contained B, and FALSE when checking whether A contained C
basically, the equivalent of the %in% operator but that actually cares about the order of elements
In a perfect world, I'd be able to do this without some kind of apply statement, but I may end up having to
Well, if one's allowd to use a kind-of apply loop, then this could work:
"%seq_in%" = function(b,a) any(sapply(1:(length(a)-length(b)+1),function(i) all(a[i:(i+length(b)-1)]==b)))
(edited thanks to bug-finding by John Coleman!)
EDIT 2:
I couldn't resist trying to solve the 'non-contiguous' case, too:
# find_subseq() returns positions within vec of ordered elements of x, or stops with NA upon failing
find_subseq = function(x,vec) {
p=match(x[1],vec)
if(is.na(p)||length(x)==1){ p }
else { c(p,p+find_subseq(x[-1],vec[-seq_len(p)])) }
}
"%seq_somewhere_in%" = function(b,a) all(!is.na(find_subseq(b,a)))
Examples:
1:3 %seq_in% 1:10
[1] TRUE
c(3,1,2) %seq_in% 1:10
[1] FALSE
c(1,2,3) %seq_in% c(3,2,1,2,3)
[1] TRUE
2:1 %seq_in% c(1,2,1)
[1] TRUE
1:3 %seq_somewhere_in% c(1,10,10,2,10,10,10,3,10)
[1] TRUE
Maybe you can define a custom function subseq_check like below
subseq_check <- function(x,y) grepl(toString(y),toString(x),fixed = TRUE)
which gives
> subseq_check(A,B)
[1] TRUE
> subseq_check(A,C)
[1] FALSE
A Hard-core approach
subseq_find <- function(x,y) {
inds <- which(x == head(y,1))
if (length(inds)==0) return(FALSE)
any(sapply(inds, function(k) all(x[k:(k+length(y)-1)]==y)))
}
such that
> subseq_find(A,B)
[1] TRUE
> subseq_find(A,C)
[1] FALSE

How to concisely deal with subsets when their lengths become zero?

To exclude elements from a vector x,
x <- c(1, 4, 3, 2)
we can subtract a vector of positions:
excl <- c(2, 3)
x[-excl]
# [1] 1 2
This also works dynamically,
(excl <- which(x[-which.max(x)] > quantile(x, .25)))
# [1] 2 3
x[-excl]
# [1] 1 2
until excl is of length zero:
excl.nolength <- which(x[-which.max(x)] > quantile(x, .95))
length(excl.nolength)
# [1] 0
x[-excl.nolength]
# integer(0)
I could kind of reformulate that, but I have many objects to which excl is applied, say:
letters[1:4][-excl.nolength]
# character(0)
I know I could use setdiff, but that's rather long and hard to read:
x[setdiff(seq(x), excl.nolength)]
# [1] 1 4 3 2
letters[1:4][setdiff(seq(letters[1:4]), excl.nolength)]
# [1] "a" "b" "c" "d"
Now, I could exploit the fact that nothing is excluded if the element number is greater than the number of elements:
length(x)
# [1] 4
x[-5]
# [1] 1 4 3 2
To generalize that I should probably use .Machine$integer.max:
tmp <- which(x[-which.max(x)] > quantile(x, .95))
excl <- if (!length(tmp) == 0) tmp else .Machine$integer.max
x[-excl]
# [1] 1 4 3 2
Wrapped into a function,
e <- function(x) if (!length(x) == 0) x else .Machine$integer.max
that's quite handy and clear:
x[-e(excl)]
# [1] 1 2
x[-e(excl.nolength)]
# [1] 1 4 3 2
letters[1:4][-e(excl.nolength)]
# [1] "a" "b" "c" "d"
But it seems a little fishy to me...
Is there a better equally concise way to deal with a subset of length zero in base R?
Edit
excl comes out as dynamic result of a function before (as shown with which above) and might be of length zero or not. If length(excl) == 0 nothing should be excluded. Following lines of code, e.g. x[-excl] should not have to be changed at best or as little as possible.
You can overwrite [ with your own function.
"[" <- function(x,y) {if(length(y)==0) x else .Primitive("[")(x,y)}
x <- c(1, 4, 3, 2)
excl <- c(2, 3)
x[-excl]
#[1] 1 2
excl <- integer()
x[-excl]
#[1] 1 4 3 2
rm("[") #Go back to normal mode
I would argue this is somewhat opinion based.
For example i find:
x <- x[-if(length(excl <- which(x[-which.max(x)] > quantile(x, .95))) == 0) .Machine$integer.max else excl]
very unreadable, but some people like one-liners. Reading package code you'll often find this is instead split up into one of the many suggestions you gave
excl <- which(x[-which.max(x)] > quantile(x, .95))
if(length(excl) != 0)
x <- x[-excl]
Alternatively, you could avoid which, and simply use the logical vector for subsetting, and this would likely be considered more clean by most
x <- x[!x[-which.max(x)] > quantile(x, .95)]
This would avoid zero-length index problem, at the cost of some loss of efficiency.
As a side note, the very example used above and in the question seems somewhat off. First which.max only returns the first index which is equal to the max value, and in addition the index will be offset for every value removed. More likely the expected example would be
x <- x[!(x > quantile(x, .95))[-which(x == max(x))]]
How bout this?
a <- letters[1:3]
excl1 <- c(1,3)
excl2 <- c()
a[!(seq_along(a) %in% excl1)]
a[!(seq_along(a) %in% excl2)]

How to store multidimensional subscript as variable in R

Suppose I have a matrix,
mat <- matrix((1:9)^2, 3, 3)
I can slice the matrix like so
> mat[2:3, 2]
[1] 25 36
How does one store the subscript as a variable? That is, what should my_sub be, such that
> mat[my_sub]
[1] 25 36
A list gets "invalid subscript type" error. A vector will lose the multidimensionality. Seems like such a basic operation to not have a primitive type that fits this usage.
I know I can access the matrix via vector addressing, which means converting from [2:3, 2] to c(5, 6), but that mapping presumes knowledge of matrix shape. What if I simply want [2:3, 2] for any matrix shape (assuming it is at least those dimensions)?
Here are some alternatives. They both generalize to higher dimenional arrays.
1) matrix subscripting If the indexes are all scalar except possibly one, as in the question, then:
mi <- cbind(2:3, 2)
mat[mi]
# test
identical(mat[mi], mat[2:3, 2])
## [1] TRUE
In higher dimensions:
a <- array(1:24, 2:4)
mi <- cbind(2, 2:3, 3)
a[mi]
# test
identical(a[mi], a[2, 2:3, 3])
## [1] TRUE
It would be possible to extend this to eliminate the scalar restriction using:
L <- list(2:3, 2:3)
array(mat[as.matrix(do.call(expand.grid, L))], lengths(L))
however, in light of (2) which also uses do.call but avoids the need for expand.grid it seems unnecessarily complex.
2) do.call This approach does not have the scalar limitation. mat and a are from above:
L2 <- list(2:3, 1:2)
do.call("[", c(list(mat), L2))
# test
identical(do.call("[", c(list(mat), L2)), mat[2:3, 1:2])
## [1] TRUE
L3 <- list(2, 2:3, 3:4)
do.call("[", c(list(a), L3))
# test
identical(do.call("[", c(list(a), L3)), a[2, 2:3, 3:4])
## [1] TRUE
This could be made prettier by defining:
`%[%` <- function(x, indexList) do.call("[", c(list(x), indexList))
mat %[% list(2:3, 1:2)
a %[% list(2, 2:3, 3:4)
Use which argument arr.ind = TRUE.
x <- c(25, 36)
inx <- which(mat == x, arr.ind = TRUE)
Warning message:
In mat == x :
longer object length is not a multiple of shorter object length
mat[inx]
#[1] 25 36
This is an interesting question. The subset function can actually help. You cannot subset directly your matrix using a vector or a list, but you can store the indexes in a list and use subset to do the trick.
mat <- matrix(1:12, nrow=4)
mat[2:3, 1:2]
# example using subset
subset(mat, subset = 1:nrow(mat) %in% 2:3, select = 1:2)
# double check
identical(mat[2:3, 1:2],
subset(mat, subset = 1:nrow(mat) %in% 2:3, select = 1:2))
# TRUE
Actually, we can write a custom function if we want to store the row- and column- indexes in the same list.
cust.subset <- function(mat, dim.list){
subset(mat, subset = 1:nrow(mat) %in% dim.list[[1]], select = dim.list[[2]])
}
# initialize a list that includes your sub-setting indexes
sbdim <- list(2:3, 1:2)
sbdim
# [[1]]
# [1] 2 3
# [[2]]
# [1] 1 2
# subset using your custom f(x) and your list
cust.subset(mat, sbdim)
# [,1] [,2]
# [1,] 2 6
# [2,] 3 7

R split numeric vector at position

I am wondering about the simple task of splitting a vector into two at a certain index:
splitAt <- function(x, pos){
list(x[1:pos-1], x[pos:length(x)])
}
a <- c(1, 2, 2, 3)
> splitAt(a, 4)
[[1]]
[1] 1 2 2
[[2]]
[1] 3
My question: There must be some existing function for this, but I can't find it? Is maybe split a possibility? My naive implementation also does not work if pos=0 or pos>length(a).
An improvement would be:
splitAt <- function(x, pos) unname(split(x, cumsum(seq_along(x) %in% pos)))
which can now take a vector of positions:
splitAt(a, c(2, 4))
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2 2
#
# [[3]]
# [1] 3
And it does behave properly (subjective) if pos <= 0 or pos >= length(x) in the sense that it returns the whole original vector in a single list item. If you'd like it to error out instead, use stopifnot at the top of the function.
I tried to use flodel's answer, but it was too slow in my case with a very large x (and the function has to be called repeatedly). So I created the following function that is much faster, but also very ugly and doesn't behave properly. In particular, it doesn't check anything and will return buggy results at least for pos >= length(x) or pos <= 0 (you can add those checks yourself if you're unsure about your inputs and not too concerned about speed), and perhaps some other cases as well, so be careful.
splitAt2 <- function(x, pos) {
out <- list()
pos2 <- c(1, pos, length(x)+1)
for (i in seq_along(pos2[-1])) {
out[[i]] <- x[pos2[i]:(pos2[i+1]-1)]
}
return(out)
}
However, splitAt2 runs about 20 times faster with an x of length 106:
library(microbenchmark)
W <- rnorm(1e6)
splits <- cumsum(rep(1e5, 9))
tm <- microbenchmark(
splitAt(W, splits),
splitAt2(W, splits),
times=10)
tm
Another alternative that might be faster and/or more readable/elegant than flodel's solution:
splitAt <- function(x, pos) {
unname(split(x, findInterval(x, pos)))
}

Resources