Using R `outer` with `%in%` operator - r

I am trying to perform the following outer operation:
x <- c(1, 11)
choices <- list(1:10, 10:20)
outer(x, choices, FUN=`%in%`)
I expect the following matrix:
[,1] [,2]
[1,] TRUE FALSE
[2,] FALSE TRUE
which would correspond to the following operations:
outer(x, choices, FUN=paste, sep=" %in% ")
[,1] [,2]
[1,] "1 %in% 1:10" "1 %in% 10:20"
[2,] "11 %in% 1:10" "11 %in% 10:20"
But for some reason I am getting:
[,1] [,2]
[1,] FALSE FALSE
[2,] FALSE FALSE
What is happening?

As expressed in the comments, the table argument of match (the function called by %in%) isn't intended to be a list (if it is, it gets coerced to a character). You should use vapply:
vapply(choices,function(y) x %in% y,logical(length(x)))
# [,1] [,2]
#[1,] TRUE FALSE
#[2,] FALSE TRUE

Another way that is close to your train of thought, would be to use expand.grid() to create the combinations, and then Map the two columns via %in% function, i.e.
d1 <- expand.grid(x, choices)
matrix(mapply(`%in%`, d1$Var1, d1$Var2), nrow = length(x))
#or you can use Map(`%in%`, ...) in order to keep results in a list
OR
As #nicola suggests, in order to make things better,
d1 <- expand.grid(list(x), choices)
mapply(%in%, d1$Var1, d1$Var2)
both giving,
[,1] [,2]
[1,] TRUE FALSE
[2,] FALSE TRUE

Related

Compare every element of each member of the list with every element of each member of another list [duplicate]

This question already has an answer here:
How to perform pairwise operation like `%in%` and set operations for a list of vectors
(1 answer)
Closed 4 years ago.
I want to compare every element of each member of the list with every element of each member of another list.
A = B = list(1, 2, c(1,2))
Expected outcome is follows:
[,1] [,2] [,3]
[1,] TRUE FALSE TRUE
[2,] FALSE TRUE TRUE
[3,] TRUE TRUE TRUE
I can solve similar task for the data.frame:
df = data.frame(A = c(1, 2, "1,2"), B = c(1, 2, "1,2"))
sapply(df$A, grepl, df$B)
which gives:
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] TRUE TRUE TRUE
But that is not exactly the solution I'm looking for.
Any help is much appreciated!
here is a complicated way:
A = B = list(1, 2, c(1,2))
outer(A, B, function(a,b) sapply((Vectorize(`%in%`))(a,b),any) )
# [,1] [,2] [,3]
#[1,] TRUE FALSE TRUE
#[2,] FALSE TRUE TRUE
#[3,] TRUE TRUE TRUE
Here is an easy way:
eg <- expand.grid(A,B)
matrix(
mapply(function(x,y) {any(x %in% y)}, x = eg$Var1, y = eg$Var2 ), nrow = length(A), ncol = length(B)
)
(just for fun:)
matrix(
mapply(function(x,y) {length(intersect(x, y)) != 0}, x = eg$Var1, y = eg$Var2 ), nrow = length(A), ncol = length(B)
)

Using dplyr functions inside apply

I want to use dplyr functions inside apply, to every element of a matrix (BRCK), which is a matrix of dataframes.
I tried something like this:
apply(BRCK, c(1,2), function(x) dplyr::select(x, dplyr::contains("_01_"), 1) %>%
dplyr::filter((month(`BRCK[[la, lo]]`) == 1)) %>%
dplyr::select(-contains("BRCK"))
But it returns
Error: Variable context not set
And the traceback:
13. stop(cnd)
12. abort("Variable context not set")
11. cur_vars_env$selected %||% abort("Variable context not set")
10. current_vars()
9. tolower(vars)
8. dplyr::contains("_01_")
7. select.list(x, dplyr::contains("_01_"), 1)
6. dplyr::select(x, dplyr::contains("_01_"), 1)
5. eval(lhs, parent, parent)
4. eval(lhs, parent, parent)
3. dplyr::select(x, dplyr::contains("_01_"), 1) %>% dplyr::filter(x,
(month(`BRCK[[la, lo]]`) == 1)) %>% dplyr::select(x, -contains("BRCK"))
2. FUN(newX[, i], ...)
1. apply(BRCK, c(1, 2), function(x) dplyr::select(x, dplyr::contains("_01_"), 1) %>% dplyr::filter(x, (month(`BRCK[[la, lo]]`) == 1)) %>%
dplyr::select(x, -contains("BRCK")))
BRCK is a very large object, It works with for cycles but I'm trying to replace them with apply functions.
With apply, x is passed as a list in the function and dplyr only deals with dataframe.
apply(BRCK, c(1,2), is.data.frame)
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
but :
apply(BRCK, c(1,2), function(x) is.data.frame(x[[1]]))
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE TRUE
[3,] TRUE TRUE TRUE
so :
library(tidyverse)
apply(BRCK, c(1,2),
function(x) {
x[[1]] %>%
dplyr::select(dplyr::contains("_01_"), 1) %>%
dplyr::filter(lubridate::month(`BRCK[[la, lo]]`) == 1) %>%
dplyr::select(-contains("BRCK"))
}
)
One problem is that each element in the loop is a list of a single data frame, not an actual data frame. Compare:
apply(BRCK, c(1,2), function(x) {
class(x)
})
[,1] [,2] [,3]
[1,] "list" "list" "list"
[2,] "list" "list" "list"
[3,] "list" "list" "list"
apply(BRCK, c(1,2), function(x) {
class(x[[1]])
})
[,1] [,2] [,3]
[1,] "data.frame" "data.frame" "data.frame"
[2,] "data.frame" "data.frame" "data.frame"
[3,] "data.frame" "data.frame" "data.frame"
I would suggest not using apply loop (rather use lapply on the indices) since the way apply subsets objects and modifies them is not well documented.
I would also suggest not storing data.frames in a matrix. You could store them in a list, and set attributes for the metadata that is implied by the matrix indices.

Recursive indexing to unlist a matrix

Consider the following vector x and list s
x <- c("apples and pears", "one banana", "pears, oranges, and pizza")
s <- strsplit(x, "(,?)\\s+")
The desired result will be the following, but please keep reading.
> t(sapply(s, `length<-`, 4))
# [,1] [,2] [,3] [,4]
#[1,] "apples" "and" "pears" NA
#[2,] "one" "banana" NA NA
#[3,] "pears" "oranges" "and" "pizza"
That's fine, it's a good way to do it. But R's vectorization is one its best features, and I'd like to see if I can do this with recursive indexing, that is, using only [ subscript indexing.
I want start with the following, and use the row and column indices to turn the matrix s into a 3x4 matrix. So I'm calling cbind on the list s, and starting from there.
(cb <- cbind(s))
# s
# [1,] Character,3
# [2,] Character,2
# [3,] Character,4
class(cb[1])
#[1] "list"
is.recursive(cb)
#[1] TRUE
I've gotten this far, but now I'm struggling with the higher dimensions. Here's the first row, From here I as to unlist the rest of the matrix using the [ and [[ index.
w <- character(nrow(cb)+nrow(cb)^2)
dim(w) <- c(3,4)
w[cbind(1, 1:3)] <- cb[[1]]
# [,1] [,2] [,3] [,4]
#[1,] "apples" "and" "pears" ""
#[2,] "" "" "" ""
#[3,] "" "" "" ""
At level 2 it gets more difficult. I've been doing things like this
> cb[[c(1,2,1), exact = TRUE]]
# Error in cb[[c(1, 2, 1), exact = TRUE]] :
# recursive indexing failed at level 2
> cb[[cbind(1,2,1)]]
# Error in cb[[cbind(1, 2, 1)]] : recursive indexing failed at level 2
Here's an example of how the indexing proceeds. I've tried all kinds of combinations of w[[cbind(1, 1:2)]] and alike
w[cbind(1, 1:3)] <- cb[[1]]
w[cbind(2, 1:2)] <- cb[[2]]
w[cbind(3, 1:4)] <- cb[[3]]
From the empty matrix w, this produces the result
# [,1] [,2] [,3] [,4]
#[1,] "apples" "and" "pears" ""
#[2,] "one" "banana" "" ""
#[3,] "pears" "oranges" "and" "pizza"
Is it possible to use recursive indexing on all levels, so that I can unlist cb into an empty matrix directly from when it was a list? i.e. put the three w[] <- cb[[]] lines into one.
I'm asking this because it gets to the heart of matrix structures in R. It's about learning the indexing, and not about finding an alternative solution to my problem.
You can use the rbind.fill.matrix function from the plyr package.
library(plyr)
rbind.fill.matrix(lapply(s, rbind))
This returns
1 2 3 4
[1,] "apples" "and" "pears" NA
[2,] "one" "banana" NA NA
[3,] "pears" "oranges" "and" "pizza"
Note that this does use as.matrix internally: rbind.fill.matrix calls matrices[] <- lapply(matrices, as.matrix)
If you wanted to bypass the intermediary steps, you can just use my cSplit function, like this:
cSplit(as.data.table(x), "x", "(,?)\\s+", fixed = FALSE)
# x_1 x_2 x_3 x_4
# 1: apples and pears NA
# 2: one banana NA NA
# 3: pears oranges and pizza
as.matrix(.Last.value)
# x_1 x_2 x_3 x_4
# [1,] "apples" "and" "pears" NA
# [2,] "one" "banana" NA NA
# [3,] "pears" "oranges" "and" "pizza"
Under the hood, however, that still does require creating a matrix and filling it in. It uses matrix indexing to fill in the values, so it is quite fast.
A manual approach would look something like:
myFun <- function(invec, split, fixed = TRUE) {
s <- strsplit(invec, split, fixed)
Ncol <- vapply(s, length, 1L)
M <- matrix(NA_character_, ncol = max(Ncol),
nrow = length(invec))
M[cbind(rep(sequence(length(invec)), times = Ncol),
sequence(Ncol))] <- unlist(s, use.names = FALSE)
M
}
myFun(x, "(,?)\\s+", FALSE)
# [,1] [,2] [,3] [,4]
# [1,] "apples" "and" "pears" NA
# [2,] "one" "banana" NA NA
# [3,] "pears" "oranges" "and" "pizza"
Speed is not everything, but it certainly should be a consideration for this type of transformation.
Here are some tests of what has been suggested so far:
## The manual approach
fun1 <- function(x) myFun(x, "(,?)\\s+", FALSE)
## The cSplit approach
fun2 <- function(x) cSplit(as.data.table(x), "x", "(,?)\\s+", fixed = FALSE)
## The OP's approach
fun3 <- function(x) {
s <- strsplit(x, "(,?)\\s+")
mx <- max(sapply(s, length))
do.call(rbind, lapply(s, function(x) { length(x) <- mx; x }))
}
## The plyr approach
fun4 <- function(x) {
s <- strsplit(x, "(,?)\\s+")
rbind.fill.matrix(lapply(s, rbind))
}
And, for fun, here's another approach, this one using dcast.data.table:
fun5 <- function(x) {
dcast.data.table(
data.table(
strsplit(x, "(,?)\\s+"))[, list(
unlist(V1)), by = sequence(length(x))][, N := sequence(
.N), by = sequence], sequence ~ N, value.var = "V1")
}
Testing is on slightly bigger data. Not very big--12k values:
x <- unlist(replicate(4000, x, FALSE))
length(x)
# [1] 12000
## I expect `rbind.fill.matrix` to be slow:
system.time(fun4(x))
# user system elapsed
# 3.38 0.00 3.42
library(microbenchmark)
microbenchmark(fun1(x), fun2(x), fun3(x), fun5(x))
# Unit: milliseconds
# expr min lq median uq max neval
# fun1(x) 97.22076 100.8013 102.5754 107.8349 166.6632 100
# fun2(x) 115.01466 120.6389 125.0622 138.0614 222.7428 100
# fun3(x) 146.33339 155.9599 158.8394 170.3917 228.5523 100
# fun5(x) 257.53868 266.5994 273.3830 296.8003 346.3850 100
A bit bigger data, but still not what others might consider big: 1.2M values.
X <- unlist(replicate(100, x, FALSE))
length(X)
# [1] 1200000
## Dropping fun3 and fun5 now, though they are very close...
## I wonder how fun5 scales further (but don't have the patience to wait)
system.time(fun5(X))
# user system elapsed
# 31.28 0.43 31.76
system.time(fun3(X))
# user system elapsed
# 31.62 0.33 31.99
microbenchmark(fun1(X), fun2(X), times = 10)
# Unit: seconds
# expr min lq median uq max neval
# fun1(X) 11.65622 11.76424 12.31091 13.38226 13.46488 10
# fun2(X) 12.71771 13.40967 14.58484 14.95430 16.15747 10
The penalty for the cSplit approach would be in terms of having to convert to a "data.table" and the checking of different conditions, but as your data grows, those penalties become less noticeable.

Why does this vectorized matrix comparison fail?

I am trying to compare 1st row of a matrix with all rows of the same matrix. But the vectorized comparison is not returning correct results. Any reason why this may be happening?
m <- matrix(c(1,2,3,1,2,4), nrow=2, ncol=3, byrow=TRUE)
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 4
> # Why does the first row not have 3 TRUE values?
> m[1,] == m
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE FALSE FALSE
> m[1,] == m[1,]
[1] TRUE TRUE TRUE
> m[1,] == m[2,]
[1] TRUE TRUE FALSE
Follow-up. In my actual data I have large number of rows then (atleast 10million) then both time and memory adds up. Additional suggestions on the below as suggested below by others?
m <- matrix(rep(c(1,2,3), 1000000), ncol=3, byrow=TRUE)
> #by #alexis_laz
> m1 <- matrix(m[1,], nrow = nrow(m), ncol = ncol(m), byrow = T)
> system.time(m == m1)
user system elapsed
0.21 0.03 0.31
> object.size(m1)
24000112 bytes
> #by #PaulHiemstra
> system.time( t(apply(m, 1, function(x) x == m[1,])) )
user system elapsed
35.18 0.08 36.04
Follow-up 2. #alexis_laz you are correct. I want to compare every row with each other and have posted a followup question on that ( How to vectorize comparing each row of matrix with all other rows)
In the comparison m[1,] == m, the first term m[1,] is recycled (once) to equal the length of m. The comparison is then done column-wise.
You're comparing c(1,2,3) with c(1,1,2,2,3,4), thus c(1,2,3,1,2,3) with c(1,1,2,2,3,3,4) so you have one TRUE followed by five FALSE (and packaged as a matrix to match the dimensions of m).
As #MatthewLundberg pointed out, the recycling rules of R do not behave as you expected. In my opinion it is always better to explicitely state what to compare and not rely on R's assumptions. One way to make the correct comparison:
t(apply(m, 1, function(x) x == m[1,]))
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE
or:
m == rbind(m[1,], m[1,])
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE
or by making R's recyling working in your favor (thanks to #Arun):
t(t(m) == m[1,])
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE

Select an element from each row of a matrix in R

The question is the same as here, but in R. I have a matrix and a vector such that
length(vec) == nrow(mat)
How do i get a vector such that
v[i] == mat[v[i],i]
I tried to achieve this by using logical matrix:
>a = matrix(runif(12),4,3)
a
[,1] [,2] [,3]
[1,] 0.6077585 0.5354680 0.2802681
[2,] 0.2596180 0.6358106 0.9336301
[3,] 0.5317069 0.4981082 0.8668405
[4,] 0.6150885 0.5164009 0.5797668
> sel = col(a) == c(1,3,2,1)
> sel
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE FALSE TRUE
[3,] FALSE TRUE FALSE
[4,] TRUE FALSE FALSE
> a[sel]
[1] 0.6077585 0.6150885 0.4981082 0.9336301
It selects right elements but messes up the order. I thought of using mapply either, but i don't know how to make it iterate through rows, like in apply.
upd: #gsk3 suggested to use as.list(as.data.frame(t(a))) this works. But still i would like to know if there is a more vectorized way, without lists.
I am not 100% sure I understand your question, but it seems like this may be close?
> b=c(1,3,2,1)
> i=cbind(1:nrow(a),b)
> a[i]

Resources