Select rows from a data frame using an array in R - r

I wish to select a set of rows from a data frame in R given multiple parameters. Normally this could be done using an OR statement, however the values are housed in an array. I am querying them as such (and with no luck):
Some data to get us rolling:
x = array(c(1,2,3),c(5,5))
y=c(1,2)
The command I'm presently using is (filtering by column 1):
x[x[,1] == y, ]
The above command yields this error:
Warning message:
In x[, i] == y :
longer object length is not a multiple of shorter object length
Which makes sense. I just don't know how to get around it.
What I am looking for is:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 2 1 3
[2,] 2 1 3 2 1
[3,] 1 3 2 1 3
[4,] 2 1 3 2 1
Thanks in advance for the help!

You are looking for %in%.
> x[x[,1] %in% y, ]
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 3 2 1 3
# [2,] 2 1 3 2 1
# [3,] 1 3 2 1 3
# [4,] 2 1 3 2 1
As #Ricardo said under comment, to explain better as to why this is happening. When you equate x[,1] to y, you get:
x[,1] == y
[1] TRUE TRUE FALSE FALSE FALSE
Since y is 1,2 it just equates that to x[, 1] and since both of them match, returns TRUE. Since the length of output must equal length(x[, 1]), the rest is "recycled" (y = 1, 2, 1 against x = 3, 1, 2) which results in FALSE. But now, if you use x[., ] to fetch the rows, only the first two values are TRUE. So, only the first two will be picked. Using `%in% results in:
x[,1] %in% y
# [1] TRUE TRUE FALSE TRUE TRUE
Which is what you expect.

To add to #Arun's answer, if the two vectors being compared are of different sizes, R wil recycle the shorter one so that R is comparing two vectors of the same size, and then it makes a pair-wise comparison. (ie compares the first element of each vector, then the second element of eac vector, etc).
It does not, for example, compare the first element of vector one against all of the elements in vector two. (for that you need %in% as #Arun mentioned)
For example, take a look at the following.
The first two examples yield equivalent output
> c(0, 1, 2, 0, 1, 2) == c(1, 2)
[1] FALSE FALSE FALSE FALSE TRUE TRUE
> c(0, 1, 2, 0, 1, 2) == c(1, 2, 1, 2, 1, 2)
[1] FALSE FALSE FALSE FALSE TRUE TRUE
The comparisons being made are:
# element# LHS RHS areEqual
# 1. 0 1 FALSE <~~ Notice that the '0' from LHS is being compared with the '1' from RHS
# 2. 1 2 FALSE
# 3. 2 1 FALSE
# 4. 0 2 FALSE
# 5. 1 1 TRUE
# 6. 2 2 TRUE
Here is another example, with the LHS "shifted" relative to the previous example.
> c(1, 2, 0, 1, 2, 0) == c(1, 2)
[1] TRUE TRUE FALSE FALSE FALSE FALSE
Note what happens when the shorter vector is not an exact multiple of the longer element.
(ie, 2 does not go into 7).
The recylcing still occurs, but a portion of the shorter vector gets cropped from the last recycle.
R gives us a warning, just in case we did not expect them to be different sizes
> c(1, 2, 3, 4, 1, 2, 0) == c(1, 2)
[1] TRUE TRUE FALSE FALSE TRUE TRUE FALSE
Warning message:
In c(1, 2, 3, 4, 1, 2, 0) == c(1, 2) :
longer object length is not a multiple of shorter object length
Notice that it does not matter if the longer vector is on the RHS or LHS; the recycling works just the same
> c(1, 2) == c(1, 2, 0, 1, 2, 0)
[1] TRUE TRUE FALSE FALSE FALSE FALSE

Related

R: sum elements in a matrix up to threshold

I have a matrix of values with thousands of rows and a couple dozen columns. For a given row, $$R_0$$, I'd like to find all other complementary rows. A complementary row is defined as:
if given row has a non-zero value for a column, then the complement must have a zero value for that column
the sum of the elements of a given row and its complements must be less than 1.0
To illustrate, here is a toy matrix:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 0 0 0.1816416 0 0.1796779
[2,] 0.1889351 0 0 0 0 0
[3,] 0 0 0.1539683 0 0 0.1983812
[4,] 0 0.155489 0.1869410 0 0 0
[5,] 0 0 0 0 0.1739382 0
For row 1, there are values for columns 4 and 6. A complementary row must have "0" for columns 4 and 6.
I don't know what data structure my desired output should be. But I know the output should tell me:
row 1 has the following complementary rows: 2, 3, 5
row 2 has the following complementary rows: 1, 3, 4, 5
row 3 has the following complementary rows: 2, 5
row 4 has the following complementary rows: 1, 2, 5
row 5 has the following complementary rows: 1, 2, 3, 4
Perhaps a list of lists? I.e.:
[1: 2, 3, 5;
2: 1, 3, 4, 5;
3: 2, 5;
4: 1, 2, 5;
5: 1, 2, 3, 4]
But I'm open to other data structures.
The following code generates the toy matrix above.
set.seed(1)
a = runif(n=30, min=0, max=0.2)
a[a<0.15] = 0
A = matrix(a, # the data elements
nrow=5, # number of rows
ncol=6, # number of columns
byrow = TRUE) # fill matrix by rows
Is there a package or clever way to approach this problem?
We can create a function to check if the combination of two rows is a compliment
check_compliment <- function(x, y) {
all(A[y, A[x,] != 0] == 0) & sum(c(A[x, ], A[y, ])) < 1
}
Here, we subset row y for columns where x is not 0 and check if all of them are 0. Also check if sum of x and y rows is less than 1.
Apply this function for every combination using outer
sapply(data.frame(outer(1:nrow(A), 1:nrow(A), Vectorize(check_compliment))), which)
#$X1
#[1] 2 4 5
#$X2
#[1] 1 3 4 5
#$X3
#[1] 2 5
#$X4
#[1] 1 2 5
#$X5
#[1] 1 2 3 4
outer step gives us TRUE/FALSE value for every combination of a row with every other row indicating if it is a compliment
outer(1:nrow(A), 1:nrow(A), Vectorize(check_compliment))
# [,1] [,2] [,3] [,4] [,5]
#[1,] FALSE TRUE FALSE TRUE TRUE
#[2,] TRUE FALSE TRUE TRUE TRUE
#[3,] FALSE TRUE FALSE FALSE TRUE
#[4,] TRUE TRUE FALSE FALSE TRUE
#[5,] TRUE TRUE TRUE TRUE FALSE
We convert this to data frame and use which to get indices for every column.

How to remove any co-occurrence of sub-list elements from vector (R)

I review the python question How to remove every occurrence of sub-list from list.
Now I want to know how many creative ways are there in R.
For example, removing any occurrences of sub_list from the main_list.
main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1)
sub_list = c(1,2)
desired result: 2 3 4 2 2 1
My suggestions:
a<-c()
for(i in 1:(length(main_list)-1)){
if (all(main_list[c(i,i+1)]==sub_list))
{a<-c(a,c(i,i+1))}
}
main_list[-a]
[1] 2 3 4 2 2 1
2
as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))
Ohh it is really dangerous. Let's try:
main_list = c(2, 1, 2, 3, 12, 1, 2, 4, 2, 2, 1)
as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))
[1] 2 3 4 2 2 1
####However
a<-c()
for(i in 1:(length(main_list)-1)){
if (all(main_list[c(i,i+1)]==sub_list))
{a<-c(a,c(i,i+1))}
}
main_list[-a]
[1] 2 3 12 4 2 2 1
Update Sat Sep 08 2018
Benchmarking Solutions:
I Benchmarked solutions base on the memory and time, each solution takes, with a big vector of numbers and used profmem and microbenchmark libraries.
set.seed(1587)
main_list<-sample(c(8:13,102:105),size = 10000000,replace = T)
main_list<-c(c(8,9,12,103),main_list,c(8,9,12,103))
sub_list<-c(8,9,12,103)
d.b's solution does not work for main_list so I modified it as follows:
ML = paste(main_list, collapse = ",") # collapse should not be empty
SL = paste(sub_list, collapse = ",")
out<-gsub(SL, "", ML)
out<-gsub("^\\,","",out)
out<-gsub("\\,$","",out)
out<-gsub("\\,,","\\,",out)
out<-as.numeric(unlist(strsplit(out,split = ",")))
The result:
solution seconds memory_byte memory_base seconds_base
<chr> <dbl> <dbl> <dbl> <dbl>
1 d.b 26.0 399904560 1 16.8
2 Grothendieck_2 1.55 1440070304 3.60 1
3 Grothendieck_1 109. 4968036376 12.4 70.3
4 李哲源 2.17 1400120824 3.50 1.40
Any comment about the benchmarking?
Here are two solutions. The first one is obviously simpler and would be used if you favour clarity and maintainability while the second one has no package dependencies and is faster.
1) zoo Use a moving window to compare each subsequence of c(main_list, sub_list) having the required length to the sub_list. (We append sub_list to ensure that there is always something to remove.) This statements returns TRUE or FALSE according to whether the current position is the end of a matching subsequence. Then compute the TRUE index numbers and from that the indices of all elements to be removed and remove them.
library(zoo)
w <- length(sub_list)
r <- rollapplyr(c(main_list, sub_list), w, identical, sub_list, fill = FALSE)
main_list[-c(outer(which(r), seq_len(w) - 1, "-"))]
## [1] 2 3 4 2 2 1
2) Base R. The middle line setting r has the same purpose as the corresponding line in (1) and the last line is the same as the last line in (2) except we use + instead of - due to the fact that embed effectively uses left alignment.
w <- length(sub_list)
r <- colSums(t(embed(c(main_list, sub_list), w)) == rev(sub_list)) == w
main_list[-c(outer(which(r), seq_len(w) - 1, "+"))]
## [1] 2 3 4 2 2 1
Here is a function that does this general thing.
xm is a main list of integer / character / logical values;
xs is a sub list of integer /character / logical values.
It is required that length(xm) > length(xs) but no such check is made right now.
foo <- function (xm, xs) {
nm <- length(xm)
ns <- length(xs)
shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
d <- xm[shift_ind] == xs
first_drop_ind <- which(.colSums(d, ns, length(d) / ns) == ns)
if (length(first_drop_ind) > 0L) {
drop_ind <- outer(0:(ns - 1), first_drop_ind, "+")
return(xm[-drop_ind])
} else {
return(xm)
}
}
main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1)
sub_list = c(1,2)
foo(main_list, sub_list)
#[1] 2 3 4 2 2 1
Explanation
xm <- main_list
xs <- sub_list
nm <- length(xm)
ns <- length(xs)
shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
MAT <- matrix(xm[shift_ind], ns)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] 2 1 2 3 1 2 4 2 2
#[2,] 1 2 3 1 2 4 2 2 1
So the first step is a shifting and matrix representation, as above.
LOGIC <- MAT == xs
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
If a co-occurrence is found, a column should contain all TRUE, i.e., the colSums should be ns. In this way we can identify the location of the first value of the matching.
first_drop_ind <- which(colSums(LOGIC) == ns)
#[1] 2 5
Now we need to expand it to cover the subsequent values after those initial matches.
drop_ind <- outer(0:(ns - 1), first_drop_ind, "+")
# [,1] [,2]
#[1,] 2 5
#[2,] 3 6
Finally we remove values at those positions from xm:
xm[-drop_ind]
#[1] 2 3 4 2 2 1
Note that in the function, the matrix is not explicitly formed. .colSums is used instead of colSums.
watch out for bug
The if ... else ... in the function is necessary. If no match is found then drop_ind would be integer(0), and using xm[-drop_ind] gives xm[integer(0)] which is integer(0).
comparison with zoo::rollapplyr
## require package `zoo`
bar <- function (xm, xs) {
w <- length(xs)
r <- rollapplyr(xm, w, identical, xs, fill = FALSE)
if (length(r) > 0L) {
return(xm[-c(outer(which(r), seq_len(w) - 1, "-"))])
} else {
return(xm)
}
}
set.seed(0)
xm <- sample.int(10, 10000, TRUE)
xs <- 1:2
library(zoo)
system.time(a <- foo(xm, xs))
# user system elapsed
# 0.004 0.000 0.001
system.time(b <- bar(xm, xs))
# user system elapsed
# 0.276 0.000 0.273
all.equal(a, b)
#[1] TRUE
I guess that rollapplyr is slower is because
it needs to first coerce xm to a "zoo" object;
internally it uses lapply so that there is a frequent jump between R and C.

Per-row index of a matrix in R (including 0-rows)

Assume we have the following logical matrix in R:
A <- matrix(as.logical(c(0,0,0,1,0,1,0,0,1,0,0,0)), nrow=4)
# [,1] [,2] [,3]
# [1,] FALSE FALSE TRUE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE FALSE
# [4,] TRUE FALSE FALSE
I want to convert this matrix into a column-wise index using
B <- column_wise_index(A)
where column_wise_index returns a vector containing the same number of elements as the number of rows in A (4), and each element contains the column of A that has a logical value TRUE. For A above, B should resemble
B <- c(3,2,0,1)
# [1] 3 2 0 1
where 0 indicates a row that has no TRUE value.
The closest I've come is applying which by row:
unlist(apply(A, 1, function(x) which(x)))
# [1] 3 2 1
However, the result skips 0, and I'm not sure how efficient this is for large matrices (say ~100K x 100 entries).
Here is a solution that is more in the spirit of how you started, but you have to admire #rawr's clever solution.
A <- matrix(as.logical(c(0,0,0,1,0,1,0,0,1,0,0,0)), nrow=4)
TrueSpots = apply(A, 1, which)
TrueSpots[!sapply(TrueSpots, length)] = 0
unlist(TrueSpots)
[1] 3 2 0 1
Update including #akrun's suggestion:
TrueSpots = apply(A, 1, which)
TrueSpots[!lengths(TrueSpots)] = 0
unlist(TrueSpots)
[1] 3 2 0 1
max.col(A) identifies the index where the maximum entry occurs within the row. Ties are broken at random (by default). rowSums(A) on a logical matrix performs a per-row binary addition.
Based on the assumption that each row has at most one TRUE value, rowSums(A) will result in a binary vector. Performing a vector-based multiplication nullifies the truly FALSE rows in A.
> A <- matrix(as.logical(c(0,0,0,1,0,1,0,0,1,0,0,0)), nrow=4)
> max.col(A)*rowSums(A)
[1] 3 2 0 1

Sum of matrix elements between diagonals efficiently in R

I have data in the form of n*n matrix for which I want to do some computations (e.g. sum) on whose elements placed between diagonals (excluding diagonals).
For example for this matrix:
[,1] [,2] [,3] [,4] [,5]
[1,] 2 0 1 4 3
[2,] 5 3 6 0 4
[3,] 3 5 2 3 1
[4,] 2 1 5 3 2
[5,] 1 4 3 4 1
The result for sum (between diagonal elements) would be:
# left slice 5+3+2+5 = 15
# bottom slice 4+3+4+5 = 16
# right slice 4+1+2+3 = 10
# top slice 0+1+4+6 = 11
# dput(m)
m <- structure(c(2, 5, 3, 2, 1, 0, 3, 5, 1, 4, 1, 6, 2, 5, 3, 4, 0,
3, 3, 4, 3, 4, 1, 2, 1), .Dim = c(5L, 5L))
How to accomplish that efficiently?
Here's how you can get the "top slice":
sum(m[lower.tri(m)[nrow(m):1,] & upper.tri(m)])
#[1] 11
to visualize it:
lower.tri(m)[nrow(m):1,] & upper.tri(m)
# [,1] [,2] [,3] [,4] [,5]
#[1,] FALSE TRUE TRUE TRUE FALSE
#[2,] FALSE FALSE TRUE FALSE FALSE
#[3,] FALSE FALSE FALSE FALSE FALSE
#[4,] FALSE FALSE FALSE FALSE FALSE
#[5,] FALSE FALSE FALSE FALSE FALSE
Here's how you can compute all 4 of the slices:
up <- upper.tri(m)
lo <- lower.tri(m)
n <- nrow(m)
# top
sum(m[lo[n:1,] & up])
# left
sum(m[lo[n:1,] & lo])
# right
sum(m[up[n:1,] & up])
# bottom
sum(m[up[n:1,] & lo])
sum(sapply(1:dim(m)[[2L]], function(i) sum(m[c(-i,-(dim(m)[[1L]]-i+1)),i])))
This goes column by column and for each column takes out the the diagonal elements and sums the rest. These partial results are then summed up.
I believe this would be fast because we go column by column and matrices in R are stored column by column (i.e. it will be CPU cache friendly). We also do not have to produce large vector of indices, only vector of two indices (those taken out) for each column.
EDIT: I read the question again more carefully. The code can be updated to produce list four values for each element in sapply: for each of the regions. The idea stays the same, for large matrix, it will be fast if you go column by column, not jumping back and forth between columns.

Identify which objects of list are contained (subset of) in another list in R

Thank you for your kind reply to my previous questions. I have two lists: list1 and list2. I would like to know if each object of list1 is contained in each object of list2. For example:
> list1
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
> list2
[[1]]
[1] 1 2 3
[[2]]
[1] 2 3
[[3]]
[1] 2 3
Here are my questions:
1.) How do you I ask R to check if an object is a subset of another object in a list?
For instance I would like to check if list2[[3]]={2,3} is contained in (subset of) list1[[2]]={2}. When I do list2[[3]] %in% list1[[2]], I get [1] TRUE FALSE. However, this is not what I desire to do?! I just want to check if list2[[3]] is a subset of list1[[2]], i.e. is {2,3} \subset of {3} as in the set theoretic notion? I do not want to perform elementwise check as R seems to be doing with the %in% command. Any suggestions?
2.) Is there some sort of way to efficiently make all pairwise subset comparisons (i.e. list1[[i]] subset of list2[[j]], for all i,j combinations? Would something like outer(list1,list2, func.subset) work once question number 1 is answered?
Thank you for your feedback!
setdiff compares unique values
length(setdiff(5, 1:5)) == 0
Alternatively, all(x %in% y) will work nicely.
To do all comparisons, something like this would work:
dt <- expand.grid(list1,list2)
dt$subset <- apply(dt,1, function(.v) all(.v[[1]] %in% .v[[2]]) )
Var1 Var2 subset
1 1 1, 2, 3 TRUE
2 2 1, 2, 3 TRUE
3 3 1, 2, 3 TRUE
4 1 2, 3 FALSE
5 2 2, 3 TRUE
6 3 2, 3 TRUE
7 1 2, 3 FALSE
8 2 2, 3 TRUE
9 3 2, 3 TRUE
Note that the expand.grid isn't the fastest way to do this when dealing with a lot of data (dwin's solution is better in that regard) but it allows you to quickly check visually whether this is doing what you want.
You can use the sets package as follows:
library(sets)
is.subset <- function(x, y) as.set(x) <= as.set(y)
outer(list1, list2, Vectorize(is.subset))
# [,1] [,2] [,3]
# [1,] TRUE FALSE FALSE
# [2,] TRUE TRUE TRUE
# [3,] TRUE TRUE TRUE
#Michael or #DWin's base version of is.subset will work just as well, but for part two of your question, I'd maintain that outer is the way to go.
is.subset <- function(x,y) {length(setdiff(x,y)) == 0}
First the combos of list1 elements that are subsets of list2 items:
> sapply(1:length(list1), function(i1) sapply(1:length(list2),
function(i2) is.subset(list1[[i1]], list2[[i2]]) ) )
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] FALSE TRUE TRUE
[3,] FALSE TRUE TRUE
Then the unsurprising lack of any of the list2 items (all of length > 1) that are subsets of list one items (all of length 1):
> sapply(1:length(list1), function(i1) sapply(1:length(list2),
function(i2) is.subset(list2[[i2]], list1[[i1]]) ) )
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
adding to #Michael's, here's a neat way to avoid the messiness of expand.grid using the AsIs function:
list2 <- list(1:3,2:3,2:3)
a <- data.frame(list1 = 1:3, I(list2))
a$subset <- apply(a, 1, function(.v) all(.v[[1]] %in% .v[[2]]) )
list1 list2 subset
1 1 1, 2, 3 TRUE
2 2 2, 3 TRUE
3 3 2, 3 TRUE

Resources