Removing all subsets from a list - r

I have a list that looks as follows:
a <- c(1, 3, 4)
b <- c(0, 2, 6)
c <- c(3, 4)
d <- c(0, 2, 6)
list(a, b, c, d)
From this list I would like to remove all subsets such that the list looks as follows:
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6
How do I do this? In my actual data I am working with a very long list (> 500k elements) so any suggestions for an efficient implementation are welcome.

Here is an approach.
lst <- list(a, b, c, d) # The list
First, remove all duplicates.
lstu <- unique(lst)
If the list still contains more than one element, we order the list by the lengths of its elements (decreasing).
lstuo <- lstu[order(-lengths(lstu))]
Then subsets can be filtered with this command:
lstuo[c(TRUE, !sapply(2:length(lstuo),
function(x) any(sapply(seq_along(lstuo)[-x],
function(y) all(lstuo[[x]] %in% lstu[[y]])))))]
The result:
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6

Alternative approach
Your data
lst <- list(a, b, c, d) # The list
lstu <- unique(lst) # remove duplicates, piggyback Sven's approach
Make matrix of values and index
m <- combn(lstu, 2) # 2-row matrix of non-self pairwise combinations of values
n <- combn(length(lstu), 2) # 2-row matrix of non-self pairwise combination of index
Determine if subset
issubset <- t(sapply(list(c(1,2),c(2,1)), function(z) mapply(function(x,y) all(x %in% y), m[z[1],], m[z[2],])))
Discard subset vectors from list
discard <- c(n*issubset)[c(n*issubset)>0]
ans <- lstu[-discard]
Output
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6

Related

Differing number of rows

Suppose I have a vector of numbers a a<-c(1, 2, 3, 4, 5, 6) and a vector of positions b b<-c(1, 2, 3).
Then I want to get the numbers that come before every position b in the vector a.
I do this lapply(b, function(x) a[1:x]) and I get the result
[1] 1
[[2]]
[1] 1 2
[[3]]
[1] 1 2 3
Now I want to combine them in a dataframe normally if the number of values for every position were equal I would have done t(as.data.frame(lapply(b, function(x) a[1:x])) But I cannot do that right now because the number of rows are different. How can I put zeros for the non-existing values?
If the output list is 'lst1', then make the lengths same with length<- assignment
lapply(lst1, function(x) {
length(x) <- max(lengths(lst1))
replace(x, is.na(x), 0)})
data
lst1 <- lapply(b, function(x) a[1:x])

How to cast a list to a data frame with unequal columns names, base R only

I have read
What is the most efficient way to cast a list as a data frame?
Convert a list to a data frame
I have a list with unequal columns names which I try to convert to a data frame, with NA for the missing entries in the shorter rows. It is easy with tidyverse (for example with bind_rows), but this is for a low level package that should use base R only.
mylist = list(
list(a = 3, b = "anton"),
list(a = 5, b = "bertha"),
list(a = 7, b = "caesar", d = TRUE)
)
# No problem with equal number of columns
do.call(rbind, lapply(mylist[1:2], data.frame))
# The list of my names
unique(unlist(lapply(mylist, names)))
# rbind does not like unequal numbers
do.call(rbind, lapply(mylist, data.frame))
Find out the unique columns in the list, in lapply add the additional columns using setdiff.
cols <- unique(unlist(sapply(mylist, names)))
do.call(rbind, lapply(mylist, function(x) {
x <- data.frame(x)
x[setdiff(cols, names(x))] <- NA
x
}))
# a b d
#1 3 anton NA
#2 5 bertha NA
#3 7 caesar TRUE
use indexes instead of columns and transpose it afterwards
l1 = [1,1]
l2 = [2,2,2,2]
df = pd.DataFrame([l1,l2], index = ('l1', 'l2'))
df.T
# l1 l2
# 0 1 2
# 1 1 2
# 2 NaN 2
# 3 NaN 2

Nested for loop leading to: Error in [<-.data.frame`(`*tmp*` replacement has x rows, data has y

I have 6 data frames (dfs) with a lot of data of different biological groups and another 6 data frames (tax.dfs) with taxonomical information about those groups. I want to replace a column of each of the 6 dfs with a column with the scientific name of each species present in the 6 tax.dfs.
To do that I created two lists of the data frames and I'm trying to apply a nested for loop:
dfs <- list(df.birds, df.mammals, df.crocs, df.snakes, df.turtles, df.lizards)
tax.dfs <- list(tax.birds,tax.mammals, tax.crocs, tax.snakes, tax.turtles, tax.lizards )
for(i in dfs){
for(y in tax.dfs){
i[,1] <- y[,2]
}}
And this is the output I'm getting:
Error in `[<-.data.frame`(`*tmp*`, , 1, value = c("Aotus trivirgatus", :
replacement has 64 rows, data has 43
But both data frames have the same number of rows, I actually used dfs to create tax.dfs applying the tnrs_match_names function from rotl package.
Any suggestions of how I could fix this error or that help me to find another way to do what I need to will be greatly appreciated.
Thank You!
For what it is worth, to iterate over two objects simultaneously, the following works:
Example Data:
df1 <- data.frame(a=1, b=2)
df2 <- data.frame(c=3, d=4)
df3 <- data.frame(e=5, f=6)
df_1 <- data.frame(a='A', b='B')
df_2 <- data.frame(c='C', d='D')
df_3 <- data.frame(e='E', f='F')
dfs <- list(df1, df2, df3)
df_s <- list(df_1, df_2, df_3)
Using mapply:
out <- mapply(function(one, two) {
one[,1] <- two[,2]
return(one)
}, dfs, df_s, SIMPLIFY = F )
out
[[1]]
a b
1 B 2
[[2]]
c d
1 D 4
[[3]]
e f
1 F 6
Here, one and two in mapply correspond to the different elements in dfs and df_s. Having said that, let's make it a bit more interesting. Let's change my third example to the following:
df_3 <- data.frame(e=c('E', 'e'), f=c('F', 'f'))
df_s <- list(df_1, df_2, df_3) # needs to be executed again
Now, let's adjust the function:
out <- mapply(function(one, two) {
if(nrow(one) != nrow(two)){return('Wrong dimensions')}
one[,1] <- two[,2]
return(one)
}, dfs, df_s, SIMPLIFY = F )
out
[[1]]
a b
1 B 2
[[2]]
c d
1 D 4
[[3]]
[1] "Wrong dimensions"

How to find elements common in at least 2 vectors?

Say I have 5 vectors:
a <- c(1,2,3)
b <- c(2,3,4)
c <- c(1,2,5,8)
d <- c(2,3,4,6)
e <- c(2,7,8,9)
I know I can calculate the intersection between all of them by using Reduce() together with intersect(), like this:
Reduce(intersect, list(a, b, c, d, e))
[1] 2
But how can I find elements that are common in, say, at least 2 vectors? i.e.:
[1] 1 2 3 4 8
It is much simpler than a lot of people are making it look. This should be very efficient.
Put everything into a vector:
x <- unlist(list(a, b, c, d, e))
Look for duplicates
unique(x[duplicated(x)])
# [1] 2 3 1 4 8
and sort if needed.
Note: In case there can be duplicates within a list element (which your example does not seem to implicate), then replace x with x <- unlist(lapply(list(a, b, c, d, e), unique))
Edit: as the OP has expressed interest in a more general solution where n >= 2, I would do:
which(tabulate(x) >= n)
if the data is only made of natural integers (1, 2, etc.) as in the example. If not:
f <- table(x)
names(f)[f >= n]
This is now not too far from James solution but it avoids the costly-ish sort. And it is miles faster than computing all possible combinations.
You could try all possible combinations, for example:
## create a list
l <- list(a, b, c, d)
## get combinations
cbn <- combn(1:length(l), 2)
## Intersect them
unique(unlist(apply(cbn, 2, function(x) intersect(l[[x[1]]], l[[x[2]]]))))
## 2 3 1 4
Here's another option:
# For each vector, get a vector of values without duplicates
deduplicated_vectors <- lapply(list(a,b,c,d,e), unique)
# Flatten the lists, then sort and use rle to determine how many
# lists each value appears in
rl <- rle(sort(unlist(deduplicated_vectors)))
# Get the values that appear in two or more lists
rl$values[rl$lengths >= 2]
This is an approach that counts the number of vectors each unique value occurs in.
unique_vals <- unique(c(a, b, c, d, e))
setNames(rowSums(!!(sapply(list(a, b, c, d, e), match, x = unique_vals)),
na.rm = TRUE), unique_vals)
# 1 2 3 4 5 8 6 7 9
# 2 5 3 2 1 2 1 1 1
A variation of #rengis method would be:
unique(unlist(Map(`intersect`, cbn[1,], cbn[2,])))
#[1] 2 3 1 4 8
where,
l <- mget(letters[1:5])
cbn <- combn(l,2)
Yet another approach, applying a vectorised function with outer:
L <- list(a, b, c, d, e)
f <- function(x, y) intersect(x, y)
fv <- Vectorize(f, list("x","y"))
o <- outer(L, L, fv)
table(unlist(o[upper.tri(o)]))
# 1 2 3 4 8
# 1 10 3 1 1
The output above gives the number of pairs of vectors that share each of the duplicated elements 1, 2, 3, 4, and 8.
When the vector is huge, solutions like duplicated or tabulate might overflow your system. In that case, dplyr comes in handy with the following code
library(dplyr) combination_of_vectors <- c(a, b, c, d, e)
#For more than 1
combination_of_vectors %>% as_tibble() %>% group_by(x) %>% filter(n()>1)
#For more than 2
combination_of_vectors %>% as_tibble() %>% group_by(x) %>% filter(n()>2)
#For more than 3
combination_of_vectors %>% as_tibble() %>% group_by(x) %>% filter(n()>2)
Hope it helps somebody

Find indices of rows from matrix A in matrix B

Let's consider two matrices A and B. A is a subset of B. How to find the index of each row of A in matrix B?
Here is a reproductible example:
set.seed(30)
B <- matrix(rnorm(n =30,mean = 0), ncol=3)
A <- subset(B, B[,1] > 1)
The goal is to find the indices idx which in this case gives row 4 and 5.
Nested apply loops should do it.
apply(A, 1, function(a)
which(apply(B, 1, function(b) all(b==a)))
)
# [1] 4 5
Or alternatively, using colSums
apply(A, 1, function(a)
which(colSums(t(B) == a) == ncol(B)))
# [1] 4 5
Alternatively, you could do this:
transform(A, idx = 1 * duplicated(rbind(A, B))[-seq_len(nrow(A))])
A nice solution without apply, originally by #Arun.
> match(apply(A, 1, paste, collapse="\b"), apply(B, 1, paste, collapse="\b"))
[1] 4 5
This takes a slightly different approach and relies on the fact that a matrix is a vector, it won't work if you have data.frames:
which( B %in% A , arr.ind=TRUE )[1:nrow(A)]
#[1] 4 5
And if you had really big matrices and wanted to be a bit more efficient you could use %in% on a subset like so:
which( B[1:nrow(B)] %in% A[1:nrow(A)] , arr.ind=TRUE )
But I don't expect this would make too much of a difference except in really big matrices.
If you had your data as data.frames you could do the same thing by passing just the first column to which:
A <- data.frame(A)
B <- data.frame(B)
which( B$X1 %in% A$X1 )
#[1] 4 5

Resources