Finding all possible combinations of vector intersections? - r

I have a set of four vectors that look like this:
[1] PRI2CO HEISCO PRI2CO DIALGU DIALGU ALSEBL
Levels: ALSEBL DIALGU HEISCO PRI2CO
[1] PRI2CO TET2PA ALSEBL PRI2CO ALSEBL TET2PA
[7] HEISCO TET2PA
Levels: ALSEBL HEISCO PRI2CO TET2PA
I would like to generate a vector that contains all values that match between every possible combination of the four vectors. For the two above, it would contain ALESBL, HEISCO, and PRI2CO. I've been doing every combination by hand so far but its tedious and I figure there has to be a better way. I tried writing a loop for it but I'm pretty new to R and it hasn't worked yet. Here's what I've been doing:
trees.species.P234<-intersect(intersect(trees.species.P2,trees.species.P3),trees.species.P4)
> trees.species.P234
[1] "PRI2CO " "ALSEBL "
I was thinking a for loop that involved a factorial might do it, but I can't get it to work.

Here you go, using the same vectors as proposed by gadzooks:
v1 <- c("PRI2CO","HEISCO","PRI2CO","DIALGU","DIALGU","ALSEBL")
v2 <- c("PRI2CO", "TET2PA","ALSEBL","PRI2CO","ALSEBL","TET2PA","HEISCO","TET2PA")
v3 <- c("PRI2CO","HEISCO","PRI2CO","DIALGU","DIALGU","ALSEBL")
v4 <- c("PRI2CO", "TET2PA","ALSEBL","PRI2CO","ALSEBL","TET2PA","HEISCO","TET2PA")
veclist <- list(v1,v2,v3,v4)
combos <- Reduce(c,lapply(2:length(veclist),
function(x) combn(1:length(veclist),x,simplify=FALSE) ))
lapply(combos, function(x) Reduce(intersect,veclist[x]) )
#[[1]]
#[1] "PRI2CO" "HEISCO" "ALSEBL"
#
#[[2]]
#[1] "PRI2CO" "HEISCO" "DIALGU" "ALSEBL"
#
#[[3]]
#[1] "PRI2CO" "HEISCO" "ALSEBL"
#etc etc

First you have to list all the combinations. For that use combn function.
> combn(1:4,2)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 1 1 2 2 3
[2,] 2 3 4 3 4 4
Now we can use the apply function to find intersection between your vectors. But before that
lets create a list of vectors. For easy reproducibility i created this list.
c <- combn(1:4,2)
l <- list(c("a","b"),c("b","c"),c("c","d"),c("d","e"))
Result <- apply(c,2,function(x){intersect(l[[x[1]]],l[[x[2]]])})
This result will be a list if you want it as vector you can use do.call
do.call("c",Result)
[1] "b" "c" "d"
For unique components
unique(do.call("c",Result))
This can be used for large lists as well.

v1 <- c("PRI2CO","HEISCO","PRI2CO","DIALGU","DIALGU","ALSEBL")
v2 <- c("PRI2CO", "TET2PA","ALSEBL","PRI2CO","ALSEBL","TET2PA","HEISCO","TET2PA")
v3 <- c("PRI2CO","HEISCO","PRI2CO","DIALGU","DIALGU","ALSEBL")
v4 <- c("PRI2CO", "TET2PA","ALSEBL","PRI2CO","ALSEBL","TET2PA","HEISCO","TET2PA")
vall <- unique(c(v1,v2,v3,v4))
for(x in vall){
if((x %in% v1)&(x %in% v2)&(x %in% v3)&(x %in% v4)){
print(x)}
}

Related

Storing unique values of each column (of a df) in list

It is straight forward to obtain unique values of a column using unique. However, I am looking to do the same but for multiple columns in a dataframe and store them in a list, all using base R. Importantly, it is not combinations I need but simply unique values for each individual column. I currently have the below:
# dummy data
df = data.frame(a = LETTERS[1:4]
,b = 1:4)
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols)
{
x = unique(i)
unique_values_by_col[[i]] = x
}
The problem comes when displaying unique_values_by_col as it shows as empty. I believe the problem is i is being passed to the loop as a text not a variable.
Any help would be greatly appreciated. Thank you.
Why not avoid the for loop altogether using lapply:
lapply(df, unique)
Resulting in:
> $a
> [1] A B C D
> Levels: A B C D
> $b
> [1] 1 2 3 4
Or you have also apply that is specifically done to be run on column or line:
apply(df,2,unique)
result:
> apply(df,2,unique)
a b
[1,] "A" "1"
[2,] "B" "2"
[3,] "C" "3"
[4,] "D" "4"
thought if you want a list lapply return you a list so may be better
Your for loop is almost right, just needs one fix to work:
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols) {
x = unique(df[[i]])
unique_values_by_col[[i]] = x
}
unique_values_by_col
# $a
# [1] A B C D
# Levels: A B C D
#
# $b
# [1] 1 2 3 4
i is just a character, the name of a column within df so unique(i) doesn't make sense.
Anyhow, the most standard way for this task is lapply() as shown by demirev.
Could this be what you're trying to do?
Map(unique,df)
Result:
$a
[1] A B C D
Levels: A B C D
$b
[1] 1 2 3 4

How to subset an environment by its variable names in r

I would like to subset an environment by its variable names.
e <- new.env(parent=emptyenv())
e$a <- 1
e$b <- 2
e$d <- 3
e[ls(e) %in% c("a","b", "c")]
### if e was a list, this would return the subset list(a=1, b=2)
I could not figure out how to subset elements of an environment by their names. Using lapply or eapply does not work either. What is the proper or easy way to subset an environment by its variable names?
Thank you.
Okay, after thinking this through a bit more, may I suggest:
mget(c("a","b"), envir=e)
#$a
#[1] 1
#
#$b
#[1] 2
My original solution is to use get() / mget() (maybe OP saw my deleted comment earlier). Then I noticed that OP had tried eapply(), so I thought about possible solutions with that. Here it is (with help of #thelatemail).
# try some different data type
e <- new.env(parent=emptyenv())
e$a <- 1:3
e$b <- matrix(1:4, 2)
e$c <- data.frame(x=letters[1:2],y=LETTERS[1:2])
You can use either of the following to collect objects in environment e into a list:
elst <- eapply(e, "[") ## my idea
elst <- eapply(e, identity) ## thanks to #thelatemail
elst <- as.list.environment(e) ## thanks to #thelatemail
#$a
#[1] 1 2 3
#$b
# [,1] [,2]
#[1,] 1 3
#[2,] 2 4
#$c
# x y
#1 a A
#2 b B
The as.list.environment() can be seen as the inverse operation of list2env(). It is mentioned in the "See Also" part of ?list2env.
The result elst is just an ordinary list. There are various way to subset this list. For example:
elst[names(elst) %in% c("a","b")] ## no need to use "ls(e)" now
#$a
#[1] 1 2 3
#$b
# [,1] [,2]
#[1,] 1 3
#[2,] 2 4
mget(ls(e)[ls(e) %in% c('a','b','d')], e)
The [ operator usually returns the same type of object as the original, so I guess you're expecting an environment, rather than a list. The same environment but with a different set of elements, or a new environment with the specified elements? Either way I think you'll end up iterating, e.g.,
f = new.env(parent=emptyenv())
for (elt in c("a", "b"))
f[[elt]] = e[[elt]]
Working with environments is not very idiomatic R code, which might explain why there is not a more elegant solution.
You can use rlang::env_get_list() to get a list of the bindings:
rlang::env_get_list(env=e, c("a","b"))
#$a
#[1] 1
#
#$b
#[1] 2
If you're trying to get an environment, rather than a list, I'm not sure how you would do that, other than just creating a new environment using the output of rlang::env_get_list().
If you want to include elements in your list that might not exist in the environment (like "c"), you have to specify a default value - otherwise you'll get an error:
env_get_list(env = e, c("a","b","c"))
#Error in env_get_list(env = e, c("a", "b", "c")) : argument "default" is missing, with no default
env_get_list(env = e, c("a","b","c"),default=NULL)
#$a
#[1] 1
#
#$b
#[1] 2
#
#$c
#NULL
I assume you don't want c at all, so I'd do something like:
temp <- c("a","b","c")[c("a","b","c") %in% env_names(e)]
temp
[1] "a" "b"
env_get_list(env=e,temp)
#$a
#[1] 1
#
#$b
#[1] 2

R: Creating a data frame from list with missing values.

I have a list here that looks like this:
head(h)
[[1]]
[1] "gene=dnaA" "locus_tag=CD630_00010" "location=1..1320"
[[2]]
character(0)
[[3]]
[1] "locus_tag=CD630_05950" "location=719777..720313"
[[4]]
[1] "gene=dnrA" "locus_tag=CD630_00010" "location=50..1320"
I'm having trouble trying to manipulate this list to create a data.frame with three columns. For the rows with missing gene info, I want to list them as "gene=unnamed" and completely remove the empty rows into a matrix as shown:
[,1] [,2] [,3]
[1,] "gene=dnaA" "locus_tag=CD630_00010" "location=1..1320"
[2,] "gene=thrA" "locus_tag=CD630_05950" "location=719777..720313"
[3,] "gene=dnrA" "locus_tag=CD630_00010" "location=50..1320"
This is what I have right now, but I get an error about missing values in the gene column. Any suggestions?
h <- data.frame(h[lapply(h,length)>0])
h <- t(h)
rownames(h) <- NULL
# Data
l <- list(c("gene=dnaA","locus_tag=CD630_00010", "location=1..1320"),
character(0), c("locusc_tag=CD630_05950", "location=719777..720313"),
c("gene=dnrA","locus_tag=CD630_00010" ,"location=50..1320" ))
# Manipulation
n <- sapply(l, length)
seq.max <- seq_len(max(n))
df <- t(sapply(l, "[", i = seq.max))
df <- t(apply(df,1,function(x){
c(x[is.na(x)],x[!is.na(x)])}))
df <- df[rowSums(!is.na(df))>0, ]
df[is.na(df)] <- "gen=unnamed"
Output:
[,1] [,2] [,3]
[1,] "gene=dnaA" "locus_tag=CD630_00010" "location=1..1320"
[2,] "gen=unnamed" "locusc_tag=CD630_05950" "location=719777..720313"
[3,] "gene=dnrA" "locus_tag=CD630_00010" "location=50..1320"
There are a number of methods for binding lists with unequal lengths. See bind_rows from dplyr, rbind.fill from plyr or rbindlist from data.table. Here is using base R
## Sample data
h <- list(letters[1:3],
character(0),
letters[4:5])
out <- do.call(rbind, lapply(h, `length<-`, 3)) # fix lengths and make matrix
out <- out[rowSums(!is.na(out))>0, ] # remove empty rows
out[is.na(out)] <- "gen=unnamed" # rename NA
data.frame(out)
# X1 X2 X3
# 1 a b c
# 2 d e gen=unnamed

How to automate making a list of lists in R

I can make this list by hand:
list( list(n=1) , list(n=2), list(n=3) )
But how do I automate this, for instance if I want n to go up to 10? I tried as.list(1:10), which firstly is a different type of data structure, and secondly I couldn't work out how to specify n.
I'm hoping the answer can be expanded to multiple element lists, e.g. all combinations of 1:3 and c('A','B'):
list( list(n=1,z='A') , list(n=2,z='A'), list(n=3,z='A'),
list(n=1,z='B') , list(n=2,z='B'), list(n=3,z='B') )
Background: I'll be using it along the lines of: lapply( outer_list, function(params) do.call(FUN,params) )
UPDATE:
It was difficult to choose which answer to give the tick to. I went with the expand.grid approach as it can scale to more than two parameters more easily; the use of mapply as shown in the comment makes the two examples above look reasonably compact and readable:
outer_list=with( expand.grid(n=1:10,stringsAsFactors=F),
mapply(list, n=n, SIMPLIFY=F)
)
outer_list=with( expand.grid(n=1:3,z=c('A','Z'), stringsAsFactors=F),
mapply(list, n=n, z=z, SIMPLIFY=F)
)
They violate the DRY principle, by repeating the parameter names in the mapply() call, which bothers me a little. So, when it bothers me enough I will use the alply call as shown in Sebastian's answer.
You don't need to expand using expand.grid.
L <- mapply(function(x, y) list("n"=x,"z"=y),
rep(1:10, each=10), LETTERS[1:10],
SIMPLIFY=FALSE)
EDIT (see comment below)
L <- mapply(function(x, y) list("n"=x,"z"=y),
rep(1:10, each=length(LETTERS[1:10])), LETTERS[1:10],
SIMPLIFY=FALSE)
vals <- expand.grid(n=1:3, z=c("A", "B"),
KEEP.OUT.ATTRS=FALSE, stringsAsFactors=FALSE)
library(plyr)
alply(vals, 1, as.list)
$`1`
$`1`$n
[1] 1
$`1`$z
[1] "A"
$`2`
$`2`$n
[1] 2
$`2`$z
[1] "A"
$`3`
$`3`$n
[1] 3
$`3`$z
[1] "A"
$`4`
$`4`$n
[1] 1
$`4`$z
[1] "B"
$`5`
$`5`$n
[1] 2
$`5`$z
[1] "B"
$`6`
$`6`$n
[1] 3
$`6`$z
[1] "B"
attr(,"split_type")
[1] "array"
attr(,"split_labels")
n z
1 1 A
2 2 A
3 3 A
4 1 B
5 2 B
6 3 B

R - preserve order when using matching operators (%in%)

I am using matching operators to grab values that appear in a matrix from a separate data frame. However, the resulting matrix has the values in the order they appear in the data frame, not in the original matrix. Is there any way to preserve the order of the original matrix using the matching operator?
Here is a quick example:
vec=c("b","a","c"); vec
df=data.frame(row.names=letters[1:5],values=1:5); df
df[rownames(df) %in% vec,1]
This produces > [1] 1 2 3 which is the order "a" "b" "c" appears in the data frame. However, I would like to generate >[1] 2 1 3 which is the order they appear in the original vector.
Thanks!
Use match.
df[match(vec, rownames(df)), ]
# [1] 2 1 3
Be aware that if you have duplicate values in either vec or rownames(df), match may not behave as expected.
Edit:
I just realized that row name indexing will solve your issue a bit more simply and elegantly:
df[vec, ]
# [1] 2 1 3
Use match (and get rid of the NA values for elements in either vector for those that don't match in the other):
Filter(function(x) !is.na(x), match(rownames(df), vec))
Since row name indexing also works on vectors, we can take this one step further and define:
'%ino%' <- function(x, table) {
xSeq <- seq(along = x)
names(xSeq) <- x
Out <- xSeq[as.character(table)]
Out[!is.na(Out)]
}
We now have the desired result:
df[rownames(df) %ino% vec, 1]
[1] 2 1 3
Inside the function, names() does an auto convert to character and table is changed with as.character(), so this also works correctly when the inputs to %ino% are numbers:
LETTERS[1:26 %in% 4:1]
[1] "A" "B" "C" "D"
LETTERS[1:26 %ino% 4:1]
[1] "D" "C" "B" "A"
Following %in%, missing values are removed:
LETTERS[1:26 %in% 3:-5]
[1] "A" "B" "C"
LETTERS[1:26 %ino% 3:-5]
[1] "C" "B" "A"
With %in% the logical sequence is repeated along the dimension of the object being subsetted, this is not the case with %ino%:
data.frame(letters, LETTERS)[1:5 %in% 3:-5,]
letters LETTERS
1 a A
2 b B
3 c C
6 f F
7 g G
8 h H
11 k K
12 l L
13 m M
16 p P
17 q Q
18 r R
21 u U
22 v V
23 w W
26 z Z
data.frame(letters, LETTERS)[1:5 %ino% 3:-5,]
letters LETTERS
3 c C
2 b B
1 a A

Resources