How to make a unique combination of vectors in R? - r

I have a couple of vectors consisting of three names. I want to get all unique pairwise combinations of these vectors. As an example, with two of those vectors, I can get the non-unique combinations with
sham1 <- c('a', 'b')
sham2 <- c('d', 'e')
shams <- list(sham1, sham2)
combinations <- apply(expand.grid(shams, shams),1, unname)
which gives the following combinations
> dput(combinations)
list(
list(c("a", "b"), c("a", "b")),
list(c("d", "e"), c("a", "b")),
list(c("a", "b"), c("d", "e")),
list(c("d", "e"), c("d", "e"))
)
I tried using unique(combinations), but this gives the same result. What I would like to get is
> dput(combinations)
list(
list(c("a", "b"), c("a", "b")),
list(c("d", "e"), c("a", "b")),
list(c("d", "e"), c("d", "e"))
)
Because there is already the combination list(c("d", "e"), c("a", "b")), I don't need the combination list(c("a", "b"), c("d", "e"))
How can I get only the unique combination of vectors?

s <- seq(length(shams))
# get unique pairs of shams indexes, including each index with itself.
uniq.pairs <- unique(as.data.frame(t(apply(expand.grid(s, s), 1, sort))))
# V1 V2
# 1 1 1
# 2 1 2
# 4 2 2
result <- apply(uniq.pairs, 1, function(x) shams[x])

I am also not exactly sure what you want but this function might help:
combn
Here is a simple example:
> combn(letters[1:4], 2)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "a" "a" "a" "b" "b" "c"
[2,] "b" "c" "d" "c" "d" "d"
I don't think this is what you want, but if you clarify perhaps I can edit to get you what you want:
> sham1<-c('a','b')
> sham2<-c('d','e')
> combn(c(sham1,sham2),2)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "a" "a" "a" "b" "b" "d"
[2,] "b" "d" "e" "d" "e" "e"

combn gets you the combinations (so, unique), but not the repeated ones. So combine that with something that gives you the repeated ones and you have it:
c(combn(shams, 2, simplify=FALSE),
lapply(shams, function(s) list(s,s)))

No idea what your examples are saying. If you want unique pairwise combinations:
strsplit(levels(interaction(sham1, sham2, sep="*")), "\\*")

I don't understand what do you want. And it seems that you changed the desired output from your other question.
You want your two list nested in a list inside another list???
It is not simpler to just once? Like when you have shams?
dput(shams)
list(c("Sham1.r1", "Sham1.r2", "Sham1.r3"), c("Sham2.r1", "Sham2.r2",
"Sham2.r3"))
To create such a nested list you could use that:
combinations <- list(shams, "")
dput(combinations)
list(list(c("Sham1.r1", "Sham1.r2", "Sham1.r3"), c("Sham2.r1", "Sham2.r2",
"Sham2.r3"), "")
Although it is not exactly what do you said...

Related

Returning the values of a list based on "two" parameters

Very new to R. So I am wondering if you can use two different parameters to get the position of both elements from a list. See the below example...
x <- c("A", "B", "A", "A", "B", "B", "C", "C", "A", "A", "B")
y <- c(which(x == "A"))
[1] 1 3 4 9 10
x[y]
[1] "A" "A" "A" "A" "A"
x[y+1]
[1] "B" "A" "B" "A" "B"
But I would like to return the positions of both y and y+1 together in the same list. My current solution is to merge the two above lists by row number and create a dataframe from there. I don't really like that and was wondering if there is another way. Thanks!
I dont know what exactly you want, but this could help:
newY = c(which(x == "A"),which(x == "A")+1)
After that you can sort it with
finaldata <- newY[order(newY)]
Or you do both in one step:
finaldata <- c(which(x == "A"),which(x == "A")+1)[order(c(which(x == "A"),which(x == "A")+1))]
Then you could also delete duplicates if you want to. Please tell me if this is what you wanted.

How do I get all pairs of values in a variable based on shared values in a different variable

My problem is perhaps a little difficult to formulate, hence I haven't found any solutions yet, but I'll try:
I wan't to find all pairs of values in a variable based on whether they share any value in another variable. Maybe the following example can illustrate it more clearly.
In a 2 variable data frame like this:
data.frame(scaffold = c("A", "A", "B", "B", "B", "C", "C", "D"),
geneID = c("162", "276", "64", "276", "281", "64", "162", "162"))
#> scaffold geneID
#> A 162
#> A 276
#> B 64
#> B 276
#> B 281
#> C 64
#> C 162
#> D 162
... I want to find all pairs of "scaffolds" A, B, C, and D, that share any of the "geneID"s 64, 162, 176, and 281, so that the above would become a data frame with all pairs of scaffolds in 2 new columns like this:
data.frame(V1 = c("A", "A", "A", "B", "C"), V2 =c("B", "C", "D", "C", "D"))
#> V1 V2
#> A B
#> A C
#> A D
#> B C
#> C D
Obviously A and B is the same pair as B and A, so these should be removed somehow, but that's probably easy. Afterwards, this data frame needs to be combined with a data frame containing x/y coordinates of the scaffolds for drawing a line between the pairs on top of a plot with the scaffolds.
I do have a working for-loop to do the job, but I need to replace that with a much faster alternative. I'll spare you the code, it's complicated and doesn't always do it right. Running it on just 20 scaffolds can take seconds, but I need to do it on thousands. I was hoping a series of dplyr or data.table functions could do the job as they probably are as fast as it gets, but I haven't been able to get my head around how.
I hope you can help me, or perhaps something similar is already in another threat I just wasn't able to find.
A performance comparison of the two solutions by #Florian and #Roman can be found at http://rpubs.com/kasperskytte/SO_question_48407650
Here is a possible solution. Note that I modified your example df so A and C share both 162 and 64, and we have to make sure that this group does not occur twice in the output.
df = data.frame(scaffold = c("A", "A", "B", "B", "B", "C", "C", "D","A"),
geneID = c("162", "276", "64", "276", "281", "64", "162", "162","64"),stringsAsFactors = F)
y = split(df$scaffold,df$geneID)
unique(do.call(rbind,(lapply(y[which(sapply(y, length) > 1)],function(x){t(combn(sort(x),2))}))))
Output:
[,1] [,2]
[1,] "A" "C"
[2,] "A" "D"
[3,] "C" "D"
[4,] "A" "B"
[5,] "B" "C"
How it works: First we split the data into groups based on df$geneID, the result we call y. Then we lapply over every element of y that has more than 1 element in it a function that gives us all n possible combinations of 2 as a nx2 matrix. By calling sort() on x inside this function we make removing duplicates easier later on, because we then rbind this list into a large matrix, and call unique() on the result to remove duplicates.
Hope this helps!
See the commends in the code.
xy <- data.frame(scaffold = c("A", "A", "B", "B", "B", "C", "C", "D"),
geneID = c("162", "276", "64", "276", "281", "64", "162", "162"))
# split by gene
xy1 <- split(xy, f = xy$geneID)
# find all combinations
out <- sapply(xy1, FUN = function(x) {
x$scaffold <- as.character(x$scaffold)
# add NA so that we can remove any cases that have a single scaffold
tryCatch(t(combn(x$scaffold, 2)), error = function(e) NA)
}, simplify = FALSE)
# remove NAs and some fiddling to get the desired format
out <- out[!is.na(out)]
out <- do.call(rbind, out)
# sort the data
out <- t(apply(out, MARGIN = 1, FUN = function(x) sort(x)))
# remove duplicates
out <- out[!duplicated(out), ]
out
[,1] [,2]
[1,] "A" "C"
[2,] "A" "D"
[3,] "C" "D"
[4,] "A" "B"
[5,] "B" "C"

Return all elements of list containing certain strings

I have a list of vectors containing strings and I want R to give me another list with all vectors that contain certain strings. MWE:
list1 <- list("a", c("a", "b"), c("a", "b", "c"))
Now, I want a list that contains all vectors with "a" and "b" in it. Thus, the new list should contain two elements, c("a", "b") and c("a", "b", "c").
As list1[grep("a|b", list1)] gives me a list of all vectors containing either "a" or "b", I expected list1[grep("a&b", list1)] to do what I want, but it did not (it returned a list of length 0).
This should work:
test <- list("a", c("a", "b"), c("a", "b", "c"))
test[sapply(test, function(x) sum(c('a', 'b') %in% x) == 2)]
Try purrr::keep
library(purrr)
keep(list1, ~ all(c("a", "b") %in% .))
We can use Filter
Filter(function(x) all(c('a', 'b') %in% x), test)
#[[1]]
#[1] "a" "b"
#[[2]]
#[1] "a" "b" "c"
A solution with grepl:
> list1[grepl("a", list1) & grepl("b", list1)]
[[1]]
[1] "a" "b"
[[2]]
[1] "a" "b" "c"

How do I filter two matrices based on common values in a column?

I am trying to filter two matrices based on the first column
a <- matrix(c("b", "s", "a", "w", "r", "te", "fds", "s", "h", "a", "df", "tyi"), nrow = 4)
colnames(a) <- c("fir", "sec", "thi")
fir sec thi
[1,] "b" "r" "h"
[2,] "s" "te" "a"
[3,] "a" "fds" "df"
[4,] "w" "s" "tyi"
b <- matrix(c("a","b","c","d", "e", "f", "g", "h", "i"), nrow = 3)
colnames(b) <- c("fir", "sec", "thi")
fir sec thi
[1,] "a" "d" "g"
[2,] "b" "e" "h"
[3,] "c" "f" "i"
Basically what I want to do is subset matrix a based on the hits in b[,1]
So since (row1, col1) and (row3, col1) in matrix a match certain values in column 1 in matrix b, I'd like to extract those two rows.
I appreciate any tips and advice. Thank you.
Also, can someone explain why this doesn't work?
> c <- intersect(a[,1], b[,1])
> c
[1] "b" "a"
> a[a[,1]==c]
[1] "b" "r" "h"
You could try this, although there may be a more elegant way to do it.
matched <- a[,1] %in% b[,1]
a[matched,]

Order a numeric vector by length in R

I've got two numeric vectors that I want to order by the length of the their observations, i.e., the number of times each observation appears.
For example:
x <- c("a", "a", "a", "b", "b", "b", "b", "c", "e", "e")
Here, b occurs four times, a three times, e two and c one time. I'd like my result in this order.
ans <- c("b", "b", "b", "b", "a", "a", "a", "e", "e", "c")
I´ve tried this:
x <- x[order(-length(x))] # and some similar lines.
Thanks
Using rle you can get values lenghts. You order lengths, and use values to recreate the vector again using the new order:
xx <- c('a', 'a', 'a', 'b', 'b', 'b','b', 'c', 'e', 'e')
rr <- rle(xx)
ord <- order(rr$lengths,decreasing=TRUE)
rep(rr$values[ord],rr$length[ord])
## [1] "b" "b" "b" "b" "a" "a" "a" "e" "e" "c"
You may also use ave when calculating the lengths
x[order(ave(x, x, FUN = length), decreasing = TRUE)]
# [1] "b" "b" "b" "b" "a" "a" "a" "e" "e" "c"

Resources