pairwise analysis in R - r

I have a large data-frame in which I have to find the columns when both rows are equal for pairs of individuals.
Here is an example of the dataframe:
>data
ID pos1234 pos1345 pos1456 pos1678
1 1 C A C G
2 2 C G A G
3 3 C A G A
4 4 C G C T
I transformed the dataframe into a pairwise matrix with:
apply(data, 2, combn, m=2)
ID pos1234 pos1345 pos1456 pos1678
[1,] "1" "C" "A" "C" "G"
[2,] "2" "C" "G" "A" "G"
[3,] "1" "C" "A" "C" "G"
[4,] "3" "C" "A" "G" "A"
[5,] "1" "C" "A" "C" "G"
[6,] "4" "C" "G" "C" "T"
[7,] "2" "C" "G" "A" "G"
[8,] "3" "C" "A" "G" "A"
[9,] "2" "C" "G" "A" "G"
[10,] "4" "C" "G" "C" "T"
[11,] "3" "C" "A" "G" "A"
[12,] "4" "C" "G" "C" "T"
I am now having trouble identifying the column containing the identical letters between pairs. For example, for pairs 1 and 2 the columns containing the identical letters would be pos1234 and pos1678.
Would it be possible get a dataframe with just identical letters for each pair of individuals?
Thanks in advance.

You can pass a function to combn:
res0 <- combn(nrow(data), 2, FUN = function(x)
names(data[-1])[ lengths(sapply(data[x,-1], unique)) == 1 ], simplify=FALSE)
which gives
[[1]]
[1] "pos1234" "pos1678"
[[2]]
[1] "pos1234" "pos1345"
[[3]]
[1] "pos1234" "pos1456"
[[4]]
[1] "pos1234"
[[5]]
[1] "pos1234" "pos1345"
[[6]]
[1] "pos1234"
To figure out which of these [[1]]..[[6]] correspond to which pairs, take combn again:
res <- setNames(res0, combn(data$ID, 2, paste, collapse="."))
which gives
$`1.2`
[1] "pos1234" "pos1678"
$`1.3`
[1] "pos1234" "pos1345"
$`1.4`
[1] "pos1234" "pos1456"
$`2.3`
[1] "pos1234"
$`2.4`
[1] "pos1234" "pos1345"
$`3.4`
[1] "pos1234"

Related

Creating result groups in R, using each element once (combination without repetition)

I have a dataset of 6 individuals: A,B,C,D,E,F
I want to group these into two groups of three individuals and have done so with the combn function in R:
m <- combn(n, 3)
This gives me all 20 possible groups where individuals occur in multiple groups. From this set of groups I then went to find all possible combinations of results, where each individual can only be used once.
I would like to do this using combinations without repetition:
C(n,r) = n! / r!(n-r)! and would therefore get 10 results that would look like this:
abc + def
abd + cef
abe + cdf
abf + cde
acd + bef
ace + bdf
acf + bde
ade + bcf
adf + bce
aef + bcd
I am not sure how to code this in R, from the list of groups that I have generated.
Edit: to generate the dataset I am using I have used the following code:
individuals <- c("a","b","c","d","e","f")
n <- length(individuals)
x <- 3
comb = function(n, x) {
factorial(n) / factorial(n-x) / factorial(x)
}
comb(n,x)
(m <- combn(n, 3))
numbers <- m
letters <- individuals
for (i in 1:length(numbers)) {
m[i] <- letters[numbers[i]]
}
In base R:
Create combnations of 3 letters and store it in a list (asplit)
Create new combnations of 2 groups (of 3 letters)
Filter the list to only keep combinations where the both parts have no element in common
individuals <- c("a","b","c","d","e","f")
combn(individuals, 3, simplify = FALSE) |>
combn(m = 2, simplify = FALSE) |>
Filter(f = \(x) !any(x[[1]] %in% x[[2]]))
output
[[1]]
[[1]][[1]]
[1] "a" "b" "c"
[[1]][[2]]
[1] "d" "e" "f"
[[2]]
[[2]][[1]]
[1] "a" "b" "d"
[[2]][[2]]
[1] "c" "e" "f"
[[3]]
[[3]][[1]]
[1] "a" "b" "e"
[[3]][[2]]
[1] "c" "d" "f"
[[4]]
[[4]][[1]]
[1] "a" "b" "f"
[[4]][[2]]
[1] "c" "d" "e"
[[5]]
[[5]][[1]]
[1] "a" "c" "d"
[[5]][[2]]
[1] "b" "e" "f"
[[6]]
[[6]][[1]]
[1] "a" "c" "e"
[[6]][[2]]
[1] "b" "d" "f"
[[7]]
[[7]][[1]]
[1] "a" "c" "f"
[[7]][[2]]
[1] "b" "d" "e"
[[8]]
[[8]][[1]]
[1] "a" "d" "e"
[[8]][[2]]
[1] "b" "c" "f"
[[9]]
[[9]][[1]]
[1] "a" "d" "f"
[[9]][[2]]
[1] "b" "c" "e"
[[10]]
[[10]][[1]]
[1] "a" "e" "f"
[[10]][[2]]
[1] "b" "c" "d"

What is the fastest way to reduce elements of a list by frequency?

Suppose that I have a list similar to this one:
set.seed(12731)
out <- lapply(1:sample.int(10, 1), function(x){sample(letters[1:4], x, replace = T)})
[[1]]
[1] "b"
[[2]]
[1] "d" "c"
[[3]]
[1] "b" "a" "a"
[[4]]
[1] "d" "d" "b" "c"
[[5]]
[1] "d" "d" "c" "c" "b"
[[6]]
[1] "b" "d" "b" "d" "c" "c"
[[7]]
[1] "a" "b" "d" "d" "b" "a" "d"
I would like to have vectors of length one given by the element of higher frequency in the list. Notice that is possible to have vectors of length > 1 if there are no duplicates. The frequency table is like this:
table(unlist(out))[order(table(unlist(out)), decreasing = T)]
b c d a
16 14 13 12
The outcome of the example is something like this:
list("b", "c", "b", "b", "b", "b", "b")
REMARK
It is possible to have vectors of length > 1 if there are no duplicates.
out <- lapply(1:sample.int(10, 1), function(x){sample(letters[1:4], x, replace = T)})
length(out)
[1] 10
out[[length(out)+1]] <- c("L", "K")
out
[[1]]
[1] "c"
[[2]]
[1] "d" "a"
[[3]]
[1] "c" "b" "a"
[[4]]
[1] "b" "c" "b" "c"
[[5]]
[1] "a" "a" "d" "c" "d"
[[6]]
[1] "d" "b" "d" "d" "d" "a"
[[7]]
[1] "d" "b" "c" "c" "d" "c" "a"
[[8]]
[1] "d" "a" "d" "b" "d" "a" "b" "d"
[[9]]
[1] "a" "b" "b" "b" "c" "c" "a" "c" "d"
[[10]]
[1] "d" "d" "d" "a" "d" "d" "c" "c" "a" "c"
[[11]]
[1] "L" "K"
Expected outcome:
list("c", "d", "c", "c", "d", "d", "d", "d", "d", "d", c("L", "K"))
I believe that this should work for what you are looking for.
# get counts for entire list and order them
myRanks <- sort(table(unlist(out)), decreasing=TRUE)
This produces
myRanks
b c d a
10 9 5 4
# calculate if most popular, then second most popular, ... item shows up for each list item
sapply(out, function(i) names(myRanks)[min(match(i, names(myRanks)))])
[1] "b" "b" "b" "c" "b" "b" "b"
Here, sapply runs through each list item and returns a vector. It applies a function that selects the name of the first element (via min) of the myRanks table that appears in the list element, using match.
In the case of multiple elements having the same count (duplicates) in the myRanks table, the following code should to return a list of the top observations per list item:
sapply(out,
function(i) {
intersect(names(myRanks)[myRanks == max(unique(myRanks[match(i, names(myRanks))]))],
i)})
Here, the names of myRanks that have the same value as the value in the list item with the highest value in myRanks are intersected with the names present in the list item in order to only return values in both sets.
This should work:
set.seed(12731)
out <- lapply(1:sample.int(10, 1), function(x){sample(letters[1:4], x, replace = T)})
out
#[[1]]
#[1] "b"
#[[2]]
#[1] "c" "b"
#[[3]]
#[1] "b" "b" "b"
#[[4]]
#[1] "d" "c" "c" "d"
#[[5]]
#[1] "d" "b" "a" "a" "c"
#[[6]]
#[1] "a" "b" "c" "b" "c" "c"
#[[7]]
#[1] "a" "c" "d" "b" "d" "c" "b"
tbl <- table(unlist(out))[order(table(unlist(out)), decreasing = T)]
sapply(out, function(x) intersect(names(tbl), x)[1])
# [1] "b" "b" "b" "c" "b" "b" "b"
[EDIT]
set.seed(12731)
out <- lapply(1:sample.int(10, 1), function(x){sample(letters[1:4], x, replace = T)})
out[[length(out)+1]] <- c("L", "K")
out
#[[1]]
#[1] "b"
#[[2]]
#[1] "c" "b"
#[[3]]
#[1] "b" "b" "b"
#[[4]]
#[1] "d" "c" "c" "d"
#[[5]]
#[1] "d" "b" "a" "a" "c"
#[[6]]
#[1] "a" "b" "c" "b" "c" "c"
#[[7]]
#[1] "a" "c" "d" "b" "d" "c" "b"
#[[8]]
#[1] "L" "K"
tbl <- table(unlist(out))[order(table(unlist(out)), decreasing = T)]
#tbl
#b c d a K L
#10 9 5 4 1 1
lapply(out, function(x) names(tbl[tbl==max(tbl[names(tbl) %in% intersect(names(tbl), x)])]))
#[[1]]
#[1] "b"
#[[2]]
#[1] "b"
#[[3]]
#[1] "b"
#[[4]]
#[1] "c"
#[[5]]
#[1] "b"
#[[6]]
#[1] "b"
#[[7]]
#[1] "b"
#[[8]]
#[1] "K" "L"

R : Extract elements of same index & depth level of a list

Here is a list :
# Build a toy list
x1=letters[1:3]
x2=letters[4:5]
x3=letters[1:8]
toy_list=list(list(list("ABX",x1),
list("ZHK",x2)),
list(list("CCC",x3)))
[[1]]
[[1]][[1]]
[[1]][[1]][[1]]
[1] "ABX"
[[1]][[1]][[2]]
[1] "a" "b" "c"
[[1]][[2]]
[[1]][[2]][[1]]
[1] "ZHK"
[[1]][[2]][[2]]
[1] "d" "e"
[[2]]
[[2]][[1]]
[[2]][[1]][[1]]
[1] "CCC"
[[2]][[1]][[2]]
[1] "a" "b" "c" "d" "e" "f" "g" "h"
Let's suppose I want to extract all elements, for example, in 2nd position, at a "deep level" of 3. In other way I want to extract elements of index [[1]][[1]][[2]], [[1]][[2]][[2]], [[2]][[1]][[2]]. Which means I want my output to be
[[1]]
[1] "a" "b" "c"
[[2]]
[1] "d" "e"
[[3]]
[1] "a" "b" "c" "d" "e" "f" "g" "h"
How would you do that in a generalize way?
With purrr, you can use at_depth(2, ...) where 2 indicates the depth level, and ... is an extractor (name/integer) or function. Simplifying the structure afterwards,
library(purrr)
toy_list %>% at_depth(2, 2) %>% flatten()
## [[1]]
## [1] "a" "b" "c"
##
## [[2]]
## [1] "d" "e"
##
## [[3]]
## [1] "a" "b" "c" "d" "e" "f" "g" "h"

R: Create list from vector in "triangular" form

I found it hard to formulate the question but I would like to find a clever way (not using loop) to get the following result:
> my.vector = letters[1:6]
> print(my.vector)
[1] "a" "b" "c" "d" "e" "f"
>
> my.list = (rep(list(NA),6))
> for (i in 1:length(my.vector)){
+ x = my.vector[1:i]
+ my.list[[i]] = x
+ }
> print(my.list)
[[1]]
[1] "a"
[[2]]
[1] "a" "b"
[[3]]
[1] "a" "b" "c"
[[4]]
[1] "a" "b" "c" "d"
[[5]]
[1] "a" "b" "c" "d" "e"
[[6]]
[1] "a" "b" "c" "d" "e" "f"
Thanks in advance,
Gabriel.
You can do:
lapply(seq_along(my.vector), head, x = my.vector)
Here's an approach (more verbose than #akrun's, but not dependent on the actual values in the original vector).
split(my.vector[sequence(seq_along(my.vector))],
rep(seq_along(my.vector), seq_along(my.vector)))
## $`1`
## [1] "a"
##
## $`2`
## [1] "a" "b"
##
## $`3`
## [1] "a" "b" "c"
##
## $`4`
## [1] "a" "b" "c" "d"
##
## $`5`
## [1] "a" "b" "c" "d" "e"
##
## $`6`
## [1] "a" "b" "c" "d" "e" "f"
##
If you wanted a matrix instead of a list, you can try:
> x <- t(replicate(length(my.vector), my.vector))
> x[upper.tri(x)] <- ""
> x
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "a" "" "" "" "" ""
[2,] "a" "b" "" "" "" ""
[3,] "a" "b" "c" "" "" ""
[4,] "a" "b" "c" "d" "" ""
[5,] "a" "b" "c" "d" "e" ""
[6,] "a" "b" "c" "d" "e" "f"
We can use
v1 <- my.vector[sequence(seq_along(my.vector))]
split(v1, cumsum(v1=='a'))

R generate all possible combinations of size m from of a character vector of n elements [duplicate]

This question already has answers here:
Generate list of all possible combinations of elements of vector
(10 answers)
Closed 8 years ago.
so, I have this vector c("T", "A", "C", "G") for genomic data. I want to generate all possible combinations of size 3, with repeats such as:
T T T
T T A
T T C
T T G
T A T
..
that would give me 4^3=64 combinations. Combinations of size 4 would give 4^4, and for size 5 should give 4^5=1024 rows.
I searched through SOF, and think expand.grid() would do that, but I couldn't find out how to use it to get the desired output. Any idea?
x <- c("T", "A", "C", "G")
do.call(expand.grid, rep(list(x), 3))
permutations from gtools is designed to do just this:
library(gtools)
data <- c("T", "A", "C", "G")
permutations(4, 3, data, repeats.allowed = TRUE)
## [,1] [,2] [,3]
## [1,] "A" "A" "A"
## [2,] "A" "A" "C"
## [3,] "A" "A" "G"
## [4,] "A" "A" "T"
## [5,] "A" "C" "A"
## [6,] "A" "C" "C"
## [7,] "A" "C" "G"
## [8,] "A" "C" "T"
## [9,] "A" "G" "A"
## [10,] "A" "G" "C"
## [11,] "A" "G" "G"
## [12,] "A" "G" "T"
## [13,] "A" "T" "A"
## [14,] "A" "T" "C"
## [15,] "A" "T" "G"
## [16,] "A" "T" "T"
## [17,] "C" "A" "A"
## [18,] "C" "A" "C"
## [19,] "C" "A" "G"
## [20,] "C" "A" "T"
…

Resources