Creating result groups in R, using each element once (combination without repetition) - r

I have a dataset of 6 individuals: A,B,C,D,E,F
I want to group these into two groups of three individuals and have done so with the combn function in R:
m <- combn(n, 3)
This gives me all 20 possible groups where individuals occur in multiple groups. From this set of groups I then went to find all possible combinations of results, where each individual can only be used once.
I would like to do this using combinations without repetition:
C(n,r) = n! / r!(n-r)! and would therefore get 10 results that would look like this:
abc + def
abd + cef
abe + cdf
abf + cde
acd + bef
ace + bdf
acf + bde
ade + bcf
adf + bce
aef + bcd
I am not sure how to code this in R, from the list of groups that I have generated.
Edit: to generate the dataset I am using I have used the following code:
individuals <- c("a","b","c","d","e","f")
n <- length(individuals)
x <- 3
comb = function(n, x) {
factorial(n) / factorial(n-x) / factorial(x)
}
comb(n,x)
(m <- combn(n, 3))
numbers <- m
letters <- individuals
for (i in 1:length(numbers)) {
m[i] <- letters[numbers[i]]
}

In base R:
Create combnations of 3 letters and store it in a list (asplit)
Create new combnations of 2 groups (of 3 letters)
Filter the list to only keep combinations where the both parts have no element in common
individuals <- c("a","b","c","d","e","f")
combn(individuals, 3, simplify = FALSE) |>
combn(m = 2, simplify = FALSE) |>
Filter(f = \(x) !any(x[[1]] %in% x[[2]]))
output
[[1]]
[[1]][[1]]
[1] "a" "b" "c"
[[1]][[2]]
[1] "d" "e" "f"
[[2]]
[[2]][[1]]
[1] "a" "b" "d"
[[2]][[2]]
[1] "c" "e" "f"
[[3]]
[[3]][[1]]
[1] "a" "b" "e"
[[3]][[2]]
[1] "c" "d" "f"
[[4]]
[[4]][[1]]
[1] "a" "b" "f"
[[4]][[2]]
[1] "c" "d" "e"
[[5]]
[[5]][[1]]
[1] "a" "c" "d"
[[5]][[2]]
[1] "b" "e" "f"
[[6]]
[[6]][[1]]
[1] "a" "c" "e"
[[6]][[2]]
[1] "b" "d" "f"
[[7]]
[[7]][[1]]
[1] "a" "c" "f"
[[7]][[2]]
[1] "b" "d" "e"
[[8]]
[[8]][[1]]
[1] "a" "d" "e"
[[8]][[2]]
[1] "b" "c" "f"
[[9]]
[[9]][[1]]
[1] "a" "d" "f"
[[9]][[2]]
[1] "b" "c" "e"
[[10]]
[[10]][[1]]
[1] "a" "e" "f"
[[10]][[2]]
[1] "b" "c" "d"

Related

Split a vector into non-overlapping sub-list with increasing length

Let's say I have this vector:
letters[1:7]
[1] "a" "b" "c" "d" "e" "f" "g"
I would like to split it into a non-overlapping list with increasing length of 1, and keep what is left behind (e.g. sub-list 4 should have 4 elements, but there's only one left, and I'd like to keep that one), like the following:
[[1]]
[1] "a"
[[2]]
[1] "b" "c"
[[3]]
[1] "d" "e" "f"
[[4]]
[1] "g"
Please do let me know any direction to solve this, thank you!
Example vector:
x <- letters[1:7]
Solution:
n <- ceiling(0.5 * sqrt(1 + 8 * length(x)) - 0.5)
split(x, rep(1:n, 1:n)[1:length(x)])
#$`1`
#[1] "a"
#
#$`2`
#[1] "b" "c"
#
#$`3`
#[1] "d" "e" "f"
#
#$`4`
#[1] "g"
Something quick'n dirty
splitter = function(x) {
n = length(x)
i = 1
while ( i * (i + 1L) / 2L < (n-i) ) i = i + 1
out = rep(i+1, n)
out[1:(i * (i + 1L) / 2L)] = rep(1:i, 1:i)
unname(split(x, out))
}
splitter(x)
[[1]]
[1] "a"
[[2]]
[1] "b" "c"
[[3]]
[1] "d" "e" "f"
[[4]]
[1] "g"
x <- letters[1:7]
splt <- rep(seq(length(x)), seq(length(x)))[seq(length(x))]
split(x, splt)
#> $`1`
#> [1] "a"
#>
#> $`2`
#> [1] "b" "c"
#>
#> $`3`
#> [1] "d" "e" "f"
#>
#> $`4`
#> [1] "g"
Created on 2022-08-04 by the reprex package (v2.0.1)

Retrieve list of path name from igraph all_simple_paths

I have a directed cyclical matrix and need to extract all the simple paths between any i and j.
The following is my ex. matrix:
>M2<-matrix(c(1,1,0,0,0,1,1,1,1,0,0,1,1,1,0,0,1,0,1,1,0,0,0,1,1), 5, byrow=T)
>colnames(M2)<-c("A", "B", "C", "D", "E")
>row.names(M2)=colnames(M2)
>M2
A B C D E
A 1 1 0 0 0
B 1 1 1 1 0
C 0 1 1 1 0
D 0 1 0 1 1
E 0 0 0 1 1
I use igraph to convert the matrix to a graph object using the graph_from_adjency_matrix function.
>graph<-graph_from_adjacency_matrix(M2, mode=c("directed"), weighted=NULL, diag=F, add.colnames=NULL, add.rownames=NA)
>graph
IGRAPH DN-- 5 9 --
+ attr: name (v/c)
+ edges (vertex names):
[1] A->B B->A B->C B->D C->B C->D D->B D->E E->D
And from there I use the all_simple_paths function to get all the simple paths between i and j. And here starts my questions.
1) I can specify the j (argument to has to=V(graph)) to be all possible end vertices. But I can't specify the from argument to calculate the paths looking for all vertices has possible starting points. I have to specify each of my variables at a time. Any solution?
2) The all_simple_path function works well and gives me all the simple paths between i and j, e.g. for simple paths starting in A and ending in any possible j:
>Simple_path_list<-all_simple_paths(graph, from ="A", to=V(graph), mode = c("out"))
>Simple_path_list
[[1]]
+ 2/5 vertices, named:
[1] A B
[[2]]
+ 3/5 vertices, named:
[1] A B C
[[3]]
+ 4/5 vertices, named:
[1] A B C D
[[4]]
+ 5/5 vertices, named:
[1] A B C D E
[[5]]
+ 3/5 vertices, named:
[1] A B D
[[6]]
+ 4/5 vertices, named:
[1] A B D E
My problem is, I need to collect all those paths and put on a list, e.g.:
Paths
A B
A B C
A B C D
A B C D E
A B D
A B D E
I tried to create a list and call for the path names using the normal list<-Simple_path_list[1] or so, but I always retrieve, together with the paths, the information on the number of vertices involved (e.g., + 4/5 vertices, named). Any idea on how can I retrieve solely the paths name and not the other information?
The lapply function on all_simple_paths makes a list of lists (i.e. a list of each vertex's list of paths). Simplify the list of lists to a list using unlist(..., recursive = F) and then use names or igraph's as_ids to extract the vertex ids solo.
library(igraph)
M2<-matrix(c(1,1,0,0,0,1,1,1,1,0,0,1,1,1,0,0,1,0,1,1,0,0,0,1,1), 5, byrow=T)
colnames(M2)<-c("A", "B", "C", "D", "E")
row.names(M2)=colnames(M2)
M2
graph<-graph_from_adjacency_matrix(M2, mode=c("directed"), weighted=NULL, diag=F, add.colnames=NULL, add.rownames=NA)
l <- unlist(lapply(V(graph) , function(x) all_simple_paths(graph, from=x)), recursive = F)
paths <- lapply(1:length(l), function(x) as_ids(l[[x]]))
This produces:
> paths
[[1]]
[1] "A" "B"
[[2]]
[1] "A" "B" "C"
[[3]]
[1] "A" "B" "C" "D"
[[4]]
[1] "A" "B" "C" "D" "E"
[[5]]
[1] "A" "B" "D"
[[6]]
[1] "A" "B" "D" "E"
[[7]]
[1] "B" "A"
[[8]]
[1] "B" "C"
[[9]]
[1] "B" "C" "D"
[[10]]
[1] "B" "C" "D" "E"
[[11]]
[1] "B" "D"
[[12]]
[1] "B" "D" "E"
[[13]]
[1] "C" "B"
[[14]]
[1] "C" "B" "A"
[[15]]
[1] "C" "B" "D"
[[16]]
[1] "C" "B" "D" "E"
[[17]]
[1] "C" "D"
[[18]]
[1] "C" "D" "B"
[[19]]
[1] "C" "D" "B" "A"
[[20]]
[1] "C" "D" "E"
[[21]]
[1] "D" "B"
[[22]]
[1] "D" "B" "A"
[[23]]
[1] "D" "B" "C"
[[24]]
[1] "D" "E"
[[25]]
[1] "E" "D"
[[26]]
[1] "E" "D" "B"
[[27]]
[1] "E" "D" "B" "A"
[[28]]
[1] "E" "D" "B" "C"
Addition
For all_shortest_paths you must subset the list of paths for each node to exclude the geodesic information.
l <- lapply(V(graph), function(x) all_shortest_paths(graph, from = x))
l <- lapply(l, function(x) x[[-2]])
l <- unlist(l, recursive = F)
paths <- lapply(1:length(l), function(x) as_ids(l[[x]]))

What is the fastest way to reduce elements of a list by frequency?

Suppose that I have a list similar to this one:
set.seed(12731)
out <- lapply(1:sample.int(10, 1), function(x){sample(letters[1:4], x, replace = T)})
[[1]]
[1] "b"
[[2]]
[1] "d" "c"
[[3]]
[1] "b" "a" "a"
[[4]]
[1] "d" "d" "b" "c"
[[5]]
[1] "d" "d" "c" "c" "b"
[[6]]
[1] "b" "d" "b" "d" "c" "c"
[[7]]
[1] "a" "b" "d" "d" "b" "a" "d"
I would like to have vectors of length one given by the element of higher frequency in the list. Notice that is possible to have vectors of length > 1 if there are no duplicates. The frequency table is like this:
table(unlist(out))[order(table(unlist(out)), decreasing = T)]
b c d a
16 14 13 12
The outcome of the example is something like this:
list("b", "c", "b", "b", "b", "b", "b")
REMARK
It is possible to have vectors of length > 1 if there are no duplicates.
out <- lapply(1:sample.int(10, 1), function(x){sample(letters[1:4], x, replace = T)})
length(out)
[1] 10
out[[length(out)+1]] <- c("L", "K")
out
[[1]]
[1] "c"
[[2]]
[1] "d" "a"
[[3]]
[1] "c" "b" "a"
[[4]]
[1] "b" "c" "b" "c"
[[5]]
[1] "a" "a" "d" "c" "d"
[[6]]
[1] "d" "b" "d" "d" "d" "a"
[[7]]
[1] "d" "b" "c" "c" "d" "c" "a"
[[8]]
[1] "d" "a" "d" "b" "d" "a" "b" "d"
[[9]]
[1] "a" "b" "b" "b" "c" "c" "a" "c" "d"
[[10]]
[1] "d" "d" "d" "a" "d" "d" "c" "c" "a" "c"
[[11]]
[1] "L" "K"
Expected outcome:
list("c", "d", "c", "c", "d", "d", "d", "d", "d", "d", c("L", "K"))
I believe that this should work for what you are looking for.
# get counts for entire list and order them
myRanks <- sort(table(unlist(out)), decreasing=TRUE)
This produces
myRanks
b c d a
10 9 5 4
# calculate if most popular, then second most popular, ... item shows up for each list item
sapply(out, function(i) names(myRanks)[min(match(i, names(myRanks)))])
[1] "b" "b" "b" "c" "b" "b" "b"
Here, sapply runs through each list item and returns a vector. It applies a function that selects the name of the first element (via min) of the myRanks table that appears in the list element, using match.
In the case of multiple elements having the same count (duplicates) in the myRanks table, the following code should to return a list of the top observations per list item:
sapply(out,
function(i) {
intersect(names(myRanks)[myRanks == max(unique(myRanks[match(i, names(myRanks))]))],
i)})
Here, the names of myRanks that have the same value as the value in the list item with the highest value in myRanks are intersected with the names present in the list item in order to only return values in both sets.
This should work:
set.seed(12731)
out <- lapply(1:sample.int(10, 1), function(x){sample(letters[1:4], x, replace = T)})
out
#[[1]]
#[1] "b"
#[[2]]
#[1] "c" "b"
#[[3]]
#[1] "b" "b" "b"
#[[4]]
#[1] "d" "c" "c" "d"
#[[5]]
#[1] "d" "b" "a" "a" "c"
#[[6]]
#[1] "a" "b" "c" "b" "c" "c"
#[[7]]
#[1] "a" "c" "d" "b" "d" "c" "b"
tbl <- table(unlist(out))[order(table(unlist(out)), decreasing = T)]
sapply(out, function(x) intersect(names(tbl), x)[1])
# [1] "b" "b" "b" "c" "b" "b" "b"
[EDIT]
set.seed(12731)
out <- lapply(1:sample.int(10, 1), function(x){sample(letters[1:4], x, replace = T)})
out[[length(out)+1]] <- c("L", "K")
out
#[[1]]
#[1] "b"
#[[2]]
#[1] "c" "b"
#[[3]]
#[1] "b" "b" "b"
#[[4]]
#[1] "d" "c" "c" "d"
#[[5]]
#[1] "d" "b" "a" "a" "c"
#[[6]]
#[1] "a" "b" "c" "b" "c" "c"
#[[7]]
#[1] "a" "c" "d" "b" "d" "c" "b"
#[[8]]
#[1] "L" "K"
tbl <- table(unlist(out))[order(table(unlist(out)), decreasing = T)]
#tbl
#b c d a K L
#10 9 5 4 1 1
lapply(out, function(x) names(tbl[tbl==max(tbl[names(tbl) %in% intersect(names(tbl), x)])]))
#[[1]]
#[1] "b"
#[[2]]
#[1] "b"
#[[3]]
#[1] "b"
#[[4]]
#[1] "c"
#[[5]]
#[1] "b"
#[[6]]
#[1] "b"
#[[7]]
#[1] "b"
#[[8]]
#[1] "K" "L"

R : Extract elements of same index & depth level of a list

Here is a list :
# Build a toy list
x1=letters[1:3]
x2=letters[4:5]
x3=letters[1:8]
toy_list=list(list(list("ABX",x1),
list("ZHK",x2)),
list(list("CCC",x3)))
[[1]]
[[1]][[1]]
[[1]][[1]][[1]]
[1] "ABX"
[[1]][[1]][[2]]
[1] "a" "b" "c"
[[1]][[2]]
[[1]][[2]][[1]]
[1] "ZHK"
[[1]][[2]][[2]]
[1] "d" "e"
[[2]]
[[2]][[1]]
[[2]][[1]][[1]]
[1] "CCC"
[[2]][[1]][[2]]
[1] "a" "b" "c" "d" "e" "f" "g" "h"
Let's suppose I want to extract all elements, for example, in 2nd position, at a "deep level" of 3. In other way I want to extract elements of index [[1]][[1]][[2]], [[1]][[2]][[2]], [[2]][[1]][[2]]. Which means I want my output to be
[[1]]
[1] "a" "b" "c"
[[2]]
[1] "d" "e"
[[3]]
[1] "a" "b" "c" "d" "e" "f" "g" "h"
How would you do that in a generalize way?
With purrr, you can use at_depth(2, ...) where 2 indicates the depth level, and ... is an extractor (name/integer) or function. Simplifying the structure afterwards,
library(purrr)
toy_list %>% at_depth(2, 2) %>% flatten()
## [[1]]
## [1] "a" "b" "c"
##
## [[2]]
## [1] "d" "e"
##
## [[3]]
## [1] "a" "b" "c" "d" "e" "f" "g" "h"

pairwise analysis in R

I have a large data-frame in which I have to find the columns when both rows are equal for pairs of individuals.
Here is an example of the dataframe:
>data
ID pos1234 pos1345 pos1456 pos1678
1 1 C A C G
2 2 C G A G
3 3 C A G A
4 4 C G C T
I transformed the dataframe into a pairwise matrix with:
apply(data, 2, combn, m=2)
ID pos1234 pos1345 pos1456 pos1678
[1,] "1" "C" "A" "C" "G"
[2,] "2" "C" "G" "A" "G"
[3,] "1" "C" "A" "C" "G"
[4,] "3" "C" "A" "G" "A"
[5,] "1" "C" "A" "C" "G"
[6,] "4" "C" "G" "C" "T"
[7,] "2" "C" "G" "A" "G"
[8,] "3" "C" "A" "G" "A"
[9,] "2" "C" "G" "A" "G"
[10,] "4" "C" "G" "C" "T"
[11,] "3" "C" "A" "G" "A"
[12,] "4" "C" "G" "C" "T"
I am now having trouble identifying the column containing the identical letters between pairs. For example, for pairs 1 and 2 the columns containing the identical letters would be pos1234 and pos1678.
Would it be possible get a dataframe with just identical letters for each pair of individuals?
Thanks in advance.
You can pass a function to combn:
res0 <- combn(nrow(data), 2, FUN = function(x)
names(data[-1])[ lengths(sapply(data[x,-1], unique)) == 1 ], simplify=FALSE)
which gives
[[1]]
[1] "pos1234" "pos1678"
[[2]]
[1] "pos1234" "pos1345"
[[3]]
[1] "pos1234" "pos1456"
[[4]]
[1] "pos1234"
[[5]]
[1] "pos1234" "pos1345"
[[6]]
[1] "pos1234"
To figure out which of these [[1]]..[[6]] correspond to which pairs, take combn again:
res <- setNames(res0, combn(data$ID, 2, paste, collapse="."))
which gives
$`1.2`
[1] "pos1234" "pos1678"
$`1.3`
[1] "pos1234" "pos1345"
$`1.4`
[1] "pos1234" "pos1456"
$`2.3`
[1] "pos1234"
$`2.4`
[1] "pos1234" "pos1345"
$`3.4`
[1] "pos1234"

Resources