Find disjoint groups (Connected Components) in a set of vectors [duplicate] - r

Given a list:
foo <- list(c("a", "b", "d"), c("c", "b"), c("c"),
c("b", "d"), c("e", "f"), c("e", "g"))
what is an efficient way to get a list that contains the disjoint sets of its content?
Here I want to obtain:
[[1]]
[1] "a" "b" "c" "d"
[[2]]
[1] "e" "f" "g"
The solutions I have managed to come up with seemed overly complicated and slow (I'm working with a largish list (4000+ elements) that contain up to hundreds of elements).
Thanks!
Solutions benchmarking
Thank you all for your input. The igraph approach is really nice. I did some benchmarking of the proposed solutions and using igraph with #flodel suggestion is efficient. The example here (iGrp) has 3170 elements.
> microbenchmark(igraph_method(iGrp), igraph_method2(iGrp), iterative_method(iGrp), times=10L)
## Unit: milliseconds
## expr min lq median uq max neval
## igraph_method(iGrp) 6892.8534 7140.0287 7229.5569 7396.2458 8044.9796 10
## igraph_method2(iGrp) 381.4555 391.2097 442.3282 472.5641 537.4885 10
## iterative_method(iGrp) 7118.7857 7272.9568 7595.9700 7675.2888 8485.4388 10
#### functions used
igraph_method <- function(lst) {
edg <- do.call("rbind", lapply(lst, function(x) {
if (length(x) > 1) t(combn(x, 2)) else NULL
}))
g <- graph.data.frame(edg)
split(V(g)$name, clusters(g)$membership)
}
igraph_method2 <- function(lst) {
edg <- do.call("rbind", lapply(lst, function(x) {
if (length(x) > 1) cbind(head(x, -1), tail(x, -1)) else NULL
}))
g <- graph.data.frame(edg)
split(V(g)$name, clusters(g)$membership)
}
iterative_method <- function(lst) {
Reduce(function(l, x) {
matches <- sapply(l, function(i) any(x %in% i))
if (any(matches)) {
combined <- unique(c(unlist(l[matches]), x))
l[matches] <- NULL # Delete old entries
l <- c(l, list(combined)) # Add combined entries
} else {
l <- c(l, list(x)) # New list entry
}
l
}, lst, init=list())
}

One way to approach this sort of problem is to build a graph where nodes are the values in your list and edges are whether those values have appeared together. Then you're just asking for the connected components of that graph. The igraph package in R makes this pretty easy. First, you'll want to build a data frame with the edges:
edges <- do.call(rbind, lapply(foo, function(x) {
if (length(x) > 1) cbind(head(x, -1), tail(x, -1)) else NULL
}))
edges
# [,1] [,2]
# [1,] "a" "b"
# [2,] "b" "d"
# [3,] "c" "b"
# [4,] "b" "d"
# [5,] "e" "f"
# [6,] "e" "g"
Then, you can build your graph from the edges and compute the connected components:
library(igraph)
g <- graph.data.frame(edges, directed=FALSE)
split(V(g)$name, clusters(g)$membership)
# $`1`
# [1] "a" "b" "c" "d"
#
# $`2`
# [1] "e" "f" "g"
For reasonably large problems, this approach seems to be modestly faster than an iterative approach:
values = as.character(1:2000)
set.seed(144)
foo <- lapply(1:4000, function(x) sample(values, rbinom(1, 10, .5)))
library(microbenchmark)
microbenchmark(josilber(foo), lundberg(foo))
# Unit: milliseconds
# expr min lq median uq max neval
# josilber(foo) 251.8007 281.0168 297.2446 314.6714 635.7916 100
# lundberg(foo) 640.0575 714.9658 761.3777 827.5415 1118.3517 100

Here is an iterative approach, building a list for the result, and combining elements as they are seen together:
Reduce(function(l, x) {
matches <- sapply(l, function(i) any(x %in% i))
if (any(matches)) {
combined <- unique(c(unlist(l[matches]), x))
l[matches] <- NULL # Delete old entries
l <- c(l, list(combined)) # Add combined entries
} else {
l <- c(l, list(x)) # New list entry
}
l
}, foo, init=list())
## [[1]]
## [1] "a" "b" "d" "c"
##
## [[2]]
## [1] "e" "f" "g"

Related

understanding behaviour map vs sapply: viewing column names containing a particular value

My dataset codes "Not available" differently depending on the variable (-99, -100, NA). It has 100s of variables so the first step was to look up which columns are affected, in order to recode them appropriately.
EDIT: thanks to #joran and #G. Grothendieck, I got answers pretty quickly. Just to provide a TL;DR: the option with colSums is probably best: fast, succinct and flexible (although its arguments are not so easy to put into a variable?)
f1 <- function() {colnames(tbl_df[map_lgl(tbl_df, ~any(. == -100, na.rm = TRUE))])}
f2 <- function() {names(tbl_df)[colSums(tbl_df == -100) > 0]}
f3 <- function() {colnames(tbl_df[,sapply(tbl_df, function(x) any(x == -100, na.rm = TRUE))])}
microbenchmark(f1(), f2(), f3(), unit = "relative")
#> Unit: relative
#> expr min lq mean median uq max neval
#> f1() 2.924239 2.694531 2.026845 2.578680 2.604190 0.8291649 100
#> f2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100
#> f3() 1.113641 1.140000 1.053742 1.167211 1.178409 0.8241631 100
Original post continues here
I've tried to generalise the sapply answer here, and after some trial and error have succeeded with purrr::map... But I don't understand why some of the things I tried do not work, in particular, sapply seems erratic.
Here's a reprex:
library(tidyverse)
set.seed(124)
df <- data.frame(a = c(sample(1:49, 49),-99, NA),
b = c(sample(1:50, 50), -99),
c = c(sample(1:50, 50), -100),
d = sample(1:51, 51),
e = sample(1:51, 51))
# First puzzle: answer in other thread doesn't work with data.frame
colnames(df[,sapply(df, function(x) any(is.na(x)))])
#> NULL
# but works with a tibble
tbl_df <- as.tibble(df)
colnames(tbl_df[,sapply(tbl_df, function(x) any(is.na(x)))])
#> [1] "a"
# However, this doesn't work for any other missing value coding
# (Edit: it seems to work if there's more than one column??)
colnames(tbl_df[,sapply(tbl_df, function(x) any(x == -99))])
#> [1] "a" "b"
colnames(tbl_df[,sapply(tbl_df, function(x) any(x == -100))])
#> Error in tbl_df[, sapply(tbl_df, function(x) any(x == -100))]:
#> object of type 'closure' is not subsettable
#(NB: I get "Error: NA column indexes not supported" on my console)
I can imagine this has something to do with the way sapply works but the documentation and answers like this one don't quite cut it for me...
I've come up with the following, which works quite fine for checking values both individually and in groups. I'd welcome any improvements (e.g. keeping the values alongside the columns where they're found).
colnames(tbl_df[unlist(map(tbl_df, ~any(. %in% c(-99, -100, NA))))])
#> [1] "a" "b" "c"
On a side note, I don't really understand why trying to achieve a similar thing in the pipe yielded the wrong thing
tbl_df %>%
filter_all(all_vars(. == -99)) %>%
colnames()
#> [1] "a" "b" "c" "d" "e"
Sorry if this seems like a motley collection of questions; but I'd appreciate any clarification!
1) drop=FALSE Subscripting a data.frame will drop the dimensions for 1d results unless drop = FALSE is used so try this. (tibble subscripting does not drop dimensions.)
# colnames(df[,sapply(df, function(x) any(is.na(x)))])
colnames(df[, sapply(df, function(x) any(is.na(x))), drop = FALSE])
## [1] "a"
or easier:
names(df)[apply(is.na(df), 2, any)]
## [1] "a"
or
names(df)[colSums(is.na(df)) > 0]
## [1] "a"
2) na.rm=TRUE In the next example there is an NA in the first column. If we exclude that we get an answer:
# colnames(tbl_df[,sapply(tbl_df, function(x) any(x == -100))])
colnames(tbl_df[, sapply(tbl_df, function(x) any(x == -100, na.rm = TRUE))])
## [1] "c"
or
names(tbl_df)[colSums(tbl_df == -100, na.rm = TRUE) > 0]
## [1] "c"
or use which
names(tbl_df[, sapply(tbl_df, function(x) length(which(x == -100)) > 0)])
## [1] "c"
or
names(tbl_df)[lengths(lapply(as.data.frame(tbl_df == -100), which)) > 0]
## [1] "c"
or using which(..., arr.ind = TRUE)
names(tbl_df)[ unique(which(tbl_df == -100, arr.ind = TRUE)[, "col"]) ]
## [1] "c"
3) simplification We can simplify this by factoring out the generic part that does not depend on the data into is.bad:
# colnames(tbl_df[unlist(map(tbl_df, ~any(. %in% c(-99, -100, NA))))])
is.bad <- function(x) any(x %in% c(-99, -100, NA))
names(tbl_df)[ sapply(tbl_df, is.bad) ]
## [1] "a" "b" "c"
or
Filter(function(x) is.bad(tbl_df[[x]]), names(tbl_df))
## [1] "a" "b" "c"
or for a different approach:
names(tbl_df)[colSums(is.na(tbl_df) | tbl_df == -99 | tbl_df == -100) > 0]
## [1] "a" "b" "c"
4) select_if filter_all with all_vars goes row by row and picks out those rows for which all the columns satisfy the condition. You want to go column by column, not row by row. Use select_if instead:
tbl_df %>%
select_if(~ any(. == -99)) %>%
names
## [1] "a" "b"

How to rename elements inside a list of lists in the an efficient manner in R?

I currently have a large list with around 5000 elements. A reproducible example is like:
List.5000 <- replicate(5000, c(list(A='A',value.A=10),list(B='B',value.B=20)), simplify = F)
which has:
> List.5000
[[1]]
[[1]]$A
[1] "A"
[[1]]$value.A
[1] 10
[[1]]$B
[1] "B"
[[1]]$value.B
[1] 20
[[2]]
[[2]]$A
[1] "A"
[[2]]$value.A
[1] 10
[[2]]$B
[1] "B"
[[2]]$value.B
[1] 20
....
When I call names(List.5000) it returns NULL. But when I call names(List.5000[[1]]), it gives:
"A" "value.A" "B" "value.B"
I would like to change the name "B" to "Z". Is there a way to do this without resorting to creating a new list, then looping and reconstructing?
for(i in seq_along(List.5000))
names(List.5000[[i]])[names(List.5000[[i]]) == 'B'] <- 'Z'
Or, if you prefer lapply/map:
List.5000 <- lapply(List.5000,
function(x) {names(x)[names(x) == 'B'] <- 'Z'; x})
library(purrr)
List.5000 <- map(List.5000, ~{names(.)[names(.) == 'B'] <- 'Z'; .})
If the names really are all the same you could just do
ind <- names(List.5000[[1]]) == 'B'
for(i in seq_along(List.5000))
names(List.5000[[i]])[ind] <- 'Z'
The bracket syntax is a little faster:
x <- List.5000[[1]]
microbenchmark(
sub = names(x) <- sub("^B$", "Z", names(x))
, ifelse = names(x) <- ifelse(names(x) == 'B', 'Z', names(x))
, stringi = names(x) <- str_replace(names(x), "^B$", "Z")
, replace = names(x) <- replace(names(x), names(x) == 'B', 'Z')
, bracket = names(x)[names(x) == 'B'] <- 'Z'
)
# Unit: microseconds
# expr min lq mean median uq max neval
# sub 22.041 31.2265 58.24097 46.9650 78.5075 373.637 100
# ifelse 13.309 22.4110 44.00665 30.2235 65.1395 113.693 100
# stringi 153.880 313.0400 346.41543 358.4795 383.4130 631.354 100
# replace 4.067 6.3205 13.09022 8.1760 11.9280 54.075 100
# bracket 3.246 4.5265 10.38177 5.9180 7.9925 55.278 100
Still a little unsatisfying, as none of these methods modify the list in place.
You can use purrr:map() and sub():
library(purrr)
map(List.5000, ~sub("^B$", "Z", names(.x)))
Or use stringr::str_replace to keep to tidyverse syntax:
library(stringr)
map(List.5000, ~str_replace(names(.x), "^B$", "Z"))
You don't even really need purrr, you can just use lapply, but I really find purrr to be clearer syntax:
lapply(List.5000, function(x) sub("^B$", "2", names(x)))
Update
Per Ryan's comment, if you actually want to change the list names themselves, but not the values, you can use this instead:
change_names <- function(x) {
names(x) <- sub("^B$", "Z", names(x))
x
}
map(List.5000, ~change_names(.x))
[[1]]
[[1]]$A
[1] "A"
[[1]]$value.A
[1] 10
[[1]]$Z
[1] "B"
[[1]]$value.B
[1] 20
...
We could use plyr::rename
out <- lapply(List.5000, plyr::rename, c("B" = "Z"))

How to extract elements and their indices from a list in R

I would like to extract list elements and their indices in R while removing items with 0 length. Let's say I have the following list in R:
l1 <- character(0)
l2 <- c("a","b")
l3 <- c("c","d","e")
list1 <- list(l1, l1, l2, l1, l3)
Then list1 returns the following:
[[1]]
character(0)
[[2]]
character(0)
[[3]]
[1] "a" "b"
[[4]]
character(0)
[[5]]
[1] "c" "d" "e"
I would like to somehow extract an object that displays the index/position for each non-empty element, as well as the contents of that element. So something that looks like this:
[[3]]
[1] "a" "b"
[[5]]
[1] "c" "d" "e"
The closest I've come to doing this is by removing the empty elements, but then I lose the original index/position of the remaining elements:
list2 <- list1[lapply(list1, length) > 0]
list2
[[1]]
[1] "a" "b"
[[2]]
[1] "c" "d" "e"
keep, will keep elements matching a predicate. negate(is_empty) creates a function that returns TRUE if a vector is not empty.
library("purrr")
names(list1) <- seq_along(list1)
keep(list1, negate(is_empty))
#> $`3`
#> [1] "a" "b"
#>
#> $`5`
#> [1] "c" "d" "e"
Overview
Keeping the indices required me to name each element in the list. This answer uses which() to set the condition that I apply to list1 to keep non-zero length elements.
# load data
l1 <- character(0)
l2 <- c("a","b")
l3 <- c("c","d","e")
list1 <- list( l1, l1, l2, l1, l3)
# name each element in the list
names( list1 ) <- as.character( 1:length( list1 ) )
# create a condition that
# keeps only non zero length elements
# from list1
non.zero.length.elements <-
which( lapply( X = list1, FUN = length ) != 0 )
# apply the condition to list1
# to view the non zero length elements
list1[ non.zero.length.elements ]
# $`3`
# [1] "a" "b"
#
# $`5`
# [1] "c" "d" "e"
# end of script #
I'm not sure exactly what 'extract an object that displays' means, but if you just want to print you can use this modified print.
I just slightly edited print.listof (it's not recursive! zero length subelements will be displayed):
print2 <- function (x, ...)
{
nn <- names(x)
ll <- length(x)
if (length(nn) != ll)
nn <- paste0("[[", seq.int(ll),"]]")
for (i in seq_len(ll)[lengths(x)>0]) {
cat(nn[i], "\n")
print(x[[i]], ...)
cat("\n")
}
invisible(x)
}
print2(list1)
[[3]]
[1] "a" "b"
[[5]]
[1] "c" "d" "e"
A very simple solution is to provide names to the elements of your list and then run your function again. There are several ways to name your elements.
l1 <- character(0)
l2 <- c("a","b")
l3 <- c("c","d","e")
list1 <- list(e1=l1, e2=l1, e3=l2, e4=l1, e5=l3)
list1
names(list1)<-paste0("element",seq(length(list1)))
list1[lapply(list1, length) > 0]

Union of intersecting vectors in a list in R

I have a list of vectors as follows.
data <- list(v1=c("a", "b", "c"), v2=c("g", "h", "k"),
v3=c("c", "d"), v4=c("n", "a"), v5=c("h", "i"))
I am trying to achieve the following:
Check whether any of the vectors intersect with each other
If intersecting vectors are found, get their union
So the desired output is
out <- list(v1=c("a", "b", "c", "d", "n"), v2=c("g", "h", "k", "i"))
I can get the union of a group of intersecting sets as follows.
Reduce(union, list(data[[1]], data[[3]], data[[4]]))
Reduce(union, list(data[[2]], data[[5]])
How to first identify the intersecting vectors? Is there a way of dividing the list into lists of groups of intersecting vectors?
#Update
Here is an attempt using data.table. Gets the desired results. But still slow for large lists as in this example dataset.
datasets.
data <- sapply(data, function(x) paste(x, collapse=", "))
data <- as.data.frame(data, stringsAsFactors = F)
repeat {
M <- nrow(data)
data <- data.table( data , key = "data" )
data <- data[ , list(dataelement = unique(unlist(strsplit(data , ", " )))), by = list(data)]
data <- data.table(data , key = "dataelement" )
data <- data[, list(data = paste0(sort(unique(unlist(strsplit(data, split=", ")))), collapse=", ")), by = "dataelement"]
data$dataelement <- NULL
data <- unique(data)
N <- nrow(data)
if (M == N)
break
}
data <- strsplit(as.character(data$data) , "," )
This is kind of like a graph problem so I like to use the igraph library for this, using your sample data, you can do
library(igraph)
#build edgelist
el <- do.call("rbind",lapply(data, embed, 2))
#make a graph
gg <- graph.edgelist(el, directed=F)
#partition the graph into disjoint sets
split(V(gg)$name, clusters(gg)$membership)
# $`1`
# [1] "b" "a" "c" "d" "n"
#
# $`2`
# [1] "h" "g" "k" "i"
And we can view the results with
V(gg)$color=c("green","purple")[clusters(gg)$membership]
plot(gg)
Here's another approach using only base R
Update
Next update after akrun's comment and with his sample data:
data <- list(v1=c('g', 'k'), v2= letters[1:4], v3= c('b', 'c', 'd', 'a'))
Modified function:
x <- lapply(seq_along(data), function(i) {
if(!any(data[[i]] %in% unlist(data[-i]))) {
data[[i]]
} else if (any(data[[i]] %in% unlist(data[seq_len(i-1)]))) {
NULL
} else {
z <- lapply(data[-seq_len(i)], intersect, data[[i]])
z <- names(z[sapply(z, length) >= 1L])
if (is.null(z)) NULL else union(data[[i]], unlist(data[z]))
}
})
x[!sapply(x, is.null)]
#[[1]]
#[1] "g" "k"
#
#[[2]]
#[1] "a" "b" "c" "d"
This works well with the original sample data, MrFlick's sample data and akrun's sample data.
Efficiency be damned and do you people even sleep? Base R only and much slower than the fastest answer. Since I wrote it, might as well post it.
f.union = function(x) {
repeat{
n = length(x)
m = matrix(F, nrow = n, ncol = n)
for (i in 1:n){
for (j in 1:n) {
m[i,j] = any(x[[i]] %in% x[[j]])
}
}
o = apply(m, 2, function(v) Reduce(union, x[v]))
if (all(apply(m, 1, sum)==1)) {return(o)} else {x=unique(o)}
}
}
f.union(data)
[[1]]
[1] "a" "b" "c" "d" "n"
[[2]]
[1] "g" "h" "k" "i"
Because I like being slow. (loaded library outside of benchmark)
Unit: microseconds
expr min lq mean median uq max neval
vlo() 896.435 1070.6540 1315.8194 1129.4710 1328.6630 7859.999 1000
akrun() 596.263 658.6590 789.9889 694.1360 804.9035 3470.158 1000
flick() 805.854 928.8160 1160.9509 1001.8345 1172.0965 5780.824 1000
josh() 2427.752 2693.0065 3344.8671 2943.7860 3524.1550 16505.909 1000 <- deleted :-(
doc() 254.462 288.9875 354.6084 302.6415 338.9565 2734.795 1000
One option would be to use combn and then find the intersects. There would be easier options.
indx <- combn(names(data),2)
lst <- lapply(split(indx, col(indx)),
function(i) Reduce(`intersect`,data[i]))
indx1 <- names(lst[sapply(lst, length)>0])
indx2 <- indx[,as.numeric(indx1)]
indx3 <- apply(indx2,2, sort)
lapply(split(1:ncol(indx3), indx3[1,]),
function(i) unique(unlist(data[c(indx3[,i])], use.names=FALSE)))
#$v1
#[1] "a" "b" "c" "d" "n"
#$v2
#[1] "g" "h" "k" "i"
Update
You could use combnPrim from library(gRbase) to make this even faster. Using a slightly bigger dataset
library(gRbase)
set.seed(25)
data <- setNames(lapply(1:1e3,function(i)sample(letters,
sample(1:20), replace=FALSE)), paste0("v", 1:1000))
and comparing with the fastest. These are modified functions based on OP's comments to #docendo discimus.
akrun2M <- function(){
ind <- sapply(seq_along(data), function(i){#copied from #docendo discimus
!any(data[[i]] %in% unlist(data[-i]))
})
data1 <- data[!ind]
indx <- combnPrim(names(data1),2)
lst <- lapply(split(indx, col(indx)),
function(i) Reduce(`intersect`,data1[i]))
indx1 <- names(lst[sapply(lst, length)>0])
indx2 <- indx[,as.numeric(indx1)]
indx3 <- apply(indx2,2, sort)
c(data[ind],lapply(split(1:ncol(indx3), indx3[1,]),
function(i) unique(unlist(data[c(indx3[,i])], use.names=FALSE))))
}
doc2 <- function(){
x <- lapply(seq_along(data), function(i) {
if(!any(data[[i]] %in% unlist(data[-i]))) {
data[[i]]
}
else {
z <- unlist(data[names(unlist(lapply(data[-c(1:i)],
intersect, data[[i]])))])
if (is.null(z)){
z
}
else union(data[[i]], z)
}
})
x[!sapply(x, is.null)]
}
Benchmarks
microbenchmark(doc2(), akrun2M(), times=10L)
# Unit: seconds
# expr min lq mean median uq max neval cld
# doc2() 35.43687 53.76418 54.77813 54.34668 62.86665 67.76754 10 b
#akrun2M() 26.64997 28.74721 38.02259 35.35081 47.56781 49.82158 10 a
I came across a similar problem that prompted me to look everywhere for a solution. I finally found a very good one thanks to a number of great contributors here, however as I seen this post I thought I would write my own custom function for this purpose. It's not actually elegant and is too slow but I think it's quite effective and can do the trick for now until I make some improvements:
anoush <- function(x) {
# First we check whether x is a list
stopifnot(is.list(x))
# Then we take every element of the input and calculate the intersect between
# that element & others. In case there were some we would store the indices
# in `vec` vector. So in the end we have a list called `ind` whose elements
# are all the indices connected with the corresponding elements of the original
# list for example first element of `ind` is `1`, `2`, `3` which means in
# the original list these elements have common values.
ind <- lapply(1:length(x), function(a) {
vec <- c()
for(i in 1:length(x)) {
if(length(unique(base::intersect(x[[a]], x[[i]]))) > 0) {
vec <- c(vec, i)
}
}
vec
})
# Then we go on to again compare each element of `ind` with other elements
# in case there were any intersect, we will calculate the `union` of them.
# for each element we will end up with a list of accumulated values but
# but in the end we use `Reduce` to capture only the last one. So for each
# element of `ind` we end up having a collection of indices that also
# result in duplicated values. For example elements `1` through `5` of
# `dup_ind` contains the same value cause in the original list these
# elements have common values.
dup_ind <- lapply(1:length(ind), function(a) {
out <- c()
for(i in 1:length(ind)) {
if(length(unique(base::intersect(ind[[a]], ind[[i]]))) > 0) {
out[[i]] <- union(ind[[a]], ind[[i]])
}
vec2 <- Reduce("union", out)
}
vec2
})
# Here we get rid of the duplicated elements of the list by means of
# `relist` funciton and since in this process all the duplicated elements
# will turn to `integer(0)` I have filtered those out.
un <- unlist(dup_ind)
res <- Map(`[`, dup_ind, relist(!duplicated(un), skeleton = dup_ind))
res2 <- Filter(length, res)
sapply(res2, function(a) unique(unlist(lapply(a, function(b) `[[`(x, b)))))
}
OP's Data Sample
> anoush(data)
[[1]]
[1] "a" "b" "c" "d" "n"
[[2]]
[1] "g" "h" "k" "i"
Dear #akrun's Data Sample
data <- list(v1=c('g', 'k'), v2= letters[1:4], v3= c('b', 'c', 'd', 'a'))
> anoush(data)
[[1]]
[1] "g" "k"
[[2]]
[1] "a" "b" "c" "d"
In general, you cannot do much better/faster than Floyd-Warshall-Algorithm, which is as follows:
library(Rcpp)
cppFunction(
"LogicalMatrix floyd(LogicalMatrix w){
int n = w.nrow();
for( int k = 0; k < n; k++ )
for( int i = 0; i < (n-1); i++ )
for( int j = i+1; j < n; j++ )
if( w(i,k) && w(k,j) ) {
w(i,j) = true;
w(j,i) = true;
}
return w;
}")
fw.union<-function(x) {
n<-length(x)
w<-matrix(F,nrow=n,ncol=n)
for( i in 1:n ) {
w[i,i]<-T
}
for( i in 1:(n-1) ) {
for( j in (i+1):n ) {
w[i,j]<-w[j,i]<- any(x[[i]] %in% x[[j]])
}
}
apply( unique( floyd(w) ), 1, function(y) { Reduce(union,x[y]) } )
}
Running benchmarks would be interesting, though. Preliminary tests suggest that my implementation is about 2-3 times faster than Vlo's.

Generate all combinations given a constraint

How can I generate all of the 6 combinations of 2 treatments (A,B) in blocks of 4, such that in each block there is an equal number of A's and B's, using R?
"AABB","ABAB","ABBA","BBAA","BABA","BAAB"
P.S. The number of combinations is calculated as follows:
If
T = #treatments
n = #treatments in each block = k*T,
The number of combinations equals n! / [k!*k! (T times)]
Thank you
Something like this should work:
library(gtools)
t <- c('A','B')
k <- 2
n <- k * length(t)
t2 <- rep(t, k)
m <- permutations(n,n)
res <- unique(apply(m,MARGIN=1,function(x) paste(t2[x],collapse='')))
--------------------------------------------------------------------
res
[1] "ABAB" "ABBA" "AABB" "BAAB" "BABA" "BBAA"
The multicool package implements an algorithm for permuting multisets --- exactly the task you want to have performed. Here's an example of what it can do:
library(multicool)
# Create a simple convenience function
enumAllPartitions <- function(multiset) {
m1 <- initMC(multiset) # Initialize the permutation object
N <- fact(length(multiset))/ # Calculate number of permutations
prod(fact(table(multiset)))
sapply(seq_len(N), function(X) paste(nextPerm(m1), collapse=""))
}
# Try it out with a few different multisets
x <- c("A", "A", "B", "B")
y <- c("G", "L", "L", "L")
z <- c("X", "X", "Y", "Z", "Z")
lapply(list(x,y,z), enumAllPartitions)
[[1]]
[1] "BBAA" "ABBA" "BABA" "ABAB" "AABB" "BAAB"
[[2]]
[1] "LLLG" "GLLL" "LGLL" "LLGL"
[[3]]
[1] "ZZYXX" "XZZYX" "ZXZYX" "ZZXYX" "XZZXY" "ZXZXY" "XZXZY" "XXZZY" "ZXXZY"
[10] "ZZXXY" "YZZXX" "ZYZXX" "XZYZX" "ZXYZX" "YZXZX" "XYZZX" "YXZZX" "ZYXZX"
[19] "XZYXZ" "ZXYXZ" "XZXYZ" "XXZYZ" "ZXXYZ" "YZXXZ" "XYZXZ" "YXZXZ" "XYXZZ"
[28] "XXYZZ" "YXXZZ" "ZYXXZ"
The expected solution can also be achieved using the new iterpc package.
I <- iterpc(c(2, 2), labels=c("A", "B"), ordered=TRUE)
getall(I)
# [,1] [,2] [,3] [,4]
# [1,] "A" "A" "B" "B"
# [2,] "A" "B" "A" "B"
# [3,] "A" "B" "B" "A"
# [4,] "B" "A" "A" "B"
# [5,] "B" "A" "B" "A"
# [6,] "B" "B" "A" "A"

Resources