Getting distinct Combinations in R with repetition - r

I have a list of integers, say: (1,2,3,4,5)
I want to obtain all the possible lists of size 5, such that:
1. The lists can contain repeat elements, e.g. (1,1,1,2,2)
2. Ordering does not matter, e.g. (1,1,2,2,1) is the same as (1,1,1,2,2)
How do I obtain this whole list? I am actually looking for combinations of size 10 from a set of 10 integers.

Using the RcppAlgos solution recommended in this answer, we want to choose sets of 5 elements from your input, with repetition, and order doesn't matter (thus comboGeneral(), we would use permuteGeneral() if order mattered). Being coded in C++ under the hood, this will be a very fast solution, and the profiling in the linked answer also found it to be memory efficient. Generating the sets for 10 multichoose 10 still took less than a second on my laptop.
library(RcppAlgos)
x = 1:5
result = comboGeneral(x, m = 5, repetition = T)
dim(result)
# [1] 126 5
head(result)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 1 1 1 1
# [2,] 1 1 1 1 2
# [3,] 1 1 1 1 3
# [4,] 1 1 1 1 4
# [5,] 1 1 1 1 5
# [6,] 1 1 1 2 2

The link provided by Gregor seems to rely entirely on third-party packages to produce multisets, so I wanted to give you a base R solution. Note that the packages mentioned in that link will almost certainly be far more efficient for extremely large datasets.
We can use expand.grid() to first generate all possible permutations with repetition of the elements in (1,2,3,4,5). In this case, different orderings are still considered to be distinct. We now want to remove these "extra" combinations that contain the same elements but have different orders, which we can do by using apply() and duplicated().
If you use the multiset calculator here, you'll find that the code below produces the correct number of combinations. Here's the code:
x <- seq(1:5)
df <- expand.grid(x, x, x, x, x) # generates 5^5 combinations, allowing repetition
index <- !duplicated(t(apply(df, 1, sort))) # find extraneous combinations
df <- df[index, ] # select only unique combinations
# check number of rows. It should be 126; one for each combination
nrows(df)
# Output
# [1] 126
# Quick look at part of the dataframe:
head(df)
Var1 Var2 Var3 Var4 Var5
1 1 1 1 1 1
2 2 1 1 1 1
3 3 1 1 1 1
4 4 1 1 1 1
5 5 1 1 1 1
7 2 2 1 1 1

For a similar approach to #MarcusCampbell's within the tidyverse we can use expand to enumerate all possible combinations, and then only keep distinct combinations that are invariant under a permutation (i.e. where the ordering does not matter):
library(tidyverse);
tibble(V1 = 1:5, V2 = 1:5, V3 = 1:5, V4 = 1:5, V5 = 1:5) %>%
expand(V1, V2, V3, V4, V5) %>%
rowwise() %>%
mutate(cmbn = paste(sort(c(V1, V2, V3, V4, V5)), collapse = ",")) %>%
distinct(cmbn);
## A tibble: 126 x 1
# cmbn
# <chr>
# 1 1,1,1,1,1
# 2 1,1,1,1,2
# 3 1,1,1,1,3
# 4 1,1,1,1,4
# 5 1,1,1,1,5
# 6 1,1,1,2,2
# 7 1,1,1,2,3
# 8 1,1,1,2,4
# 9 1,1,1,2,5
#10 1,1,1,3,3
## ... with 116 more rows

Related

Count instances of overlap in two vectors in R

I am hoping to create a matrix that shows a count of instances of overlapping values for a grouping variable based on a second variable. Specifically, I am hoping to determine the degree to which primary studies overlap across meta-analyses in order to create a network diagram.
So, in this example, I have three meta-analyses that include some portion of three primary studies.
df <- data.frame(metas = c(1,1,1,2,3,3), studies = c(1,3,2,1,2,3))
metas studies
1 1 1
2 1 3
3 1 2
4 2 1
5 3 2
6 3 3
I would like it to return:
v1 v2 v3
1 3 1 2
2 1 1 0
3 2 0 2
The value in row 1, column 1 indicates that Meta-analysis 1 had three studies in common with itself (i.e., it included three studies). Row 1, column 2 indicates that Meta-analysis 1 had one study in common with Meta-analysis 2. Row 1, column 3 indicates that Meta-analysis 1 had two studies in common with Meta-analysis 3.
I believe you are looking for a symmetric matrix of intersecting studies.
dfspl <- split(df$studies, df$metas)
out <- outer(seq_along(dfspl), seq_along(dfspl),
function(a, b) lengths(Map(intersect, dfspl[a], dfspl[b])))
out
# [,1] [,2] [,3]
# [1,] 3 1 2
# [2,] 1 1 0
# [3,] 2 0 2
If you need names on them, you can go with the names as defined by df$metas:
rownames(out) <- colnames(out) <- names(dfspl)
out
# 1 2 3
# 1 3 1 2
# 2 1 1 0
# 3 2 0 2
If you need the names defined as v plus the meta name, go with
rownames(out) <- colnames(out) <- paste0("v", names(dfspl))
out
# v1 v2 v3
# v1 3 1 2
# v2 1 1 0
# v3 2 0 2
If you need to understand what this is doing, outer creates an expansion of the two argument vectors, and passes them all at once to the function. For instance,
outer(seq_along(dfspl), seq_along(dfspl), function(a, b) { browser(); 1; })
# Called from: FUN(X, Y, ...)
debug at #1: [1] 1
# Browse[2]>
a
# [1] 1 2 3 1 2 3 1 2 3
# Browse[2]>
b
# [1] 1 1 1 2 2 2 3 3 3
# Browse[2]>
What we ultimately want to do is find the intersection of each pair of studies.
dfspl[[1]]
# [1] 1 3 2
dfspl[[3]]
# [1] 2 3
intersect(dfspl[[1]], dfspl[[3]])
# [1] 3 2
length(intersect(dfspl[[1]], dfspl[[3]]))
# [1] 2
Granted, we are doing it twice (once for 1 and 3, once for 3 and 1, which is the same result), so this is a little inefficient ... it would be better to filter them to only look at the upper or lower half and transferring it to the other.
Edited for a more efficient process (only calculating each intersection pair once, and never calculating self-intersection.)
eg <- expand.grid(a = seq_along(dfspl), b = seq_along(dfspl))
eg <- eg[ eg$a < eg$b, ]
eg
# a b
# 4 1 2
# 7 1 3
# 8 2 3
lens <- lengths(Map(intersect, dfspl[eg$a], dfspl[eg$b]))
lens
# 1 1 2 ## btw, these are just names, from eg$a
# 1 2 0
out <- matrix(nrow = length(dfspl), ncol = length(dfspl))
out[ cbind(eg$a, eg$b) ] <- lens
out
# [,1] [,2] [,3]
# [1,] NA 1 2
# [2,] NA NA 0
# [3,] NA NA NA
out[ lower.tri(out) ] <- out[ upper.tri(out) ]
diag(out) <- lengths(dfspl)
out
# [,1] [,2] [,3]
# [1,] 3 1 2
# [2,] 1 1 0
# [3,] 2 0 2
Same idea as #r2evans, also Base R (and a bit less eloquent) (edited as required):
# Create df using sample data:
df <- data.frame(metas = c(1,1,1,2,3,3), studies = c(1,7,2,1,2,3))
# Test for equality between the values in the metas vector and the rest of
# of the values in the dataframe -- Construct symmetric matrix from vector:
m1 <- diag(v1); m1[,1] <- m1[1,] <- v1 <- rowSums(data.frame(sapply(df$metas, `==`,
unique(unlist(df)))))
# Coerce matrix to dataframe setting the names as desired; dropping non matches:
df_2 <- setNames(data.frame(m1[which(rowSums(m1) > 0), which(colSums(m1) > 0)]),
paste0("v", 1:ncol(m1[which(rowSums(m1) > 0), which(colSums(m1) > 0)])))

Converting a matrix to a data.table list format [duplicate]

This seems like a simple problem but I am having trouble doing this in a fast manner.
Say I have a matrix and I want to sort this matrix and store the indices of the elements in descending order. Is there a quick way to do this? Right now, I am extracting the maximum, storing the result, changing it to -2, and then extracting the next maximum in a for loop. Which is probably the most inefficient way to do it.
My problem actually requires me to work on a 20,000 X 20,000 matrix. Memory is not an issue. Any ideas about the fastest way to do it would be great.
For example if I have a matrix
>m<-matrix(c(1,4,2,3),2,2)
>m
[,1] [,2]
[1,] 1 2
[2,] 4 3
I want the result to indicate the numbers in descending order:
row col val
2 1 4
2 2 3
1 2 2
1 1 1
Here's a possible data.table solution
library(data.table)
rows <- nrow(m) ; cols <- ncol(m)
res <- data.table(
row = rep(seq_len(rows), cols),
col = rep(seq_len(cols), each = rows),
val = c(m)
)
setorder(res, -val)
res
# row col val
# 1: 2 1 4
# 2: 2 2 3
# 3: 1 2 2
# 4: 1 1 1
Edit: a base R alternative
res <- cbind(
row = rep(seq_len(rows), cols),
col = rep(seq_len(cols), each = rows),
val = c(m)
)
res[order(-res[, 3]),]
# row col val
# [1,] 2 1 4
# [2,] 2 2 3
# [3,] 1 2 2
# [4,] 1 1 1

Printing the sorted elements of a matrix in descending order with array indices in the fastest fashion

This seems like a simple problem but I am having trouble doing this in a fast manner.
Say I have a matrix and I want to sort this matrix and store the indices of the elements in descending order. Is there a quick way to do this? Right now, I am extracting the maximum, storing the result, changing it to -2, and then extracting the next maximum in a for loop. Which is probably the most inefficient way to do it.
My problem actually requires me to work on a 20,000 X 20,000 matrix. Memory is not an issue. Any ideas about the fastest way to do it would be great.
For example if I have a matrix
>m<-matrix(c(1,4,2,3),2,2)
>m
[,1] [,2]
[1,] 1 2
[2,] 4 3
I want the result to indicate the numbers in descending order:
row col val
2 1 4
2 2 3
1 2 2
1 1 1
Here's a possible data.table solution
library(data.table)
rows <- nrow(m) ; cols <- ncol(m)
res <- data.table(
row = rep(seq_len(rows), cols),
col = rep(seq_len(cols), each = rows),
val = c(m)
)
setorder(res, -val)
res
# row col val
# 1: 2 1 4
# 2: 2 2 3
# 3: 1 2 2
# 4: 1 1 1
Edit: a base R alternative
res <- cbind(
row = rep(seq_len(rows), cols),
col = rep(seq_len(cols), each = rows),
val = c(m)
)
res[order(-res[, 3]),]
# row col val
# [1,] 2 1 4
# [2,] 2 2 3
# [3,] 1 2 2
# [4,] 1 1 1

Summarizing category values within a nested list in R

I have a list of elements in R structured like this:
[[1]]
value weight
[1,] 1 0.085308057
[2,] 1 0.251184834
[3,] 1 0.009478673
[4,] 1 0.180094787
[5,] 1 0.445497630
[6,] 1 0.028436019
[[2]]
value weight
[1,] 1 0.1753555
[2,] 2 0.1706161
[3,] 1 0.3317536
[4,] 3 0.3222749
I am trying to add the weights for each "value" category within each level of the list which would result in something like the following:
Unit value weight
1 1 1.0000000
2 1 0.5071091
2 2 0.1706161
2 3 0.3222749
There are approximately 2000 "units" that I need to summarize, so it would not be feasible to extract values from each one separately without a loop function, but I am having trouble writing the code to perform this task.
I also understand that I could turn the list into a dataframe in order to perform these calculations, but because each element of the list has different numbers of rows, I am unsure of how to go about doing this.
I am still new to learning R, so any help would be greatly appreciated!
So this is very easy to solve using rbindlist from the data.table package v>=1.9.5 (see here for installation instructions)
I'm not sure if your list contain data.frames or matrices. If the later is the case, first do (we will call your list l)
l <- lapply(l, as.data.frame)
Then, the solution is straight forward
library(data.table)
rbindlist(l, idcol = "Unit")[, .(weight = sum(weight)), by = .(Unit, value)]
# Unit value weight
# 1: 1 1 1.0000000
# 2: 2 1 0.5071091
# 3: 2 2 0.1706161
# 4: 2 3 0.3222749
Alternatively, same result could be achieved using a combination of tidyr and dplyr packages
library(tidyr)
library(dplyr)
unnest(l, "Unit") %>%
group_by(Unit, value) %>%
summarise(weight = sum(weight))
# Source: local data frame [4 x 3]
# Groups: Unit
#
# Unit value weight
# 1 X1 1 1.0000000
# 2 X2 1 0.5071091
# 3 X2 2 0.1706161
# 4 X2 3 0.3222749

finding pairs of duplicate columns in R

thank you for viewing this post. I am a newbie for R language.
I want to find if one column(not specified one) is a duplicate of the other, and return a matrix with dimensions num.duplicates x 2 with each row giving both indices of any pair of duplicated variables. the matrix is organized so that first column is the lower number of the pair, and it is increasing ordered.
Let say I have a dataset
v1 v2 v3 v4 v5 v6
1 1 1 2 4 2 1
2 2 2 3 5 3 2
3 3 3 4 6 4 3
and I want this
[,1] [,2]
[1,] 1 2
[2,] 1 6
[3,] 2 6
[4,] 3 5
Please help, thank you!
Something like this I suppose:
out <- data.frame(t(combn(1:ncol(dd),2)))
out[combn(1:ncol(dd),2,FUN=function(x) all(dd[x[1]]==dd[x[2]])),]
# X1 X2
#1 1 2
#5 1 6
#9 2 6
#11 3 5
I feel like i'm missing something more simple, but this seems to work.
Here's the sample data.
dd <- data.frame(
v1 = 1:3, v2 = 1:3, v3 = 2:4,
v4 = 4:6, v5 = 2:4, v6 = 1:3
)
Now i'll assign each column to a group using ave() to look for duplicates. Then I'll count the number of columns in group
groups <- ave(1:ncol(dd), as.list(as.data.frame(t(dd))), FUN=min, drop=T)
Now that I have the groups, i'll split the column indexes up by those groups, if there is more than one, i'll grab all pairwise combinations. That will create a wide matrix and I flip it to a tall-line as you desire with t()
morethanone <- function(x) length(x)>1
dups <- t(do.call(cbind,
lapply(Filter(morethanone, split(1:ncol(dd), groups)), combn, 2)
))
That returns
[,1] [,2]
[1,] 1 2
[2,] 1 6
[3,] 2 6
[4,] 3 5
as desired
First, generate all possible combinatons with expand.grid. Second, remove duplicates and sort in desired order. Third, use sapply to find indexes of repeated columns:
kk <- expand.grid(1:ncol(df), 1:ncol(df))
nn <- kk[kk[, 1] > kk[, 2], 2:1]
nn[sapply(1:nrow(nn),
function(i) all(df[, nn[i, 1]] == df[, nn[i, 2]])), ]
Var2 Var1
2 1 2
6 1 6
12 2 6
17 3 5
The approach I propose is R-ish, but I suppose writing a simple double loop is justified for this case, especially if you recently started learning the language.

Resources