Subsets of data frame - r

I have a data frame with entries in R, and want to create all possible unique subsets from this data frame, when each subset should include a unique possible pairwise combination of two columns from the pool of columns in the original data frame. This means that if the number of columns in the original data frame is Y, the number of unique subsets I should get is Y*(Y-1)/2. I also want that the name of the columns in each subset would be the name that was used in the original data frame. How do I do it?

colpairs <- function(d) {
apply(combn(ncol(d),2), 2, function(x) d[,x])
}
x <- colpairs(iris)
sapply(x, head, n=2)
## [[1]]
## Sepal.Length Sepal.Width
## 1 5.1 3.5
## 2 4.9 3.0
##
## [[2]]
## Sepal.Length Petal.Length
## 1 5.1 1.4
## 2 4.9 1.4
...

I'd use combn to make the indices of your columns, and lapply to take subsets of your data.frame and store them in a list structure. e.g.
# Example data
set.seed(1)
df <- data.frame( a = sample(2,4,repl=T) ,
b = runif(4) ,
c = sample(letters ,4 ),
d = sample( LETTERS , 4 ) )
# Use combn to get indices
ind <- combn( x = 1:ncol(df) , m = 2 , simplify = FALSE )
# ind is the column indices. The indices returned by the example above are (pairs in columns):
#[,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 1 1 1 2 2 3
#[2,] 2 3 4 3 4 4
# Make subsets, combine in list
out <- lapply( ind , function(x) df[,x] )
[[1]]
# a b
#1 1 0.2016819
#2 1 0.8983897
#3 2 0.9446753
#4 2 0.6607978
[[2]]
# a c
#1 1 q
#2 1 b
#3 2 e
#4 2 x
[[3]]
# a d
#1 1 R
#2 1 J
#3 2 S
#4 2 L
[[4]]
# b c
#1 0.2016819 q
#2 0.8983897 b
#3 0.9446753 e
#4 0.6607978 x
[[5]]
# b d
#1 0.2016819 R
#2 0.8983897 J
#3 0.9446753 S
#4 0.6607978 L
[[6]]
# c d
#1 q R
#2 b J
#3 e S
#4 x L

Related

R rbind command remove extra information

x=rbind(rep(1:3),rep(1:3))
x
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 3
How is it possible to remove the braces and values inside with comma? I try make.row.names = FALSE but this does not work
You can do it with rownames and colnames:
colnames(x) <- 1:3
rownames(x) <- 1:2
x
# 1 2 3
#1 1 2 3
#2 1 2 3
You're probably confusing matrices with data frames?
x <- rbind(rep(1:3), rep(1:3))
x
# [,1] [,2] [,3]
# [1,] 1 2 3
# [2,] 1 2 3
The display is perfectly fine, since x is a matrix:
class(x)
# [1] "matrix"
You could change dimnames like so
dimnames(x) <- list(1:nrow(x), 1:ncol(x))
x
# 1 2 3
# 1 1 2 3
# 2 1 2 3
However, probably you want a data frame.
x <- as.data.frame(rbind(rep(1:3), rep(1:3)))
x
# V1 V2 V3
# 1 1 2 3
# 2 1 2 3
class(x)
# [1] "data.frame"

R - Combinations of a list WITH duplicates?

How can I get all the combinations of a list with duplicates. By duplicates I mean an element with itself. I am building a symmetric matrix.
names.list<-c("A","B","C")
as.data.frame(t(combn(names.list,2)))
Result is:
V1 V2
1 A B
2 A C
3 B C
When I want:
V1 V2
1 A A
2 A B
3 A C
4 B B
5 B C
6 C C
Or even:
V1 V2
1 A A
2 A B
3 A C
4 B A
5 B B
6 B C
7 C A
8 C B
9 C C
But my matrices are large so I would like to keep combinations to a minimum (so preferably the second result), since more combinations = more computations = larger run times..
Thanks.
It sounds like you're looking for expand.grid instead of combn:
expand.grid(names.list, names.list)
# Var1 Var2
# 1 A A
# 2 B A
# 3 C A
# 4 A B
# 5 B B
# 6 C B
# 7 A C
# 8 B C
# 9 C C
Update
There's also combinations from "gtools" which would give you your preferred output.
library(gtools)
combinations(3, 2, names.list, repeats = TRUE)
# [,1] [,2]
# [1,] "A" "A"
# [2,] "A" "B"
# [3,] "A" "C"
# [4,] "B" "B"
# [5,] "B" "C"
# [6,] "C" "C"

interleave rows of matrix stored in a list in R

I want to create interleaved matrix from a list of matrices.
Example input:
> l <- list(a=matrix(1:4,2),b=matrix(5:8,2))
> l
$a
[,1] [,2]
[1,] 1 3
[2,] 2 4
$b
[,1] [,2]
[1,] 5 7
[2,] 6 8
Expected output:
1 3
5 7
2 4
6 8
I have checked the interleave function in gdata but it does not show this behaviour for lists. Any help appreciated.
Here is a one-liner:
do.call(rbind, l)[order(sequence(sapply(l, nrow))), ]
# [,1] [,2]
# [1,] 1 3
# [2,] 5 7
# [3,] 2 4
# [4,] 6 8
To help understand, the matrices are first stacked on top of each other with do.call(rbind, l), then the rows are extracted in the right order:
sequence(sapply(l, nrow))
# a1 a2 b1 b2
# 1 2 1 2
order(sequence(sapply(l, nrow)))
# [1] 1 3 2 4
It will work with any number of matrices and it will do "the right thing" (subjective) even if they don't have the same number of rows.
Rather than reinventing the wheel, you can just modify it to get you to your destination.
The interleave function from "gdata" starts with ... to let you specify a number of data.frames or matrices to put together. The first few lines of the function look like this:
head(interleave)
#
# 1 function (..., append.source = TRUE, sep = ": ", drop = FALSE)
# 2 {
# 3 sources <- list(...)
# 4 sources[sapply(sources, is.null)] <- NULL
# 5 sources <- lapply(sources, function(x) if (is.matrix(x) ||
# 6 is.data.frame(x))
You can just rewrite lines 1 and 3 as I did in this Gist to create a list version of interleave (here, I've called it Interleave)
head(Interleave)
#
# 1 function (myList, append.source = TRUE, sep = ": ", drop = FALSE)
# 2 {
# 3 sources <- myList
# 4 sources[sapply(sources, is.null)] <- NULL
# 5 sources <- lapply(sources, function(x) if (is.matrix(x) ||
# 6 is.data.frame(x))
Does it work?
l <- list(a=matrix(1:4,2),b=matrix(5:8,2), c=matrix(9:12,2))
Interleave(l)
# [,1] [,2]
# a 1 3
# b 5 7
# c 9 11
# a 2 4
# b 6 8
# c 10 12

data.table "key indices" or "group counter"

After creating a key on a data.table:
set.seed(12345)
DT <- data.table(x = sample(LETTERS[1:3], 10, replace = TRUE),
y = sample(LETTERS[1:3], 10, replace = TRUE))
setkey(DT, x, y)
DT
# x y
# [1,] A B
# [2,] A B
# [3,] B B
# [4,] B B
# [5,] C A
# [6,] C A
# [7,] C A
# [8,] C A
# [9,] C C
# [10,] C C
I would like to get an integer vector giving for each row the corresponding "key index". I hope the expected output (column i) below will help clarify what I mean:
# x y i
# [1,] A B 1
# [2,] A B 1
# [3,] B B 2
# [4,] B B 2
# [5,] C A 3
# [6,] C A 3
# [7,] C A 3
# [8,] C A 3
# [9,] C C 4
# [10,] C C 4
I thought about using something like cumsum(!duplicated(DT[, key(DT), with = FALSE])) but am hoping there is a better solution. I feel this vector could be part of the table's internal representation, and maybe there is a way to access it? Even if it is not the case, what would you suggest?
Update: From v1.8.3, you can simply use the inbuilt special .GRP:
DT[ , i := .GRP, by = key(DT)]
See history for older answers.
I'd probably just do this, since I'm fairly confident that no index counter is available from within the call to [.data.table():
ii <- unique(DT)
ii[ , i := seq_len(nrow(ii))]
DT[ii]
# x y i
# 1: A B 1
# 2: A B 1
# 3: B B 2
# 4: B B 2
# 5: C A 3
# 6: C A 3
# 7: C A 3
# 8: C A 3
# 9: C C 4
# 10: C C 4
You could make this a one-liner, at the expense of an additional call to unique.data.table():
DT[unique(DT)[ , i := seq_len(nrow(unique(DT)))]]

in R, how to retrieve a complete matrix using combn?

My problem, removing the specific purpose, seems like this:
how to transform a combination like this:
first use combn(letters[1:4], 2) to calculate the combination
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "a" "a" "a" "b" "b" "c"
[2,] "b" "c" "d" "c" "d" "d"
use each column to obtain another data frame:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
elements are obtained, for example: the first element, from the first column of the above dataframe
then How can i transform the above dataframe into a matrix, for example result, things like:
a b c d
a 0 1 2 3
b 1 0 4 5
c 2 4 0 6
d 3 5 6 0
the elements with same col and row names will have zero value where others corresponding to above value
Here is one way that works:
inputs <- letters[1:4]
combs <- combn(inputs, 2)
N <- seq_len(ncol(combs))
nams <- unique(as.vector(combs))
out <- matrix(ncol = length(nams), nrow = length(nams))
out[lower.tri(out)] <- N
out <- t(out)
out[lower.tri(out)] <- N
out <- t(out)
diag(out) <- 0
rownames(out) <- colnames(out) <- inputs
Which gives:
> out
a b c d
a 0 1 2 3
b 1 0 4 5
c 2 4 0 6
d 3 5 6 0
If I had to do this a lot, I'd wrap those function calls into a function.
Another option is to use as.matrix.dist() to do the conversion for us by setting up a "dist" object by hand. Using some of the objects from earlier:
## Far easier
out2 <- N
class(out2) <- "dist"
attr(out2, "Labels") <- as.character(inputs)
attr(out2, "Size") <- length(inputs)
attr(out2, "Diag") <- attr(out2, "Upper") <- FALSE
out2 <- as.matrix(out2)
Which gives:
> out2
a b c d
a 0 1 2 3
b 1 0 4 5
c 2 4 0 6
d 3 5 6 0
Again, I'd wrap this in a function if I had to do it more than once.
Does it have to be a mirror matrix with zeros over the diagonal?
combo <- combn(letters[1:4], 2)
in.combo <- matrix(1:6, nrow = 1)
combo <- rbind(combo, in.combo)
out.combo <- matrix(rep(NA, 16), ncol = 4)
colnames(out.combo) <- letters[1:4]
rownames(out.combo) <- letters[1:4]
for(cols in 1:ncol(combo)) {
vec1 <- combo[, cols]
out.combo[vec1[1], vec1[2]] <- as.numeric(vec1[3])
}
> out.combo
a b c d
a NA 1 2 3
b NA NA 4 5
c NA NA NA 6
d NA NA NA NA

Resources