finding pairs of duplicate columns in R - r

thank you for viewing this post. I am a newbie for R language.
I want to find if one column(not specified one) is a duplicate of the other, and return a matrix with dimensions num.duplicates x 2 with each row giving both indices of any pair of duplicated variables. the matrix is organized so that first column is the lower number of the pair, and it is increasing ordered.
Let say I have a dataset
v1 v2 v3 v4 v5 v6
1 1 1 2 4 2 1
2 2 2 3 5 3 2
3 3 3 4 6 4 3
and I want this
[,1] [,2]
[1,] 1 2
[2,] 1 6
[3,] 2 6
[4,] 3 5
Please help, thank you!

Something like this I suppose:
out <- data.frame(t(combn(1:ncol(dd),2)))
out[combn(1:ncol(dd),2,FUN=function(x) all(dd[x[1]]==dd[x[2]])),]
# X1 X2
#1 1 2
#5 1 6
#9 2 6
#11 3 5

I feel like i'm missing something more simple, but this seems to work.
Here's the sample data.
dd <- data.frame(
v1 = 1:3, v2 = 1:3, v3 = 2:4,
v4 = 4:6, v5 = 2:4, v6 = 1:3
)
Now i'll assign each column to a group using ave() to look for duplicates. Then I'll count the number of columns in group
groups <- ave(1:ncol(dd), as.list(as.data.frame(t(dd))), FUN=min, drop=T)
Now that I have the groups, i'll split the column indexes up by those groups, if there is more than one, i'll grab all pairwise combinations. That will create a wide matrix and I flip it to a tall-line as you desire with t()
morethanone <- function(x) length(x)>1
dups <- t(do.call(cbind,
lapply(Filter(morethanone, split(1:ncol(dd), groups)), combn, 2)
))
That returns
[,1] [,2]
[1,] 1 2
[2,] 1 6
[3,] 2 6
[4,] 3 5
as desired

First, generate all possible combinatons with expand.grid. Second, remove duplicates and sort in desired order. Third, use sapply to find indexes of repeated columns:
kk <- expand.grid(1:ncol(df), 1:ncol(df))
nn <- kk[kk[, 1] > kk[, 2], 2:1]
nn[sapply(1:nrow(nn),
function(i) all(df[, nn[i, 1]] == df[, nn[i, 2]])), ]
Var2 Var1
2 1 2
6 1 6
12 2 6
17 3 5
The approach I propose is R-ish, but I suppose writing a simple double loop is justified for this case, especially if you recently started learning the language.

Related

Getting distinct Combinations in R with repetition

I have a list of integers, say: (1,2,3,4,5)
I want to obtain all the possible lists of size 5, such that:
1. The lists can contain repeat elements, e.g. (1,1,1,2,2)
2. Ordering does not matter, e.g. (1,1,2,2,1) is the same as (1,1,1,2,2)
How do I obtain this whole list? I am actually looking for combinations of size 10 from a set of 10 integers.
Using the RcppAlgos solution recommended in this answer, we want to choose sets of 5 elements from your input, with repetition, and order doesn't matter (thus comboGeneral(), we would use permuteGeneral() if order mattered). Being coded in C++ under the hood, this will be a very fast solution, and the profiling in the linked answer also found it to be memory efficient. Generating the sets for 10 multichoose 10 still took less than a second on my laptop.
library(RcppAlgos)
x = 1:5
result = comboGeneral(x, m = 5, repetition = T)
dim(result)
# [1] 126 5
head(result)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 1 1 1 1
# [2,] 1 1 1 1 2
# [3,] 1 1 1 1 3
# [4,] 1 1 1 1 4
# [5,] 1 1 1 1 5
# [6,] 1 1 1 2 2
The link provided by Gregor seems to rely entirely on third-party packages to produce multisets, so I wanted to give you a base R solution. Note that the packages mentioned in that link will almost certainly be far more efficient for extremely large datasets.
We can use expand.grid() to first generate all possible permutations with repetition of the elements in (1,2,3,4,5). In this case, different orderings are still considered to be distinct. We now want to remove these "extra" combinations that contain the same elements but have different orders, which we can do by using apply() and duplicated().
If you use the multiset calculator here, you'll find that the code below produces the correct number of combinations. Here's the code:
x <- seq(1:5)
df <- expand.grid(x, x, x, x, x) # generates 5^5 combinations, allowing repetition
index <- !duplicated(t(apply(df, 1, sort))) # find extraneous combinations
df <- df[index, ] # select only unique combinations
# check number of rows. It should be 126; one for each combination
nrows(df)
# Output
# [1] 126
# Quick look at part of the dataframe:
head(df)
Var1 Var2 Var3 Var4 Var5
1 1 1 1 1 1
2 2 1 1 1 1
3 3 1 1 1 1
4 4 1 1 1 1
5 5 1 1 1 1
7 2 2 1 1 1
For a similar approach to #MarcusCampbell's within the tidyverse we can use expand to enumerate all possible combinations, and then only keep distinct combinations that are invariant under a permutation (i.e. where the ordering does not matter):
library(tidyverse);
tibble(V1 = 1:5, V2 = 1:5, V3 = 1:5, V4 = 1:5, V5 = 1:5) %>%
expand(V1, V2, V3, V4, V5) %>%
rowwise() %>%
mutate(cmbn = paste(sort(c(V1, V2, V3, V4, V5)), collapse = ",")) %>%
distinct(cmbn);
## A tibble: 126 x 1
# cmbn
# <chr>
# 1 1,1,1,1,1
# 2 1,1,1,1,2
# 3 1,1,1,1,3
# 4 1,1,1,1,4
# 5 1,1,1,1,5
# 6 1,1,1,2,2
# 7 1,1,1,2,3
# 8 1,1,1,2,4
# 9 1,1,1,2,5
#10 1,1,1,3,3
## ... with 116 more rows

R combinations of column vectors with names of new vectors as combination of original vectors

Apologies for the embarrassingly simple problem. I want to create combinations of all column vectors in a data frame, add the new vectors and rename them as a combination of the original column vector names.
For example
A B V3 V4 V5 V6
1 1 3 1 3 3 9
2 2 4 4 8 8 16
3 3 5 9 15 15 25
I'd like V3 to be named AA, V4 to be AB, V5 to be B*A...etc
the closest I have come is a python solution via a 'for loop'. Makes sense, but what is the r syntax to name the columns?
df<-data.frame(x1=1:3,x2=3:5)
for (i in 1:ncol(df)){
for (j in 1:ncol(df){
df[i,"-",j]<-df([,i]*df[,j])
}
}
Alternatively, I could use sapply instead of a loop but I am still stuck with renaming the new columns:
df<-data.frame(x1=1:3,x2=3:5)
a[3:6]<-sapply(a[1:2],"*",a[1:2])
Many thanks,
LR
One option is expand.grid
d1 <- expand.grid(names(df), names(df))
df[do.call(paste0, d1)] <- apply(d1, 1, FUN = function(x) do.call(`*`, df[x]))
df
# x1 x2 x1x1 x2x1 x1x2 x2x2
#1 1 3 1 3 3 9
#2 2 4 4 8 8 16
#3 3 5 9 15 15 25
Or another option is
do.call(polym, c(df, degree = 2, raw = TRUE))
Or
poly(as.matrix(df), degree = 2, raw = TRUE)

Efficiently augment a data frame by values found in a matrix

I have the following data frame (called cp):
v1 v2 v3 v4
1 1 2 3 4
2 3 1 2 4
3 4 2 1 3
Where 1, 2, 3 and 4 are nodes on a directed graph. The distance between the nodes is given by the following weighted adjacency matrix (let's call it B):
0 3 1 2
3 0 1 4
1 1 0 2
2 4 2 0
I need to augment the columns in the data frame with the distance between the nodes given by the rows and columns of the adjacency matrix (again data frame cp):
v1 v2 v3 v4 V5 V6 V7
1 1 2 3 4 3 1 2
2 3 1 2 4 1 3 4
3 4 2 1 3 4 3 1
That is, the values on columns V5, V6 and V7 come from looking up the distance between the adjacent pairs of nodes in colunms v1 to v4. For example, the 3 in column V5 is the distance from node 1 and 2 -which is found in the first row and second column of matrix B (that is, 3), and so on.
I have written the following code in R to achieve this:
for (i in 1:3){
for (j in 5:7){
cp[i, j] <- B[cp[i, j - 4], cp[i, j - 3]]
}
}
The code works fine with a data frame of just a few observations. The problem is that it takes many many hours to process a data frame of 9 columns and 11 million observations. Can you please help me find a more efficient way to do this without the for loops?
In R, matrices are stored in a one dimensional vector. You can make use of this and index pairs of indices in a matrix using this property.
Hence you can do this:
B <- as.matrix(B)
cp <- cbind(cp, sapply(1:(ncol(cp) - 1), function(ii){
B[cp[,ii] + nrow(B) * (cp[,ii+1] - 1)]
}))
cp
# v1 v2 v3 v4 1 2 3
# 1 1 2 3 4 3 1 2
# 2 3 1 2 4 1 3 4
# 3 4 2 1 3 4 3 1
You can try this which should be way faster as your data frame cp has 9 columns:
res <- apply(mapply(seq, seq(ncol(cp)-1), 2:ncol(cp)), 2, function(i) B[cp[,i]])
# [,1] [,2] [,3]
#[1,] 3 1 2
#[2,] 1 3 4
#[3,] 4 3 1
cbind(cp, res) will give your desired output.
You can convert your data frame cp to a matrix by as.matrix(cp). The type matrix is used here because of easier vectorization.
Benchmarking (cp of dim 1e+6 x 9)
library(microbenchmark)
set.seed(1)
cp <- t(replicate(1e+6, sample(9)))
B <- t(replicate(9, sample(9)-1))
f989=function() apply(mapply(seq, seq(ncol(cp)-1), 2:ncol(cp)), 2, function(i) B[cp[,i]])
fikop=function() sapply(1:(ncol(cp) - 1), function(ii){
B[cp[,ii] + nrow(B) * (cp[,ii+1] - 1)]
})
all(f989()==fikop())
# [1] TRUE
microbenchmark(f989(), fikop())
# Unit: milliseconds
# expr min lq mean median uq max neval
# f989() 157.4025 165.0029 190.5306 200.8816 204.6907 239.720 100
# fikop() 212.2289 255.1914 259.2568 261.1330 266.3382 310.974 100

Converting a matrix to a data.table list format [duplicate]

This seems like a simple problem but I am having trouble doing this in a fast manner.
Say I have a matrix and I want to sort this matrix and store the indices of the elements in descending order. Is there a quick way to do this? Right now, I am extracting the maximum, storing the result, changing it to -2, and then extracting the next maximum in a for loop. Which is probably the most inefficient way to do it.
My problem actually requires me to work on a 20,000 X 20,000 matrix. Memory is not an issue. Any ideas about the fastest way to do it would be great.
For example if I have a matrix
>m<-matrix(c(1,4,2,3),2,2)
>m
[,1] [,2]
[1,] 1 2
[2,] 4 3
I want the result to indicate the numbers in descending order:
row col val
2 1 4
2 2 3
1 2 2
1 1 1
Here's a possible data.table solution
library(data.table)
rows <- nrow(m) ; cols <- ncol(m)
res <- data.table(
row = rep(seq_len(rows), cols),
col = rep(seq_len(cols), each = rows),
val = c(m)
)
setorder(res, -val)
res
# row col val
# 1: 2 1 4
# 2: 2 2 3
# 3: 1 2 2
# 4: 1 1 1
Edit: a base R alternative
res <- cbind(
row = rep(seq_len(rows), cols),
col = rep(seq_len(cols), each = rows),
val = c(m)
)
res[order(-res[, 3]),]
# row col val
# [1,] 2 1 4
# [2,] 2 2 3
# [3,] 1 2 2
# [4,] 1 1 1

Printing the sorted elements of a matrix in descending order with array indices in the fastest fashion

This seems like a simple problem but I am having trouble doing this in a fast manner.
Say I have a matrix and I want to sort this matrix and store the indices of the elements in descending order. Is there a quick way to do this? Right now, I am extracting the maximum, storing the result, changing it to -2, and then extracting the next maximum in a for loop. Which is probably the most inefficient way to do it.
My problem actually requires me to work on a 20,000 X 20,000 matrix. Memory is not an issue. Any ideas about the fastest way to do it would be great.
For example if I have a matrix
>m<-matrix(c(1,4,2,3),2,2)
>m
[,1] [,2]
[1,] 1 2
[2,] 4 3
I want the result to indicate the numbers in descending order:
row col val
2 1 4
2 2 3
1 2 2
1 1 1
Here's a possible data.table solution
library(data.table)
rows <- nrow(m) ; cols <- ncol(m)
res <- data.table(
row = rep(seq_len(rows), cols),
col = rep(seq_len(cols), each = rows),
val = c(m)
)
setorder(res, -val)
res
# row col val
# 1: 2 1 4
# 2: 2 2 3
# 3: 1 2 2
# 4: 1 1 1
Edit: a base R alternative
res <- cbind(
row = rep(seq_len(rows), cols),
col = rep(seq_len(cols), each = rows),
val = c(m)
)
res[order(-res[, 3]),]
# row col val
# [1,] 2 1 4
# [2,] 2 2 3
# [3,] 1 2 2
# [4,] 1 1 1

Resources