This question already has answers here:
pair-wise duplicate removal from dataframe [duplicate]
(4 answers)
Closed 6 years ago.
I'm sure it's something simple, but I have a data frame
df <- data.frame(a = c(1, 2, 3),
b = c(2, 3, 1),
c = c(3, 1, 4))
And I want a new data frame that contains the unique combinations of values in the rows, irrespective of which column they're in. So in the case above I'd want
a b c
1 2 3
3 1 4
I've tried
unique(df[c('a', 'b', 'c')])
but it sees (1, 2, 3) as unique from (2, 3, 1), which I don't want.
Maybe something like that
indx <- !duplicated(t(apply(df, 1, sort))) # finds non - duplicates in sorted rows
df[indx, ] # selects only the non - duplicates according to that index
# a b c
# 1 1 2 3
# 3 3 1 4
If your data.frame is quite big, the speed may be a matter for you. You can find duplicated sets much faster with the following idea.
Let's imaginary assign each possible value in rows a prime number and count products for each row. For example, for given df we can accept primenums = c(2,3,5,7) and count products c(30,30,70). Then duplicates in this products-vector correspond to duplicated sets in our data.frame. As multiplication is being computed much faster then any kinds of sorting, you can gain efficiency.
The code is following.
require("numbers")
primenums <- Primes(100)[1:4]
dfmult <- apply(as.matrix(df), 1, function(z) prod(primenums[z]) )
my_indx <- !duplicated(dfmult)
df[my_indx,]
Here we initialize vector primenums with the help of function Primes from package numbers, but you can do manually in other way.
Take a look at the example. Here I show comparison of efficiency.
require("numbers")
# generate all unique combinations 10 out of 20
allcomb <- t(combn(20,10))
# make sample of 1 million rows
set.seed(789)
df <- allcomb[sample(nrow(allcomb), 1e6, T),]
# lets sort matrix to show we have duplicates
df <- df[do.call(order, lapply(1:ncol(df), function(i) df[, i])), ]
head(df, 10)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 1 2 3 4 5 6 7 8 9 10
# [2,] 1 2 3 4 5 6 7 8 9 10
# [3,] 1 2 3 4 5 6 7 8 9 10
# [4,] 1 2 3 4 5 6 7 8 9 10
# [5,] 1 2 3 4 5 6 7 8 9 11
# [6,] 1 2 3 4 5 6 7 8 9 11
# [7,] 1 2 3 4 5 6 7 8 9 11
# [8,] 1 2 3 4 5 6 7 8 9 11
# [9,] 1 2 3 4 5 6 7 8 9 11
# [10,] 1 2 3 4 5 6 7 8 9 11
# to be fair need to permutate numbers in rows before searching for identical sets
df <- t(apply(df, 1, function(z) z[sample(10,10)] ))
df <- as.data.frame(df)
names(df) <- letters[1:10]
# how does it look like now?
head(df, 10)
# a b c d e f g h i j
# 1 2 3 7 9 10 1 4 8 5 6
# 2 4 2 6 3 8 10 9 1 5 7
# 3 4 2 6 8 5 1 10 7 3 9
# 4 6 8 5 4 2 1 10 9 7 3
# 5 11 2 7 6 8 1 9 4 5 3
# 6 9 6 3 11 4 2 8 7 5 1
# 7 5 2 3 11 1 8 6 9 7 4
# 8 3 9 7 1 2 5 4 8 11 6
# 9 6 2 8 3 4 1 11 5 9 7
# 10 4 6 3 9 7 2 1 5 11 8
# now lets shuffle rows to make df more plausible
df <- df[sample(nrow(df), nrow(df)),]
Now when data.frame is ready we can test different algorithms.
system.time(indx <- !duplicated(t(apply(df, 1, sort))) )
# user system elapsed
# 119.75 0.06 120.03
# doesn't impress, frankly speaking
library(sets)
system.time(indx <- !duplicated(apply(df, 1, as.set)) )
# user system elapsed
# 91.60 0.00 91.89
# better, but we want faster! =)
# now lets check out the method with prime numbers
primenums <- Primes(100)[1:20]
# [1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71
system.time({
dfmult <- apply(as.matrix(df), 1, function(z) prod(primenums[z]) )
my_indx <- !duplicated(dfmult) })
# user system elapsed
# 6.44 0.16 6.61
# not bad, isn't it? but lets compare results
identical(indx, my_indx)
# [1] TRUE
# So, if there is no difference, why wait more? ;)
There is one important assumption here - we use as.matrix(df), but what if there are not only numeric variables in our data.frame? A more unificated solution will be as follows:
system.time({
dfmult <- apply(
apply(df, 2, function(colmn) as.integer(factor(colmn,
levels = unique(c(as.matrix(df)))))),
1, function(z) prod(primenums[z]) )
my_indx <- !duplicated(dfmult) })
# user system elapsed
# 27.48 0.34 27.84
# is distinctly slower but still much faster then previous methods
And what about if we have very much columns or very much different variables? In this case instead of prod() we can use sum(log()) (which is being computed probably even faster for large numbers). Take a look at this.
pr <- Primes(5e7)
length(pr)
# [1] 3001134
system.time(N <- sum(log(pr)))
# user system elapsed
# 0.12 0.00 0.13
N
# [1] 49993718
It's hard to imagine df with 3 mln columns, but here it's ok. This way allows us to carry df of any incredibly huge size with as many columns our RAM can hold.
As an alternative approach, the package sets provides a fast way of checking for set equality:
library(sets)
df.sets <- apply(df, 1, as.set)
#[[1]]
#{1, 2, 3}
#[[2]]
#{1, 2, 3}
#[[3]]
#{1, 3, 4}
df[!duplicated(df.sets),]
# a b c
#1 1 2 3
#3 3 1 4
Related
I was wondering if you had any idea what R code I could use to automate my process.
I would like to repeat "chunks" of an initial vector (Vec1). I divide the vector in groups of 4 values and repeat each group 5 times. Currently, with my bad technique, each time I add a new experiment to the analysis I have to manually create a vector to indicate which chunk I would like to repeat next. In the end I put the vector corresponding to each experiment together to get my desired output.
Vec1 <- A simple numeric vector that grows in size for each new experiment. Each new experiment extends the vector by 4 additional values.
Exp1 <- rep(Vec1 [1:4], times=5)
Exp2 <- rep(Vec1 [5:8], times=5)
Exp3 <- rep(Vec1 [9:12], times=5)
NewVector<- c(Exp1, Exp2, Exp3)
Could I use a trick to automate it?
Many thanks for the help,
Best regards,
Edouard M.
I don't know about "automate". You could write a function that takes the values 1:4 and adds multiples of 4 to it.
add_exp <- function(values = 1:4, n = 0) {
rep(values, 5) + 4 * n
}
Then add_exp() gives:
[1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
And add_exp(n = 1) gives:
[1] 5 6 7 8 5 6 7 8 5 6 7 8 5 6 7 8 5 6 7 8
So you could get NewVector using:
NewVector<- c(add_exp(), add_exp(n = 1), add_exp(n = 2))
Or if you wanted to use lapply to supply the values of n:
NewVector <- unlist(lapply(0:2, function(x) add_exp(n = x)))
Using sequence:
n <- 3L # number of experiments
v <- 4L # length of vector added for each experiment
r <- 5L # number of replications
sequence(rep(v, n*r), rep(seq(1, n*v, v), each = r))
#> [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 5 6 7 8 5
#> [26] 6 7 8 5 6 7 8 5 6 7 8 5 6 7 8 9 10 11 12 9 10 11 12 9 10
#> [51] 11 12 9 10 11 12 9 10 11 12
I have the following matrix:
m <- matrix(1:9, ncol=3, byrow=TRUE)
m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
that I need to flatten, i.e., convert to a vector.
However, instead of going along the columns:
as.vector(m)
[1] 7 4 1 8 5 2 9 6 3
I need the resulting vector to go along the rows and from the bottom and to the right, e.g.:
[1] 7 8 9 4 5 6 1 2 3
How can I do that?
1) Reverse the first dimension, tranpose and then unravel:
c(t(m[nrow(m):1, ]))
## [1] 7 8 9 4 5 6 1 2 3
2) Here is a second approach which computes the indices and then applies them. It is longer but avoids the transpose:
nr <- nrow(m)
nc <- ncol(m)
c(m[cbind(rep(nr:1, each = nc), 1:nc)])
## [1] 7 8 9 4 5 6 1 2 3
2a) A variation of (2) is to use a 1d index:
m[rep(nr:1, each = nc) + nr * (0:(nc - 1))]
## [1] 7 8 9 4 5 6 1 2 3
Note
I tried it for a 100x100 and a 1000x1000 matrix. In the first case (1) was the fastest and in the second case (2) and (2a) were the fastest thus if speed is a concern the actual dimensions seem to make a difference as to which to choose.
One option could be also using asplit():
unlist(rev(asplit(m, 1)))
[1] 7 8 9 4 5 6 1 2 3
Maybe you can use the following ways:
Solution 1:
as.vector(t(apply(m, 2, rev)))
which gives:
> as.vector(t(apply(m, 2, rev)))
[1] 7 8 9 4 5 6 1 2 3
Solution 2:
unlist(rev(data.frame(t(m))),use.names = F)
which gives:
> unlist(rev(data.frame(t(m))),use.names = F)
[1] 7 8 9 4 5 6 1 2 3
I have a matrix sort of like:
data <- round(runif(30)*10)
dimnames <- list(c("1","2","3","4","5"),c("1","2","3","2","3","2"))
values <- matrix(data, ncol=6, dimnames=dimnames)
# 1 2 3 2 3 2
# 1 5 4 9 6 7 8
# 2 6 9 9 1 2 5
# 3 1 2 5 3 10 1
# 4 6 5 1 8 6 4
# 5 6 4 5 9 4 4
Some of the column names are the same. I want to essentially reduce the columns in this matrix by taking the min of all values in the same row where the columns have the same name. For this particular matrix, the result would look like this:
# 1 2 3
# 1 5 4 7
# 2 6 1 2
# 3 1 1 5
# 4 6 4 1
# 5 6 4 4
The actual data set I'm using here has around 50,000 columns and 4,500 rows. None of the values are missing and the result will have around 40,000 columns. The way I tried to solve this was by melting the data then using group_by from dplyr before reshaping back to a matrix. The problem is that it takes forever to generate the data frame from the melt and I'd like to be able to iterate faster.
We can use rowMins from library(matrixStats)
library(matrixStats)
res <- vapply(split(1:ncol(values), colnames(values)),
function(i) rowMins(values[,i,drop=FALSE]), rep(0, nrow(values)))
res
# 1 2 3
#[1,] 5 4 7
#[2,] 6 1 2
#[3,] 1 1 5
#[4,] 6 4 1
#[5,] 6 4 4
row.names(res) <- row.names(values)
This is my vector before kmeans -
> sort(table(mydata))
mydata
23 7 9 4 10 3 5 8 2 1
1 3 3 4 5 6 6 6 7 9
km <- kmeans(mydata, centers = 10)
After kmeans -
> sort(table(km$cluster))
km$cluster
1 6 7 3 5 2 4 10 8 9
1 3 3 4 5 6 6 6 7 9
Clearly, all my 1s are stored in cluster 9, all 2s are stored in Cluster 8 and so on.
Can I find using R which cluster a particular number belongs to? Say, finding which cluster my 1s are in?
The values for $cluster are returned in the same order as your original data.
mydata <- rep(c(23,7,9,4,10,3,5,8,2,1), c(1,3,3,4,5,6,6,6,7,9))
sort(table(mydata))
# mydata
# 23 7 9 4 10 3 5 8 2 1
# 1 3 3 4 5 6 6 6 7 9
km <- kmeans(mydata, centers = 10)
unique(cbind(value=mydata, clust=km$cluster))
# value clust
# [1,] 23 9
# [2,] 7 5
# [3,] 9 7
# [4,] 4 4
# [5,] 10 1
# [6,] 3 10
# [7,] 5 2
# [8,] 8 8
# [9,] 2 6
# [10,] 1 3
Here i've just re-joined the two with cbind and used unique to eliminate all the dups since you have such discrete data.
Extending on MrFlick's answer (upvoted), and in case you want the cluster number programmatically, you could do also this (utilizing the magrittr package, to get rid of all these nested parentheses):
library(magrittr)
data.point <- 5 # put the data point here
cluster.no <- c(mydata==data.point) %>% which %>% km$cluster[.] %>% unique
Examples:
library(magrittr)
set.seed(42) # for reproducibility
mydata <- rep(c(23,7,9,4,10,3,5,8,2,1), c(1,3,3,4,5,6,6,6,7,9))
km <- kmeans(mydata, centers = 10)
data.point <- 23
c(mydata==data.point) %>% which %>% km$cluster[.] %>% unique
# 8
data.point <- 10
c(mydata==data.point) %>% which %>% km$cluster[.] %>% unique
# 1
I have csv file which reads like this
1 5
2 3
3 2
4 6
5 3
6 7
7 2
8 1
9 1
What I want to do is to this:
1 5 4 6 7 2
2 3 5 3 8 1
3 2 6 7 9 1
i.e after every third row, I want a different column of the values side by side. Any advise?
Thanks a lot
Here's a way to do this with matrix indexing. It's a bit strange, but I find it interesting so I will post it.
You want an index matrix, with indices as follows. This gives the order of your data as a matrix (column-major order):
1, 1
2, 1
3, 1
1, 2
2, 2
3, 2
4, 1
...
8, 2
9, 2
This gives the pattern that you need to select the elements. Here's one approach to building such a matrix. Say that your data is in the object dat, a data frame or matrix:
m <- matrix(
c(
outer(rep(1:3, 2), seq(0,nrow(dat)-1,by=3), FUN='+'),
rep(rep(1:2, each=3), nrow(dat)/3)
),
ncol=2
)
The outer expression is the first column of the desired index matrix, and the rep expression is the second column. Now just index dat with this index matrix, and build a result matrix with three rows:
matrix(dat[m], nrow=3)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 5 4 6 7 2
## [2,] 2 3 5 3 8 1
## [3,] 3 2 6 7 9 1
a <- read.table(text = "1 5
2 3
3 2
4 6
5 3
6 7
7 2
8 1
9 1")
(seq_len(nrow(a))-1) %/% 3
# [1] 0 0 0 1 1 1 2 2 2
split(a, (seq_len(nrow(a))-1) %/% 3)
# $`0`
# V1 V2
# 1 1 5
# 2 2 3
# 3 3 2
# $`1`
# V1 V2
# 4 4 6
# 5 5 3
# 6 6 7
# $`2`
# V1 V2
# 7 7 2
# 8 8 1
# 9 9 1
do.call(cbind,split(a, (seq_len(nrow(a))-1) %/% 3))
# 0.V1 0.V2 1.V1 1.V2 2.V1 2.V2
# 1 1 5 4 6 7 2
# 2 2 3 5 3 8 1
# 3 3 2 6 7 9 1