Difference between Distinct vs Unique - r

What are the differences between distinct and unique in R using dplyr in consideration to:
Speed
Capabilities (valid inputs, parameters, etc) & Uses
Output
For example:
library(dplyr)
data(iris)
# creating data with duplicates
iris_dup <- bind_rows(iris, iris)
d <- distinct(iris_dup)
u <- unique(iris_dup)
all(d==u) # returns True
In this example distinct and unique perform the same function. Are there examples of times you should use one but not the other? Are there any tricks or common uses of one?

These functions may be used interchangeably, as there exists equivalent commands in both functions. The main difference lies in the speed and the output format.
distinct() is a function under the package dplyr, and may be customized. For example, the following snippet returns only the distinct elements of a specified set of columns in the dataframe
distinct(iris_dup, Petal.Width, Species)
unique() strictly returns the unique rows in a dataframe. All the elements in each row must match in order to be termed as duplicates.
Edit: As Imo points out, unique() has a similar functionality. We obtain a temporary dataframe and find the unique rows from that. This process may be slower for large dataframes.
unique(iris_dup[c("Petal.Width", "Species")])
Both return the same output (albeit with a small difference - they indicate different row numbers). distinct returns an ordered list, whereas unique returns the row number of the first occurrence of each unique element.
Petal.Width Species
1 0.2 setosa
2 0.4 setosa
3 0.3 setosa
4 0.1 setosa
5 0.5 setosa
6 0.6 setosa
7 1.4 versicolor
8 1.5 versicolor
9 1.3 versicolor
10 1.6 versicolor
11 1.0 versicolor
12 1.1 versicolor
13 1.8 versicolor
14 1.2 versicolor
15 1.7 versicolor
16 2.5 virginica
17 1.9 virginica
18 2.1 virginica
19 1.8 virginica
20 2.2 virginica
21 1.7 virginica
22 2.0 virginica
23 2.4 virginica
24 2.3 virginica
25 1.5 virginica
26 1.6 virginica
27 1.4 virginica
Overall, both functions return the unique row elements based on the combined set of columns chosen. However, I am inclined to quote the dplyr library and state that distinct is faster.

With regard to two of your criteria, speed and input, here's a little function using the tictoc library. It shows that distinct() is notably faster (the input has numeric and character columns):
library(dplyr)
library(tictoc)
library(glue)
make_a_df <- function(nrows = NULL){
tic()
df <- tibble(
alpha = sample(letters, nrows, replace = TRUE),
numeric = rnorm(mean = 0, sd = 1, n = nrows)
)
unique(df)
print(glue('Unique with {nrows}: '))
toc()
tic()
df <- tibble(
alpha = sample(letters, nrows, replace = TRUE),
numeric = rnorm(mean = 0, sd = 1, n = nrows)
)
distinct(df)
print(glue('Distinct with {nrows}: '))
toc()
}
Result:
> make_a_df(50); make_a_df(500); make_a_df(5000); make_a_df(50000); make_a_df(500000)
Unique with 50:
0.02 sec elapsed
Distinct with 50:
0 sec elapsed
Unique with 500:
0 sec elapsed
Distinct with 500:
0 sec elapsed
Unique with 5000:
0.02 sec elapsed
Distinct with 5000:
0 sec elapsed
Unique with 50000:
0.09 sec elapsed
Distinct with 50000:
0.01 sec elapsed
Unique with 5e+05:
1.77 sec elapsed
Distinct with 5e+05:
0.34 sec elapsed

Related

exclude duplicated values based on other variables [duplicate]

What are the differences between distinct and unique in R using dplyr in consideration to:
Speed
Capabilities (valid inputs, parameters, etc) & Uses
Output
For example:
library(dplyr)
data(iris)
# creating data with duplicates
iris_dup <- bind_rows(iris, iris)
d <- distinct(iris_dup)
u <- unique(iris_dup)
all(d==u) # returns True
In this example distinct and unique perform the same function. Are there examples of times you should use one but not the other? Are there any tricks or common uses of one?
These functions may be used interchangeably, as there exists equivalent commands in both functions. The main difference lies in the speed and the output format.
distinct() is a function under the package dplyr, and may be customized. For example, the following snippet returns only the distinct elements of a specified set of columns in the dataframe
distinct(iris_dup, Petal.Width, Species)
unique() strictly returns the unique rows in a dataframe. All the elements in each row must match in order to be termed as duplicates.
Edit: As Imo points out, unique() has a similar functionality. We obtain a temporary dataframe and find the unique rows from that. This process may be slower for large dataframes.
unique(iris_dup[c("Petal.Width", "Species")])
Both return the same output (albeit with a small difference - they indicate different row numbers). distinct returns an ordered list, whereas unique returns the row number of the first occurrence of each unique element.
Petal.Width Species
1 0.2 setosa
2 0.4 setosa
3 0.3 setosa
4 0.1 setosa
5 0.5 setosa
6 0.6 setosa
7 1.4 versicolor
8 1.5 versicolor
9 1.3 versicolor
10 1.6 versicolor
11 1.0 versicolor
12 1.1 versicolor
13 1.8 versicolor
14 1.2 versicolor
15 1.7 versicolor
16 2.5 virginica
17 1.9 virginica
18 2.1 virginica
19 1.8 virginica
20 2.2 virginica
21 1.7 virginica
22 2.0 virginica
23 2.4 virginica
24 2.3 virginica
25 1.5 virginica
26 1.6 virginica
27 1.4 virginica
Overall, both functions return the unique row elements based on the combined set of columns chosen. However, I am inclined to quote the dplyr library and state that distinct is faster.
With regard to two of your criteria, speed and input, here's a little function using the tictoc library. It shows that distinct() is notably faster (the input has numeric and character columns):
library(dplyr)
library(tictoc)
library(glue)
make_a_df <- function(nrows = NULL){
tic()
df <- tibble(
alpha = sample(letters, nrows, replace = TRUE),
numeric = rnorm(mean = 0, sd = 1, n = nrows)
)
unique(df)
print(glue('Unique with {nrows}: '))
toc()
tic()
df <- tibble(
alpha = sample(letters, nrows, replace = TRUE),
numeric = rnorm(mean = 0, sd = 1, n = nrows)
)
distinct(df)
print(glue('Distinct with {nrows}: '))
toc()
}
Result:
> make_a_df(50); make_a_df(500); make_a_df(5000); make_a_df(50000); make_a_df(500000)
Unique with 50:
0.02 sec elapsed
Distinct with 50:
0 sec elapsed
Unique with 500:
0 sec elapsed
Distinct with 500:
0 sec elapsed
Unique with 5000:
0.02 sec elapsed
Distinct with 5000:
0 sec elapsed
Unique with 50000:
0.09 sec elapsed
Distinct with 50000:
0.01 sec elapsed
Unique with 5e+05:
1.77 sec elapsed
Distinct with 5e+05:
0.34 sec elapsed

Repeatedly apply a conditional summary to groups in a dataframe

I have a large dataframe that looks like this:
group_id distance metric
1 1.1 0.85
1 1.1 0.37
1 1.7 0.93
1 2.3 0.45
...
1 6.3 0.29
1 7.9 0.12
2 2.5 0.78
2 2.8 0.32
...
The dataframe is already sorted by group_id and then distance. I want know the dplyr or data.table efficient equivalent to doing the following operations:
Within each group_id:
Let the unique and sorted values of distance within the current group_id be d1,d2,...,d_n.
For each d in d1,d2,...,d_n: Compute some function f on all values of metric whose distance value is less than d. The function f is a custom user defined function, that takes in a vector and returns a scalar. Assume that the function f is well defined on an empty vector.
So, in the example above, the desired dataframe would look like:
group_id distance_less_than metric
1 1.1 f(empty vector)
1 1.7 f(0.85, 0.37)
1 2.3 f(0.85, 0.37, 0.93)
...
1 7.9 f(0.85, 0.37, 0.93, 0.45,...,0.29)
2 2.5 f(empty vector)
2 2.8 f(0.78)
...
Notice how distance values can be repeated, like the value 1.1 under group 1. In such cases, both of the rows should be excluded when the distance is less than 1.1 (in this case this results in an empty vector).
A possible approach is to use non-equi join available in data.table. The left table is the unique set of combinations of group_id and distance and right table are all the distance less than left table's distance.
f <- sum
DT[unique(DT, by=c("group_id", "distance")), on=.(group_id, distance<distance), allow.cartesian=TRUE,
f(metric), by=.EACHI]
output:
group_id distance V1
1: 1 1.1 NA
2: 1 1.7 1.22
3: 1 2.3 2.15
4: 1 6.3 2.60
5: 1 7.9 2.89
6: 2 2.5 NA
7: 2 2.8 0.78
data:
library(data.table)
DT <- fread("group_id distance metric
1 1.1 0.85
1 1.1 0.37
1 1.7 0.93
1 2.3 0.45
1 6.3 0.29
1 7.9 0.12
2 2.5 0.78
2 2.8 0.32")
Don't think this would be faster than data.table option but here is one way using dplyr
library(dplyr)
df %>%
group_by(group_id) %>%
mutate(new = purrr::map_dbl(distance, ~f(metric[distance < .])))
where f is your function. map_dbl expects return type of function to be double. If you have different return type for your function you might want to use map_int, map_chr or likes.
If you want to keep only one entry per distance you might remove them using filter and duplicated
df %>%
group_by(group_id) %>%
mutate(new = purrr::map_dbl(distance, ~f(metric[distance < .]))) %>%
filter(!duplicated(distance))

R says "Cannot take a sample larger than the population" -- but I am not taking a sample larger than the population

I am trying to pick 3500 random observations from a set of 5655 observations. But when I do so, R is throwing a strange error, saying that "cannot take a sample larger than the population when 'replace = FALSE'"
I am trying to take a sample smaller than the population. Why is R throwing this error?
nrow(males)
[1] 5655
m = sample(males, 3500, replace = FALSE, prob = NULL)
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
You need to sample from the numbers, not from the data frame. Then use the results to get the sampled rows.
m <- males[sample(nrow(males), 3500, replace = FALSE, prob = NULL),]
You can also use $ to select the specific column within your data set you want to sample from.
Ex: m <- sample(dataframename$variable, 3500)
Another solution is to use dplyr
library(dplyr)
males %>% sample_n(3500, replace = FALSE, prob = NULL)
#if you don't like the pipe notation, this works equally well
sample_n(males, 3500, replace = FALSE, prob = NULL)
This can happen if you accidentally use sample() where you actually want to be using sample_n().
Example
What you don't want
iris %>%
sample(10)
# Error in sample.int(length(x), size, replace, prob) :
# cannot take a sample larger than the population when 'replace = FALSE'
Using sample_n() instead:
library(dplyr)
iris %>%
sample_n(10)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 4.6 3.6 1.0 0.2 setosa
# 2 5.4 3.7 1.5 0.2 setosa
# 3 5.0 3.6 1.4 0.2 setosa
# 4 6.7 3.3 5.7 2.1 virginica
# 5 6.2 3.4 5.4 2.3 virginica
# 6 4.3 3.0 1.1 0.1 setosa
# 7 5.8 2.7 5.1 1.9 virginica
# 8 5.8 2.8 5.1 2.4 virginica
# 9 6.8 3.2 5.9 2.3 virginica
# 10 7.6 3.0 6.6 2.1 virginica
Change the replace from False to True
m = sample(males, 3500, replace = True, prob = NULL)
Change the replace from FALSE to TRUE
nrow(males)
[1] 5655
m = sample(males, 3500, replace = TRUE, prob = NULL)

Extracting block of m rows at regular interval from large dataset

I have a small problem. I have a dataset with 8208 rows of data. It's a single column of data, I want to take every n rows as a block and add this to a new data frame.
So, for example:
newdf has column 1 to column 23.
column 1 is composed of rows 289:528 from the original dataset
column 2 is composed of rows 625:864 from the original dataset
And so on. The "block" size is 239 rows, the jump between blocks is every 336 rows.
I can do this manually, but it just becomes tedious. I have to repeat this entire procedure for another 11 sets of data so obviously a more automated approach would be preferable.
The trick here is to create an index of integers that refer to the row numbers you want to keep. This is simple enough with some use of rep, sequences and R's recycling rule.
Let me demonstrate using iris. Say you want to skip 25 rows, then return 3 rows:
skip <- 25
take <- 3
total <- nrow(iris)
reps <- total %/% (skip + take)
index <- rep(0:(reps-1), each=take) * (skip + take) + (1:take) + skip
The index now is:
index
[1] 26 27 28 54 55 56 82 83 84 110 111 112 138 139 140
And the rows of iris:
iris[index, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
26 5.0 3.0 1.6 0.2 setosa
27 5.0 3.4 1.6 0.4 setosa
28 5.2 3.5 1.5 0.2 setosa
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
56 5.7 2.8 4.5 1.3 versicolor
82 5.5 2.4 3.7 1.0 versicolor
83 5.8 2.7 3.9 1.2 versicolor
84 6.0 2.7 5.1 1.6 versicolor
110 7.2 3.6 6.1 2.5 virginica
111 6.5 3.2 5.1 2.0 virginica
112 6.4 2.7 5.3 1.9 virginica
138 6.4 3.1 5.5 1.8 virginica
139 6.0 3.0 4.8 1.8 virginica
140 6.9 3.1 5.4 2.1 virginica
Update
Note the OP states the block size is 239 elements but it is clear from the examples rows indicated that the block size is 240
> length(289:528)
[1] 240
I'll leave the example below at a block length of 239, but adjust if it is really 240.
It isn't clear from the Question, but assuming that you have something like this
df <- data.frame(A = runif(8208))
a data frame with 8208 rows.
First compute the indices of the elements of A that you need to keep. This is done via
want <- sapply(seq(289, nrow(df)-239, by = 336),
function(x) x + (seq_len(239) - 1))
Then we can use the fact that R fills matrices by columns and convert the required elements of A to a matrix with 239 rows
mat <- matrix(df$A[want], nrow = 239)
This works
> all.equal(mat[,1], df$A[289:527])
[1] TRUE
but do note that I have taken a block length of 239 here (289:527) not the indices the OP quotes as that is a block size of 240 (see Update above)
If you want this is a data frame, just add
df2 <- as.data.frame(mat)
Try this:
1) Create a list of indices
lapply(seq(1, 8208, 336), function(X) X:(X+239)) -> Indices
2) Select Data
Columns <- lapply(Indices, function(X) OldDF[X,])
3) Combine selected data in columns
NewDF <- do.call(cbind, Columns)
Why not just:
as.dataframe(matrix(orig, nrow=528 )[289:528 ,])
Since the 8028 is not an exactl multiple of the row count we need to determine the columns:
> 8208/528
[1] 15.54545 # so either 15 or 16
> 8208-15*528
[1] 288 # all in the to-be-discarded section
as.dataframe(matrix(orig, nrow=528, col=15 )[289:528 ,])
Or:
as.dataframe(matrix(orig, nrow=528, col=8208 %/% 528)[289:528 ,])

How can I use functions returning vectors (like fivenum) with ddply or aggregate?

I would like to split my data frame using a couple of columns and call let's say fivenum on each group.
aggregate(Petal.Width ~ Species, iris, function(x) summary(fivenum(x)))
The returned value is a data.frame with only 2 columns and the second being a matrix. How can I turn it into normal columns of a data.frame?
Update
I want something like the following with less code using fivenum
ddply(iris, .(Species), summarise,
Min = min(Petal.Width),
Q1 = quantile(Petal.Width, .25),
Med = median(Petal.Width),
Q3 = quantile(Petal.Width, .75),
Max = max(Petal.Width)
)
Here is a solution using data.table (while not specifically requested, it is an obvious compliment or replacement for aggregate or ddply. As well as being slightly long to code, repeatedly calling quantile will be inefficient, as for each call you will be sorting the data
library(data.table)
Tukeys_five <- c("Min","Q1","Med","Q3","Max")
IRIS <- data.table(iris)
# this will create the wide data.table
lengthBySpecies <- IRIS[,as.list(fivenum(Sepal.Length)), by = Species]
# and you can rename the columns from V1, ..., V5 to something nicer
setnames(lengthBySpecies, paste0('V',1:5), Tukeys_five)
lengthBySpecies
Species Min Q1 Med Q3 Max
1: setosa 4.3 4.8 5.0 5.2 5.8
2: versicolor 4.9 5.6 5.9 6.3 7.0
3: virginica 4.9 6.2 6.5 6.9 7.9
Or, using a single call to quantile using the appropriate prob argument.
IRIS[,as.list(quantile(Sepal.Length, prob = seq(0,1, by = 0.25))), by = Species]
Species 0% 25% 50% 75% 100%
1: setosa 4.3 4.800 5.0 5.2 5.8
2: versicolor 4.9 5.600 5.9 6.3 7.0
3: virginica 4.9 6.225 6.5 6.9 7.9
Note that the names of the created columns are not syntactically valid, although you could go through a similar renaming using setnames
EDIT
Interestingly, quantile will set the names of the resulting vector if you set names = TRUE, and this will copy (slow down the number crunching and consume memory - it even warns you in the help, fancy that!)
Thus, you should probably use
IRIS[,as.list(quantile(Sepal.Length, prob = seq(0,1, by = 0.25), names = FALSE)), by = Species]
Or, if you wanted to return the named list, without R copying internally
IRIS[,{quant <- as.list(quantile(Sepal.Length, prob = seq(0,1, by = 0.25), names = FALSE))
setattr(quant, 'names', Tukeys_five)
quant}, by = Species]
You can use do.call to call data.frame on each of the matrix elements recursively to get a data.frame with vector elements:
dim(do.call("data.frame",dfr))
[1] 3 7
str(do.call("data.frame",dfr))
'data.frame': 3 obs. of 7 variables:
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
$ Petal.Width.Min. : num 0.1 1 1.4
$ Petal.Width.1st.Qu.: num 0.2 1.2 1.8
$ Petal.Width.Median : num 0.2 1.3 2
$ Petal.Width.Mean : num 0.28 1.36 2
$ Petal.Width.3rd.Qu.: num 0.3 1.5 2.3
$ Petal.Width.Max. : num 0.6 1.8 2.5
As far as I know, there isn't an exact way to do what you're asking, because the function you're using (fivenum) doesn't return data in a way that can be easily bound to columns from within the 'ddply' function. This is easy to clean up, though, in a programmatic way.
Step 1: Perform the fivenum function on each 'Species' value using the 'ddply' function.
data <- ddply(iris, .(Species), summarize, value=fivenum(Petal.Width))
# Species value
# 1 setosa 0.1
# 2 setosa 0.2
# 3 setosa 0.2
# 4 setosa 0.3
# 5 setosa 0.6
# 6 versicolor 1.0
# 7 versicolor 1.2
# 8 versicolor 1.3
# 9 versicolor 1.5
# 10 versicolor 1.8
# 11 virginica 1.4
# 12 virginica 1.8
# 13 virginica 2.0
# 14 virginica 2.3
# 15 virginica 2.5
Now, the 'fivenum' function returns a list, so we end up with 5 line entries for each species. That's the part where the 'fivenum' function is fighting us.
Step 2: Add a label column. We know what Tukey's five numbers are, so we just call them out in the order that the 'fivenum' function returns them. The list will repeat until it hits the end of the data.
Tukeys_five <- c("Min","Q1","Med","Q3","Max")
data$label <- Tukeys_five
# Species value label
# 1 setosa 0.1 Min
# 2 setosa 0.2 Q1
# 3 setosa 0.2 Med
# 4 setosa 0.3 Q3
# 5 setosa 0.6 Max
# 6 versicolor 1.0 Min
# 7 versicolor 1.2 Q1
# 8 versicolor 1.3 Med
# 9 versicolor 1.5 Q3
# 10 versicolor 1.8 Max
# 11 virginica 1.4 Min
# 12 virginica 1.8 Q1
# 13 virginica 2.0 Med
# 14 virginica 2.3 Q3
# 15 virginica 2.5 Max
Step 3: With the labels in place, we can quickly cast this data into a new shape using the 'dcast' function from the 'reshape2' package.
library(reshape2)
dcast(data, Species ~ label)[,c("Species",Tukeys_five)]
# Species Min Q1 Med Q3 Max
# 1 setosa 0.1 0.2 0.2 0.3 0.6
# 2 versicolor 1.0 1.2 1.3 1.5 1.8
# 3 virginica 1.4 1.8 2.0 2.3 2.5
All that junk at the end are just specifying the column order, since the 'dcast' function automatically puts things in alphabetical order.
Hope this helps.
Update: I decided to return, because I realized there is one other option available to you. You can always bind a matrix as part of a data frame definition, so you could resolve your 'aggregate' function like so:
data <- aggregate(Petal.Width ~ Species, iris, function(x) summary(fivenum(x)))
result <- data.frame(Species=data[,1],data[,2])
# Species Min. X1st.Qu. Median Mean X3rd.Qu. Max.
# 1 setosa 0.1 0.2 0.2 0.28 0.3 0.6
# 2 versicolor 1.0 1.2 1.3 1.36 1.5 1.8
# 3 virginica 1.4 1.8 2.0 2.00 2.3 2.5
This is my solution:
ddply(iris, .(Species), summarize, value=t(fivenum(Petal.Width)))

Resources