I want to find rows in a dataset where the values in all columns, except for one, match. After much messing around trying unsuccessfully to get duplicated() to return all instances of the duplicate rows (not just the first instance), I figured out a way to do it (below).
For example, I want to identify all rows in the Iris dataset that are equal except for Petal.Width.
require(tidyverse)
x = iris%>%select(-Petal.Width)
dups = x[x%>%duplicated(),]
answer = iris%>%semi_join(dups)
> answer
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.1 1.5 0.1 setosa
3 4.8 3.0 1.4 0.1 setosa
4 5.1 3.5 1.4 0.3 setosa
5 4.9 3.1 1.5 0.2 setosa
6 4.8 3.0 1.4 0.3 setosa
7 5.8 2.7 5.1 1.9 virginica
8 6.7 3.3 5.7 2.1 virginica
9 6.4 2.8 5.6 2.1 virginica
10 6.4 2.8 5.6 2.2 virginica
11 5.8 2.7 5.1 1.9 virginica
12 6.7 3.3 5.7 2.5 virginica
As you can see, that works, but this is one of those times when I'm almost certain that lots other folks need this functionality, and that I'm ignorant of a single function that does this in fewer steps or a generally tidier way. Any suggestions?
An alternate approach, from at least two other posts, applied to this case would be:
answer = iris[duplicated(iris[-4]) | duplicated(iris[-4], fromLast = TRUE),]
But that also seems like just a different workaround instead of single function. Both approaches take the same amount of time. (0.08 sec on my system). Is there no neater/faster way of doing this?
e.g. something like
iris%>%duplicates(all=TRUE,ignore=Petal.Width)
iris[duplicated(iris[,-4]) | duplicated(iris[,-4], fromLast = TRUE),]
Of duplicate rows (regardless of column 4) duplicated(iris[,-4]) gives the second row of the duplicate sets, rows 18, 35, 46, 133, 143 & 145, and duplicated(iris[,-4], fromLast = TRUE) gives the first row per duplicate set, 1, 10, 13, 102, 125 and 129. By adding | this results in 12 TRUEs, so it returns the expected output.
Or perhaps with dplyr: Basically you group on all variables except Petal.Width, count how much they occur, and filter those which occur more than once.
library(dplyr)
iris %>%
group_by_at(vars(-Petal.Width)) %>%
filter(n() > 1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.1 1.5 0.1 setosa
3 4.8 3.0 1.4 0.1 setosa
4 5.1 3.5 1.4 0.3 setosa
5 4.9 3.1 1.5 0.2 setosa
6 4.8 3.0 1.4 0.3 setosa
7 5.8 2.7 5.1 1.9 virginica
8 6.7 3.3 5.7 2.1 virginica
9 6.4 2.8 5.6 2.1 virginica
10 6.4 2.8 5.6 2.2 virginica
11 5.8 2.7 5.1 1.9 virginica
12 6.7 3.3 5.7 2.5 virginica
I think janitor can do this somewhat directly.
library(janitor)
get_dupes(iris, !Petal.Width)
# get_dupes(iris, !Petal.Width)[,names(iris)] # alternative: no count column
Sepal.Length Sepal.Width Petal.Length Species dupe_count Petal.Width
1 4.8 3.0 1.4 setosa 2 0.1
2 4.8 3.0 1.4 setosa 2 0.3
3 4.9 3.1 1.5 setosa 2 0.1
4 4.9 3.1 1.5 setosa 2 0.2
5 5.1 3.5 1.4 setosa 2 0.2
6 5.1 3.5 1.4 setosa 2 0.3
7 5.8 2.7 5.1 virginica 2 1.9
8 5.8 2.7 5.1 virginica 2 1.9
9 6.4 2.8 5.6 virginica 2 2.1
10 6.4 2.8 5.6 virginica 2 2.2
11 6.7 3.3 5.7 virginica 2 2.1
12 6.7 3.3 5.7 virginica 2 2.5
I looked into the source of duplicated but would be interested to see if anyone can find anything faster. It might involve going to Rcpp or something similar though. On my machine, the base method is the fastest but your original method is actually better than the most readable dplyr method. I think that wrapping a function like this for your own purposes ought to be sufficient, since your run times don't seem excessively long anyway you can simply do iris %>% opts("Petal.Width") for pipeability if that's the main concern.
library(tidyverse)
library(microbenchmark)
opt1 <- function(df, ignore) {
ignore = enquo(ignore)
x <- df %>% select(-!!ignore)
dups <- x[x %>% duplicated(), ]
answer <- iris %>% semi_join(dups)
}
opt2 <- function(df, ignore) {
index <- which(colnames(df) == ignore)
df[duplicated(df[-index]) | duplicated(df[-index], fromLast = TRUE), ]
}
opt3 <- function(df, ignore){
ignore <- enquo(ignore)
df %>%
group_by_at(vars(-!!ignore)) %>%
filter(n() > 1)
}
microbenchmark(
opt1 = suppressMessages(opt1(iris, Petal.Width)),
opt2 = opt2(iris, "Petal.Width"),
opt3 = opt3(iris, Petal.Width)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> opt1 3.427753 4.024185 4.851445 4.464072 5.069216 12.800890 100 b
#> opt2 1.712975 1.908130 2.403859 2.133632 2.542871 7.557102 100 a
#> opt3 6.604614 7.334304 8.461424 7.920369 8.919128 24.255678 100 c
Created on 2018-07-12 by the reprex package (v0.2.0).
Related
Here's an example of keeping unique rows based on the Sepal.Length column only using dplyr::distinct(). This removes already many rows, from 150 rows in the original iris data set to 35 rows after the application of dplyr::distinct().
Instead of having dplyr::distinct() making an exact comparison, I would like to provide a level of numerical precision on the deviations, let us say 10%.
For example, the first two observations shown below for the Sepal.Length variable, show 5.1 and 4.9, respectively. That means a difference of 0.2. If we make that relative to the highest value, i.e. 5.1, then we have a relative difference of 0.2 / 5.1 = 0.03921569 (3.9%).
Hence, according to the criterion of 10%, these two observations should be considered equal, as 3.9% < 10 %.
So, is there a simple and idiomatic way of doing this? I mean without resorting to explicitly having me calculating distances between all observations for each variable of interest and then applying filters...?
Am I underestimating the problem, and this really requires, perhaps, clustering of the values to define "equal" groups, and is therefore a difficult problem that goes beyond the scope of dplyr, and distinct() particularly?
If I am answering my own question with these suppositions, I think it might still be useful to leave it here posted for others that might be approaching this question as naively as me.
library(dplyr, warn.conflicts = FALSE)
distinct(iris, Sepal.Length, .keep_all = TRUE)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.4 2.9 1.4 0.2 setosa
#> 8 4.8 3.4 1.6 0.2 setosa
#> 9 4.3 3.0 1.1 0.1 setosa
#> 10 5.8 4.0 1.2 0.2 setosa
#> 11 5.7 4.4 1.5 0.4 setosa
#> 12 5.2 3.5 1.5 0.2 setosa
#> 13 5.5 4.2 1.4 0.2 setosa
#> 14 4.5 2.3 1.3 0.3 setosa
#> 15 5.3 3.7 1.5 0.2 setosa
#> 16 7.0 3.2 4.7 1.4 versicolor
#> 17 6.4 3.2 4.5 1.5 versicolor
#> 18 6.9 3.1 4.9 1.5 versicolor
#> 19 6.5 2.8 4.6 1.5 versicolor
#> 20 6.3 3.3 4.7 1.6 versicolor
#> 21 6.6 2.9 4.6 1.3 versicolor
#> 22 5.9 3.0 4.2 1.5 versicolor
#> 23 6.0 2.2 4.0 1.0 versicolor
#> 24 6.1 2.9 4.7 1.4 versicolor
#> 25 5.6 2.9 3.6 1.3 versicolor
#> 26 6.7 3.1 4.4 1.4 versicolor
#> 27 6.2 2.2 4.5 1.5 versicolor
#> 28 6.8 2.8 4.8 1.4 versicolor
#> 29 7.1 3.0 5.9 2.1 virginica
#> 30 7.6 3.0 6.6 2.1 virginica
#> 31 7.3 2.9 6.3 1.8 virginica
#> 32 7.2 3.6 6.1 2.5 virginica
#> 33 7.7 3.8 6.7 2.2 virginica
#> 34 7.4 2.8 6.1 1.9 virginica
#> 35 7.9 3.8 6.4 2.0 virginica
In terms of identifying relative distance from column max and using this as some sort of grouping you could use the following approach for one column. The important part is to round the calculation result to i.e. 10% steps mean one digit after the decimal separator:
iris %>%
dplyr::transmute(Sepal.Length = round((max(Sepal.Length) - Sepal.Length) / max(Sepal.Length), 1)) %>%
dplyr::distinct()
and if you like to use it for all numerical columns a generalization of the above solution looks like this
iris %>%
dplyr::mutate(across(where(is.numeric), ~ round((max(.x) - .x) / max(.x), 1))) %>%
dplyr::distinct()
if you want to add new columns with the max distance of eacht column, this can be an option:
iris %>%
mutate(across(where(is.numeric),
.fns = list(dist_max = ~round((max(.x) - .x) / max(.x), 1)),
.names = "{fn}_{col}" ))
Note sure if any of these answers you question?
I have read several other posts where people seem to have had the same problem. I am using the example of the iris dataset. I want to find the max by group but instead it is giving me the max across the whole dataset.
I tried to detach(plyr) because it said to make sure you load dplyr after plyr. I also tried adding dplyr:: before the commands. But neither of those seem to make a difference. I am using dplyr version 1.0.2
This is the code I am using. I am new to posting so not sure how to show the mistake in the data or how to make data show up correctly. This is what I get but max_sepal should be 5.8 for that first group. Thank you for your help!
iris_1 <- iris %>%
dplyr::group_by(Species) %>%
dplyr::mutate(max_sepal = max(iris$Sepal.Length, na.rm=TRUE))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species max_sepal
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 7.9
2 4.9 3 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.9
5 5 3.6 1.4 0.2 setosa 7.9
6 5.4 3.9 1.7 0.4 setosa 7.9
7 4.6 3.4 1.4 0.3 setosa 7.9
8 5 3.4 1.5 0.2 setosa 7.9
9 4.4 2.9 1.4 0.2 setosa 7.9
10 4.9 3.1 1.5 0.1 setosa 7.9
What you are likely looking for is the summarise function which is also part of dplyr. A quick distinction between mutate and summarise is below.
mutate() either changes an existing column or adds a new one.
summarise() calculates a single value (per group).
iris %>%
group_by(Species) %>%
summarise(max_sepal = max(Sepal.Length, na.rm = TRUE))
# A tibble: 3 x 2
Species max_sepal
<fct> <dbl>
1 setosa 5.8
2 versicolor 7
3 virginica 7.9
You can some more examples of this below
https://community.rstudio.com/t/what-is-difference-between-mutate-and-summarise/23103/3
I am using iris dataframe
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3.0 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.6 3.1 1.5 0.2
5 setosa 5.0 3.6 1.4 0.2
6 setosa 5.4 3.9 1.7 0.4
7 virginica 6.7 3.3 5.7 2.5
8 virginica 6.7 3.0 5.2 2.3
9 virginica 6.3 2.5 5.0 1.9
10 virginica 6.5 3.0 5.2 2.0
11 virginica 6.2 3.4 5.4 2.3
12 virginica 5.9 3.0 5.1 1.8
I would like to group the dataframe by Species and summarise the data using a custom generic function
My proposed code is the following:
iris %>% group_by(Species) %>% summarise(MySummary= GenericSummaryFunction(.))
GenericSummaryFunction <- function (x){.....}
The problem I am facing is that the dataframe being passed to GenericSummaryFunction is ungrouped, thus the output is not group specific.
Species MySummary
<fct> <dbl>
1 setosa 5.80
2 virginica 5.80
I am not sure what to replace "." for in iris %>% group_by(Species) %>% summarise(MySummary= GenericSummaryFunction(.)) to pass the grouped dataframe instead of the whole dataframe. I am using dyplr 0.83
Try with this and tell me if you have problems.
It should work with older versions of dplyr.
library(dplyr)
library(tidyr)
library(purrr)
GenericSummaryFunction <- function(x){
# I just came up with something meaningless to get one number
sum(x)
}
iris %>%
nest(-Species) %>%
mutate(data = map(data, GenericSummaryFunction)) %>%
unnest()
Given a data.frame and a set of columns, I'd like to write a csv file (or text file in general)
for each column containing but containing information for all columns, however rows filtered based on the respective column.
For example, say I'd like to save a file for each, Sepal.Width and Sepal.Length containing the top 5 rows for each respectively:
top_n(iris, 5, Sepal.Width)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.4 3.9 1.7 0.4 setosa
# 2 5.8 4.0 1.2 0.2 setosa
# 3 5.7 4.4 1.5 0.4 setosa
# 4 5.4 3.9 1.3 0.4 setosa
# 5 5.2 4.1 1.5 0.1 setosa
# 6 5.5 4.2 1.4 0.2 setosa
# this should go in top5_Sepal.Width.csv
top_n(iris, 5, Sepal.Length)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 7.7 3.8 6.7 2.2 virginica
# 2 7.7 2.6 6.9 2.3 virginica
# 3 7.7 2.8 6.7 2.0 virginica
# 4 7.9 3.8 6.4 2.0 virginica
# 5 7.7 3.0 6.1 2.3 virginica
# this should go in top5_Sepal.Length.csv
I've tried something like below, however I don't know to write the mywrite function i.e. how access the whole data.frame for filtering (.x only contains the column)
myvars <- c("Sepal.Width", "Sepal.Length")
tmp <- iris %>%
map_at(myvars, ~mywrite(.x))
Alternatively, purrr:map2 allows to track names, but doesn't come in an _atflavour (I guess filtering can be done in mywrite then..
However, again no access to the whole iris data.frame:
tmp <- iris %>%
map2(., colnames(iris), ~mywrite(.x, .y))
As a third option, I think there is the option to loop over the column names, possibly using tidy evaluation i.e. mycol <- sym(myvars[i]) and !!mycol , but wanted to refrain from for loops ideally.
Note, this is a somewhat silly toy example which only serves to illustrate the issue.
Note2, this answer is similar but based on groups within a column rather than individual columns:
We can use map with non-standard evaluation to get top 5 values for each myvars
library(dplyr)
purrr::map(myvars, ~top_n(iris, 5, !!sym(.x)))
#[[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.4 3.9 1.7 0.4 setosa
#2 5.8 4.0 1.2 0.2 setosa
#3 5.7 4.4 1.5 0.4 setosa
#4 5.4 3.9 1.3 0.4 setosa
#5 5.2 4.1 1.5 0.1 setosa
#6 5.5 4.2 1.4 0.2 setosa
#[[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 7.7 3.8 6.7 2.2 virginica
#2 7.7 2.6 6.9 2.3 virginica
#3 7.7 2.8 6.7 2.0 virginica
#4 7.9 3.8 6.4 2.0 virginica
#5 7.7 3.0 6.1 2.3 virginica
If you want to write each part to csv, you can extend the pipe to
map(myvars, ~top_n(iris, 5, !!sym(.x)) %>% write.csv(paste0("top5_", .x, ".csv")))
In a data.table, if a certain column has identical values occurring consecutively over a certain number of times, I'd like to remove the corresponding rows. I also would like to do this by group.
For example, say dt is my data.table. I would like to remove rows if the same value occurs consecutively over 2 times in Petal.Width grouped by Species.
dt <- iris[c(1:3, 7:7, 51:53, 62:63), ]
setDT(dt)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 7 4.6 3.4 1.4 0.3 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 52 6.4 3.2 4.5 1.5 versicolor
# 53 6.9 3.1 4.9 1.5 versicolor
# 62 5.9 3.0 4.2 1.5 versicolor
# 63 6.0 2.2 4.0 1.0 versicolor
The desired outcome is a data.table with the following rows.
# 7 4.6 3.4 1.4 0.3 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 63 6.0 2.2 4.0 1.0 versicolor
Here is an option:
library(data.table)
setDT(dt)[dt[,{
rl <- rleid(Species, Petal.Width)
rw <- rowid(rl)
.I[!rl %in% rl[rw > 1]]
}]]
output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 4.6 3.4 1.4 0.3 setosa
2: 7.0 3.2 4.7 1.4 versicolor
3: 6.0 2.2 4.0 1.0 versicolor
Here's an option:
library(data.table)
dt <- iris[c(1:3, 7:7, 51:53, 62:63), ]
setDT(dt)
dt[dt[, .I[.N < 3], by = .(rleid(Petal.Width), Species)]$V1]
Thanks to #chinsoon12 for suggesting to wrap rleid() around Pedal.Width to filter out consecutive values.