How to use quantile function with dplyr summarize_at - r

I'm trying to calculate 25, 50 and 75 percentile of all cuantitative variables grouped by the specie of the dataset iris, so using dplyr::summarize_at function is possible to do it just once. I use the following code but i allways get an error:
iris %>%
group_by(Species) %>%
summarize_at(dplyr::vars(c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")),
.funs=c("25%"=quantile(0.25),
"50%"=quantile(0.50),
"75%"=quantile(0.75)))
This is the error i get: "Error: expecting a one sided formula, a function, or a function name."
Thank you for your help.

I can propose you a data.table solution. Unfortunately, I don't have a dplyr solution in mind.
dt <- data.table::as.data.table(iris)
dt <- dt[,lapply(.SD, quantile, probs = c(.25,.5,.75)),
.SDcols = c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width"),
by = "Species"]
dt[,'quantile' := c("25%","50%","75%")]
# Sepal.Length Sepal.Width Petal.Length Petal.Width # Species Sepal.Length Sepal.Width Petal.Length Petal.Width quantile
# 1: setosa 4.800 3.200 1.400 0.2 25%
# 2: setosa 5.000 3.400 1.500 0.2 50%
# 3: setosa 5.200 3.675 1.575 0.3 75%
# 4: versicolor 5.600 2.525 4.000 1.2 25%
# 5: versicolor 5.900 2.800 4.350 1.3 50%
# 6: versicolor 6.300 3.000 4.600 1.5 75%
# 7: virginica 6.225 2.800 5.100 1.8 25%
# 8: virginica 6.500 3.000 5.550 2.0 50%
# 9: virginica 6.900 3.175 5.875 2.3 75%
Hope that helps!

Using the developer version of dplyr(0.8.9) we can use summarise with across. One drawback is that the names of the quantiles are not returned although we can know since we do our operations in the order we desire:
iris %>%
group_by(Species) %>%
summarise(across(is.numeric,~c(`25%`=quantile(.x,0.25), `50%`=
quantile(.x,0.5),
`75%`= quantile(.x,0.75))))
The above is equivalent to:
iris %>%
group_by(Species) %>%
summarise_if(is.numeric,~c(`25%`=quantile(.x,0.25), `50%`=
quantile(.x,0.5),
`75%`= quantile(.x,0.75)))
Result:
# A tibble: 9 x 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 4.8 3.2 1.4 0.2
2 setosa 5 3.4 1.5 0.2
3 setosa 5.2 3.68 1.58 0.3
4 versicolor 5.6 2.52 4 1.2
5 versicolor 5.9 2.8 4.35 1.3
6 versicolor 6.3 3 4.6 1.5
7 virginica 6.22 2.8 5.1 1.8
8 virginica 6.5 3 5.55 2
9 virginica 6.9 3.18 5.88 2.3
A possibility to add the names of the quantiles. Note however that dplyr and the tidyverse do not recycle vectors which means we'll have to hardcode this:
iris %>%
group_by(Species) %>%
summarise_if(is.numeric,~c(`25%`=quantile(.x,0.25), `50%`=
quantile(.x,0.5),
`75%`= quantile(.x,0.75))) %>%
mutate(quant= rep(c("25%","50%","75%"),nrow(.) / 3))
You can also save the summarise result(res here) and resort to good ol' base for the recycle: res$quant <- c("25%","50%","75%")
# A tibble: 9 x 6
Species Sepal.Length Sepal.Width Petal.Length Petal.Width quant
<fct> <dbl> <dbl> <dbl> <dbl> <chr>
1 setosa 4.8 3.2 1.4 0.2 25%
2 setosa 5 3.4 1.5 0.2 50%
3 setosa 5.2 3.68 1.58 0.3 75%
4 versicolor 5.6 2.52 4 1.2 25%
5 versicolor 5.9 2.8 4.35 1.3 50%
6 versicolor 6.3 3 4.6 1.5 75%
7 virginica 6.22 2.8 5.1 1.8 25%
8 virginica 6.5 3 5.55 2 50%
9 virginica 6.9 3.18 5.88 2.3 75%

Related

Sample from groups, but n varies per group in R

I am trying to randomly sample n times a given grouped variable, but the n varies by the group. For example:
library(dplyr)
iris <- iris %>% mutate(len_bin=cut(Sepal.Length,seq(0,8,by=1))
I have these factors, which are my grouped variable:
table(iris$len_bin)
(4,5] (5,6] (6,7] (7,8]
32 57 49 12
Is there a way to randomly sample only these groups n times, n being the number of times each element is present in this vector:
x <- c("(4,5]","(5,6]","(5,6]","(5,6]","(6,7]")
The result should look like:
# Groups: len_bin [4]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species len_bin
<dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 5 2 3.5 1 versicolor (4,5]
2 5.3 3.7 1.5 0.2 setosa (5,6]
2 5.3 3.7 1.5 0.2 setosa (5,6]
2 5.3 3.7 1.5 0.2 setosa (5,6]
3 6.5 3 5.8 2.2 virginica (6,7]
I managed to do this with a for loop and using sample_n() based on the vector. I am assuming there must be a faster way. Can I define n within sample_n() for example?
In base R you can do:
iris <- iris %>% mutate(len_bin = cut(Sepal.Length, seq(4, 8, by = 1))
x <- c("(4,5]","(5,6]","(5,6]","(5,6]","(6,7]")
l <- mapply(\(x, y) x[sample(nrow(x), y), ],
split(iris, iris$len_bin),
c(table(factor(x, levels = levels(iris$len_bin)))),
SIMPLIFY = F)
do.call(rbind.data.frame, l)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species len_bin
#(4,5] 5.0 3.2 1.2 0.2 setosa (4,5]
#(5,6].17 5.4 3.9 1.3 0.4 setosa (5,6]
#(5,6].63 6.0 2.2 4.0 1.0 versicolor (5,6]
#(5,6].97 5.7 2.9 4.2 1.3 versicolor (5,6]
#(6,7] 6.9 3.1 5.1 2.3 virginica (6,7]

How to get names inside enframe

I want to get summary of multiple columns in a data frame group-wise. I'm using dplyr::group_by and dplyr::summarise_if to get the results, but I'm unable to get name the columns according to the names of the columns which are being summarised.
The following example illustrates this:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
library(tidyr)
iris %>%
group_by(Species) %>%
summarise_if(.predicate = is.numeric,
.funs = ~ list(enframe(x = summary(object = .)))) %>%
unnest() %>%
select(which(x = !duplicated(x = lapply(X = .,
FUN = summary))))
#> # A tibble: 18 x 6
#> Species name value value1 value2 value3
#> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa Min. 4.3 2.3 1 0.1
#> 2 setosa 1st Qu. 4.8 3.2 1.4 0.2
#> 3 setosa Median 5 3.4 1.5 0.2
#> 4 setosa Mean 5.01 3.43 1.46 0.246
#> 5 setosa 3rd Qu. 5.2 3.68 1.58 0.3
#> 6 setosa Max. 5.8 4.4 1.9 0.6
#> 7 versicolor Min. 4.9 2 3 1
#> 8 versicolor 1st Qu. 5.6 2.52 4 1.2
#> 9 versicolor Median 5.9 2.8 4.35 1.3
#> 10 versicolor Mean 5.94 2.77 4.26 1.33
#> 11 versicolor 3rd Qu. 6.3 3 4.6 1.5
#> 12 versicolor Max. 7 3.4 5.1 1.8
#> 13 virginica Min. 4.9 2.2 4.5 1.4
#> 14 virginica 1st Qu. 6.22 2.8 5.1 1.8
#> 15 virginica Median 6.5 3 5.55 2
#> 16 virginica Mean 6.59 2.97 5.55 2.03
#> 17 virginica 3rd Qu. 6.9 3.18 5.88 2.3
#> 18 virginica Max. 7.9 3.8 6.9 2.5
Created on 2019-05-15 by the reprex package (v0.2.1)
As you can see, the columns are named value, value1, etc, whereas I'd like them to be Sepal.Length, Sepal.Width, etc. After I get this result, of course it is possible to name the columns manually, but I guess there's a better way to do it using the value argument of tibble::enframe.
As an alternative, I'm currently using the following method. It requires a fake data, which is also not preferable.
iris %>%
group_by(Species) %>%
summarise_if(.predicate = is.numeric,
.funs = ~ list(summary(object = .))) %>%
unnest() %>%
group_by(Species) %>%
mutate(Statistic = names(x = summary(object = rnorm(n = 1)))) %>%
ungroup() %>%
select(Species, Statistic, everything())
Any help will be appreciated.
Might be this way? I didn't sort it according to the name within each Species, but I think it isn't important.
library(tidyverse)
iris %>%
group_by(Species) %>%
summarise_if(is.numeric, . ~ list(enframe(summary(.)))) %>%
gather('key', 'value', -Species) %>%
unnest() %>%
spread(key, value)
## A tibble: 18 x 6
# Species name Petal.Length Petal.Width Sepal.Length Sepal.Width
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 setosa 1st Qu. 1.4 0.2 4.8 3.2
# 2 setosa 3rd Qu. 1.58 0.3 5.2 3.68
# 3 setosa Max. 1.9 0.6 5.8 4.4
# 4 setosa Mean 1.46 0.246 5.01 3.43
# 5 setosa Median 1.5 0.2 5 3.4
# 6 setosa Min. 1 0.1 4.3 2.3
# 7 versicolor 1st Qu. 4 1.2 5.6 2.52
# 8 versicolor 3rd Qu. 4.6 1.5 6.3 3
# 9 versicolor Max. 5.1 1.8 7 3.4
#10 versicolor Mean 4.26 1.33 5.94 2.77
#11 versicolor Median 4.35 1.3 5.9 2.8
#12 versicolor Min. 3 1 4.9 2
#13 virginica 1st Qu. 5.1 1.8 6.22 2.8
#14 virginica 3rd Qu. 5.88 2.3 6.9 3.18
#15 virginica Max. 6.9 2.5 7.9 3.8
#16 virginica Mean 5.55 2.03 6.59 2.97
#17 virginica Median 5.55 2 6.5 3
#18 virginica Min. 4.5 1.4 4.9 2.2

Select all rows which are duplicates except for one column

I want to find rows in a dataset where the values in all columns, except for one, match. After much messing around trying unsuccessfully to get duplicated() to return all instances of the duplicate rows (not just the first instance), I figured out a way to do it (below).
For example, I want to identify all rows in the Iris dataset that are equal except for Petal.Width.
require(tidyverse)
x = iris%>%select(-Petal.Width)
dups = x[x%>%duplicated(),]
answer = iris%>%semi_join(dups)
> answer
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.1 1.5 0.1 setosa
3 4.8 3.0 1.4 0.1 setosa
4 5.1 3.5 1.4 0.3 setosa
5 4.9 3.1 1.5 0.2 setosa
6 4.8 3.0 1.4 0.3 setosa
7 5.8 2.7 5.1 1.9 virginica
8 6.7 3.3 5.7 2.1 virginica
9 6.4 2.8 5.6 2.1 virginica
10 6.4 2.8 5.6 2.2 virginica
11 5.8 2.7 5.1 1.9 virginica
12 6.7 3.3 5.7 2.5 virginica
As you can see, that works, but this is one of those times when I'm almost certain that lots other folks need this functionality, and that I'm ignorant of a single function that does this in fewer steps or a generally tidier way. Any suggestions?
An alternate approach, from at least two other posts, applied to this case would be:
answer = iris[duplicated(iris[-4]) | duplicated(iris[-4], fromLast = TRUE),]
But that also seems like just a different workaround instead of single function. Both approaches take the same amount of time. (0.08 sec on my system). Is there no neater/faster way of doing this?
e.g. something like
iris%>%duplicates(all=TRUE,ignore=Petal.Width)
iris[duplicated(iris[,-4]) | duplicated(iris[,-4], fromLast = TRUE),]
Of duplicate rows (regardless of column 4) duplicated(iris[,-4]) gives the second row of the duplicate sets, rows 18, 35, 46, 133, 143 & 145, and duplicated(iris[,-4], fromLast = TRUE) gives the first row per duplicate set, 1, 10, 13, 102, 125 and 129. By adding | this results in 12 TRUEs, so it returns the expected output.
Or perhaps with dplyr: Basically you group on all variables except Petal.Width, count how much they occur, and filter those which occur more than once.
library(dplyr)
iris %>%
group_by_at(vars(-Petal.Width)) %>%
filter(n() > 1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.1 1.5 0.1 setosa
3 4.8 3.0 1.4 0.1 setosa
4 5.1 3.5 1.4 0.3 setosa
5 4.9 3.1 1.5 0.2 setosa
6 4.8 3.0 1.4 0.3 setosa
7 5.8 2.7 5.1 1.9 virginica
8 6.7 3.3 5.7 2.1 virginica
9 6.4 2.8 5.6 2.1 virginica
10 6.4 2.8 5.6 2.2 virginica
11 5.8 2.7 5.1 1.9 virginica
12 6.7 3.3 5.7 2.5 virginica
I think janitor can do this somewhat directly.
library(janitor)
get_dupes(iris, !Petal.Width)
# get_dupes(iris, !Petal.Width)[,names(iris)] # alternative: no count column
Sepal.Length Sepal.Width Petal.Length Species dupe_count Petal.Width
1 4.8 3.0 1.4 setosa 2 0.1
2 4.8 3.0 1.4 setosa 2 0.3
3 4.9 3.1 1.5 setosa 2 0.1
4 4.9 3.1 1.5 setosa 2 0.2
5 5.1 3.5 1.4 setosa 2 0.2
6 5.1 3.5 1.4 setosa 2 0.3
7 5.8 2.7 5.1 virginica 2 1.9
8 5.8 2.7 5.1 virginica 2 1.9
9 6.4 2.8 5.6 virginica 2 2.1
10 6.4 2.8 5.6 virginica 2 2.2
11 6.7 3.3 5.7 virginica 2 2.1
12 6.7 3.3 5.7 virginica 2 2.5
I looked into the source of duplicated but would be interested to see if anyone can find anything faster. It might involve going to Rcpp or something similar though. On my machine, the base method is the fastest but your original method is actually better than the most readable dplyr method. I think that wrapping a function like this for your own purposes ought to be sufficient, since your run times don't seem excessively long anyway you can simply do iris %>% opts("Petal.Width") for pipeability if that's the main concern.
library(tidyverse)
library(microbenchmark)
opt1 <- function(df, ignore) {
ignore = enquo(ignore)
x <- df %>% select(-!!ignore)
dups <- x[x %>% duplicated(), ]
answer <- iris %>% semi_join(dups)
}
opt2 <- function(df, ignore) {
index <- which(colnames(df) == ignore)
df[duplicated(df[-index]) | duplicated(df[-index], fromLast = TRUE), ]
}
opt3 <- function(df, ignore){
ignore <- enquo(ignore)
df %>%
group_by_at(vars(-!!ignore)) %>%
filter(n() > 1)
}
microbenchmark(
opt1 = suppressMessages(opt1(iris, Petal.Width)),
opt2 = opt2(iris, "Petal.Width"),
opt3 = opt3(iris, Petal.Width)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> opt1 3.427753 4.024185 4.851445 4.464072 5.069216 12.800890 100 b
#> opt2 1.712975 1.908130 2.403859 2.133632 2.542871 7.557102 100 a
#> opt3 6.604614 7.334304 8.461424 7.920369 8.919128 24.255678 100 c
Created on 2018-07-12 by the reprex package (v0.2.0).

How to pass a dataframe and uneven vectors as parameters in purrr map

I have a function with mixed data types. It takes a data frame and string variable as the input parameter.
library(dplyr)
myfunc <- function (dat=NULL,species=NULL,sepal_thres=NULL) {
dat %>%
filter(Species==species & Sepal.Length <= sepal_thres)
}
myfunc(dat=iris,species="virginica",sepal_thres=5)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 4.9 2.5 4.5 1.7 virginica
But I want to apply it with list of vectors
species_vecs <- c("virginica","setosa")
sepal_thres_vecs <- c(5, 6)
purrr::pmap(list(dat=iris, species=species_vecs, sepal_thres=sepal_thres_vecs), myfunc)
I got this error:
Error: Element 2 has length 2, not 1 or 5.
What's the right way to do it?
Not that the species and sepal_tres parameters are taken from this combination:
> expand.grid(species_vecs,sepal_thres_vecs) %>% rename(species=Var1, sepal_thres=Var2)
species sepal_thres
1 virginica 5
2 setosa 5
3 virginica 6
4 setosa 6
but dat as parameter is fixed.
pmap will use recycling if you have a length-1 element as part of your bigger list. In this case, you can pass iris as a list element within the full list to use it for each species-sepal combination.
Note that pmap goes through list elements with multiple values in the order they appear. If you want every combination of the species and sepal vectors in pmap you would need to create and give the full vectors as list elements (i.e., you would have to do the crossing yourself).
purrr::pmap(list(dat = list(iris), species = rep(species_vecs, 2),
sepal_thres = rep(sepal_thres_vecs, each = 2) ), myfunc)
[[1]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 2.5 4.5 1.7 virginica
[[2]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 4.6 3.4 1.4 0.3 setosa
6 5.0 3.4 1.5 0.2 setosa
...
You can use this solution :
expand.grid(species_vecs,sepal_thres_vecs) %>%
rename(species=Var1, sepal_thres=Var2) %>%
as.tibble() %>%
mutate(sum = map2(as.character(species), sepal_thres,myfunc,dat = iris)) %>%
unnest(sum)
You could use Vectorize
input <- expand.grid(species_vecs,sepal_thres_vecs,stringsAsFactors = F) %>% rename(species=Var1, sepal_thres=Var2)
# species sepal_thres
# 1 virginica 5
# 2 setosa 5
# 3 virginica 6
# 4 setosa 6
output <- Vectorize(myfunc,c("species","sepal_thres"),SIMPLIFY=F)(dat=iris,species=input[[1]],sepal_thres=input[[2]])
output[[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 4.9 2.5 4.5 1.7 virginica
output[[3]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.8 2.7 5.1 1.9 virginica
# 2 4.9 2.5 4.5 1.7 virginica
# 3 5.7 2.5 5.0 2.0 virginica
# 4 5.8 2.8 5.1 2.4 virginica
# 5 6.0 2.2 5.0 1.5 virginica
# 6 5.6 2.8 4.9 2.0 virginica
# 7 6.0 3.0 4.8 1.8 virginica
# 8 5.8 2.7 5.1 1.9 virginica
# 9 5.9 3.0 5.1 1.8 virginica

How to replicate a ddply behavior that uses a custom function with dplyr?

I'm trying to replace all my plyr calls with dplyr. There are still a few snags and one of them is with the group_by function. I imagine it acts the same way as the second ddply argument and does a split, apply and combine based on the grouping variables I list. But that doesn't appear to be the case. Here is a rather trivial example.
Let's define a silly function
mm <- function(x) return(x[1:5, ])
Now we can split the species in the irisdataset like so and apply this function to each piece.
ddply(iris, .(Species), mm)
This works as intended. However, when I try the same with dplyr, it doesn't work as expected.
iris %>% group_by(Species) %>% mm
What am I doing wrong?
As shown in ?do, you can refer to a group with . in your expression. The following will replicate your ddply output:
iris %>% group_by(Species) %>% do(.[1:5, ])
# Source: local data frame [15 x 5]
# Groups: Species
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 7.0 3.2 4.7 1.4 versicolor
# 7 6.4 3.2 4.5 1.5 versicolor
# 8 6.9 3.1 4.9 1.5 versicolor
# 9 5.5 2.3 4.0 1.3 versicolor
# 10 6.5 2.8 4.6 1.5 versicolor
# 11 6.3 3.3 6.0 2.5 virginica
# 12 5.8 2.7 5.1 1.9 virginica
# 13 7.1 3.0 5.9 2.1 virginica
# 14 6.3 2.9 5.6 1.8 virginica
# 15 6.5 3.0 5.8 2.2 virginica
More generally, to apply a custom function to groups with dplyr, you can do something like the following (thanks #docendodiscimus):
iris %>% group_by(Species) %>% do(mm(.))
slice has been created for this :
library(dplyr)
iris %>% group_by(Species) %>% slice(1:5)
#> # A tibble: 15 x 5
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 7 3.2 4.7 1.4 versicolor
#> 7 6.4 3.2 4.5 1.5 versicolor
#> 8 6.9 3.1 4.9 1.5 versicolor
#> 9 5.5 2.3 4 1.3 versicolor
#> 10 6.5 2.8 4.6 1.5 versicolor
#> 11 6.3 3.3 6 2.5 virginica
#> 12 5.8 2.7 5.1 1.9 virginica
#> 13 7.1 3 5.9 2.1 virginica
#> 14 6.3 2.9 5.6 1.8 virginica
#> 15 6.5 3 5.8 2.2 virginica

Resources