Creating multiple rank columns dplyr - r

I have a df with several columns, and I'd like to rank them. I can do them one at a time like this:
iris.ranked <-
iris %>%
arrange(Sepal.Length) %>%
mutate(Sepal.Length = rank(Sepal.Length))
But there are lots of columns...and this is clunky. I'd rather feed a list of columns and rank them all in one code chunk. I was thinking something like this but not working...
iris.ranked.all <-
iris %>%
mutate_at(
c('Sepal.Length',
'Sepal.Width',
'Petal.Width',
'Petal.Length'),
function(x) arrange(x) %>% rank()
)

Use mutate(across()) from dplyr:
library(dplyr)
iris |>
mutate(
across(
Sepal.Length:Petal.Width,
rank,
.names = "rank_{.col}")
)
# # A tibble: 150 x 9
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species rank_Sepal.Length rank_Sepal.Width rank_Petal.Length rank_Petal.Width
# <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa 37 128. 18 20
# 2 4.9 3 1.4 0.2 setosa 19.5 70.5 18 20
# 3 4.7 3.2 1.3 0.2 setosa 10.5 101 8 20
# 4 4.6 3.1 1.5 0.2 setosa 7.5 89 31 20
# 5 5 3.6 1.4 0.2 setosa 27.5 134. 18 20
# 6 5.4 3.9 1.7 0.4 setosa 49.5 146. 46.5 45
# 7 4.6 3.4 1.4 0.3 setosa 7.5 120. 18 38
# 8 5 3.4 1.5 0.2 setosa 27.5 120. 31 20
# 9 4.4 2.9 1.4 0.2 setosa 3 52.5 18 20
# 10 4.9 3.1 1.5 0.1 setosa 19.5 89 31 3
# # ... with 140 more rows
Or if in fact you want to overwrite the columns as your question suggests, omit the .names argument:
iris |>
mutate(
across(
Sepal.Length:Petal.Width,
rank)
)

Related

How to detect if data.frame is grouped by dplyr from subfunction?

I have an R package where some functions are designed to be typically called within dplyr functions mutate or summarize.
newdata <- dplyr::mutate(group_by(olddata, col1), newcol = myfunc(col1))
However, sometimes users might forget to group their data before putting it into the mutate or summarize call.
newdata <- dplyr::mutate(olddata, newcol = myfunc(col1))
When the data frame is not grouped first, the package functions will produce largely nonsensical results. However, there won't be any errors or warnings per se, which could leave users uncertain about the cause of the issue.
I'd like to add a Warning() within the myfunc code when myfunc detects that the input data isn't coming from a grouped data.frame. However, I can't figure out how myfunc could detect if the data is coming from a grouped data.frame. It appears that mutate only passes a vector to myfunc, so both dplyr::is.grouped_df and inherits(x, "grouped_df") return false.
What I would like:
myfunc <- function(x) {if(comes.from.grouped.df) {print("grouped")} else {print("ungrouped")}}
mutate(olddata, newcol = myfunc(col1))
'ungrouped'
mutate(group_by(olddata, col1), newcol = myfunc(col1))
'grouped'
'grouped'
'grouped'
If you want your function used within a specific context, and emit a warning if the data frame is not grouped, then you can do:
library(tidyverse)
myfunc <- function(x) {
if(all(ls(envir = parent.frame()) == "~")) {
ss <- sys.status()
funcs <- sapply(ss$sys.calls, function(x) deparse(as.list(x)[[1]]))
wf <- which(funcs == "mutate")
if(length(wf) == 0) stop("`myfunc` must be called from inside `mutate`")
wf <- max(wf)
data <- eval(substitute(.data), ss$sys.frames[[wf]])
if(!inherits(data, "grouped_df")) {
warning("`myfunc` called on an ungrouped data frame / tibble.")
}
return(x^2)
}
stop("`myfunc` must be called from inside `mutate`")
}
Used outside mutate, we get an error:
myfunc(1:10)
#> Error in myfunc(1:10): `myfunc` must be called from inside `mutate`
With an ungrouped data frame or tibble we get a warning:
tibble(iris) %>%
mutate(x = myfunc(Sepal.Length))
#> Warning in myfunc(Sepal.Length): `myfunc` called on an ungrouped data frame /
#> tibble.
#> # A tibble: 150 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species x
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 26.0
#> 2 4.9 3 1.4 0.2 setosa 24.0
#> 3 4.7 3.2 1.3 0.2 setosa 22.1
#> 4 4.6 3.1 1.5 0.2 setosa 21.2
#> 5 5 3.6 1.4 0.2 setosa 25
#> 6 5.4 3.9 1.7 0.4 setosa 29.2
#> 7 4.6 3.4 1.4 0.3 setosa 21.2
#> 8 5 3.4 1.5 0.2 setosa 25
#> 9 4.4 2.9 1.4 0.2 setosa 19.4
#> 10 4.9 3.1 1.5 0.1 setosa 24.0
#> # ... with 140 more rows
And it runs without complaint if the tibble is grouped:
tibble(iris) %>%
group_by(Species) %>%
mutate(x = myfunc(Sepal.Length))
#> # A tibble: 150 x 6
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species x
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 26.0
#> 2 4.9 3 1.4 0.2 setosa 24.0
#> 3 4.7 3.2 1.3 0.2 setosa 22.1
#> 4 4.6 3.1 1.5 0.2 setosa 21.2
#> 5 5 3.6 1.4 0.2 setosa 25
#> 6 5.4 3.9 1.7 0.4 setosa 29.2
#> 7 4.6 3.4 1.4 0.3 setosa 21.2
#> 8 5 3.4 1.5 0.2 setosa 25
#> 9 4.4 2.9 1.4 0.2 setosa 19.4
#> 10 4.9 3.1 1.5 0.1 setosa 24.0
#> # ... with 140 more rows
Created on 2023-02-15 with reprex v2.0.2

Add a row to each dataframe in a list with the column median using map_dfr

I have this example list that contains 3 dataframes:
library(tidyverse)
list_df <- iris %>%
group_by(Species) %>%
slice(1:3) %>%
ungroup() %>%
group_split(Species)
I want to add a new row at the end of each dataframe that shows the column median
My try so far (and earlier this day I am sure it worked) is not working:
list_df %>%
map_dfr([,1:4], ~ .x %>%
add_row(!!!map(., median)))
I want to learn why my code is not working and what exactly !!! is for ins this situation.
The [, 1:4] doesn't include the data i.e. it only shows the index and thus it fails
list_df %>%
map_dfr(~ .x %>%
add_row(!!! map(.[1:4], median)))
-output
# A tibble: 12 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.9 3.2 1.4 0.2 <NA>
5 7 3.2 4.7 1.4 versicolor
6 6.4 3.2 4.5 1.5 versicolor
7 6.9 3.1 4.9 1.5 versicolor
8 6.9 3.2 4.7 1.5 <NA>
9 6.3 3.3 6 2.5 virginica
10 5.8 2.7 5.1 1.9 virginica
11 7.1 3 5.9 2.1 virginica
12 6.3 3 5.9 2.1 <NA>
If we want to add a row with the group information, another option is group_modify (without splitting)
iris %>%
group_by(Species) %>%
slice(1:3) %>%
group_modify(~ .x %>%
add_row(!!! map(.x, median))) %>%
ungroup
-output
# A tibble: 12 x 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.9 3.2 1.4 0.2
5 versicolor 7 3.2 4.7 1.4
6 versicolor 6.4 3.2 4.5 1.5
7 versicolor 6.9 3.1 4.9 1.5
8 versicolor 6.9 3.2 4.7 1.5
9 virginica 6.3 3.3 6 2.5
10 virginica 5.8 2.7 5.1 1.9
11 virginica 7.1 3 5.9 2.1
12 virginica 6.3 3 5.9 2.1
If we want to add the median rows,
iris %>%
group_by(Species) %>%
slice(1:3) %>%
group_modify(~ .x %>%
add_row(!!! map(.x, median))) %>%
mutate(rn = row_number()) %>%
ungroup %>%
summarise(across(2:5, ~ c(.[rn < max(rn)],
sum(.[rn == max(rn)]))), Species = c(Species[rn != max(rn)],
"Total"))
-output
# A tibble: 10 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 1
2 4.9 3 1.4 0.2 1
3 4.7 3.2 1.3 0.2 1
4 7 3.2 4.7 1.4 2
5 6.4 3.2 4.5 1.5 2
6 6.9 3.1 4.9 1.5 2
7 6.3 3.3 6 2.5 3
8 5.8 2.7 5.1 1.9 3
9 7.1 3 5.9 2.1 3
10 18.1 9.4 12 3.8 Total

Can you list an exception to tidyselect `everything()`

library(tidyverse)
iris %>% as_tibble() %>% select(everything())
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 140 more rows
Say I want to select everything in the iris data frame except Species. How do I list this one exception while utilizing tidyselect::everything()?
My actual pipe is below, and when
... %>%
group_by(`ID`) %>%
fill(everything, .direction = "updown") %>%
... %>%
and I get the following error:
Error: Column ID can't be modified because it's a grouping variable
You would do
iris %>% as_tibble() %>% select(-Species)
but assuming you have good reason not to want that, here's a way using everything()
iris %>% as_tibble() %>% select(setdiff(everything(), one_of("Species")))
#> # A tibble: 150 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5 3.6 1.4 0.2
#> 6 5.4 3.9 1.7 0.4
#> 7 4.6 3.4 1.4 0.3
#> 8 5 3.4 1.5 0.2
#> 9 4.4 2.9 1.4 0.2
#> 10 4.9 3.1 1.5 0.1
#> # ... with 140 more rows
(or just iris %>% as_tibble() %>% select(setdiff(everything(), 5)) if it's acceptable)

preserve dataframe name using dplyr group_split [duplicate]

This question already has answers here:
Give name to list variable
(3 answers)
Closed 3 years ago.
using group_split from dplyr but I need every dataframe in the list to preserve the name.
Example from dplyr documentation (notice the dataframes are numbered. The optimal output is every dataframe to have the name of the grouped variable (Setosa, versicolor....):
ir <- iris %>%
group_by(Species)
group_split(ir)
#> [[1]]
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 40 more rows
#>
#> [[2]]
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 7 3.2 4.7 1.4 versicolor
#> 2 6.4 3.2 4.5 1.5 versicolor
#> 3 6.9 3.1 4.9 1.5 versicolor
#> 4 5.5 2.3 4 1.3 versicolor
#> 5 6.5 2.8 4.6 1.5 versicolor
#> 6 5.7 2.8 4.5 1.3 versicolor
#> 7 6.3 3.3 4.7 1.6 versicolor
#> 8 4.9 2.4 3.3 1 versicolor
#> 9 6.6 2.9 4.6 1.3 versicolor
#> 10 5.2 2.7 3.9 1.4 versicolor
#> # … with 40 more rows
#>
#> [[3]]
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 6.3 3.3 6 2.5 virginica
#> 2 5.8 2.7 5.1 1.9 virginica
#> 3 7.1 3 5.9 2.1 virginica
#> 4 6.3 2.9 5.6 1.8 virginica
#> 5 6.5 3 5.8 2.2 virginica
#> 6 7.6 3 6.6 2.1 virginica
#> 7 4.9 2.5 4.5 1.7 virginica
#> 8 7.3 2.9 6.3 1.8 virginica
#> 9 6.7 2.5 5.8 1.8 virginica
#> 10 7.2 3.6 6.1 2.5 virginica
#> # … with 40 more rows
#>
#> attr(,"ptype")
#> # A tibble: 0 x 5
#> # … with 5 variables: Sepal.Length <dbl>, Sepal.Width <dbl>,
#> # Petal.Length <dbl>, Petal.Width <dbl>, Species <fct>
group_split does not preserve names. From ?group_split
it does not name the elements of the list based on the grouping as this typically loses information and is confusing.
You could use base base::split for that
split(iris, iris$Species)
Or name the list of tibbles separately using setNames.
library(dplyr)
group_split(ir) %>% setNames(unique(iris$Species))
group_split split based on factor levels of data, so if we want to split them based on their occurrence in the data, we might have to rearrange the factor levels. In iris dataset the factor levels are in the same order as they occur in the data, hence the above works.
More generally we should use.
iris %>%
mutate(Species= factor(Species, levels = unique(Species))) %>%
group_split(Species) %>%
setNames(unique(iris$Species))
We can use set_names from tidyverse
library(tidyverse)
ir %>%
group_split() %>%
set_names(levels(iris$Species))

R - Selecting Top Records but With a Grouping

Using the Iris dataframe I can pretty easily pull the first n = 100 records with:
m_data<-iris
m_data[1:100,]
But I am also interested in pulling the first 100 records based on a nice split of the Species. Assume for the moment that the first 100 records are all the same species - I would like to pull the data with a "first sampling" based on the varying Species instead.
Any suggestions are welcome. Thank you.
You can also do this with dplyr, here selecting the first 10 from each species:
library(dplyr)
iris %>%
group_by(Species) %>%
filter(row_number() <= 10) # or slice(1:10)
#> # A tibble: 30 x 5
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 20 more rows
Created on 2018-08-13 by the reprex package (v0.2.0).
Here's an alternative:
do.call(rbind, lapply(split(iris, iris$Species), head, 100))
This pulls the first 100 records from iris by Species
You can use by instead of lapply
do.call(rbind, by(iris, iris$Species, head, 100))

Resources