I would like to select the top n rows in a data frame for which I
calculated a column n that represents the sum of a variable. For example,
using the mtcars data, I would like to filter to keep only the two cyl
with the greatest sum of mpg. In the following example, I was expecting
to select all rows where cyl == 4 and cyl == 8. It must be simple, but
I can not figure out my mistake.
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(sum(mpg))
#> # A tibble: 3 x 2
#> cyl `sum(mpg)`
#> <dbl> <dbl>
#> 1 4 293.
#> 2 6 138.
#> 3 8 211.
mtcars %>%
group_by(cyl) %>% # Calculate the sum of mpg for each cyl
add_tally(mpg, sort = TRUE) %>%
ungroup() %>%
top_n(2, n)
#> # A tibble: 11 x 12
#> mpg cyl disp hp drat wt qsec vs am gear carb n
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 293.
#> 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 293.
#> 3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 293.
#> 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 293.
#> 5 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 293.
#> 6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 293.
#> 7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 293.
#> 8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 293.
#> 9 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 293.
#> 10 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 293.
#> 11 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 293.
Created on 2019-07-26 by the reprex package (v0.3.0)
It seems that top_n returns the top n rows after ordering the dataframe and returns more than n rows if there are ties. It does not return rows with distinct top n values.
From documentation -
Usage
top_n(x, n, wt)
Arguments
x: a tbl() to filter
n: number of rows to return. If x is grouped,
this is the number of rows per group. Will include more than n rows if
there are ties. If n is positive, selects the top n rows. If negative,
selects the bottom n rows.
You need, as suggested by #tmfmnk -
mtcars %>%
group_by(cyl) %>%
add_tally(mpg, sort = TRUE) %>%
ungroup() %>%
filter(dense_rank(desc(n)) < 3)
Related
I'd like to add a row for each group, where the entry for a particular column is the mean of the values of that column for that group. It's easy to add a constant value
library(dplyr)
mtcars %>% group_by(cyl) %>% group_modify(~add_row(.x, .before=0, carb=2))
# A tibble: 35 x 11
# Groups: cyl [3]
cyl mpg disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 NA NA NA NA NA NA NA NA NA 2
2 4 22.8 108 93 3.85 2.32 18.6 1 1 4 1
3 4 24.4 147. 62 3.69 3.19 20 1 0 4 2
4 4 22.8 141. 95 3.92 3.15 22.9 1 0 4 2
But when I try to dynamically add e.g. the mean of all carbs for that group, it doesn't recognise carb as a column:
mtcars %>% group_by(cyl) %>% group_modify(~add_row(.x, .before=0, carb=mean(carb)))
Error in mean(carb) : object 'carb' not found
Alternatively:
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(carb = mean(carb)) %>%
bind_rows(mtcars) %>%
arrange(cyl)
#> # A tibble: 35 x 11
#> cyl carb mpg disp hp drat wt qsec vs am gear
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1.55 NA NA NA NA NA NA NA NA NA
#> 2 4 1 22.8 108 93 3.85 2.32 18.6 1 1 4
#> 3 4 2 24.4 147. 62 3.69 3.19 20 1 0 4
#> 4 4 2 22.8 141. 95 3.92 3.15 22.9 1 0 4
#> 5 4 1 32.4 78.7 66 4.08 2.2 19.5 1 1 4
#> 6 4 2 30.4 75.7 52 4.93 1.62 18.5 1 1 4
#> 7 4 1 33.9 71.1 65 4.22 1.84 19.9 1 1 4
#> 8 4 1 21.5 120. 97 3.7 2.46 20.0 1 0 3
#> 9 4 1 27.3 79 66 4.08 1.94 18.9 1 1 4
#> 10 4 2 26 120. 91 4.43 2.14 16.7 0 1 5
#> # ... with 25 more rows
How does one use an external list of variables for dplyr::distinct() command in r?
For example, I want to use an external list of the following variables as the basis of a distinct command for the mtcars dataset:
## creates external_list_of_vars_df
# ---- NOTE: creates object
external_list_of_vars_df <-
data.frame(
external_list_of_vars_df =
c("gear", "carb", "am")
)
# ---- NOTE: turns object into tibble
external_list_of_vars_df <-
as_tibble(external_list_of_vars_df)
# ---- NOTE: displays data
external_list_of_vars_df
> external_list_of_vars_df
# A tibble: 3 × 1
external_list_of_vars_df
<chr>
1 gear
2 carb
3 am
I can use the long way, which requires inputting the variables of interest manually, to accomplish this task:
> mtcars_distinct_df_long
# A tibble: 13 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
5 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
6 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
7 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
8 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
9 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
10 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
11 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
12 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
13 15 8 301 335 3.54 3.57 14.6 0 1 5 8
When I try to use the shortcut, it does not work:
## my short way to create mtcars_distinct_df_external, by inputting variables manually
# ---- NOTE: creates object
mtcars_distinct_df_external <-
as_tibble(
mtcars %>%
distinct(vars(external_list_of_vars_df$external_list_of_vars_df), .keep_all = TRUE)
)
# ---- NOTE: displays data
mtcars_distinct_df_external
> mtcars_distinct_df_external
# A tibble: 1 × 12
mpg cyl disp hp drat wt qsec vs am gear carb `vars(external_list_of_vars_df$external_li…`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <quos>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 external_list_of_vars_df$external_list_of_v…
> # ---- NOTE: does not work
Is this task possible? If so, how?
Thanks ahead of time.
Here is the code I used to generate the example:
# how to use external list of vars for dplyr::distinct() cammand
## loads package(s)
if(!require(tidyverse)){install.packages("tidyverse")}
## data for example
mtcars
## creates external_list_of_vars_df
# ---- NOTE: creates object
external_list_of_vars_df <-
data.frame(
external_list_of_vars_df =
c("gear", "carb", "am")
)
# ---- NOTE: turns object into tibble
external_list_of_vars_df <-
as_tibble(external_list_of_vars_df)
# ---- NOTE: displays data
external_list_of_vars_df
## long way to create mtcars_distinct_df, by inputting variables manually
# ---- NOTE: creates object
mtcars_distinct_df_long <-
as_tibble(
mtcars %>%
distinct(gear, carb, am, .keep_all = TRUE)
)
# ---- NOTE: displays data
mtcars_distinct_df_long
## my short way to create mtcars_distinct_df_external, by inputting variables manually
# ---- NOTE: creates object
mtcars_distinct_df_external <-
as_tibble(
mtcars %>%
distinct(vars(external_list_of_vars_df$external_list_of_vars_df), .keep_all = TRUE)
)
# ---- NOTE: displays data
mtcars_distinct_df_external
# ---- NOTE: does not work
There are two ways to use an external vector of variable names inside dplyr verbs.
Using across(all_of()):
library(dplyr)
external_list_of_vars <- c("gear", "carb", "am")
mtcars %>%
distinct(across(all_of(external_list_of_vars)), .keep_all = TRUE)
Using tidy evaluation — specifically, the unquote-splice operator !!!:
mtcars %>%
distinct(!!!syms(external_list_of_vars), .keep_all = TRUE)
It wasn’t clear to me if your names vector is already inside a dataframe, or if that was just part of your attempt to solve the problem. If the former, you can replace external_list_of_vars in my code with external_list_of_vars_df$external_list_of_vars_df.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mtcars %>%
as_tibble() %>%
distinct(gear, carb, am, .keep_all = TRUE)
#> # A tibble: 13 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 4 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 5 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 6 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 7 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> 8 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
#> 9 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 10 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
#> 11 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
#> 12 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
#> 13 15 8 301 335 3.54 3.57 14.6 0 1 5 8
Created on 2022-02-25 by the reprex package (v2.0.1)
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
How can two strings be concatenated?
(12 answers)
Closed 1 year ago.
I would like to replace duplicate row values in a given column by appending an underscore with an index based on their incidence. For example
old_df_col new_df_col
object object_1
object object_2
object object_3
object object_4
Most other questions focus around deleting or replacing duplicate values with NA so I wasn't able to find an implementation using R and dplyr.
Here's what I've worked out so far
# count duplicates
mtcars %>% group_by(carb) %>% summarize(n=n())
# filter duplicates
mtcars %>% group_by(carb) %>% filter(n()>1)
You can group by the target variable and use row_number() to create the sequence.
Clearly, you might have to sort the data set previously (using arrange()) so that the sequence has some meaning for your data, but is not strictly necessary.
library(dplyr)
mtcars %>% group_by(carb) %>%
arrange(carb, cyl, mpg, hp) %>%
mutate(
carb_seq = paste("carb", carb, "seq", row_number(), sep = "_")
)
# A tibble: 32 x 12
# Groups: carb [6]
mpg cyl disp hp drat wt qsec vs am gear carb carb_seq
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 carb_1_seq_1
2 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 carb_1_seq_2
3 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 carb_1_seq_3
4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 carb_1_seq_4
5 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 carb_1_seq_5
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 carb_1_seq_6
7 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 carb_1_seq_7
8 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 carb_2_seq_1
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 carb_2_seq_2
10 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 carb_2_seq_3
# … with 22 more rows
Created on 2021-07-11 by the reprex package (v2.0.0)
I would like to group data and then arrange the table so that groups with the highest values are shown first. E.g. in mtcars dataset, I would like to group the cars by number of cylinders and then arrange the table so that the groups with the highest mean mpg are shown first
mtcars %>% group_by (cyl) %>% arrange (desc(mean (mpg)))
this produces an error:
Error: incorrect size (1) at position 1, expecting : 32
the reason I am asking is that filter() when applied after group_by() is applied to the whole group, not individual rows.
A good way to do this is to turn the grouping variable into a factor and use reorder (or forcats::fct_reorder) to control the order of the levels. Then you can arrange by that column. (The grouping is implicit in the reorder functions.)
library(dplyr)
mtcars %>%
mutate(
cyl = reorder(factor(cyl), -mpg)
# stats::reorder, built-in, uses mean by default
# use -mpg to make it descending
) %>%
arrange(cyl)
# alternately
library(forcats)
mtcars %>%
mutate(
cyl = fct_reorder(factor(cyl), mpg, .fun = mean, .desc = TRUE)
# forcats::fct_reorder, uses median by default,
# takes a .desc argument to make it descending
) %>%
arrange(cyl)
Changing the data like this is nice because the order you specify will be remembered and used by other functions (like ordering bars or facets in a ggplot).
Perhaps this? First, group by cyl, then fill a new column with mean(mpg), which you can then arrange by however you want, and finally remove the temporary mean(mpg) column.
mtcars %>%
group_by(cyl) %>%
mutate(mean_mpg = mean(mpg)) %>%
arrange(desc(mean_mpg)) %>%
select(-mean_mpg)
#> # A tibble: 32 x 11
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
#> 5 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
#> 7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
#> 8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
#> 9 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
#> 10 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
#> # ... with 22 more rows
I am trying to apply a sampling function in a grouped fashion to a data frame, where it should sample n samples from each group, or all group members if the group size is smaller than n.
Using dplyr, I first tried
library(dplyr)
mtcars %>% group_by(cyl) %>% sample_n(2)
This works when n is smaller than all the group sizes but does not take the full group when I choose n larger than the group size (note that there are 7 cars in one of the cyl groups):
mtcars %>% group_by(cyl) %>% sample_n(8)
Error: `size` must be less or equal than 7 (size of data),
set `replace` = TRUE to use sampling with replacement
I tried to solve this by creating an adapted group_n function like so:
sample_n_or_all <- function(tbl, n) {
if (nrow(tbl) < n)return(tbl)
sample_n(tbl, n)
}
but using my custom function (mtcars %>% group_by(cyl) %>% sample_n_or_all(8)) generates the same error.
Any suggestions how I can adapt my function so I can apply it to each of the groups? Or another solution to the problem?
We could check the number of rows in the group and pass the value to sample_n accordingly.
library(dplyr)
n <- 8
temp <- mtcars %>% group_by(cyl) %>% sample_n(if(n() < n) n() else n)
temp
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# 2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
# 3 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 4 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 5 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
# 6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
# 7 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
# 8 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
# 9 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#10 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
# … with 13 more rows
We can check number of rows in each group after that.
table(temp$cyl)
#4 6 8
#8 7 8
table(mtcars$cyl)
# 4 6 8
#11 7 14
We can do this without using a logical condition with pmin
library(dplyr)
tmp <- mtcars %>%
group_by(cyl) %>%
sample_n(pmin(n(), n))
# A tibble: 23 x 11
# Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
# 2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
# 3 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# 4 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
# 5 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
# 6 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
# 7 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
# 8 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
# 9 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#10 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# … with 13 more rows
-checking
table(tmp$cyl)
# 4 6 8
# 8 7 8