Why aggregate and summarise gives answers in different order? - r

If I calculate something using aggregate function or using summarise in dplyr package why those gives answers different order?
Example:
a <- aggregate(hp~mpg+cyl+gear, mtcars, FUN = sum)
gives me
mpg cyl gear hp
1 21.5 4 3 97
2 18.1 6 3 105
3 21.4 6 3 110
4 10.4 8 3 420
5 13.3 8 3 245
and
b <- mtcars %>%
group_by(mpg, cyl, gear) %>%
summarise(hp = sum(hp))
gives me
mpg cyl gear hp
<dbl> <dbl> <dbl> <dbl>
1 10.4 8 3 420
2 13.3 8 3 245
3 14.3 8 3 245
4 14.7 8 3 230
5 15 8 5 335
Why order is not the same?

As mentioned by #zx8754, tidyverse operations will re-order the rows. No guarantee that you will get a certain row order.
https://github.com/tidyverse/dplyr/issues/2192#issuecomment-281655703
Looking a bit closely, I see that aggregate sorted by gear, cyl, then mpg.
So the following tidyverse code will provide the same row order as aggregate(hp~mpg+cyl+gear, mtcars, FUN = sum) :
library(tidyverse)
mtcars %>% group_by(gear, cyl, mpg) %>% summarise(hp = sum(hp)) %>% head()
#> # A tibble: 6 x 4
#> # Groups: gear, cyl [3]
#> gear cyl mpg hp
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3 4 21.5 97
#> 2 3 6 18.1 105
#> 3 3 6 21.4 110
#> 4 3 8 10.4 420
#> 5 3 8 13.3 245
#> 6 3 8 14.3 245
Created on 2019-02-27 by the reprex package (v0.2.1)
and to get the same row order as mtcars %>% group_by(mpg, cyl, gear) %>% summarise(hp = sum(hp)):
library(tidyverse)
aggregate(hp~gear+cyl+mpg, mtcars, FUN = sum) %>% head()
#> gear cyl mpg hp
#> 1 3 8 10.4 420
#> 2 3 8 13.3 245
#> 3 3 8 14.3 245
#> 4 3 8 14.7 230
#> 5 5 8 15.0 335
#> 6 3 8 15.2 330
Created on 2019-02-27 by the reprex package (v0.2.1)

Related

Assign most common value of factor variable with summarize in R

R noob here, working in tidyverse / RStudio.
I have a categorical / factor variable that I'd like to retain in a group_by/summarize workflow. I'd like to summarize it using a summary function that returns the most common value of that factor within each group.
Is there a summary function I can use for this?
mean returns NA, median only works with numeric data, and summary gives me separate rows with counts of each factor level instead of the most common level.
Edit: example using subset of mtcars dataset:
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
21 6 160 110 3.9 2.62 16.5 0 1 4 4
21 6 160 110 3.9 2.88 17.0 0 1 4 4
22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
24.4 4 147. 62 3.69 3.19 20 1 0 4 2
22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
Here I have converted carb into a factor variable. In this subset of the data, you can see that among 6-cylinder cars there are 3 with carb=4 and 1 with carb=1; similarly among 4-cylinder cars there are 2 with carb=2 and 1 with carb=1.
So if I do:
data %>% group_by(cyl) %>% summarise(modalcarb = FUNC(carb))
where FUNC is the function I'm looking for, I should get:
cyl carb
<dbl> <fct>
4 2
6 4
8 2 # there are multiple potential ways of handling multi-modal situations, but that's secondary here
Hope that makes sense!
You could use the function fmode of collapse to calculate the mode. Here I created a reproducible example using mtcars dataset where the cyl column is your factor variable to group on like this:
library(dplyr)
library(collapse)
mtcars %>%
mutate(cyl = as.factor(cyl)) %>%
group_by(cyl) %>%
summarise(mode = fmode(am))
#> # A tibble: 3 × 2
#> cyl mode
#> <fct> <dbl>
#> 1 4 1
#> 2 6 0
#> 3 8 0
Created on 2022-11-24 with reprex v2.0.2
We could use which.max after count:
library(dplyr)
# fake dataset
x <- mtcars %>%
mutate(cyl = factor(cyl)) %>%
select(cyl)
x %>%
count(cyl) %>%
slice(which.max(n))
cyl n
<fct> <int>
1 8 14
You can use which.max to index and table to count.
library(tidyverse)
mtcars |>
group_by(cyl) |>
summarise(modalcarb = carb[which.max(table(carb))])
#> # A tibble: 3 x 2
#> cyl modalcarb
#> <dbl> <dbl>
#> 1 4 2
#> 2 6 4
#> 3 8 3

R DPLYR GROUPINGS

library(dplyr)
data(mtcars)
mtcars$FACTORA = sample(c("A", "b"), r=T)
mtcars$FACTORB=sample("c","e")
DATA = mtcars %>%
group_by(FACTORA, FACTORB) %>%
slice(which.min(wt)) &
group_by(FACTORA) %>%
slice(which.min(wt))
I wish to keep rows that MINIMIZE wt by qsec and gear and also keep rows that minimize wt just by qsec all in one data.
or do i have to do this
DATA = mtcars %>%
group_by(FACTORA,FACTORB) %>%
slice(which.min(wt))
DATADATA = mtcars %>%
group_by(FACTORA) %>%
slice(which.min(wt))
and then do merge?
I think this is what you mean (replacing qsec for cyl which is categorical). Note that in this set of groupings the keep2 is a bit extraneous since any row that minimizes wt for each cyl is guaranteed to appear in the rows that minimize wt for each cyl/gear group.
Also, this will only return one minimum and drop ties, though since you use which.min above I figure that isn't important.
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
arrange(wt) %>%
mutate(keep1 = row_number() == 1L) %>%
group_by(cyl) %>%
arrange(wt) %>%
mutate(keep2 = row_number() == 1L) %>%
filter(keep1 | keep2)
#> # A tibble: 8 × 13
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb keep1 keep2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
#> 1 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 TRUE TRUE
#> 2 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 TRUE FALSE
#> 3 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 TRUE FALSE
#> 4 21 6 160 110 3.9 2.62 16.5 0 1 4 4 TRUE TRUE
#> 5 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 TRUE FALSE
#> 6 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 TRUE TRUE
#> 7 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 TRUE FALSE
#> 8 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2 TRUE FALSE
Created on 2022-04-29 by the reprex package (v2.0.1)

How does one use an external list of variables for dplyr::distinct() command in r?

How does one use an external list of variables for dplyr::distinct() command in r?
For example, I want to use an external list of the following variables as the basis of a distinct command for the mtcars dataset:
## creates external_list_of_vars_df
# ---- NOTE: creates object
external_list_of_vars_df <-
data.frame(
external_list_of_vars_df =
c("gear", "carb", "am")
)
# ---- NOTE: turns object into tibble
external_list_of_vars_df <-
as_tibble(external_list_of_vars_df)
# ---- NOTE: displays data
external_list_of_vars_df
> external_list_of_vars_df
# A tibble: 3 × 1
external_list_of_vars_df
<chr>
1 gear
2 carb
3 am
I can use the long way, which requires inputting the variables of interest manually, to accomplish this task:
> mtcars_distinct_df_long
# A tibble: 13 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
5 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
6 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
7 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
8 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
9 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
10 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
11 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
12 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
13 15 8 301 335 3.54 3.57 14.6 0 1 5 8
When I try to use the shortcut, it does not work:
## my short way to create mtcars_distinct_df_external, by inputting variables manually
# ---- NOTE: creates object
mtcars_distinct_df_external <-
as_tibble(
mtcars %>%
distinct(vars(external_list_of_vars_df$external_list_of_vars_df), .keep_all = TRUE)
)
# ---- NOTE: displays data
mtcars_distinct_df_external
> mtcars_distinct_df_external
# A tibble: 1 × 12
mpg cyl disp hp drat wt qsec vs am gear carb `vars(external_list_of_vars_df$external_li…`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <quos>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 external_list_of_vars_df$external_list_of_v…
> # ---- NOTE: does not work
Is this task possible? If so, how?
Thanks ahead of time.
Here is the code I used to generate the example:
# how to use external list of vars for dplyr::distinct() cammand
## loads package(s)
if(!require(tidyverse)){install.packages("tidyverse")}
## data for example
mtcars
## creates external_list_of_vars_df
# ---- NOTE: creates object
external_list_of_vars_df <-
data.frame(
external_list_of_vars_df =
c("gear", "carb", "am")
)
# ---- NOTE: turns object into tibble
external_list_of_vars_df <-
as_tibble(external_list_of_vars_df)
# ---- NOTE: displays data
external_list_of_vars_df
## long way to create mtcars_distinct_df, by inputting variables manually
# ---- NOTE: creates object
mtcars_distinct_df_long <-
as_tibble(
mtcars %>%
distinct(gear, carb, am, .keep_all = TRUE)
)
# ---- NOTE: displays data
mtcars_distinct_df_long
## my short way to create mtcars_distinct_df_external, by inputting variables manually
# ---- NOTE: creates object
mtcars_distinct_df_external <-
as_tibble(
mtcars %>%
distinct(vars(external_list_of_vars_df$external_list_of_vars_df), .keep_all = TRUE)
)
# ---- NOTE: displays data
mtcars_distinct_df_external
# ---- NOTE: does not work
There are two ways to use an external vector of variable names inside dplyr verbs.
Using across(all_of()):
library(dplyr)
external_list_of_vars <- c("gear", "carb", "am")
mtcars %>%
distinct(across(all_of(external_list_of_vars)), .keep_all = TRUE)
Using tidy evaluation — specifically, the unquote-splice operator !!!:
mtcars %>%
distinct(!!!syms(external_list_of_vars), .keep_all = TRUE)
It wasn’t clear to me if your names vector is already inside a dataframe, or if that was just part of your attempt to solve the problem. If the former, you can replace external_list_of_vars in my code with external_list_of_vars_df$external_list_of_vars_df.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mtcars %>%
as_tibble() %>%
distinct(gear, carb, am, .keep_all = TRUE)
#> # A tibble: 13 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 4 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 5 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 6 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 7 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> 8 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
#> 9 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 10 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
#> 11 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
#> 12 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
#> 13 15 8 301 335 3.54 3.57 14.6 0 1 5 8
Created on 2022-02-25 by the reprex package (v2.0.1)

How to sort a tibble by column order? (first column, second column, third...)

I have a tibble that contains an arbitrary number of columns.
These columns were inserted in order so their index (first, second, ...) is meaningful.
I'm trying to sort the tibble by the first column, then the second column, then the third, and so on.
I'd rather keep using dplyr::arrange() to be consistent with my framework, but if it cannot be done I'd gladly accept any solution.
Also, if missing values could be considered last that would be a great plus.
Here is a reproducible example with my expected output and some failed attempts:
library(tidyverse)
set.seed(0)
x=as_tibble(mtcars)[1:4] %>% sample_n(5)
x
#> # A tibble: 5 x 4
#> mpg cyl disp hp
#> <dbl> <dbl> <dbl> <dbl>
#> 1 15.2 8 276. 180
#> 2 19.2 8 400 175
#> 3 21.4 6 258 110
#> 4 14.3 8 360 245
#> 5 21 6 160 110
# **** EXPECTED OUTPUTS: ****
arrange(x, mpg, cyl, disp, hp)
#> # A tibble: 5 x 4
#> mpg cyl disp hp
#> <dbl> <dbl> <dbl> <dbl>
#> 1 14.3 8 360 245
#> 2 15.2 8 276. 180
#> 3 19.2 8 400 175
#> 4 21 6 160 110
#> 5 21.4 6 258 110
arrange(x, hp, disp, cyl, mpg)
#> # A tibble: 5 x 4
#> mpg cyl disp hp
#> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110
#> 2 21.4 6 258 110
#> 3 19.2 8 400 175
#> 4 15.2 8 276. 180
#> 5 14.3 8 360 245
# **** MY FAILED ATTEMPTS: ****
arrange(x, all_of(colnames(x)))
#> Error: arrange() failed at implicit mutate() step.
#> * Problem with `mutate()` column `..1`.
#> i `..1 = all_of(colnames(x))`.
#> i `..1` must be size 5 or 1, not 4.
arrange(x, !!all_of(colnames(x)))
#> Error: arrange() failed at implicit mutate() step.
#> * Problem with `mutate()` column `..1`.
#> i `..1 = <chr>`.
#> i `..1` must be size 5 or 1, not 4.
arrange(x, !!!all_of(names(x)))
#> # A tibble: 5 x 4
#> mpg cyl disp hp
#> <dbl> <dbl> <dbl> <dbl>
#> 1 15.2 8 276. 180
#> 2 19.2 8 400 175
#> 3 21.4 6 258 110
#> 4 14.3 8 360 245
#> 5 21 6 160 110
do.call(arrange, x, colnames(x))
#> Warning in if (quote) args <- lapply(args, enquote): the condition has length >
#> 1 and only the first element will be used
#> Error in if (quote) args <- lapply(args, enquote): argument is not interpretable as logical
do.call(arrange, x, list(colnames(x)))
#> Error in if (quote) args <- lapply(args, enquote): argument is not interpretable as logical
Created on 2021-08-10 by the reprex package (v2.0.0)
Since arrange_all has been deprecated, you can use across in arrange.
library(dplyr)
x %>% arrange(across())
# A tibble: 5 x 4
# mpg cyl disp hp
# <dbl> <dbl> <dbl> <dbl>
#1 14.3 8 360 245
#2 15.2 8 276. 180
#3 19.2 8 400 175
#4 21 6 160 110
#5 21.4 6 258 110
For reverse you can do -
x %>% arrange(across(.cols = ncol(.):1))
# A tibble: 5 x 4
# mpg cyl disp hp
# <dbl> <dbl> <dbl> <dbl>
#1 21 6 160 110
#2 21.4 6 258 110
#3 19.2 8 400 175
#4 15.2 8 276. 180
#5 14.3 8 360 245
In base R, you can use do.call with order -
#1.
x[do.call(order, x), ]
#2.
x[do.call(order, rev(x)), ]
# arrange by column left to right:
x %>% arrange(!!!syms(colnames(.)))
# arrange by column right to left:
x %>% arrange(!!!syms(rev(colnames(.))))
Output:
> x %>% arrange(!!!syms(colnames(.)))
# A tibble: 5 x 4
mpg cyl disp hp
<dbl> <dbl> <dbl> <dbl>
1 14.3 8 360 245
2 15.2 8 276. 180
3 19.2 8 400 175
4 21 6 160 110
5 21.4 6 258 110
> x %>% arrange(!!!syms(rev(colnames(.))))
# A tibble: 5 x 4
mpg cyl disp hp
<dbl> <dbl> <dbl> <dbl>
1 21 6 160 110
2 21.4 6 258 110
3 19.2 8 400 175
4 15.2 8 276. 180
5 14.3 8 360 245
We can use
library(dplyr)
x %>%
arrange(across(everything()))
-output
A tibble: 5 x 4
mpg cyl disp hp
<dbl> <dbl> <dbl> <dbl>
1 14.3 8 360 245
2 15.2 8 276. 180
3 19.2 8 400 175
4 21 6 160 110
5 21.4 6 258 110

R split apply combine with dplyr - how to keep NA resulting from slice

mtcars %>% select(mpg, cyl) %>% group_by(cyl) %>% arrange(mpg) %>% slice(8)
outputs
mpg cyl
<dbl> <dbl>
1 30.4 4
2 15.2 8
As you can see, it does not produce a row for 6 cylinders - what is the recommended way to keep all the groups, even if combine is empty?
To quickly select a row from each group, keeping NAs, you can subset inside summarise_all:
mtcars %>% group_by(cyl) %>%
arrange(mpg) %>%
summarise_all(funs(.[8]))
## # A tibble: 3 × 11
## cyl mpg disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 2 6 NA NA NA NA NA NA NA NA NA NA
## 3 8 15.2 304.0 150 3.15 3.435 17.30 0 0 3 2
However, #Frank is right above; it won't extend nicely to subsetting to multiple rows in this format because summarise demands a single result row for each group. To subset, say, rows 7 and 8 of each group, use a list column and unnest with tidyr::unnest:
library(tidyverse)
mtcars %>% group_by(cyl) %>%
arrange(mpg) %>%
summarise_all(funs(list(.[7:8]))) %>%
unnest()
## # A tibble: 6 × 11
## cyl mpg disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 27.3 79.0 66 4.08 1.935 18.90 1 1 4 1
## 2 4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 3 6 21.4 258.0 110 3.08 3.215 19.44 1 0 3 1
## 4 6 NA NA NA NA NA NA NA NA NA NA
## 5 8 15.2 275.8 180 3.07 3.780 18.00 0 0 3 3
## 6 8 15.2 304.0 150 3.15 3.435 17.30 0 0 3 2
A more concise version with purrr::dmap returns the same thing:
mtcars %>% group_by(cyl) %>%
arrange(mpg) %>%
dmap(~.x[7:8])
Since dplyr 0.8 we can use group_map, so with the same idea as #alistaire we can do:
library(dplyr)
mtcars2 <- mtcars %>% select(mpg, cyl) %>% group_by(cyl) %>% arrange(mpg)
mtcars2 %>% group_map(~.[8,])
#> # A tibble: 3 x 2
#> # Groups: cyl [3]
#> cyl mpg
#> <dbl> <dbl>
#> 1 4 30.4
#> 2 6 NA
#> 3 8 15.2
mtcars2 %>% group_map(~.[7:8,])
#> # A tibble: 6 x 2
#> # Groups: cyl [3]
#> cyl mpg
#> <dbl> <dbl>
#> 1 4 27.3
#> 2 4 30.4
#> 3 6 21.4
#> 4 6 NA
#> 5 8 15.2
#> 6 8 15.2

Resources