Using dplyr summarise_at with column index - r

I noticed that when supplying column indices to dplyr::summarize_at the column to be summarized is determined excluding the grouping column(s). I wonder if that is how it's supposed to be since by this design, using the correct column index depends on whether the summarising column(s) are positioned before or after the grouping columns.
Here's an example:
library(dplyr)
data("mtcars")
# grouping column after summarise columns
mtcars %>% group_by(gear) %>% summarise_at(3:4, mean)
## A tibble: 3 x 3
# gear disp hp
# <dbl> <dbl> <dbl>
#1 3 326.3000 176.1333
#2 4 123.0167 89.5000
#3 5 202.4800 195.6000
# grouping columns before summarise columns
mtcars %>% group_by(cyl) %>% summarise_at(3:4, mean)
## A tibble: 3 x 3
# cyl hp drat
# <dbl> <dbl> <dbl>
#1 4 82.63636 4.070909
#2 6 122.28571 3.585714
#3 8 209.21429 3.229286
# no grouping columns
mtcars %>% summarise_at(3:4, mean)
# disp hp
#1 230.7219 146.6875
# actual third & fourth columns
names(mtcars)[3:4]
#[1] "disp" "hp"
packageVersion("dplyr")
#[1] ‘0.7.2’
Notice how the summarised columns change depending on grouping and position of the grouping column.
Is this the same on other platforms? Is it a bug or a feature?

with version 0.7.5 this behavior can't be reproduced anymore:
library(dplyr)
mtcars %>% group_by(gear) %>% summarise_at(3:4, mean)
# # A tibble: 3 x 3
# gear disp hp
# <dbl> <dbl> <dbl>
# 1 3 326. 176.
# 2 4 123. 89.5
# 3 5 202. 196.
mtcars %>% group_by(cyl) %>% summarise_at(3:4, mean)
# # A tibble: 3 x 3
# cyl disp hp
# <dbl> <dbl> <dbl>
# 1 4 105. 82.6
# 2 6 183. 122.
# 3 8 353. 209.

#docendodiscimus thanks for pointing this out, because even if this feature was intentional, documentation doesn't explicitly explain this and in my case could be source of errors. Actually, this problem was solved before answering on the other question, and my comment above does it properly with the same logic.
At this moment, possible solution is to provide names instead of indexes. But one is still able to make it using indexes just by adding few symbols .vars = names(.)[3:4], like below:
mtcars %>%
group_by(cyl) %>%
summarise_at( .vars = colnames(.)[3:4] , mean)
mtcars %>%
group_by(cyl) %>%
summarise_at( .vars = names(.)[3:4] , mean)
## A tibble: 3 x 3
# cyl disp hp
# <dbl> <dbl> <dbl>
#1 4 105.1364 82.63636
#2 6 183.3143 122.28571
#3 8 353.1000 209.21429

Related

Add Another Column Info to Results of groupby r

Can someone help me please?
I Have Column A, Column B and Column C, I want to get the top value of column C, grouped by A, but also have the information of B for those top values
Max <-X %>% select(A,B,C) %>% group_by(A) %>% summarise(top = max(C))
But this code only show me the top values of each unique A data, so I dont know whats the B value assigned to that. (Important, making group_by(A,B) doesnt work, because it doesnt give the top values for each unique A value, it returns the same as the data base X)
This could be achieved via dplyr::top_n or ? dplyr::slice_max like so:
library(dplyr)
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% top_n(1, hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% slice_max(hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
So, in your case it should be:
Max <-X %>% select(A,B,C) %>% group_by(A) %>% slice_max(C)

Can I use summarise_at for existing variables while adding other variables at the same time?

Suppose I have a grouped data frame:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise(blah = mean(disp))
# A tibble: 3 x 2
cyl blah
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.
Then suppose I want to sum some existing variables:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise_at(vars(vs:carb), sum)
# A tibble: 3 x 5
cyl vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 10 8 45 17
2 6 4 3 27 24
3 8 0 2 46 49
However, if I want to add both summarise commands together, I cannot:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise_at(vars(vs:carb), sum) %>%
+ summarise(blah = mean(disp))
Error in mean(disp) : object 'disp' not found
After using group_by() in a dplyr chain, Hhow can I add new features with summarise() as well as summing existing features as above with summarise_at(vars(vs:carb), sum)?
The only way I can think of (at the moment) is the store the data immediately before your first summary, then run two summary verbs, and join them on the grouped variable. For instance:
library(dplyr)
grouped_data <- group_by(mtcars, cyl)
left_join(
summarize(grouped_data, blah = mean(disp)),
summarize_at(grouped_data, vars(vs:carb), sum),
by = "cyl")
# # A tibble: 3 x 6
# cyl blah vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 105. 10 8 45 17
# 2 6 183. 4 3 27 24
# 3 8 353. 0 2 46 49
You can left_join with the dataframe resulting from the summarise.
library(dplyr)
data(mtcars)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars(vs:carb), sum) %>%
left_join(mtcars %>% group_by(cyl) %>% summarise(blah = mean(disp)))
#Joining, by = "cyl"
## A tibble: 3 x 6
# cyl vs am gear carb blah
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 10 8 45 17 105.
#2 6 4 3 27 24 183.
#3 8 0 2 46 49 353.
What I would do is use mutate_at for first step so that other columns are not collapsed and then use summarise_at with mean for all the columns together.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate_at(vars(vs:carb), sum) %>%
summarise_at(vars(vs:carb, disp), mean)
# cyl vs am gear carb disp
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 10 8 45 17 105.
#2 6 4 3 27 24 183.
#3 8 0 2 46 49 353.
Here's a way, we need to define an helper function first and it works only in a pipe chain and uses unexported functions from dplyr though so might break one day.
.at <- function(.vars, .funs, ...) {
# make sure we are in a piped call
in_a_piped_fun <- exists(".",parent.frame()) &&
length(ls(envir=parent.frame(), all.names = TRUE)) == 1
if (!in_a_piped_fun)
stop(".at() must be called as an argument to a piped function")
# borrow code from summarize_at
.tbl <- try(eval.parent(quote(.)))
dplyr:::manip_at(
.tbl, .vars, .funs, rlang::enquo(.funs), rlang:::caller_env(),
.include_group_vars = TRUE, ...)
}
library(dplyr, warn.conflicts = FALSE)
mtcars %>%
summarize(!!!.at(vars(vs:carb), sum), blah = mean(disp))
#> vs am gear carb blah
#> 1 14 13 118 90 230.7219
Created on 2019-11-17 by the reprex package (v0.3.0)

Use group size (`group_size`) in `summarise` in `dplyr` [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 3 years ago.
I want to use the size of a group as part of a groupwise operation in dplyr::summarise.
E.g calculate the proportion of manuals by cylinder, by grouping the cars data by cyl and dividing the number of manuals by the size of the group:
mtcars %>%
group_by(cyl) %>%
summarise(zz = sum(am)/group_size(.))
But, (I think), because group_size is after a grouped tbl_df and . is ungrouped, this returns
Error in mutate_impl(.data, dots) : basic_string::resize
Is there a way to do this?
You probably can use n() to get the number of rows for group
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(zz = sum(am)/n())
# cyl zz
# <dbl> <dbl>
#1 4.00 0.727
#2 6.00 0.429
#3 8.00 0.143
It is just a group by mean
mtcars %>%
group_by(cyl) %>%
summarise(zz = mean(am))
# A tibble: 3 x 2
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
If we need to use group_size
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest %>%
mutate(zz = map_dbl(data, ~ sum(.x$am)/group_size(.x))) %>%
arrange(cyl) %>%
select(-data)
# A tibble: 3 x 2
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
Or using do
mtcars %>%
group_by(cyl) %>%
do(data.frame(zz = sum(.$am)/group_size(.)))
# A tibble: 3 x 2
# Groups: cyl [3]
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143

dplyr Summarise improperly excluding NA

We can group mtcars by cylinder and summarize miles per gallon with some simple code.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(avg = mean(mpg))
This provides the correct output shown below.
cyl avg
1 4 26.66364
2 6 19.74286
3 8 15.10000
If I kindly ask dplyr to exclude NA I get some weird results.
mtcars %>%
group_by(cyl) %>%
summarise(avg = mean(!is.na(mpg)))
Since there are no NA in this data set the results should be the same as above. But it averages all mpg to exactly "1". A problem with my code or a bug in dplyr?
cyl avg
1 4 1
2 6 1
3 8 1
My actual data set does have some NA that I need to exclude only for this summarization, but exhibits the same behavior.
You want this:
mtcars %>%
group_by(cyl) %>%
summarise(avg = mean(mpg, na.rm = T))
# A tibble: 3 x 2
cyl avg
<dbl> <dbl>
1 4 26.66364
2 6 19.74286
3 8 15.10000
Right now, you're returning a logical vector with !is.na(mpg). When you take the mean() of a logical vector, it'll be coerced to 1, not the numeric value you desire.
The way you have coded it, the input to the mean() function is a vector of TRUE and FALSE values. Use mean(mpg[!is.na(mpg)]) instead.
Consider using data.table which I have used for illustration purposes. The following all produce the same result.
library(data.table)
MT[, mean(mpg), by = cyl]
cyl V1
1: 6 19.74286
2: 4 26.66364
3: 8 15.10000
MT[, mean(mpg, na.rm=TRUE), by = cyl]
cyl V1
1: 6 19.74286
2: 4 26.66364
3: 8 15.10000
MT[, mean(mpg[!is.na(mpg)]), by = cyl]
cyl V1
1: 6 19.74286
2: 4 26.66364
3: 8 15.10000

conditionally insert rows to a dataframe using dplyr

How can I insert a new row after each row where the hp exceeds 110 in the mtcars dataset, still keeping the cars with hp <=110?
library(tidyverse)
data(mtcars)
I´ve tried the following but without success--
mtcars %>% modify_if(.p = .$hp>110,.f = add_row(.after=1)) #noe equal lengths
mtcars %>% filter(hp>110) %>% add_row(.after=1) #only gives an extra row for the first row meeting condition
mtcars %>% rownames_to_column() %>%
group_by(rowname) %>% modify_if(.p=.$hp>110,.f=add_row(.after=1)) #not egual length
the following - using purrr- seems to work:
foo <- function(df){
if (df$hp>110) {df<-add_row(.data=df,.after=1)}
df
}
mtcars %>% rownames_to_column(var = "make") %>% nest(-make) %>%
mutate(new=map(data,~ foo(.x))) %>% select (-data) %>% unnest(new)
Any function called add_row_if ???
This inserts a new row of random numbers (runif(length(.)) beneath each row that matches your filter:
mtcars %>% filter(hp > 110) %>%
group_by(row_number()) %>%
do (
rbind(., runif(length(.)))
) %>%
ungroup()
# # A tibble: 36 × 12
# mpg cyl disp hp drat wt qsec vs
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 18.70000000 8.0000000 360.0000000 175.0000000 3.15000000 3.440000000 17.02000000 0.0000000
# 2 0.98189043 0.5940565 0.7401747 0.7300069 0.73460416 0.102802673 0.06632293 0.3771534
# 3 14.30000000 8.0000000 360.0000000 245.0000000 3.21000000 3.570000000 15.84000000 0.0000000
# 4 0.02192815 0.4811006 0.6456729 0.2900382 0.69145964 0.741733891 0.34932004 0.1356568
# 5 19.20000000 6.0000000 167.6000000 123.0000000 3.92000000 3.440000000 18.30000000 1.0000000
# 6 0.95568849 0.6532453 0.8744450 0.3346875 0.66217223 0.776532606 0.24731722 0.9633778
# 7 17.80000000 6.0000000 167.6000000 123.0000000 3.92000000 3.440000000 18.90000000 1.0000000
# 8 0.43592634 0.6036127 0.3185138 0.1035780 0.48505561 0.007380369 0.24154177 0.4978268
# 9 16.40000000 8.0000000 275.8000000 180.0000000 3.07000000 4.070000000 17.40000000 0.0000000
# 10 0.10970496 0.9063285 0.4659718 0.2793090 0.04670105 0.337342230 0.85691425 0.2758889
# # ... with 26 more rows, and 4 more variables: am <dbl>, gear <dbl>, carb <dbl>,
# # `row_number()` <dbl>

Resources