filter inside dplyr's summarise - r

I want to use filter or similar function inside summarise from dplyr package. So I've got a dataframe (e.g. mtcars) where I need to group by factor (e.g. cyl) and then calculate some statistics and a percentage of total wt for every cyl type —> wt.pc.
The question is how can I subset/filter wt column inside summarise function to get a percentage but without last 10 rows?
I've tried this code but it returns NA:(
mtcars %>%
group_by(cyl) %>%
summarise(wt = round(sum(wt)),
wt.pc = sum(wt) * 100 / sum(mtcars[, 6]),
wt.pc.short = sum(wt[1:22]) * 100 / sum(mtcars[1:22, 6]),
drat.max = round(max(drat)))
# A tibble: 3 x 5
cyl wt wt.pc wt.pc.short drat.max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 25 24.3 NA 5
2 6 22 21.4 NA 4
3 8 56 54.4 NA 4
wt.pc.short — % of sum(wt) for every cyl for shorter dataframe mtcars[1:22,]

Something like this?
mtcars %>%
mutate(id = row_number()) %>%
group_by(cyl) %>%
summarise(wt_new = round(sum(wt)), # note the change in name here!
wt.pc = sum(wt) * 100 / sum(mtcars[, 6]),
wt.pc.short = sum(wt[id<23]) * 100 / sum(mtcars[1:22, 6]),
drat.max = round(max(drat)))
# A tibble: 3 x 5
cyl wt_new wt.pc wt.pc.short drat.max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 25 24.3 22.7 5
2 6 22 21.4 25.8 4
3 8 56 54.4 51.6 4
The important part here is that when you assign wt in the call to summarize, all subsequent references to wt will take the previously assigned wt, not the original wt. A statement such as wt[1:22] is thus somewhat problematic. You can see this here:
mean(mtcars[,"mpg"])
# [1] 20.09062
var(mtcars[,"mpg"])
# [1] 36.3241
mtcars %>% summarise(var_before = var(mpg),
mpg = mean(mpg),
var_after = var(mpg))
# var_before mpg var_after
# 1 36.3241 20.09062 NA

I think you can do it like this. First we calculate the row number within the group, if max(row_number) > 10 then we have enough observations to remove the last 10 rows, in which case we filter to max(ID)-9 (i.e. remove the last 10 rows), otherwise ID==ID returns true and doesn't remove anything.
mtcars %>% group_by(cyl) %>%
mutate(ID = row_number()) %>%
filter(if (max(ID) > 10) ID < (max(ID) - 9) else ID == ID)

Related

Add Another Column Info to Results of groupby r

Can someone help me please?
I Have Column A, Column B and Column C, I want to get the top value of column C, grouped by A, but also have the information of B for those top values
Max <-X %>% select(A,B,C) %>% group_by(A) %>% summarise(top = max(C))
But this code only show me the top values of each unique A data, so I dont know whats the B value assigned to that. (Important, making group_by(A,B) doesnt work, because it doesnt give the top values for each unique A value, it returns the same as the data base X)
This could be achieved via dplyr::top_n or ? dplyr::slice_max like so:
library(dplyr)
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% top_n(1, hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% slice_max(hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
So, in your case it should be:
Max <-X %>% select(A,B,C) %>% group_by(A) %>% slice_max(C)

group_by() level disappear after filter()/mutate()/count() without using ungroup

This problem bothers me for the entire day and I don't know why it happens.
The issue is group_by level will disappear after one line of code such as filter(),mutate(), count(), and in order to keep that level, I need to add group_by() everytime after these codes again to keep the group level.
Below I attach an example.
As you can see, if I add group_by after filter, it works fine.
data("mtcars")
> mtcars %>%
+ filter(hp == 110) %>%
+ group_by(cyl) %>%
+ count(mpg)
cyl mpg n
1 6 21.0 2
2 6 21.4 1
However, if I use group_by before filter and count the value, it will lose the group by level
data("mtcars")
> mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ count(mpg)
mpg n
1 21.0 2
2 21.4 1
In order to make it work, I need to change codes to
> mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ group_by(cyl) %>%
+ count(mpg)
cyl mpg n
1 6 21.0 2
2 6 21.4 1
This method also doesn't work:
> mtcars %>%
+ dplyr::group_by(cyl) %>%
+ dplyr::filter(hp == 110) %>%
+ dplyr::count(mpg)
mpg n
1 21.0 2
2 21.4 1
I am using another PC to run the codes and it works well.
data("mtcars")
mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ count(mpg)
# A tibble: 2 x 3
# Groups: cyl [1]
cyl mpg n
<dbl> <dbl> <int>
1 6 21 2
2 6 21.4 1
I have reinstalled dplyr package many times and this thing keeps happening. I am using version 1.0.2 for dplyr.
Really appreciate if someone can help me about this issue!
Edit:
The problem is being solved after I update my R version to 4.0.2 (my previous version is 3.6.3). Not sure why dplyr doesn't work properly undr 3.6.3 but at least the problem is being solved for now.
Try this:
data("mtcars")
> mtcars %>%
+ dplyr::group_by(cyl) %>%
+ dplyr::filter(hp == 110) %>%
+ dplyr::count(mpg)
There can be masking problem. Function filter is in dplyr and stats package as well. Same issue was discussed here. Similar problem occours with select function.
Also note in that context the difference between:
data("mtcars")
mtcars %>%
group_by(cyl,gear) %>%
summarize(
n=n()
) %>%
mutate(mysum = sum(n))
# A tibble: 8 x 4
# Groups: cyl [3]
cyl gear n mysum
<dbl> <dbl> <int> <int>
1 4 3 1 11
2 4 4 8 11
3 4 5 2 11
4 6 3 2 7
5 6 4 4 7
6 6 5 1 7
mtcars %>%
group_by(cyl,gear) %>%
count() %>%
mutate(mysum = sum(n))
# A tibble: 8 x 4
# Groups: cyl, gear [8]
cyl gear n mysum
<dbl> <dbl> <int> <int>
1 4 3 1 1
2 4 4 8 8
3 4 5 2 2
4 6 3 2 2
5 6 4 4 4
Summarise defaults to dropping the last grouping variable (.groups="drop_last"). And for a funny reason :)
https://twitter.com/hadleywickham/status/1254802700589555715

Can I use summarise_at for existing variables while adding other variables at the same time?

Suppose I have a grouped data frame:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise(blah = mean(disp))
# A tibble: 3 x 2
cyl blah
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.
Then suppose I want to sum some existing variables:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise_at(vars(vs:carb), sum)
# A tibble: 3 x 5
cyl vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 10 8 45 17
2 6 4 3 27 24
3 8 0 2 46 49
However, if I want to add both summarise commands together, I cannot:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise_at(vars(vs:carb), sum) %>%
+ summarise(blah = mean(disp))
Error in mean(disp) : object 'disp' not found
After using group_by() in a dplyr chain, Hhow can I add new features with summarise() as well as summing existing features as above with summarise_at(vars(vs:carb), sum)?
The only way I can think of (at the moment) is the store the data immediately before your first summary, then run two summary verbs, and join them on the grouped variable. For instance:
library(dplyr)
grouped_data <- group_by(mtcars, cyl)
left_join(
summarize(grouped_data, blah = mean(disp)),
summarize_at(grouped_data, vars(vs:carb), sum),
by = "cyl")
# # A tibble: 3 x 6
# cyl blah vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 105. 10 8 45 17
# 2 6 183. 4 3 27 24
# 3 8 353. 0 2 46 49
You can left_join with the dataframe resulting from the summarise.
library(dplyr)
data(mtcars)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars(vs:carb), sum) %>%
left_join(mtcars %>% group_by(cyl) %>% summarise(blah = mean(disp)))
#Joining, by = "cyl"
## A tibble: 3 x 6
# cyl vs am gear carb blah
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 10 8 45 17 105.
#2 6 4 3 27 24 183.
#3 8 0 2 46 49 353.
What I would do is use mutate_at for first step so that other columns are not collapsed and then use summarise_at with mean for all the columns together.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate_at(vars(vs:carb), sum) %>%
summarise_at(vars(vs:carb, disp), mean)
# cyl vs am gear carb disp
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 10 8 45 17 105.
#2 6 4 3 27 24 183.
#3 8 0 2 46 49 353.
Here's a way, we need to define an helper function first and it works only in a pipe chain and uses unexported functions from dplyr though so might break one day.
.at <- function(.vars, .funs, ...) {
# make sure we are in a piped call
in_a_piped_fun <- exists(".",parent.frame()) &&
length(ls(envir=parent.frame(), all.names = TRUE)) == 1
if (!in_a_piped_fun)
stop(".at() must be called as an argument to a piped function")
# borrow code from summarize_at
.tbl <- try(eval.parent(quote(.)))
dplyr:::manip_at(
.tbl, .vars, .funs, rlang::enquo(.funs), rlang:::caller_env(),
.include_group_vars = TRUE, ...)
}
library(dplyr, warn.conflicts = FALSE)
mtcars %>%
summarize(!!!.at(vars(vs:carb), sum), blah = mean(disp))
#> vs am gear carb blah
#> 1 14 13 118 90 230.7219
Created on 2019-11-17 by the reprex package (v0.3.0)

how to calculate proportion by another variable (not by frequency) in dplyr in R

Using mtcars data, I want to calculate proportion of mpg for each group of cyl and am. How to calc it?
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n(mpg)) %>%
mutate(mpg.gr = mpg/(sum(mpg))
Thanks in advance!
If I understand you correctly, you want the proportion of records for each combination of cyl and am. If so, then I believe your code isn't working because n() doesn't accept an argument. You also need to ungroup() before calculating your proportions.
You could simply do:
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n()) %>%
ungroup() %>%
mutate(mpg.gr = mpg/(sum(mpg))
#> # A tibble: 6 x 4
#> cyl am mpg mpg.gr
#> <dbl> <dbl> <int> <dbl>
#> 1 4 0 3 0.0938
#> 2 4 1 8 0.25
#> 3 6 0 4 0.125
#> 4 6 1 3 0.0938
#> 5 8 0 12 0.375
#> 6 8 1 2 0.0625
Note that thanks to ungroup(), the proportions are calculated using the counts of all records, not just those within the cyl group, as before.

Use group size (`group_size`) in `summarise` in `dplyr` [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 3 years ago.
I want to use the size of a group as part of a groupwise operation in dplyr::summarise.
E.g calculate the proportion of manuals by cylinder, by grouping the cars data by cyl and dividing the number of manuals by the size of the group:
mtcars %>%
group_by(cyl) %>%
summarise(zz = sum(am)/group_size(.))
But, (I think), because group_size is after a grouped tbl_df and . is ungrouped, this returns
Error in mutate_impl(.data, dots) : basic_string::resize
Is there a way to do this?
You probably can use n() to get the number of rows for group
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(zz = sum(am)/n())
# cyl zz
# <dbl> <dbl>
#1 4.00 0.727
#2 6.00 0.429
#3 8.00 0.143
It is just a group by mean
mtcars %>%
group_by(cyl) %>%
summarise(zz = mean(am))
# A tibble: 3 x 2
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
If we need to use group_size
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest %>%
mutate(zz = map_dbl(data, ~ sum(.x$am)/group_size(.x))) %>%
arrange(cyl) %>%
select(-data)
# A tibble: 3 x 2
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
Or using do
mtcars %>%
group_by(cyl) %>%
do(data.frame(zz = sum(.$am)/group_size(.)))
# A tibble: 3 x 2
# Groups: cyl [3]
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143

Resources