Count number of rows by group using dplyr

Count number of rows by group using dplyr - r

I am using the mtcars dataset. I want to find the number of records for a particular combination of data. Something very similar to the count(*) group by clause in SQL. ddply() from plyr is working for me
library(plyr)
ddply(mtcars, .(cyl,gear),nrow)
has output
cyl gear V1
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
Using this code
library(dplyr)
g <- group_by(mtcars, cyl, gear)
summarise(g, length(gear))
has output
length(cyl)
1 32
I found various functions to pass in to summarise() but none seem to work for me. One function I found is sum(G), which returned
Error in eval(expr, envir, enclos) : object 'G' not found
Tried using n(), which returned
Error in n() : This function should not be called directly
What am I doing wrong? How can I get group_by() / summarise() to work for me?

There's a special function n() in dplyr to count rows (potentially within groups):
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
summarise(n = n())
#Source: local data frame [8 x 3]
#Groups: cyl [?]
#
# cyl gear n
# (dbl) (dbl) (int)
#1 4 3 1
#2 4 4 8
#3 4 5 2
#4 6 3 2
#5 6 4 4
#6 6 5 1
#7 8 3 12
#8 8 5 2
But dplyr also offers a handy count function which does exactly the same with less typing:
count(mtcars, cyl, gear) # or mtcars %>% count(cyl, gear)
#Source: local data frame [8 x 3]
#Groups: cyl [?]
#
# cyl gear n
# (dbl) (dbl) (int)
#1 4 3 1
#2 4 4 8
#3 4 5 2
#4 6 3 2
#5 6 4 4
#6 6 5 1
#7 8 3 12
#8 8 5 2

I think what you are looking for is as follows.
cars_by_cylinders_gears <- mtcars %>%
group_by(cyl, gear) %>%
summarise(count = n())
This is using the dplyr package. This is essentially the longhand version of the count () solution provided by docendo discimus.

another approach is to use the double colons:
mtcars %>%
dplyr::group_by(cyl, gear) %>%
dplyr::summarise(length(gear))

Another option, not necesarily more elegant, but does not require to refer to a specific column:
mtcars %>%
group_by(cyl, gear) %>%
do(data.frame(nrow=nrow(.)))
This is equivalent to using count():
library(dplyr, warn.conflicts = FALSE)
all.equal(mtcars %>%
group_by(cyl, gear) %>%
do(data.frame(n=nrow(.))) %>%
ungroup(),
count(mtcars, cyl, gear), check.attributes=FALSE)
#> [1] TRUE

Another option is using the function tally from dplyr. Here is a reproducible example:
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
tally()
#> # A tibble: 8 × 3
#> # Groups: cyl [3]
#> cyl gear n
#> <dbl> <dbl> <int>
#> 1 4 3 1
#> 2 4 4 8
#> 3 4 5 2
#> 4 6 3 2
#> 5 6 4 4
#> 6 6 5 1
#> 7 8 3 12
#> 8 8 5 2
Created on 2022-09-11 with reprex v2.0.2

Related

Creating Crosstable with Multiple Variables Summarized by Row Categories

I am interested in summarizing several outcomes by sample categories and presenting it all in one table. Something with output that resembles:
vs
am
cyl
0
1
0
1
4
1
10
3
8
6
3
4
4
3
8
14
0
12
2
were I able to combine ("cbind") the tables generated by:
ftable(mtcars$cyl, mtcars$vs)
and by:
ftable(mtcars$cyl, mtcars$am)
The crosstable() and CrossTable() packages showed promise but I couldn't see how to expand it out to multiple groups of columns without nesting them.
As demonstrated here, ftable can get close with:
ftable(vs + am ~ cyl, mtcars)
except for also nesting am within vs.
Similarly, dplyr gets close via, e.g.,
library(dplyr)
mtcars %>%
group_by(cyl, vs, am) %>%
summarize(count = n())
or something more complex like this
but I have several variables to present and this nesting defeats the ability to summarize in my case.
Perhaps aggregate could work in the hands of a cleverer person than I?
TYIA!

foo = function(df, grp, vars) {
lapply(vars, function(nm) {
tmp = as.data.frame(as.matrix(ftable(reformulate(grp, nm), df)))
names(tmp) = paste0(nm, "_", names(tmp))
tmp
})
}
do.call(cbind, foo(mtcars, "cyl", c("vs", "am", "gear")))
# vs_0 vs_1 am_0 am_1 gear_3 gear_4 gear_5
# 4 1 10 3 8 1 8 2
# 6 3 4 4 3 2 4 1
# 8 14 0 12 2 12 0 2

A solution based on purrr::map_dfc and tidyr::pivot_wider:
library(tidyverse)
map_dfc(c("vs", "am", "gear"), ~ mtcars %>% pivot_wider(id_cols = cyl,
names_from = .x, values_from = .x, values_fn = length,
names_prefix = str_c(.x, "_"), names_sort = T, values_fill = 0) %>%
{if (.x != "vs") select(.,-cyl) else .}) %>% arrange(cyl)
#> This message is displayed once per session.
#> # A tibble: 3 × 8
#> cyl vs_0 vs_1 am_0 am_1 gear_3 gear_4 gear_5
#> <dbl> <int> <int> <int> <int> <int> <int> <int>
#> 1 4 1 10 3 8 1 8 2
#> 2 6 3 4 4 3 2 4 1
#> 3 8 14 0 12 2 12 0 2

This was not really planned, but you can do this using the package crosstable, with the help of a simple left_join() call:
library(tidyverse)
library(crosstable)
ct1 = crosstable(mtcars, cyl, by=vs)
ct2 = crosstable(mtcars, cyl, by=am)
ct = left_join(ct1, ct2, by=c(".id", "label", "variable"),
suffix=c("_vs", "_am"))
ct
#> # A tibble: 3 × 7
#> .id label variable `0_vs` `1_vs` `0_am` `1_am`
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 cyl cyl 4 1 (9.09%) 10 (90.91%) 3 (27.27%) 8 (72.73%)
#> 2 cyl cyl 6 3 (42.86%) 4 (57.14%) 4 (57.14%) 3 (42.86%)
#> 3 cyl cyl 8 14 (100.00%) 0 (0%) 12 (85.71%) 2 (14.29%)
as_flextable(ct)
Created on 2022-06-16 by the reprex package (v2.0.1)
Maybe I will add a cbind() method for crosstables one day, so that the as_flextable() output looks better.

Passing multiple columns to a UDF as grouping variables in a tidy way

I want to pass multiple columns to one UDF argument in the tidy way (so as bare column names).
Example: I have a simple function which takes a column of the mtcars dataset as an input and uses that as the grouping variable to do an easy count operation with summarise.
library(tidyverse)
test_function <- function(grps){
grps <- enquo(grps)
mtcars %>%
group_by(!!grps) %>%
summarise(Count = n())
}
Result if I execute the function with "cyl" as the grouping variable:
test_function(grps = cyl)
-----------------
cyl Count
<dbl> <int>
1 4 11
2 6 7
3 8 14
Now imagine I want to pass multiple columns to the argument "grps" so that the dataset is grouped by more columns. Here is what I imagine some example function executions could look like:
test_function(grps = c(cyl, gear))
test_function(grps = list(cyl, gear))
Here is what the expected result would look like:
cyl gear Count
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
Is there a way to pass multiple bare columns to one argument of a UDF? I know about the "..." operator already but since I have in reality 2 arguments where I want to possibly pass more than one bare column as an argument the "..." is not feasible.

You can use the across() function with embraced arguments for this which works for most dplyr verbs. It will accept bare names or character strings:
test_function <- function(grps){
mtcars %>%
group_by(across({{ grps }})) %>%
summarise(Count = n())
}
test_function(grps = c(cyl, gear))
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 8 x 3
# Groups: cyl [3]
cyl gear Count
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
test_function(grps = c("cyl", "gear"))
# Same output

group_by() level disappear after filter()/mutate()/count() without using ungroup

This problem bothers me for the entire day and I don't know why it happens.
The issue is group_by level will disappear after one line of code such as filter(),mutate(), count(), and in order to keep that level, I need to add group_by() everytime after these codes again to keep the group level.
Below I attach an example.
As you can see, if I add group_by after filter, it works fine.
data("mtcars")
> mtcars %>%
+ filter(hp == 110) %>%
+ group_by(cyl) %>%
+ count(mpg)
cyl mpg n
1 6 21.0 2
2 6 21.4 1
However, if I use group_by before filter and count the value, it will lose the group by level
data("mtcars")
> mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ count(mpg)
mpg n
1 21.0 2
2 21.4 1
In order to make it work, I need to change codes to
> mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ group_by(cyl) %>%
+ count(mpg)
cyl mpg n
1 6 21.0 2
2 6 21.4 1
This method also doesn't work:
> mtcars %>%
+ dplyr::group_by(cyl) %>%
+ dplyr::filter(hp == 110) %>%
+ dplyr::count(mpg)
mpg n
1 21.0 2
2 21.4 1
I am using another PC to run the codes and it works well.
data("mtcars")
mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ count(mpg)
# A tibble: 2 x 3
# Groups: cyl [1]
cyl mpg n
<dbl> <dbl> <int>
1 6 21 2
2 6 21.4 1
I have reinstalled dplyr package many times and this thing keeps happening. I am using version 1.0.2 for dplyr.
Really appreciate if someone can help me about this issue!
Edit:
The problem is being solved after I update my R version to 4.0.2 (my previous version is 3.6.3). Not sure why dplyr doesn't work properly undr 3.6.3 but at least the problem is being solved for now.

Try this:
data("mtcars")
> mtcars %>%
+ dplyr::group_by(cyl) %>%
+ dplyr::filter(hp == 110) %>%
+ dplyr::count(mpg)
There can be masking problem. Function filter is in dplyr and stats package as well. Same issue was discussed here. Similar problem occours with select function.

Also note in that context the difference between:
data("mtcars")
mtcars %>%
group_by(cyl,gear) %>%
summarize(
n=n()
) %>%
mutate(mysum = sum(n))
# A tibble: 8 x 4
# Groups: cyl [3]
cyl gear n mysum
<dbl> <dbl> <int> <int>
1 4 3 1 11
2 4 4 8 11
3 4 5 2 11
4 6 3 2 7
5 6 4 4 7
6 6 5 1 7
mtcars %>%
group_by(cyl,gear) %>%
count() %>%
mutate(mysum = sum(n))
# A tibble: 8 x 4
# Groups: cyl, gear [8]
cyl gear n mysum
<dbl> <dbl> <int> <int>
1 4 3 1 1
2 4 4 8 8
3 4 5 2 2
4 6 3 2 2
5 6 4 4 4
Summarise defaults to dropping the last grouping variable (.groups="drop_last"). And for a funny reason :)
https://twitter.com/hadleywickham/status/1254802700589555715

standard eval with `dplyr::count()` [duplicate]

This question already has answers here:
dplyr: How to use group_by inside a function?
(4 answers)
Closed 3 years ago.
How can I pass a character vector to dplyr::count().
library(magrittr)
variables <- c("cyl", "vs")
mtcars %>%
dplyr::count_(variables)
This works well, but dplyr v0.8 throws the warning:
count_() is deprecated.
Please use count() instead
The 'programming' vignette or the tidyeval book can help you
to program with count() : https://tidyeval.tidyverse.org
I'm not seeing standard evaluation examples of quoted names or of dplyr::count() in https://tidyeval.tidyverse.org/dplyr.html or other chapters of the current versions of the tidyeval book and Programming with dplyr.
My two best guesses after reading this documenation and another SO question is
mtcars %>%
dplyr::count(!!variables)
mtcars %>%
dplyr::count(!!rlang::sym(variables))
which throw these two errors:
Error: Column <chr> must be length 32 (the number of rows) or one,
not 2
Error: Only strings can be converted to symbols

To create a list of symbols from strings, you want rlang::syms (not rlang::sym). For unquoting a list or a vector, you want to use !!! (not !!). The following will work:
library(magrittr)
variables <- c("cyl", "vs")
vars_sym <- rlang::syms(variables)
vars_sym
#> [[1]]
#> cyl
#>
#> [[2]]
#> vs
mtcars %>%
dplyr::count(!!! vars_sym)
#> # A tibble: 5 x 3
#> cyl vs n
#> <dbl> <dbl> <int>
#> 1 4 0 1
#> 2 4 1 10
#> 3 6 0 3
#> 4 6 1 4
#> 5 8 0 14

Maybe you can try
mtcars %>%
group_by(cyl, vs) %>%
tally()
This gives
# A tibble: 5 x 3
# Groups: cyl [3]
cyl vs n
<dbl> <dbl> <int>
1 4 0 1
2 4 1 10
3 6 0 3
4 6 1 4
5 8 0 14

Summarise for multiple group_by variables combined and individually

I am using dplyr's group_by and summarise to get a mean by each group_by variable combined, but also want to get the mean by each group_by variable individually.
For example if I run
mtcars %>%
group_by(cyl, vs) %>%
summarise(new = mean(wt))
I get
cyl vs new
<dbl> <dbl> <dbl>
4 0 2.140000
4 1 2.300300
6 0 2.755000
6 1 3.388750
8 0 3.999214
But I want to get
cyl vs new
<dbl> <dbl> <dbl>
4 0 2.140000
4 1 2.300300
4 NA 2.285727
6 0 2.755000
6 1 3.388750
6 NA 3.117143
8 0 3.999214
NA 0 3.688556
NA 1 2.611286
I.e. get the mean for the variables both combined and individually
Edit
Jaap marked this as duplicate and pointed me in the direction of Using aggregate to apply several functions on several variables in one call. I looked at jaap's answer there which referenced dplyr but I can't see how that answers my question? You say to use summarise_each, but I still don't see how I can use that to get the mean of each of my group by variables individually? Apologies if I am being stupid...

Here is an idea using bind_rows,
library(dplyr)
mtcars %>%
group_by(cyl, vs) %>%
summarise(new = mean(wt)) %>%
bind_rows(.,
mtcars %>% group_by(cyl) %>% summarise(new = mean(wt)) %>% mutate(vs = NA),
mtcars %>% group_by(vs) %>% summarise(new = mean(wt)) %>% mutate(cyl = NA)) %>%
arrange(cyl) %>%
ungroup()
# A tibble: 10 × 3
# cyl vs new
# <dbl> <dbl> <dbl>
#1 4 0 2.140000
#2 4 1 2.300300
#3 4 NA 2.285727
#4 6 0 2.755000
#5 6 1 3.388750
#6 6 NA 3.117143
#7 8 0 3.999214
#8 8 NA 3.999214
#9 NA 0 3.688556
#10 NA 1 2.611286

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Count number of rows by group using dplyr - r

I think what you are looking for is as follows. cars_by_cylinders_gears <- mtcars %>% group_by(cyl, gear) %>% summarise(count = n()) This is using the dplyr package. This is essentially the longhand version of the count () solution provided by docendo discimus.

another approach is to use the double colons: mtcars %>% dplyr::group_by(cyl, gear) %>% dplyr::summarise(length(gear))

Related

Creating Crosstable with Multiple Variables Summarized by Row Categories

Passing multiple columns to a UDF as grouping variables in a tidy way

group_by() level disappear after filter()/mutate()/count() without using ungroup

standard eval with `dplyr::count()` [duplicate]

Summarise for multiple group_by variables combined and individually

Categories

Resources