count by all variables / count distinct with dplyr - r

Say I have this data.frame :
library(dplyr)
df1 <- data.frame(x=rep(letters[1:3],1:3),y=rep(letters[1:3],1:3))
# x y
# 1 a a
# 2 b b
# 3 b b
# 4 c c
# 5 c c
# 6 c c
I can group and count easily by mentioning the names :
df1 %>%
count(x,y)
# A tibble: 3 x 3
# x y n
# <fctr> <fctr> <int>
# 1 a a 1
# 2 b b 2
# 3 c c 3
How do I do to group by everything without mentioning individual column names, in the most compact /readable way ?

We can pass the input itself to the ... argument and splice it with !!! :
df1 %>% count(., !!!.)
#> x y n
#> 1 a a 1
#> 2 b b 2
#> 3 c c 3
Note : see edit history to make sense of some comments
With base we could do : aggregate(setNames(df1[1],"n"), df1, length)

For those who wouldn't get the voodoo you are using in the accepted answer, if you don't need to use dplyr, you can do it with data.table:
setDT(df1)
df1[, .N, names(df1)]
# x y N
# 1: a a 1
# 2: b b 2
# 3: c c 3

Have you considered the (now superceded) group_by_all()?
df1 <- data.frame(x=rep(letters[1:3],1:3),y=rep(letters[1:3],1:3))
df1 %>% group_by_all() %>% count
df1 %>% group_by(across()) %>% count()
df1 %>% count(across()) # don't know why this returns a data.frame and not tibble
See the colwise vignette "other verbs" section for explanation... though honestly I get turned around myself sometimes.

Related

Drop list columns from dataframe using dplyr and select_if

Is it possible to drop all list columns from a dataframe using dpyr select similar to dropping a single column?
df <- tibble(
a = LETTERS[1:5],
b = 1:5,
c = list('bob', 'cratchit', 'rules!','and', 'tiny tim too"')
)
df %>%
select_if(-is.list)
Error in -is.list : invalid argument to unary operator
This seems to be a doable work around, but was wanting to know if it can be done with select_if.
df %>%
select(-which(map(df,class) == 'list'))
Use Negate
df %>%
select_if(Negate(is.list))
# A tibble: 5 x 2
a b
<chr> <int>
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
There is also purrr::negate that would give the same result.
We can use Filter from base R
Filter(Negate(is.list), df)
# A tibble: 5 x 2
# a b
# <chr> <int>
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5

R Show duplicates in dataframe

I am trying to "highlight" duplicates in my dataframe. I found various tutorials on dropping duplicates or creating a new dataset containing only duplicates. But since I expect something went wrong in earlier stages of my datawork, I would (for now) just like to see which observations appear to be duplicates in order to understand what went wrong. I would like R to create column c
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
c <- c(2,1,2,1,2,2,1)
df <-data.frame(a,b,c)
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
df <-data.frame(a,b)
library(dplyr)
df %>%
group_by(a,b) %>% # for each combination of a and b
mutate(c = n()) %>% # count times they appear
ungroup()
# # A tibble: 7 x 3
# a b c
# <fct> <dbl> <int>
# 1 C 1 2
# 2 A 1 1
# 3 A 2 2
# 4 B 1 1
# 5 A 2 2
# 6 C 1 2
# 7 C 2 1

dplyr: access current group variable

After using data.table for quite some time I now thought it's time to try dplyr. It's fun, but I wasn't able to figure out how to access
the current grouping variable
returning multiple values per group
The following example shows is working fine with data.table. How would you write this with dplyr
library(data.table)
foo <- matrix(c(1, 2, 3, 4), ncol = 2)
dt <- data.table(a = c(1, 1, 2), b = c(4, 5, 6))
# data.table (expected)
dt[, .(c = foo[, a]), by = a]
a c
1: 1 1
2: 1 2
3: 2 3
4: 2 4
# dplyr (?)
library(dplyr)
dt %>%
group_by(a) %>%
summarize(c = foo[a])
We can use do from dplyr. (No other packages used). The do is very handy for expanding rows. We only need to wrap with data.frame.
dt %>%
group_by(a) %>%
do(data.frame(c = foo[, unique(.$a)]))
# a c
# <dbl> <dbl>
#1 1 1
#2 1 2
#3 2 3
#4 2 4
Or instead of unique we can subset by the 1st observation
dt %>%
group_by(a) %>%
do(data.frame(c = foo[, .$a[1]]))
# a c
# <dbl> <dbl>
#1 1 1
#2 1 2
#3 2 3
#4 2 4
Or with dplyr >= 1.0.0 (EDIT: Based on #Todd West comments)
dt %>%
reframe(c = foo[, cur_group()$a], .by = 'a')
a c
1 1 1
2 1 2
3 2 3
4 2 4
This can be also done without using any packages
stack(lapply(split(dt$a, dt$a), function(x) foo[,unique(x)]))[2:1]
# ind values
#1 1 1
#2 1 2
#3 2 3
#4 2 4
You can still access the group variable but it is like a normal vector with one unique value for each group, so if you put unique around it, it will work. And at same time, dplyr does not seem to expand rows like data.table automatically, you will need the unnest from tidyr package:
library(dplyr); library(tidyr)
dt %>%
group_by(a) %>%
summarize(c = list(foo[,unique(a)])) %>%
unnest()
# Source: local data frame [4 x 2]
# a c
# <dbl> <dbl>
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
Or we can use first to speed up, since we've already know the group variable vector is the same for every group:
dt %>%
group_by(a) %>%
summarize(c = list(foo[,first(a)])) %>%
unnest()
# Source: local data frame [4 x 2]
# a c
# <dbl> <dbl>
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
To access a grouping variable in a grouped operation (map, walk, mutate), we can refer to .y which is exposed automatically within the evaluation context.
Example
> iris %>% group_by(Species) %>% group_walk(~{ print(.y) })
# A tibble: 1 x 1
Species
<fct>
1 setosa
# A tibble: 1 x 1
Species
<fct>
1 versicolor
# A tibble: 1 x 1
Species
<fct>
1 virginica
This is also documented with more details in https://dplyr.tidyverse.org/reference/group_map.html
The key, a tibble with exactly one row and columns for each grouping variable, exposed as .y.
Regarding the other proposed solutions: Afaik do is not recommended any longer, and the other solution with unqiue is imho clumsy (as it requires another reference to the dataframe in question).

how can I mutate in dplyr without losing order?

Using data.table I can do the following:
library(data.table)
dt = data.table(a = 1:2, b = c(1,2,NA,NA))
# a b
#1: 1 1
#2: 2 2
#3: 1 NA
#4: 2 NA
dt[, b := b[1], by = a]
# a b
#1: 1 1
#2: 2 2
#3: 1 1
#4: 2 2
Attempting the same operation in dplyr however the data gets scrambled/sorted by a:
library(dplyr)
dt = data.table(a = 1:2, b = c(1,2,NA,NA))
dt %.% group_by(a) %.% mutate(b = b[1])
# a b
#1 1 1
#2 1 1
#3 2 2
#4 2 2
(as an aside the above also sorts the original dt, which is somewhat confusing for me given dplyr's philosophy of not modifying in place - I'm guessing that's a bug with how dplyr interfaces with data.table)
What's the dplyr way of achieving the above?
In the current development version of dplyr (which will eventually
become dplyr 0.2) the behaviour differs between data frames and data
tables:
library(dplyr)
library(data.table)
df <- data.frame(a = 1:2, b = c(1,2,NA,NA))
dt <- data.table(df)
df %.% group_by(a) %.% mutate(b = b[1])
## Source: local data frame [4 x 2]
## Groups: a
##
## a b
## 1 1 1
## 2 2 2
## 3 1 1
## 4 2 2
dt %.% group_by(a) %.% mutate(b = b[1])
## Source: local data table [4 x 2]
## Groups: a
##
## a b
## 1 1 1
## 2 1 1
## 3 2 2
## 4 2 2
This happens because group_by() applied to a data.table
automatically does setkey() on the assumption that the index will make
future operations faster.
If there's a strong feeling that this is a bad default, I'm happy to change it.

Create a variable capturing the most frequent occurence by group

Define:
df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
s.t.
> df1
id v1
1 1 a
2 1 b
3 1 b
4 2 c
5 2 c
6 2 c
I want to create a third variable freq that contains the most frequent observation in v1 by id s.t.
> df2
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c
You can do this using ddply and a custom function to pick out the most frequent value:
myFun <- function(x){
tbl <- table(x$v1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}
ddply(df1,.(id),.fun=myFun)
Note that which.max will return the first occurrence of the maximum value, in the case of ties. See ??which.is.max in the nnet package for an option that breaks ties randomly.
Another way consists of using tidyverse functions:
grouping first, using group_by(), and counting the occurrence of the second variable using tally()
arranging by the number of occurrences with arrange()
summarizing and picking out the first row with summarize() and first()
Therefore:
df1 %>%
group_by(id, v1) %>%
tally() %>%
arrange(id, desc(n)) %>%
summarize(freq = first(v1))
This will give you just the mapping (which I find cleaner):
# A tibble: 2 x 2
id freq
<dbl> <fctr>
1 1 b
2 2 c
You can then left_join your original data frame with that table.
mode <- function(x) names(table(x))[ which.max(table(x)) ]
df1$freq <- ave(df1$v1, df1$id, FUN=mode)
> df1
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c

Resources