Create subsets by group using dplyr - r

Here I have a data of three fields Dealer,Product,Freq.
My aim is to create a data which will contain top 2 sells for each dealer.
I have done it using data.table as bellow:
library(data.table)
library(dplyr)
dt <- data.table(Dealer = c("A","B","A","A","B","A"),
Product = c("a","b","b","c","d","d"),
Freq = c(10,12,23,24,23,12))
dt[,.SD[order(Freq, decreasing = T)][seq_along(Freq) < 3], by = Dealer]
How to do the similar thing using 'dplyr' package.

Here, I group by Dealer then find the top 2 values of Freq in each group.
dt %>% group_by(Dealer) %>% top_n(2, Freq) %>% ungroup
# # A tibble: 4 x 3
# Dealer Product Freq
# <fct> <fct> <dbl>
# 1 B b 12
# 2 A b 23
# 3 A c 24
# 4 B d 23

We can use slice or filter after doing the group_by and arrange (same methodology as in the OP's post)
library(dplyr)
dt %>%
group_by(Dealer) %>%
arrange(Dealer, desc(Freq)) %>%
slice(1:2)
# or with
# filter(row_number() < 3)
# A tibble: 4 x 3
# Groups: Dealer [2]
# Dealer Product Freq
# <chr> <chr> <dbl>
#1 A c 24
#2 A b 23
#3 B d 23
#4 B b 12
NOTE: In case of ties, this will get the output exactly the number of rows specified in the slice or filter

Related

How to filter rows according to the bigger value in another column?

I have a data frame like below
d1<-c('a','b','c','d','e','f','g','h','i','j','k','l')
d2<-c(1,5,1,2,13,2,32,2,1,2,4,5)
df1<-data.frame(d1,d2)
Which looks like the data table in this picture
My goal is to filter the rows based on which value of d2 in every 3 rows is biggest. So it would look like this:
Thank you!
We may use rollmax from zoo to filter the rows
library(dplyr)
library(zoo)
df1 %>%
filter(d2 == na.locf0(rollmax(d2, k = 3, fill = NA)))
d1 d2
1 b 5
2 e 13
3 g 32
4 l 5
You can create a grouping variable that puts observations into groups of 3. I have first created a sequence from 1 to the total number of rows, incremented by 3. And then repeated each number of this sequence 3 times and subset the result to get a vector the same length of the data, incase the number of observations is not perfectly divisible by 3. Then simply filter rows based by the largest number of each group in d2 column.
library(dplyr)
df1 %>%
mutate(group = rep(seq(1, n(), by = 3), each = 3)[1:n()]) %>%
group_by(group) %>%
filter(d2 == max(d2))
# A tibble: 4 x 3
# Groups: group [4]
# d1 d2 group
# <chr> <dbl> <dbl>
# 1 b 5 1
# 2 e 13 4
# 3 g 32 7
# 4 l 5 10
Yet another solution:
library(tidyverse)
d1<-c('a','b','c','d','e','f','g','h','i','j','k','l')
d2<-c(1,5,1,2,13,2,32,2,1,2,4,5)
df1<-data.frame(d1,d2)
df1 %>%
mutate(id = rep(1:(n()/3), each=3)) %>%
group_by(id) %>%
slice_max(d2) %>%
ungroup %>% select(-id)
#> # A tibble: 4 × 2
#> d1 d2
#> <chr> <dbl>
#> 1 b 5
#> 2 e 13
#> 3 g 32
#> 4 l 5

How to Pass column name in group by from a variable

Want to extract max values of a column of each group of data frame.
I have column name in a variable which i want to pass in group by condition but it is failing.
I have below data frame:
df <- read.table(header = TRUE, text = 'Gene Value
A 12
A 10
B 3
B 5
B 6
C 1
D 3
D 4')
Column values in Variables below:
columnselected <- c("Value")
groupbycol <- c("Gene")
My Code is :
df %>% group_by(groupbycol) %>% top_n(1, columnselected)
This code is giving error.
Gene Value
A 12
B 6
C 1
D 4
You need to convert column names to symbol using sym and then evaluate them using !!
library(dplyr)
df %>% group_by(!!sym(groupbycol)) %>% top_n(1, !!sym(columnselected))
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
We can use group_by_at and without using an additional package
library(dplyr)
df %>%
group_by_at(groupbycol) %>%
top_n(1, !! as.name(columnselected))
# A tibble: 4 x 2
# Groups: Gene [4]
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
NOTE: There would be many dupes for this post :=)

Using dplyrs group_by and summarise to find number of intersections with a different vector

I have a situation where I am trying to find the number of intersections with a vector per group in another tibble.
Data example
a <- tibble(EXPERIMENT = rep(c("a","b","c"),each =4),
ECOTYPE = rep(1:12))
b <- tibble(ECOTYPE = c(1,1,5,4,8,7,6,1,4,4,2,5,6,7,1))
I want to find the number of intersections between ECOTYPE in b and ECOTYPEper EXPERIMENT in a.
I wonder if I can use dplyr to solve this, as the group_by function seems to fit this problem, but when I run:
a %>%
group_by(EXPERIMENT) %>%
summarise(INTERSECTIONS = length(intersect(b$ECOTYPE, .$ECOTYPE))
I only get the total number of intersections between a and b.
Am I missing something?
Edit:
Sorry for not posting my desired output. I would like something like this:
# A tibble: 3 x 2
EXPERIMENT INTERSECTIONS
<chr> <dbl>
1 a 8
2 b 7
3 c 0
Depending how you want to count, this will give the number of rows in b matching a:
b %>% mutate(b_flag = 1) %>%
right_join(a) %>%
group_by(EXPERIMENT) %>%
summarize(INTERSECTIONS = sum(b_flag, na.rm = T))
# # A tibble: 3 x 2
# EXPERIMENT INTERSECTIONS
# <fctr> <dbl>
# 1 a 8
# 2 b 7
# 3 c 0
I think the only problem with your code is the unnecessary .$, but it gives the counts of distinct ecotypes in b, ignoring the fact that b has three ECOTYPE = 1 rows, for example.
a %>%
group_by(EXPERIMENT) %>%
summarise(INTERSECTIONS = length(intersect(b$ECOTYPE, ECOTYPE)))
# # A tibble: 3 x 2
# EXPERIMENT INTERSECTIONS
# <fctr> <int>
# 1 a 3
# 2 b 4
# 3 c 0
This is a result of how intersect works:
intersect(c(1, 2, 3), c(1, 1, 1))
# [1] 1
Join the two and count how many are left:
inner_join(a,b, by='ECOTYPE') %>% group_by(EXPERIMENT) %>% count()
# A tibble: 2 x 2
# Groups: EXPERIMENT [2]
EXPERIMENT n
<chr> <int>
1 a 8
2 b 7
Now, if you add an indicator column to b, you can start to count absences as well:
b %>% mutate(present=TRUE) %>% right_join(a, by='ECOTYPE') %>% group_by(EXPERIMENT) %>% summarise(n(), missing=sum(is.na(present)))
# A tibble: 3 x 3
EXPERIMENT `n()` missing
<chr> <int> <int>
1 a 9 1
2 b 7 0
3 c 4 4

Sum multiple variables by group and create new column with their sum

I have a data frame with grouped variable and I want to sum them by group. It's easy with dplyr.
library(dplyr)
library(magrittr)
data <- data.frame(group = c("a", "a", "b", "c", "c"), n1 = 1:5, n2 = 2:6)
data %>% group_by(group) %>%
summarise_all(sum)
# A tibble: 3 x 3
group n1 n2
<fctr> <int> <int>
1 a 3 5
2 b 3 4
3 c 9 11
But now I want a new column total with the sum of n1 and n2 by group. Like this:
# A tibble: 3 x 3
group n1 n2 ttl
<fctr> <int> <int> <int>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20
How can I do that with dplyr?
EDIT:
Actually, it's just an example, I have a lot of variables.
I tried these two codes but it's not in the right dimension...
data %>% group_by(group) %>%
summarise_all(sum) %>%
summarise_if(is.numeric, sum)
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate_if(is.numeric, .funs = sum)
You can use mutate after summarize:
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt1 = n1 + n2)
# A tibble: 3 x 4
# group n1 n2 tt1
# <fctr> <int> <int> <int>
#1 a 3 5 8
#2 b 3 4 7
#3 c 9 11 20
If need to sum all numeric columns, you can use rowSums with select_if (to select numeric columns) to sum columns up:
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt1 = rowSums(select_if(., is.numeric)))
# A tibble: 3 x 4
# group n1 n2 tt1
# <fctr> <int> <int> <dbl>
#1 a 3 5 8
#2 b 3 4 7
#3 c 9 11 20
We can use apply together with the dplyr functions.
data <- data.frame(group = c("a", "a", "b", "c", "c"), n1 = 1:5, n2 = 2:6)
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate(ttl = apply(.[, 2:ncol(.)], 1, sum))
# A tibble: 3 × 4
group n1 n2 ttl
<fctr> <int> <int> <int>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20
Or rowSums with the same strategy. The key is to use . to specify the data frame and [] with x:ncol(.) to keep the columns you want.
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate(ttl = rowSums(.[, 2:ncol(.)]))
# A tibble: 3 × 4
group n1 n2 ttl
<fctr> <int> <int> <dbl>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20
Base R
cbind(aggregate(.~group, data, sum), ttl = sapply(split(data[,-1], data$group), sum))
# group n1 n2 ttl
#a a 3 5 8
#b b 3 4 7
#c c 9 11 20
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'group', get the sum of each columns in the Subset of data.table, and then with Reduce, get the sum of the rows of the columns of interest
library(data.table)
setDT(data)[, lapply(.SD, sum) , group][, tt1 := Reduce(`+`, .SD),
.SDcols = names(data)[-1]][]
# group n1 n2 tt1
#1: a 3 5 8
#2: b 3 4 7
#3: c 9 11 20
Or with base R
addmargins(as.matrix(rowsum(data[-1], data$group)), 2)
# n1 n2 Sum
#a 3 5 8
#b 3 4 7
#c 9 11 20
Or with dplyr
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt = rowSums(.[-1]))

dplyr: access current group variable

After using data.table for quite some time I now thought it's time to try dplyr. It's fun, but I wasn't able to figure out how to access
the current grouping variable
returning multiple values per group
The following example shows is working fine with data.table. How would you write this with dplyr
library(data.table)
foo <- matrix(c(1, 2, 3, 4), ncol = 2)
dt <- data.table(a = c(1, 1, 2), b = c(4, 5, 6))
# data.table (expected)
dt[, .(c = foo[, a]), by = a]
a c
1: 1 1
2: 1 2
3: 2 3
4: 2 4
# dplyr (?)
library(dplyr)
dt %>%
group_by(a) %>%
summarize(c = foo[a])
We can use do from dplyr. (No other packages used). The do is very handy for expanding rows. We only need to wrap with data.frame.
dt %>%
group_by(a) %>%
do(data.frame(c = foo[, unique(.$a)]))
# a c
# <dbl> <dbl>
#1 1 1
#2 1 2
#3 2 3
#4 2 4
Or instead of unique we can subset by the 1st observation
dt %>%
group_by(a) %>%
do(data.frame(c = foo[, .$a[1]]))
# a c
# <dbl> <dbl>
#1 1 1
#2 1 2
#3 2 3
#4 2 4
Or with dplyr >= 1.0.0 (EDIT: Based on #Todd West comments)
dt %>%
reframe(c = foo[, cur_group()$a], .by = 'a')
a c
1 1 1
2 1 2
3 2 3
4 2 4
This can be also done without using any packages
stack(lapply(split(dt$a, dt$a), function(x) foo[,unique(x)]))[2:1]
# ind values
#1 1 1
#2 1 2
#3 2 3
#4 2 4
You can still access the group variable but it is like a normal vector with one unique value for each group, so if you put unique around it, it will work. And at same time, dplyr does not seem to expand rows like data.table automatically, you will need the unnest from tidyr package:
library(dplyr); library(tidyr)
dt %>%
group_by(a) %>%
summarize(c = list(foo[,unique(a)])) %>%
unnest()
# Source: local data frame [4 x 2]
# a c
# <dbl> <dbl>
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
Or we can use first to speed up, since we've already know the group variable vector is the same for every group:
dt %>%
group_by(a) %>%
summarize(c = list(foo[,first(a)])) %>%
unnest()
# Source: local data frame [4 x 2]
# a c
# <dbl> <dbl>
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
To access a grouping variable in a grouped operation (map, walk, mutate), we can refer to .y which is exposed automatically within the evaluation context.
Example
> iris %>% group_by(Species) %>% group_walk(~{ print(.y) })
# A tibble: 1 x 1
Species
<fct>
1 setosa
# A tibble: 1 x 1
Species
<fct>
1 versicolor
# A tibble: 1 x 1
Species
<fct>
1 virginica
This is also documented with more details in https://dplyr.tidyverse.org/reference/group_map.html
The key, a tibble with exactly one row and columns for each grouping variable, exposed as .y.
Regarding the other proposed solutions: Afaik do is not recommended any longer, and the other solution with unqiue is imho clumsy (as it requires another reference to the dataframe in question).

Resources