Sometimes it is desirable to have a complete dataframe with observations for all combinations of grouping factors, even when these are absent in the original data (i.e. by filling these gaps with NA data).
Consider the following example with mtcars:
mtcars %>% group_by(cyl, gear) %>% dplyr::summarise(N = n())
# A tibble: 8 x 3
# Groups: cyl [3]
cyl gear N
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
When grouping by cyl and gear, observations are missing for cyl=8 and gear=4. Is it possible to obtain this summary table in a straightforward, hopefully tidyverse-based, way that includes a row with NA observations for combinations of factors that are missing?. E.g. the desired output would be:
# A tibble: 9 x 3
# Groups: cyl [3]
cyl gear N
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 4 NA
9 8 5 2
We can use complete after removing the group attributes with ungroup
library(dplyr)
library(tidyr)
mtcars %>%
group_by(cyl, gear) %>%
dplyr::summarise(N = n()) %>%
ungroup %>%
complete(cyl, gear)
# A tibble: 9 x 3
# cyl gear N
# <dbl> <dbl> <int>
#1 4 3 1
#2 4 4 8
#3 4 5 2
#4 6 3 2
#5 6 4 4
#6 6 5 1
#7 8 3 12
#8 8 4 NA
#9 8 5 2
Or another option is to create a combination dataset with unique elements of the columns and then do a left_join (not as straightforward as the previous one)
crossing(cyl = unique(mtcars$cyl), gear = unique(mtcars$gear)) %>%
left_join(mtcars %>%
group_by(cyl, gear) %>%
dplyr::summarise(N = n()))
If you convert the groups to factor and use count (alternative for group_by with summarise n()) with .drop = FALSE it will complete missing observations.
library(dplyr)
mtcars %>% mutate_at(vars(cyl, gear), factor) %>% count(cyl, gear, .drop = FALSE)
# cyl gear N
# <fct> <fct> <int>
#1 4 3 1
#2 4 4 8
#3 4 5 2
#4 6 3 2
#5 6 4 4
#6 6 5 1
#7 8 3 12
#8 8 4 0
#9 8 5 2
Related
I used tidyeval to write a short function which takes grouping variables as an input, groups the mtcars dataset and counts the number of occurences per group:
test_function <- function(grps){
mtcars %>%
group_by(across({{grps}})) %>%
summarise(Count = n())
}
test_function(grps = c(cyl, gear))
---
cyl gear Count
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
Now imagine for that example I want a subtotal column for each group cyl. So how many cars have 4 (6,8) cylinders? This is what the result should look like:
test_function(grps = c(cyl, gear), subtotalrows = TRUE) ### example function execution
---
cyl gear Count
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 4 total 11
5 6 3 2
6 6 4 4
7 6 5 1
8 6 total 7
9 8 3 12
10 8 5 2
11 8 total 14
In this case the subtotal columns I am looking for can simply be produced with the same function but with one less grouping variable:
test_function(grps = cyl)
---
cyl Count
<dbl> <int>
1 4 11
2 6 7
3 8 14
But since I don't want to use the function in itself (not even sure wether this is possible in R) I would like to go for a different approach: As far as I know the best (and only way) to create subtotal rows so far is by calculating them independently and then binding them row wise to the grouped table (i.e.: rbind, bind_rows). In my case that means only take the first grouping variable, create the subtotal rows and later on bind them to the table. But here is where I have problems with the tidyeval syntax. Here is in pseudocode what I would like to do in the function:
test_function <- function(grps, subtotalrows = TRUE){
grouped_result <- mtcars %>%
group_by(across({{grps}})) %>%
summarise(Count = n())
if(subtotalrows == FALSE){
return(grouped_result)
} else {
#pseudocode
group_for_subcalculation <- grps[[1]] #I want the first element of the grps argument
subtotal_result <- mtcars %>%
group_by(across({{group_for_subcalculation}})) %>%
summarise(Count = n()) %>%
mutate(grps[[2]] := "total") %>%
arrange(grps[[1]], grps[[2]], Count)
return(rbind(grouped_result, subtotal_result))
}
}
So, two questions: I am curious how I can extract the first column name passed by grps and work with it in the following code. Second, this pseudocode example is specific for 2 columns passed by grps. Imagine I want to pass 3 or more even. How would you do that (loops)?
Try this function -
library(dplyr)
test_function <- function(grps, subtotalrows = TRUE){
grouped_data <- mtcars %>% group_by(across({{grps}}))
groups <- group_vars(grouped_data)
col_to_change <- groups[length(groups)] #Last value in grps
grouped_result <- grouped_data %>% summarise(Count = n())
if(!subtotalrows) return(grouped_result)
else {
result <- grouped_result %>%
summarise(Count = sum(Count),
!!col_to_change := 'Total') %>%
bind_rows(grouped_result %>%
mutate(!!col_to_change := as.character(.data[[col_to_change]]))) %>%
select(all_of(groups), Count) %>%
arrange(across(all_of(groups)))
}
return(result)
}
Test the function -
test_function(grps = c(cyl, gear))
# A tibble: 11 x 3
# cyl gear Count
# <dbl> <chr> <int>
# 1 4 3 1
# 2 4 4 8
# 3 4 5 2
# 4 4 Total 11
# 5 6 3 2
# 6 6 4 4
# 7 6 5 1
# 8 6 Total 7
# 9 8 3 12
#10 8 5 2
#11 8 Total 14
test_function(grps = c(cyl, gear), FALSE)
# cyl gear Count
# <dbl> <dbl> <int>
#1 4 3 1
#2 4 4 8
#3 4 5 2
#4 6 3 2
#5 6 4 4
#6 6 5 1
#7 8 3 12
#8 8 5 2
For 3 variables -
test_function(grps = c(cyl, gear, carb))
# cyl gear carb Count
# <dbl> <dbl> <chr> <int>
# 1 4 3 1 1
# 2 4 3 Total 1
# 3 4 4 1 4
# 4 4 4 2 4
# 5 4 4 Total 8
# 6 4 5 2 2
# 7 4 5 Total 2
# 8 6 3 1 2
# 9 6 3 Total 2
#10 6 4 4 4
#11 6 4 Total 4
#12 6 5 6 1
#13 6 5 Total 1
#14 8 3 2 4
#15 8 3 3 3
#16 8 3 4 5
#17 8 3 Total 12
#18 8 5 4 1
#19 8 5 8 1
#20 8 5 Total 2
This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 1 year ago.
I have data where each row represents one observation from one person. For example:
library(dplyr)
dat <- tibble(ID = rep(sample(1111:9999, 3), each = 3),
X = 1:9)
# A tibble: 9 x 2
ID X
<int> <int>
1 9573 1
2 9573 2
3 9573 3
4 7224 4
5 7224 5
6 7224 6
7 7917 7
8 7917 8
9 7917 9
I want to replace these IDs with a different value. It can be anything, but the easiest (and preferred) solutions is just to replace with 1:n groups. So the desired solution would be:
# A tibble: 9 x 2
ID X
<int> <int>
1 1 1
2 1 2
3 1 3
4 2 4
5 2 5
6 2 6
7 3 7
8 3 8
9 3 9
Probably something that starts with:
dat %>%
group_by(IID) %>%
???
A fast option would be match
library(dplyr)
dat %>%
mutate(ID = match(ID, unique(ID)))
-output
# A tibble: 9 x 2
# ID X
# <int> <int>
#1 1 1
#2 1 2
#3 1 3
#4 2 4
#5 2 5
#6 2 6
#7 3 7
#8 3 8
#9 3 9
Or use as.integer on a factor
dat %>%
mutate(ID = as.integer(factor(ID, levels = unique(ID))))
In tidyverse, we can also cur_group_id
dat %>%
group_by(ID = factor(ID, levels = unique(ID))) %>%
mutate(ID = cur_group_id()) %>%
ungroup
Somewhat hard to define this question without sounding like lots of similar questions!
I have a function for which I want one of the parameters to be a function name, that will be passed to dplyr::summarise, e.g. "mean" or "sum":
data(mtcars)
f <- function(x = mtcars,
groupcol = "cyl",
zCol = "disp",
zFun = "mean") {
zColquo = quo_name(zCol)
cellSummaries <- x %>%
group_by(gear, !!sym(groupcol)) %>% # 1 preset grouper, 1 user-defined
summarise(Count = n(), # 1 preset summary, 1 user defined
!!zColquo := mean(!!sym(zColquo))) # mean should be zFun, user-defined
ungroup
}
(this groups by gear and cyl, then returns, per group, count and mean(disp))
Per my note, I'd like 'mean' to be dynamic, performing the function defined by zFun, but I can't for the life of me work out how to do it! Thanks in advance for any advice.
You can use match.fun to make the function dynamic. I also removed zColquo as it's not needed.
library(dplyr)
library(rlang)
f <- function(x = mtcars,
groupcol = "cyl",
zCol = "disp",
zFun = "mean") {
cellSummaries <- x %>%
group_by(gear, !!sym(groupcol)) %>%
summarise(Count = n(),
!!zCol := match.fun(zFun)(!!sym(zCol))) %>%
ungroup
return(cellSummaries)
}
You can then check output
f()
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 242.
#3 3 8 12 358.
#4 4 4 8 103.
#5 4 6 4 164.
#6 5 4 2 108.
#7 5 6 1 145
#8 5 8 2 326
f(zFun = "sum")
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 483
#3 3 8 12 4291.
#4 4 4 8 821
#5 4 6 4 655.
#6 5 4 2 215.
#7 5 6 1 145
#8 5 8 2 652
We can use get
library(dplyr)
f <- function(x = mtcars,
groupcol = "cyl",
zCol = "disp",
zFun = "mean") {
zColquo = quo_name(zCol)
x %>%
group_by(gear, !!sym(groupcol)) %>% # 1 preset grouper, 1 user-defined
summarise(Count = n(), # 1 preset summary, 1 user defined
!!zColquo := get(zFun)(!!sym(zCol))) %>%
ungroup
}
f()
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 242.
#3 3 8 12 358.
#4 4 4 8 103.
#5 4 6 4 164.
#6 5 4 2 108.
#7 5 6 1 145
#8 5 8 2 326
f(zFun = "sum")
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 483
#3 3 8 12 4291.
#4 4 4 8 821
#5 4 6 4 655.
#6 5 4 2 215.
#7 5 6 1 145
#8 5 8 2 652
In addition, we could remove the sym evaluation in group_by and in summarise if we wrap with across
f <- function(x = mtcars,
groupcol = "cyl",
zCol = "disp",
zFun = "mean") {
x %>%
group_by(across(c(gear, groupcol))) %>% # 1 preset grouper, 1 user-defined
summarise(Count = n(), # 1 preset summary, 1 user defined
across(zCol, ~ get(zFun)(.))) %>%
ungroup
}
f()
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 242.
#3 3 8 12 358.
#4 4 4 8 103.
#5 4 6 4 164.
#6 5 4 2 108.
#7 5 6 1 145
#8 5 8 2 326
If I have a grouping:
mtcars %>% group_by(cyl,carb)
How can I add a column that counts the number of unique group combinations; so carb groups within cyl groups? This would be something like:
cyl carb combination
6 2 1
6 4 2
6 6 3
4 2 1
4 4 2
4 6 3
Maybe there's a better way to avoid the n column, but below should be a good start:
mtcars %>% count(cyl,carb) %>% group_by(cyl) %>% mutate(combination=1:n())
# A tibble: 9 x 4
# Groups: cyl [3]
cyl carb n combination
<dbl> <dbl> <int> <int>
1 4 1 5 1
2 4 2 6 2
3 6 1 2 1
4 6 4 4 2
5 6 6 1 3
6 8 2 4 1
7 8 3 3 2
8 8 4 6 3
9 8 8 1 4
There are many ways to do this, this is the way I did it:
library(dplyr)
mtcars %>% group_by(cyl,carb) %>% summarize("count" = length(carb))
Similar to this question but I want to use tidy evaluation instead.
df = data.frame(group = c(1,1,1,2,2,2,3,3,3),
date = c(1,2,3,4,5,6,7,8,9),
speed = c(3,4,3,4,5,6,6,4,9))
> df
group date speed
1 1 1 3
2 1 2 4
3 1 3 3
4 2 4 4
5 2 5 5
6 2 6 6
7 3 7 6
8 3 8 4
9 3 9 9
The task is to create a new column (newValue) whose values equals to the values of the date column (per group) with one condition: speed == 4. Example: group 1 has a newValue of 2 because date[speed==4] = 2.
group date speed newValue
1 1 1 3 2
2 1 2 4 2
3 1 3 3 2
4 2 4 4 4
5 2 5 5 4
6 2 6 6 4
7 3 7 6 8
8 3 8 4 8
9 3 9 9 8
It worked without tidy evaluation
df %>%
group_by(group) %>%
mutate(newValue=date[speed==4L])
#> # A tibble: 9 x 4
#> # Groups: group [3]
#> group date speed newValue
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 3 2
#> 2 1 2 4 2
#> 3 1 3 3 2
#> 4 2 4 4 4
#> 5 2 5 5 4
#> 6 2 6 6 4
#> 7 3 7 6 8
#> 8 3 8 4 8
#> 9 3 9 9 8
But had error with tidy evaluation
my_fu <- function(df, filter_var){
filter_var <- sym(filter_var)
df <- df %>%
group_by(group) %>%
mutate(newValue=!!filter_var[speed==4L])
}
my_fu(df, "date")
#> Error in quos(..., .named = TRUE): object 'speed' not found
Thanks in advance.
We can place the evaluation within brackets. Otherwise, it may try to evaluate the whole expression (filter_var[speed = 4L]) instead of filter_var alone
library(rlang)
library(dplyr)
my_fu <- function(df, filter_var){
filter_var <- sym(filter_var)
df %>%
group_by(group) %>%
mutate(newValue=(!!filter_var)[speed==4L])
}
my_fu(df, "date")
# A tibble: 9 x 4
# Groups: group [3]
# group date speed newValue
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 3 2
#2 1 2 4 2
#3 1 3 3 2
#4 2 4 4 4
#5 2 5 5 4
#6 2 6 6 4
#7 3 7 6 8
#8 3 8 4 8
#9 3 9 9 8
Also, you can use from sqldf. Join df with a constraint on that:
library(sqldf)
df = data.frame(group = c(1,1,1,2,2,2,3,3,3),
date = c(1,2,3,4,5,6,7,8,9),
speed = c(3,4,3,4,5,6,6,4,9))
sqldf("SELECT df_origin.*, df4.`date` new_value FROM
df df_origin join (SELECT `group`, `date` FROM df WHERE speed = 4) df4
on (df_origin.`group` = df4.`group`)")