Maintain original column titles with dplyr function in loop R - r

Let's say I have a dataset like mtcars and I would like to loop over different columns to calculate counts (among others).
library(dplyr)
df <- mtcars
groups <- c('cyl', 'hp')
for(g in groups) {
group_counts <- df %>%
group_by(get(g)) %>%
count()
print(group_counts)
}
Which gives me the following for the cyl column and something similar for the hp column:
`get(g)` n
4 11
6 7
8 14
How do I manage to the first column to return the original column name instead of the 'get(g)'? Like this
cyl n
4 11
6 7
8 14

You can use across(all_of()) like this:
for(g in groups) {
group_counts <- df %>%
group_by(across(all_of(g))) %>%
count()
print(group_counts)
}
Output:
# A tibble: 3 × 2
# Groups: cyl [3]
cyl n
<dbl> <int>
1 4 11
2 6 7
3 8 14
# A tibble: 22 × 2
# Groups: hp [22]
hp n
<dbl> <int>
1 52 1
2 62 1
3 65 1
4 66 2
5 91 1
6 93 1
7 95 1
8 97 1
9 105 1
10 109 1
# … with 12 more rows

groups is a character vector. To reference objects with corresponding names, we can convert to symbol and evaluate !!:
for(g in groups) {
group_counts <- df %>%
group_by(!!sym(g)) %>%
count()
print(group_counts)
}
# A tibble: 3 × 2
# Groups: cyl [3]
cyl n
<dbl> <int>
1 4 11
2 6 7
3 8 14
# A tibble: 22 × 2
# Groups: hp [22]
hp n
<dbl> <int>
1 52 1
2 62 1
3 65 1
4 66 2
5 91 1
6 93 1
7 95 1
8 97 1
9 105 1
10 109 1
# … with 12 more rows
# ℹ Use `print(n = ...)` to see more rows
We can also convert groups into a vector of symbols outside the loop, and evaluate !! inside the loop:
my_function <- function(df, groups) {
groups <- syms(groups)
for(g in groups) {
group_counts <- df %>%
group_by(!!g) %>%
count()
print(group_counts)
}
}
my_function(df, groups)
# A tibble: 3 × 2
# Groups: cyl [3]
cyl n
<dbl> <int>
1 4 11
2 6 7
3 8 14
# A tibble: 22 × 2
# Groups: hp [22]
hp n
<dbl> <int>
1 52 1
2 62 1
3 65 1
4 66 2
5 91 1
6 93 1
7 95 1
8 97 1
9 105 1
10 109 1
# … with 12 more rows
# ℹ Use `print(n = ...)` to see more rows

Related

How to get specific values out of a list of values passed to one argument of a UDF with tidyeval

I used tidyeval to write a short function which takes grouping variables as an input, groups the mtcars dataset and counts the number of occurences per group:
test_function <- function(grps){
mtcars %>%
group_by(across({{grps}})) %>%
summarise(Count = n())
}
test_function(grps = c(cyl, gear))
---
cyl gear Count
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
Now imagine for that example I want a subtotal column for each group cyl. So how many cars have 4 (6,8) cylinders? This is what the result should look like:
test_function(grps = c(cyl, gear), subtotalrows = TRUE) ### example function execution
---
cyl gear Count
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 4 total 11
5 6 3 2
6 6 4 4
7 6 5 1
8 6 total 7
9 8 3 12
10 8 5 2
11 8 total 14
In this case the subtotal columns I am looking for can simply be produced with the same function but with one less grouping variable:
test_function(grps = cyl)
---
cyl Count
<dbl> <int>
1 4 11
2 6 7
3 8 14
But since I don't want to use the function in itself (not even sure wether this is possible in R) I would like to go for a different approach: As far as I know the best (and only way) to create subtotal rows so far is by calculating them independently and then binding them row wise to the grouped table (i.e.: rbind, bind_rows). In my case that means only take the first grouping variable, create the subtotal rows and later on bind them to the table. But here is where I have problems with the tidyeval syntax. Here is in pseudocode what I would like to do in the function:
test_function <- function(grps, subtotalrows = TRUE){
grouped_result <- mtcars %>%
group_by(across({{grps}})) %>%
summarise(Count = n())
if(subtotalrows == FALSE){
return(grouped_result)
} else {
#pseudocode
group_for_subcalculation <- grps[[1]] #I want the first element of the grps argument
subtotal_result <- mtcars %>%
group_by(across({{group_for_subcalculation}})) %>%
summarise(Count = n()) %>%
mutate(grps[[2]] := "total") %>%
arrange(grps[[1]], grps[[2]], Count)
return(rbind(grouped_result, subtotal_result))
}
}
So, two questions: I am curious how I can extract the first column name passed by grps and work with it in the following code. Second, this pseudocode example is specific for 2 columns passed by grps. Imagine I want to pass 3 or more even. How would you do that (loops)?
Try this function -
library(dplyr)
test_function <- function(grps, subtotalrows = TRUE){
grouped_data <- mtcars %>% group_by(across({{grps}}))
groups <- group_vars(grouped_data)
col_to_change <- groups[length(groups)] #Last value in grps
grouped_result <- grouped_data %>% summarise(Count = n())
if(!subtotalrows) return(grouped_result)
else {
result <- grouped_result %>%
summarise(Count = sum(Count),
!!col_to_change := 'Total') %>%
bind_rows(grouped_result %>%
mutate(!!col_to_change := as.character(.data[[col_to_change]]))) %>%
select(all_of(groups), Count) %>%
arrange(across(all_of(groups)))
}
return(result)
}
Test the function -
test_function(grps = c(cyl, gear))
# A tibble: 11 x 3
# cyl gear Count
# <dbl> <chr> <int>
# 1 4 3 1
# 2 4 4 8
# 3 4 5 2
# 4 4 Total 11
# 5 6 3 2
# 6 6 4 4
# 7 6 5 1
# 8 6 Total 7
# 9 8 3 12
#10 8 5 2
#11 8 Total 14
test_function(grps = c(cyl, gear), FALSE)
# cyl gear Count
# <dbl> <dbl> <int>
#1 4 3 1
#2 4 4 8
#3 4 5 2
#4 6 3 2
#5 6 4 4
#6 6 5 1
#7 8 3 12
#8 8 5 2
For 3 variables -
test_function(grps = c(cyl, gear, carb))
# cyl gear carb Count
# <dbl> <dbl> <chr> <int>
# 1 4 3 1 1
# 2 4 3 Total 1
# 3 4 4 1 4
# 4 4 4 2 4
# 5 4 4 Total 8
# 6 4 5 2 2
# 7 4 5 Total 2
# 8 6 3 1 2
# 9 6 3 Total 2
#10 6 4 4 4
#11 6 4 Total 4
#12 6 5 6 1
#13 6 5 Total 1
#14 8 3 2 4
#15 8 3 3 3
#16 8 3 4 5
#17 8 3 Total 12
#18 8 5 4 1
#19 8 5 8 1
#20 8 5 Total 2

R user-defined/dynamic summary function within dplyr::summarise

Somewhat hard to define this question without sounding like lots of similar questions!
I have a function for which I want one of the parameters to be a function name, that will be passed to dplyr::summarise, e.g. "mean" or "sum":
data(mtcars)
f <- function(x = mtcars,
groupcol = "cyl",
zCol = "disp",
zFun = "mean") {
zColquo = quo_name(zCol)
cellSummaries <- x %>%
group_by(gear, !!sym(groupcol)) %>% # 1 preset grouper, 1 user-defined
summarise(Count = n(), # 1 preset summary, 1 user defined
!!zColquo := mean(!!sym(zColquo))) # mean should be zFun, user-defined
ungroup
}
(this groups by gear and cyl, then returns, per group, count and mean(disp))
Per my note, I'd like 'mean' to be dynamic, performing the function defined by zFun, but I can't for the life of me work out how to do it! Thanks in advance for any advice.
You can use match.fun to make the function dynamic. I also removed zColquo as it's not needed.
library(dplyr)
library(rlang)
f <- function(x = mtcars,
groupcol = "cyl",
zCol = "disp",
zFun = "mean") {
cellSummaries <- x %>%
group_by(gear, !!sym(groupcol)) %>%
summarise(Count = n(),
!!zCol := match.fun(zFun)(!!sym(zCol))) %>%
ungroup
return(cellSummaries)
}
You can then check output
f()
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 242.
#3 3 8 12 358.
#4 4 4 8 103.
#5 4 6 4 164.
#6 5 4 2 108.
#7 5 6 1 145
#8 5 8 2 326
f(zFun = "sum")
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 483
#3 3 8 12 4291.
#4 4 4 8 821
#5 4 6 4 655.
#6 5 4 2 215.
#7 5 6 1 145
#8 5 8 2 652
We can use get
library(dplyr)
f <- function(x = mtcars,
groupcol = "cyl",
zCol = "disp",
zFun = "mean") {
zColquo = quo_name(zCol)
x %>%
group_by(gear, !!sym(groupcol)) %>% # 1 preset grouper, 1 user-defined
summarise(Count = n(), # 1 preset summary, 1 user defined
!!zColquo := get(zFun)(!!sym(zCol))) %>%
ungroup
}
f()
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 242.
#3 3 8 12 358.
#4 4 4 8 103.
#5 4 6 4 164.
#6 5 4 2 108.
#7 5 6 1 145
#8 5 8 2 326
f(zFun = "sum")
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 483
#3 3 8 12 4291.
#4 4 4 8 821
#5 4 6 4 655.
#6 5 4 2 215.
#7 5 6 1 145
#8 5 8 2 652
In addition, we could remove the sym evaluation in group_by and in summarise if we wrap with across
f <- function(x = mtcars,
groupcol = "cyl",
zCol = "disp",
zFun = "mean") {
x %>%
group_by(across(c(gear, groupcol))) %>% # 1 preset grouper, 1 user-defined
summarise(Count = n(), # 1 preset summary, 1 user defined
across(zCol, ~ get(zFun)(.))) %>%
ungroup
}
f()
# A tibble: 8 x 4
# gear cyl Count disp
# <dbl> <dbl> <int> <dbl>
#1 3 4 1 120.
#2 3 6 2 242.
#3 3 8 12 358.
#4 4 4 8 103.
#5 4 6 4 164.
#6 5 4 2 108.
#7 5 6 1 145
#8 5 8 2 326

Adding an incremental count of sub-groups using dplyr

If I have a grouping:
mtcars %>% group_by(cyl,carb)
How can I add a column that counts the number of unique group combinations; so carb groups within cyl groups? This would be something like:
cyl carb combination
6 2 1
6 4 2
6 6 3
4 2 1
4 4 2
4 6 3
Maybe there's a better way to avoid the n column, but below should be a good start:
mtcars %>% count(cyl,carb) %>% group_by(cyl) %>% mutate(combination=1:n())
# A tibble: 9 x 4
# Groups: cyl [3]
cyl carb n combination
<dbl> <dbl> <int> <int>
1 4 1 5 1
2 4 2 6 2
3 6 1 2 1
4 6 4 4 2
5 6 6 1 3
6 8 2 4 1
7 8 3 3 2
8 8 4 6 3
9 8 8 1 4
There are many ways to do this, this is the way I did it:
library(dplyr)
mtcars %>% group_by(cyl,carb) %>% summarize("count" = length(carb))

Standard Deviation coming up NA when using summarise() function

I am trying to calculate descriptive statistics for the birthweight data set (birthwt) found in RStudio. However, I'm only interested in a few variables: age, ftv, ptl and lwt.
This is the code I have so far:
library(MASS)
library(dplyr)
data("birthwt")
grouped <- group_by(birthwt, age, ftv, ptl, lwt)
summarise(grouped,
mean = mean(bwt),
median = median(bwt),
SD = sd(bwt))
It gives me a pretty-printed table but only a limited number of the SD is filled and the rest say NA. I just can't work out why or how to fix it!
I stumbled here for another reason and also for me, the answer comes from the docs:
# BEWARE: reusing variables may lead to unexpected results
mtcars %>%
group_by(cyl) %>%
summarise(disp = mean(disp), sd = sd(disp))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 3
#> cyl disp sd
#> <dbl> <dbl> <dbl>
#> 1 4 105. NA
#> 2 6 183. NA
#> 3 8 353. NA
So, in case someone has the same reason as me, instead of reusing a variable, create new ones:
mtcars %>%
group_by(cyl) %>%
summarise(
disp_mean = mean(disp),
disp_sd = sd(disp)
)
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
cyl disp_mean disp_sd
<dbl> <dbl> <dbl>
1 4 105. 26.9
2 6 183. 41.6
3 8 353. 67.8
The number of rows for some of the groups are 1.
grouped %>%
summarise(n = n())
# A tibble: 179 x 5
# Groups: age, ftv, ptl [?]
# age ftv ptl lwt n
# <int> <int> <int> <int> <int>
# 1 14 0 0 135 1
# 2 14 0 1 101 1
# 3 14 2 0 100 1
# 4 15 0 0 98 1
# 5 15 0 0 110 1
# 6 15 0 0 115 1
# 7 16 0 0 110 1
# 8 16 0 0 112 1
# 9 16 0 0 135 2
#10 16 1 0 95 1
According to ?sd,
The standard deviation of a length-one vector is NA.
This results in NA values for the sd where there is only one element

select multiple rows from a dataframe with a value equal to highest value by group

I have a data frame that looks like this:
query <- c('a','a','a','b','b','b','c','c','c')
hit <- c(1,2,3,4,5,6,7,8,9)
score <- c(97,97,96,98,97,96,99,99,98)
df <- data.frame(query,hit,score)
df
query hit score
1 a 1 97
2 a 2 97
3 a 3 96
4 b 4 98
5 b 5 97
6 b 6 96
7 c 7 99
8 c 8 99
9 c 9 98
I want to group on the first column, and select all rows with a score equal to the highest score for that group. About the closest I can figure out is to use top_n like this:
df %>%
+ group_by(query) %>%
+ top_n(2,score)
A tibble: 6 x 3
Groups: query [3]
query hit score
<fctr> <dbl> <dbl>
1 a 1 97
2 a 2 97
3 b 4 98
4 b 5 97
5 c 7 99
6 c 8 99
But obviously all that's doing is giving me the top two (or whatever I specify). The result I want to end up with would look more like this:
query hit score
<fctr> <dbl> <dbl>
1 a 1 97
2 a 2 97
3 b 4 98
5 c 7 99
6 c 8 99
As usual, I assume I'm missing something pretty simple.
In dplyr, just filter on score == max(score):
group_by(df, query) %>%
filter(score == max(score))
# A tibble: 5 x 3
# Groups: query [3]
# query hit score
# <fctr> <dbl> <dbl>
# 1 a 1 97
# 2 a 2 97
# 3 b 4 98
# 4 c 7 99
# 5 c 8 99
You could also easily do this in base R, with ave():
df[with(df, ave(score, query, FUN = max) == score), ]
# query hit score
# 1 a 1 97
# 2 a 2 97
# 4 b 4 98
# 7 c 7 99
# 8 c 8 99
Your syntax is basically correct, just specify n = 1 instead of n = 2 in top_n.
query <- c('a','a','a','b','b','b','c','c','c')
hit <- c(1,2,3,4,5,6,7,8,9)
score <- c(97,97,96,98,97,96,99,99,98)
df <- data.frame(query,hit,score)
df %>%
group_by(query) %>%
top_n(n = 1, wt = score)
#> # A tibble: 5 x 3
#> # Groups: query [3]
#> query hit score
#> <fctr> <dbl> <dbl>
#> 1 a 1 97
#> 2 a 2 97
#> 3 b 4 98
#> 4 c 7 99
#> 5 c 8 99
When using top_n, if there is a tie, all observations with that score will be returned. So you can specify that you want the 1 highest score (n = 1), and then all observations with that score, within each group, will be returned.

Resources