I have datasets that involve large number of column joins (8-12) and at the same time depending upon the circumstance 1-3 of these columns may not be needed.
Presently I have been writing out these long group-bys using dplyr but with so many columns and changing situations, it is easy to misspell or forget a column.
I'd like to somehow create a variable that goes along with doing this, but I haven't been able to figure out how to due to the quotes that are present when I try to use paste. Can anyone show me a quick example of how to do this?
For example:
library(dplyr)
# I want this group-list not to have quotes so I can drop in my group_by below
my_group_list = paste0("vs"," ","am") #quotes get in the way
mtcars %>% group_by(my_group_list) %>% summarise(countofvalues = n())
If there are many columns, we can specify the columns to group from directly subsetting the column names. In that case, use group_by_
library(dplyr)
mtcars %>%
group_by_(.dots=names(.)[8:9]) %>%
summarise(countofvalues = n())
# vs am countofvalues
# (dbl) (dbl) (int)
#1 0 0 12
#2 0 1 6
#3 1 0 7
#4 1 1 7
The above also works if we have a vector of values
my_group_list <- c("vs", "am")
mtcars %>%
group_by_(.dots = my_group_list) %>%
summarise(countofvalues = n())
# vs am countofvalues
# (dbl) (dbl) (int)
#1 0 0 12
#2 0 1 6
#3 1 0 7
#4 1 1 7
As the OP mentioned that it is not doing the grouping, we can test it by uniteing the 'vs' and 'am' columns, use it as grouping variable and then do the n().
library(tidyr)
mtcars %>%
unite(vs_am, vs, am) %>%
group_by(vs_am) %>%
summarise(countofvalues = n())
# vs_am countofvalues
# (chr) (int)
#1 0_0 12
#2 0_1 6
#3 1_0 7
#4 1_1 7
I know this is a pretty stale thread, but I happened on to it and found a more recent answer. You can use group_by_at() and tidy select helpers (I found it on this dplyr issue). For example:
my_group_list <- c("vs", "am")
mtcars %>%
group_by_at(all_of(my_group_list)) %>%
summarise(countofvalues = n())
# `summarise()` regrouping output by 'vs' (override with `.groups` argument)
# A tibble: 4 x 3
# Groups: vs [2]
# vs am countofvalues
# <dbl> <dbl> <int>
# 1 0 0 12
# 2 0 1 6
# 3 1 0 7
# 4 1 1 7
Related
I'm trying to apply a tidyverse-based approach, or at least a tidy solution, for applying custom functions over the levels of a factor in a dataframe.
Consider the following test dataset:
df <- tibble(LINE=rep(c(1,2),each=6), FOUND=c(1,1,1,0,1,1,0,0,1,0,0,1))
# LINE FOUND
# <dbl> <dbl>
# 1 1 1
# 2 1 1
# 3 1 1
# 4 1 0
# 5 1 1
# 6 1 1
# 7 2 0
# 8 2 0
# 9 2 1
#10 2 0
#11 2 0
#12 2 1
I want to know for example the proportion of found results (eg. FOUND==1) by level of the LINE factor. Right now, I'm working with the following code, but I'm really trying to get to something cleaner.
# This is the function to calculate the proportion "found"
get_prop <- function (data) {
tot <- data %>% nrow()
found <- data %>% dplyr::filter(FOUND==1) %>% nrow
found / tot
}
# This is the code to generate the expected result
lines <- df$LINE %>% unique %>% sort
v_line <- vector()
v_prop <- vector()
for (i in 1:length(lines)) {
tot <- df %>% dplyr::filter(LINE==lines[i])
v_line[i] <- lines[i]
v_prop[i] <- get_prop(tot)
}
df_line = data.frame(LINE = v_line, CALL = v_prop)
I would expect the following to work, but it does not, since its returning the result for each level, but the numerical solution is that of the whole dataset, and not levels-specific:
df %>% dplyr::group_by(LINE) %>% dplyr::summarise(get_prop(.))
EDIT: Please note that what I am looking for is a solution for applying a custom function over the levels of a factor in a dataframe. It is not necessarily the number or the proportion of occurrences of a particular value, as in the example illustrated.
EDIT 2: That is, I'm looking for a solution that makes use of the get_prop function above. This is not because it is the best way of solving this particular issue, but because it is more generalizable
If you want to apply a custom function group-wise, you can use the group_split command. This will split your data frame into elements of a list. Each list element being a subset of the df. You can then use map to apply your function to each level (note that you can group_split and map in one step by using group_map). I added the last line to get to the form of the original approach.
df %>%
group_by(LINE) %>%
group_split() %>%
map_dbl(get_prop) %>%
tibble(LINE = seq_along(.), CALL = .) # optional to get back to a df
#> # A tibble: 2 x 2
#> LINE CALL
#> <int> <dbl>
#> 1 1 0.833
#> 2 2 0.333
Created on 2020-01-20 by the reprex package (v0.3.0)
Now one thing I'm worried about with this solution is that group_split drops the grouping variable (I would have preferred if it was kept as the names of the list or an attribute). So if you want a tibble as the outcome it might make sense to save the grouping variable beforehand:
groups <- unique(df$LINE)
df %>%
group_by(LINE) %>%
group_split() %>%
map_dbl(get_prop) %>%
tibble(group = groups, result = .)
update
I think the overall cleanest approach would be this (using a more general example):
library(tidyverse)
df <- tibble(LINE=rep(c("a", "b"),each=6), FOUND=c(1,1,1,0,1,1,0,0,1,0,0,1))
lvls <- unique(df$LINE)
df %>%
group_by(LINE) %>%
group_map(~ get_prop(.x)) %>%
setNames(lvls) %>%
unlist() %>%
enframe()
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 a 0.833
#> 2 b 0.333
Created on 2020-01-20 by the reprex package (v0.3.0)
Another option could be to use group_map and then tibble::enframe
library(dplyr)
df %>%
group_by(LINE) %>%
group_map(~get_prop(.)) %>%
unlist() %>%
tibble::enframe()
# name value
# <int> <dbl>
#1 1 0.833
#2 2 0.333
You could also use group_modify which would keep the group names (using #JBGruber's data)
df %>%
group_by(LINE) %>%
group_modify(~ tibble::enframe(get_prop(.), name = NULL))
# LINE value
# <chr> <dbl>
#1 a 0.833
#2 b 0.333
I would like to make tables for publication that give the number of observations, grouped by two variables. The code for this works fine. However, I have run into problems when trying to turn this into a function.
I am using dplyr_0.7.2
Example using mtcars:
Code for table outside of function: this works
library(tidyverse)
tab1 <- mtcars %>% count(cyl) %>% rename(Total = n)
tab2 <- mtcars %>%
group_by(cyl, gear) %>% count %>%
spread(gear, n)
tab <- full_join(tab1, tab2, by = "cyl")
tab
# This is the output (which is what I want)
A tibble: 3 x 5
cyl Total `3` `4` `5`
<dbl> <int> <int> <int> <int>
1 4 11 1 8 2
2 6 7 2 4 1
3 8 14 12 NA 2
Trying to put this into a function
Function for tab1: this works
count_by_two_groups_A <- function(df, var1){
var1 <- enquo(var1)
tab1 <- df %>% count(!!var1) %>% rename(Total = n)
tab1
}
count_by_two_groups_A(mtcars, cyl)
A tibble: 3 x 2
cyl Total
<dbl> <int>
1 4 11
2 6 7
3 8 14
Function for first part of tab2: it works up to this point, but...
count_by_two_groups_B <- function(df, var1, var2){
var1 <- enquo(var1)
var2 <- enquo(var2)
tab2 <- df %>% group_by((!!var1), (!!var2)) %>% count
tab2
}
count_by_two_groups_B(mtcars, cyl, gear)
A tibble: 8 x 3
Groups: (cyl), (gear) [8]
`(cyl)` `(gear)` n
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
The column names have changed to (cyl) and (gear). I can't seem to figure out how to carry on with spread() and full_join() (or anything else using the new column names) now that the column names have changed. I.e. I can't figure out how to specify the new column names in the tidyeval way, to be able to carry on. I have tried various things, without success.
The usual way of setting names in a tidyeval context is to use the definition operator :=. It would look like this:
df %>%
group_by(
!! nm1 := !! var1,
!! nm2 := !! var2
) %>%
count()
For this you need to extract nm1 from var1. Unfortunately I don't have an easy way of stripping down the enclosing parentheses yet. I think it'd make sense to do it in the forthcoming function ensym() (it captures symbols instead of quosures and issue an error if you supply a call). I have submitted a ticket here: https://github.com/tidyverse/rlang/issues/223
Fortunately we have two easy solutions here. First note that you don't need the enclosing parentheses. They are only needed when other operators are involved in the captured expression. E.g. in these situations:
(!! var) / avg
(!! var) < value
In this case if you omitted parentheses, !! would try to unquote the whole expressions instead of just the one symbol. On the other hand in your function there is no operator so you can safely unquote without enclosing:
count_by_two_groups_B <- function(df, var1, var2) {
var1 <- enquo(var1)
var2 <- enquo(var2)
df %>%
group_by(!! var1, !! var2) %>%
count()
}
Finally, you could make your function more general by allowing a variable number of arguments. This is even easier to implement because dots are forwarded so there is no need to capture and unquote. Just pass them down to group_by():
count_by <- function(df, ...) {
df %>%
group_by(...) %>%
count()
}
I can make it work with NSE (non-standard evaluation). Could not do it with tidyverse as I did not have that installed and did not bother installing.
Here is a working code:
library(dplyr)
library(tidyr)
count_by_two_groups_B <- function(df, var1, var2){
# var1 <- enquo(var1)
# var2 <- enquo(var2)
tab2 <- df %>% group_by_(var1, var2) %>% summarise(n = n() ) %>%spread(gear, n)
tab2
}
count_by_two_groups_B(mtcars, 'cyl', 'gear')
Result:
# A tibble: 3 x 4
# Groups: cyl [3]
cyl `3` `4` `5`
* <dbl> <int> <int> <int>
1 4 1 8 2
2 6 2 4 1
3 8 12 NA 2
This is one of those situations where reaching for dplyr or tidyverse seems excessive. There are base functions to do this ... table and to make the results in long form, as.dataframe:
as.data.frame( with(mtcars, table(cyl,gear)) , responseName="Total")
#--------
cyl gear Total
1 4 3 1
2 6 3 2
3 8 3 12
4 4 4 8
5 6 4 4
6 8 4 0
7 4 5 2
8 6 5 1
9 8 5 2
This would be one dplyr approach:
mtcars %>% group_by(cyl,gear) %>% summarise(Total=n())
#----
# A tibble: 8 x 3
# Groups: cyl [?]
cyl gear Total
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
And if the question was how to get this as a table object (thinking that might have been your goal with spread then just:
with(mtcars, table(cyl,gear))
I need to rename the second columns for all the dataframes in a list. I'm trying to use purrr::walk.
Here is the code:
cyl.name<- c('4-cyl', '6-cyl', '8-cyl')
cyl<- c(4,6,8)
car <- map(cyl, ~mtcars %>% filter(cyl==.x) %>%
group_by(gear) %>%
summarise(mean=mean(hp)) )
walk (seq_along(cyl.name), function (x) names(car[[x]])[2]<- cyl.name[x])
When I check the columns names, all the mean column are still named 'mean'. What did I do wrong?
If you have the list of the column names like this, you could use map2 to simultaneously loop through the filter variable and the naming variable. This would allow you to name the columns as you go rather than renaming after making the list.
This does involve using some tidyeval operations from rlang for programming with dplyr.
map2(cyl, cyl.name, ~mtcars %>%
filter(cyl==.x) %>%
group_by(gear) %>%
summarise( !!.y := mean(hp)) )
[[1]]
# A tibble: 3 x 2
gear `4-cyl`
<dbl> <dbl>
1 3 97
2 4 76
3 5 102
[[2]]
# A tibble: 3 x 2
gear `6-cyl`
<dbl> <dbl>
1 3 107.5
2 4 116.5
3 5 175.0
[[3]]
# A tibble: 2 x 2
gear `8-cyl`
<dbl> <dbl>
1 3 194.1667
2 5 299.5000
I must be missing something with how group_by levels in dplyr get peeled off. In the example below, I group by 2 columns, summarize values into a single variable, then sort by that new variable:
mtcars %>% group_by( cyl, gear ) %>%
summarize( hp_range = max(hp) - min(mpg)) %>%
arrange( desc(hp_range) )
# Source: local data frame [8 x 3]
# Groups: cyl [3]
#
# cyl gear hp_range
# (dbl) (dbl) (dbl)
#1 4 4 87.6
#2 4 5 87.0
#3 4 3 75.5
#4 6 5 155.3
#5 6 4 105.2
#6 6 3 91.9
#7 8 5 320.0
#8 8 3 234.6
Obviously this is not sorted by hp_range as intended. What am I missing?
EDIT: The example works as expected without the call to desc in arrange. Still unclear why?
Ok, just got to the bottom of this:
The call to desc had no effect, it was by chance that the example did not work without it
The key is that when you group_by multiple columns, it seems that results are automatically sorted by the Groups. In the example above it is sorted by cyl. To get the intended sort of the entire data table, you must first ungroup and then arrange
mtcars %>% group_by( cyl, gear ) %>%
summarize( hp_range = max(hp) - min(mpg)) %>%
ungroup() %>%
arrange( hp_range )
When using summarise with plyr's ddply function, empty categories are dropped by default. You can change this behavior by adding .drop = FALSE. However, this doesn't work when using summarise with dplyr. Is there another way to keep empty categories in the result?
Here's an example with fake data.
library(dplyr)
df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
# Now add an extra level to df$b that has no corresponding value in df$a
df$b = factor(df$b, levels=1:3)
# Summarise with plyr, keeping categories with a count of zero
plyr::ddply(df, "b", summarise, count_a=length(a), .drop=FALSE)
b count_a
1 1 6
2 2 6
3 3 0
# Now try it with dplyr
df %.%
group_by(b) %.%
summarise(count_a=length(a), .drop=FALSE)
b count_a .drop
1 1 6 FALSE
2 2 6 FALSE
Not exactly what I was hoping for. Is there a dplyr method for achieving the same result as .drop=FALSE in plyr?
The issue is still open, but in the meantime, especially since your data are already factored, you can use complete from "tidyr" to get what you might be looking for:
library(tidyr)
df %>%
group_by(b) %>%
summarise(count_a=length(a)) %>%
complete(b)
# Source: local data frame [3 x 2]
#
# b count_a
# (fctr) (int)
# 1 1 6
# 2 2 6
# 3 3 NA
If you wanted the replacement value to be zero, you need to specify that with fill:
df %>%
group_by(b) %>%
summarise(count_a=length(a)) %>%
complete(b, fill = list(count_a = 0))
# Source: local data frame [3 x 2]
#
# b count_a
# (fctr) (dbl)
# 1 1 6
# 2 2 6
# 3 3 0
Since dplyr 0.8 group_by gained the .drop argument that does just what you asked for:
df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
df$b = factor(df$b, levels=1:3)
df %>%
group_by(b, .drop=FALSE) %>%
summarise(count_a=length(a))
#> # A tibble: 3 x 2
#> b count_a
#> <fct> <int>
#> 1 1 6
#> 2 2 6
#> 3 3 0
One additional note to go with #Moody_Mudskipper's answer: Using .drop=FALSE can give potentially unexpected results when one or more grouping variables are not coded as factors. See examples below:
library(dplyr)
data(iris)
# Add an additional level to Species
iris$Species = factor(iris$Species, levels=c(levels(iris$Species), "empty_level"))
# Species is a factor and empty groups are included in the output
iris %>% group_by(Species, .drop=FALSE) %>% tally
#> Species n
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
#> 4 empty_level 0
# Add character column
iris$group2 = c(rep(c("A","B"), 50), rep(c("B","C"), each=25))
# Empty groups involving combinations of Species and group2 are not included in output
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally
#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 versicolor A 25
#> 4 versicolor B 25
#> 5 virginica B 25
#> 6 virginica C 25
#> 7 empty_level <NA> 0
# Turn group2 into a factor
iris$group2 = factor(iris$group2)
# Now all possible combinations of Species and group2 are included in the output,
# whether present in the data or not
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally
#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 setosa C 0
#> 4 versicolor A 25
#> 5 versicolor B 25
#> 6 versicolor C 0
#> 7 virginica A 0
#> 8 virginica B 25
#> 9 virginica C 25
#> 10 empty_level A 0
#> 11 empty_level B 0
#> 12 empty_level C 0
Created on 2019-03-13 by the reprex package (v0.2.1)
dplyr solution:
First make grouped df
by_b <- tbl_df(df) %>% group_by(b)
then we summarise those levels that occur by counting with n()
res <- by_b %>% summarise( count_a = n() )
then we merge our results into a data frame that contains all factor levels:
expanded_res <- left_join(expand.grid(b = levels(df$b)),res)
finally, in this case since we are looking at counts the NA values are changed to 0.
final_counts <- expanded_res[is.na(expanded_res)] <- 0
This can also be implemented functionally, see answers:
Add rows to grouped data with dplyr?
A hack:
I thought I would post a terrible hack that works in this case for interest's sake. I seriously doubt you should ever actually do this but it shows how group_by() generates the atrributes as if df$b was a character vector not a factor with levels. Also, I don't pretend to understand this properly -- but I am hoping this helps me learn -- this is the only reason I'm posting it!
by_b <- tbl_df(df) %>% group_by(b)
define an "out-of-bounds" value that cannot exist in dataset.
oob_val <- nrow(by_b)+1
modify attributes to "trick" summarise():
attr(by_b, "indices")[[3]] <- rep(NA,oob_val)
attr(by_b, "group_sizes")[3] <- 0
attr(by_b, "labels")[3,] <- 3
do the summary:
res <- by_b %>% summarise(count_a = n())
index and replace all occurences of oob_val
res[res == oob_val] <- 0
which gives the intended:
> res
Source: local data frame [3 x 2]
b count_a
1 1 6
2 2 6
3 3 0
this is not exactly what was asked in the question, but at least for this simple example, you could get the same result using xtabs, for example:
using dplyr:
df %>%
xtabs(formula = ~ b) %>%
as.data.frame()
or shorter:
as.data.frame(xtabs( ~ b, df))
result (equal in both cases):
b Freq
1 1 6
2 2 6
3 3 0