I have a dataset that looks like the following:
id samediff factor value
1 S give 3
1 S impact 4
2 S give 2
2 S impact 5
3 D give 1
3 D impact 4
4 D give 3
4 D impact 5
I would like to perform several t.tests to compare the means for each factor in the S (samediff) condition to the means for that same factor in the D (samediff) condition.
I know I could do this in the following way:
dfgive<-filter(df, factor == "give")
t.test(value~samediff, dfgive)
dfimpact<-filter(df, factor == "impact")
t.test(value~samediff, dfimpact)
Is there a way to perform several t.tests in fewer lines? In the actual dataset, there are several more factors than are included here. I would like to be able to conduct all the t.tests necessary without creating separate dataframes in the same way I've shown above.
To augment existing answers, you can use broom::tidy to tidy the output from the t.test, e.g.
library(tidyverse)
library(broom)
df %>%
group_by(factor) %>%
summarise(ttest = list(t.test(value ~ samediff))) %>%
mutate(ttest = map(ttest, tidy)) %>%
unnest() %>%
select(factor, estimate, estimate1, estimate2, p.value)
# # A tibble: 2 x 5
# factor estimate estimate1 estimate2 p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 give -0.5 2 2.5 0.712
# 2 impact 0 4.5 4.5 1
Here's a base-R approach:
results <- lapply(split(df, df$factor), function(X) {
out <- t.test(value ~ samediff, X)
data.frame(diff = out$statistic,
mean1 = out$estimate[1],
mean2 = out$estimate[2],
pval = out$p.value)
})
do.call(rbind, results)
# diff mean1 mean2 pval
# give -0.4472136 2.0 2.5 0.7117228
# impact 0.0000000 4.5 4.5 1.0000000
We can split the data by factor and apply t.test one by one. The final output is a list. We can access the result by lst$give or lst$impact.
library(tidyverse)
lst <- df %>%
split(.$factor) %>%
map(~t.test(value ~ samediff, .x))
DATA
df <- read.table(text = "id samediff factor value
1 S give 3
1 S impact 4
2 S give 2
2 S impact 5
3 D give 1
3 D impact 4
4 D give 3
4 D impact 5 ",
header = TRUE, stringsAsFactors = FALSE)
We can group by 'factor' and summarise the output of t.test in a list
library(dplyr)
out <- df %>%
group_by(factor) %>%
summarise(ttest = list(t.test(value ~ samediff)))
out
# A tibble: 2 x 2
# factor ttest
# <chr> <list>
#1 give <S3: htest>
#2 impact <S3: htest>
The output is stored in a list column which can be extracted with $ or [[
identical(out$ttest[[1]], t.test(value ~ samediff, dfgive))
#[1] TRUE
Related
I have a data.frame (or tiibble or whatever) with an id variable. Often I made some operation for this id with dplyr::group_by, so
data %>%
group_by(id) %>%
summarise/mutate/...()
Often, I have other non-numeric variables that are unique for each id, such as the project or country to which the id belongs and other characteristics of the id (such as gender, etc.). When I use the summarise function above, these other variables ares lost unless I specify, either
data %>%
group_by(id) %>%
summarise(across(c(project, country, gender, ...), unique),...)
or
data %>%
group_by(id, project, country, gender, ...) %>%
summarise()
Is there some functions which detect these variables which are unique for each id, so that one does not have to specify them?
Thank you!
PS: I am asking mainly on dplyr and group_by related functions, but other environments like R-base or data.table are wellcome also.
I did not test it extensively yet it should do the job
library(dplyr)
myData <- tibble(X = c(1, 1, 2, 2, 2, 3),
Y = LETTERS[c(1, 1, 2, 2, 2, 3)],
R = rnorm(6))
myData
#> # A tibble: 6 x 3
#> X Y R
#> <dbl> <chr> <dbl>
#> 1 1 A 0.463
#> 2 1 A -0.965
#> 3 2 B -0.403
#> 4 2 B -0.417
#> 5 2 B -2.28
#> 6 3 C 0.423
group_by_id_vars <- function(.data, ...) {
# group by the prespecified ID variables
.data <- .data %>% group_by(...)
# how many groups do these ID determine
ID_groups <- .data %>% n_groups()
# Get the number of groups if the initial grouping variables are combined
# with other variables
groupVars <- sapply(substitute(list(...))[-1], deparse) #specified grouping Variable
nms <- names(.data) # all variables in .data
res <- sapply(nms[!nms %in% groupVars],
function(x) {
.data %>%
# important to specify add = TRUE to combine the variable
# with the IDs
group_by(across(all_of(x)), .add = TRUE) %>%
n_groups()})
# which combinations are identical, i.e. this variable does not increase the
# number of groups in the data if combined with IDvars
v <- names(res)[which(res == ID_groups)]
# group the data accordingly
.data <- .data %>% ungroup() %>% group_by(across(all_of(c(groupVars, v))))
return(.data)
}
myData %>%
group_by_id_vars(X) %>%
summarise(n = n())
#> `summarise()` regrouping output by 'X' (override with `.groups` argument)
#> # A tibble: 3 x 3
#> # Groups: X [3]
#> X Y n
#> <dbl> <chr> <int>
#> 1 1 A 2
#> 2 2 B 3
#> 3 3 C 1
This is a bit more advanced in application, but what you are looking for are linear combinations of your grouping variables. You can convert these to factors and then use some linear algebra.
You can use findLinearCombos() from caret to locate these. It takes a bit of work to get it all organized how I think you want it though.
Something like this may do the trick. I also have not extensively tested this.
Packages
library(dplyr)
library(caret)
library(purrr)
Function
group_by_lc <- function(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data)) {
# capture the ... and convert to a character vector
.groups <- rlang::ensyms(...)
.groups_chr <- map_chr(.groups, rlang::as_name)
# convert all character and factor variables to a numeric
d <- .data %>%
mutate(across(where(is.factor), as.character),
across(where(is.character), as.factor),
across(where(is.factor), as.integer))
# find linear combinations of the character / factor variables
lc <- caret::findLinearCombos(d)
# see if any of your grouping variables have linear combinations
find_group_match <- function(known_groups, lc_pair) {
if (any(lc_pair %in% known_groups)) unique(c(lc_pair, known_groups)) else NULL
}
# convert column indices to names
lc_pairs <- map(lc$linearCombos, ~ names(d)[.x])
# iteratively look for linear combinations of known grouping variabels
lc_cols <- reduce(lc_pairs, find_group_match, .init = .groups_chr)
# find new grouping variables
added_groups <- rlang::syms(lc_cols[!(lc_cols %in% .groups_chr)])
# apply the grouping to your groups and the linear combinations
group_by(.data, !!!.groups, !!!added_groups, .add = .add, .drop = .drop)
}
Usage
data <- tibble(V = LETTERS[1:10], W = letters[1:10], X = paste0(V, W), Y = rep(LETTERS[1:5], each = 2), Z = runif(10))
group_by_lc(data, W)
Result
You can see how it added in all the other grouping variables. You can rework this all in other ways, the key part is building that added_groups list to find them.
# A tibble: 10 x 5
# Groups: W, X, V [10]
V W X Y Z
<chr> <chr> <chr> <chr> <dbl>
1 A a Aa A 0.884
2 B b Bb A 0.133
3 C c Cc B 0.194
4 D d Dd B 0.407
5 E e Ee C 0.256
6 F f Ff C 0.0976
7 G g Gg D 0.635
8 H h Hh D 0.0542
9 I i Ii E 0.0104
10 J j Jj E 0.464
I have a data frame of 30 years of a response variable. I want to write a code that will subset that df into "x" number of years "n" times, and run a regression of the response in all of those subsets.
So if we started with 30 years, x=5 & n=2, we would end with 2 regressions, each using 5 random years out of the available 30. I wrote a function for that here:
# build df
df = data.frame(year=c(1:30),
response = runif(30,1,100))
# create function
subsample <- function(df, x, n ){
df %>%
# collaplse the tibble
nest(data=everything()) %>%
# repeat the tibble for number of simulations
slice(rep(1:n(), each = n)) %>%
# add group number, which will be the "nth" trial
mutate(group = c(1:n)) %>%
# expand data
unnest(cols = c(data)) %>%
# group by group number, then subsample n times from each group
group_by(group) %>%
group_map(~ sample_n(.x, x, replace = F)) %>%
# stitch back together and add group number col back
bind_rows(.id="trial") %>%
# arrange by group and year
mutate(trial=as.numeric(trial)) %>%
arrange(trial,year) %>%
# group by subsample and run regression
group_by(trial) %>%
do({
mod = lm(response ~ year, data = .)
data.frame(Intercept = coef(mod)[1],
Slope = coef(mod)[2])
})
}
# test function
subsample(df, x=5, n=2)
# A tibble: 2 x 3
# Groups: simulation [2]
trial Intercept Slope
<dbl> <dbl> <dbl>
1 1 48.5 -0.895
2 2 35.4 -0.275
Great, so that works, and we get two regressions (all I want is slope and intercept) each using a subset of 5 out of 30 years.
However, now I want to do this with every possible combination of years (so x = c(2:30)), ending with a df that should look like this
# A tibble: 2 x 3
number_of_years trial Intercept Slope
<dbl> <dbl> <dbl> <dbl>
1 2 1 48.5 -0.895
2 2 2 35.4 -0.275
3 3 1 55.2 0.333
4 3 2 34.1 0.224
5 4 1 63.2 -0.359
6 4 2 45.5 -0.241
7 5 1 43.1 0.257
8 5 2 37.9 -0.657
9 6 1 51.0 -0.456
10 6 2 65.6 0.126
This would be showing regression values of 2 trials ("n") each using 2 random years (number_of_years, "x"), then 2 trials using 3 random years, 4 random years, etc... all the way until 30.
So I tried to follow the same logic as above, but now trying map_group() with the custom function that I built:
df %>%
# collaplse the tibble
nest(data=everything()) %>%
# repeat the tibble for the number of simulations we want to test (29, in this case)
slice(rep(1:n(), each = (nrow(df)-1))) %>%
# add column for number out of total and unnest
mutate(number_of_years = c(2:(nrow(.)+1))) %>%
select(number_of_years,data) %>% #reorder
unnest(cols =c(data)) %>%
# group by out of total
group_by(number_of_years) %>%
group_map( ~ subsample(.x, x=5, n=2,))
### this is the problematic line!
### this is giving us 2 trials (n=2) of a regression, each using
### x=5 years of sampling. but instead of x=5 years, I want x=number_of_years
### so x should be the same as the grouping variable.
So the problem here is, since my subsample() function needs 3 inputs (df, x, n), I need to figure out how to make "x" the same as the grouping variable for the dataset. x should be (number_of_years). I've tried to do group_map( ~ subsample(.x,.x$number_of_years,2) and variations like that, but I can't figure out how to make it return 30 tibbles of 2 trials each, meaning 2 regressions of subsamples of the original df, but each one using a different number of years to calculate the regression.
I would like to stay in the tidyverse/ dplyr/ purr workflow if possible.
Thanks!
I am trying to draw a stratified sample from a data set for which a variable exists that indicates how large the sample size per group should be.
library(dplyr)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
In this example, grp refers to the group I want to sample by and frq is the sample size specificied for that group.
Using split, I came up with this possible solution, which gives the desired result but seems rather inefficient :
s <- split(df, df$grp)
lapply(s,function(x) sample_n(x, size = unique(x$frq))) %>%
do.call(what = rbind)
Is there a way using just dplyr's group_by and sample_n to do this?
My first thought was:
df %>% group_by(grp) %>% sample_n(size = frq)
but this gives the error:
Error in is_scalar_integerish(size) : object 'frq' not found
This works:
df %>% group_by(grp) %>% sample_n(frq[1])
# A tibble: 9 x 3
# Groups: grp [3]
id grp frq
<int> <int> <dbl>
1 3 1 3
2 4 1 3
3 2 1 3
4 6 2 2
5 8 2 2
6 13 3 4
7 14 3 4
8 12 3 4
9 11 3 4
Not sure why it didn't work when you tried it.
library(tidyverse)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
set.seed(22)
df %>%
group_by(grp) %>% # for each group
nest() %>% # nest data
mutate(v = map(data, ~sample_n(data.frame(id=.$id), unique(.$frq)))) %>% # sample using id values and (unique) frq value
unnest(v) # unnest the sampled values
# # A tibble: 9 x 2
# grp id
# <int> <int>
# 1 1 2
# 2 1 5
# 3 1 3
# 4 2 8
# 5 2 9
# 6 3 14
# 7 3 13
# 8 3 15
# 9 3 11
Function sample_n works if you pass as inputs a data frame of ids (not a vector of ids) and one frequency value (for each group).
An alternative version using map2 and generating the inputs for sample_n in advance:
df %>%
group_by(grp) %>% # for every group
summarise(d = list(data.frame(id=id)), # create a data frame of ids
frq = unique(frq)) %>% # get the unique frq value
mutate(v = map2(d, frq, ~sample_n(.x, .y))) %>% # sample using data frame of ids and frq value
unnest(v) %>% # unnest sampled values
select(-frq) # remove frq column (if needed)
The following answer is not recommended, just shows a different approach without nests/maps that some people might find more comprehensible. Possibly of use to someone working with a smallish data set who wants to do something slightly different to the original question, is a bit scared or doesn't have time to play around with functions they don't really understand, and isn't too worried about efficiency. You just need to recall the behaviour of the original sample function in base R: when provided with a (positive) integer argument x, it outputs a vector randomly permuting the integers from 1:x.
> sample(5)
[1] 5 1 4 2 3
If we had five elements, we could then obtain a random sample of size three by only selecting the positions where 1, 2 and 3 were permuted - in this case we'd pick the second, fourth and fifth elements. All clear? Then similarly we can just do that within each group, assigning random integers from 1 to the group size, and choosing as our sample the places where the random id is less than or equal to the desired sample size for that group.
library(tidyverse)
# The iris data set has three different species
# I want to sample 2, 5 and 3 flowers respectively from each
sample_sizes <- data.frame(
Species = unique(iris$Species),
n_to_sample = c(2, 5, 3)
)
iris %>%
left_join(sample_sizes, by = "Species") %>% # adds column for how many to sample from this species
group_by(Species) %>% # each species is a group, the size of the group can be found by n()
mutate(random_id = sample(n())) %>% # give each flower in the group a random id between 1 and n()
ungroup() %>%
filter(random_id <= n_to_sample)
Which gave me the output:
# A tibble: 10 x 7
Sepal.Length Sepal.Width Petal.Length Petal.Width Species n_to_sample random_id
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <int>
1 4.9 3.1 1.5 0.1 setosa 2 1
2 5.7 4.4 1.5 0.4 setosa 2 2
3 6.2 2.2 4.5 1.5 versicolor 5 3
4 6.3 2.5 4.9 1.5 versicolor 5 2
5 6.4 2.9 4.3 1.3 versicolor 5 5
6 6 2.9 4.5 1.5 versicolor 5 4
7 5.5 2.4 3.8 1.1 versicolor 5 1
8 7.3 2.9 6.3 1.8 virginica 3 1
9 7.2 3 5.8 1.6 virginica 3 3
10 6.2 3.4 5.4 2.3 virginica 3 2
You can of course pipe through to select(-random_id, -n_to_sample) if you no longer have any use for the final two columns, but I left them in so it's clearer from the output how the code worked.
For the example data given in the question:
library(dplyr)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
df %>%
group_by(grp) %>%
mutate(random_id = sample(n())) %>%
ungroup() %>%
filter(random_id <= frq) %>%
select(-random_id)
# A tibble: 9 x 3
id grp frq
<int> <int> <dbl>
1 1 1 3
2 2 1 3
3 3 1 3
4 8 2 2
5 9 2 2
6 11 3 4
7 12 3 4
8 13 3 4
9 15 3 4
NB if you're a safety fanatic and x might be zero, and you want to guarantee the length of the output is definitely the same as x, you're better to do sample(seq_len(x)) than sample(x). That way you get the zero-length vector integer(0) rather than the length-one vector 0 in the case where x is zero. In my code, the mutate will never be working on a row for which n() is zero (if n() were zero then that group is empty so there couldn't be a row there) and this isn't a problem. Just something to be aware of if you're taking this approach somewhere else.
Benchmarks for comparison:
f1 <- function(df) { # #AntoniosK with nest and map
df %>%
group_by(grp) %>% # for each group
nest() %>% # nest data
mutate(v = map(data, ~sample_n(data.frame(id=.$id), unique(.$frq)))) %>% # sample using id values and (unique) frq value
unnest(v) # unnest the sampled values
}
f2 <- function(df) { # #AntoniosK with nest and map2
df %>%
group_by(grp) %>% # for every group
summarise(d = list(data.frame(id=id)), # create a data frame of ids
frq = unique(frq)) %>% # get the unique frq value
mutate(v = map2(d, frq, ~sample_n(.x, .y))) %>% # sample using data frame of ids and frq value
unnest(v) %>% # unnest sampled values
select(-frq) # remove frq column (if needed)
}
f3 <- function(df) { # #thc
df %>% group_by(grp) %>% sample_n(frq[1])
}
f4 <- function(df) { # #Silverfish
df %>%
group_by(grp) %>%
mutate(random_id = sample(n())) %>%
ungroup() %>%
filter(random_id <= frq) %>%
select(-random_id)
}
# example data of variable size
df_n <- function(n) {
data.frame(id = seq_len(3*n),
grp = rep(1:3,each = n),
frq = rep(c(3,2,4), each = n))
}
require(microbenchmark)
microbenchmark(f1(df_n(1e3)), f2(df_n(1e3)), f3(df_n(1e3)), f4(df_n(1e3)),
f1(df_n(1e6)), f2(df_n(1e6)), f3(df_n(1e6)), f4(df_n(1e6)),
times=20)
Results strongly favour #thc's df %>% group_by(grp) %>% sample_n(frq[1]) both for data frame with a couple of thousand or couple of million rows. My naive approach takes two or three times as long, and #AntoniosK's faster solution is the one with nest and map2 (worse than mine for smaller data frames but better for the larger ones).
Unit: milliseconds
expr min lq mean median uq max neval
f1(df_n(1000)) 12.0007 12.27295 12.479760 12.34190 12.46475 13.6403 20
f2(df_n(1000)) 9.5841 9.82185 9.905120 9.87820 9.98865 10.2993 20
f3(df_n(1000)) 1.3729 1.53470 1.593015 1.56755 1.68910 1.8456 20
f4(df_n(1000)) 3.1732 3.21600 3.558855 3.27500 3.57350 5.4715 20
f1(df_n(1e+06)) 1582.3807 1695.15655 1699.288195 1714.13435 1727.53300 1744.2654 20
f2(df_n(1e+06)) 323.3649 336.94280 407.581130 346.95390 463.69935 911.6647 20
f3(df_n(1e+06)) 216.3265 235.85830 268.756465 247.63620 259.02640 395.9372 20
f4(df_n(1e+06)) 641.5119 663.03510 737.089355 682.69730 803.98205 1132.6586 20
Reading the guide to programming with dplyr, I am able to refer to all ... variables at once. But how can I use them individually?
Here's a function that counts two variables. It succeeds using quos() and !!!:
library(dplyr) # version 0.6 or higher
library(tidyr)
# counts two variables
my_fun <- function(dat, ...){
cols <- quos(...)
dat <- dat %>%
count(!!!cols)
dat
}
my_fun(mtcars, cyl, am)
#> # A tibble: 6 x 3
#> cyl am n
#> <dbl> <dbl> <int>
#> 1 4 0 3
#> 2 4 1 8
#> 3 6 0 4
#> 4 6 1 3
#> 5 8 0 12
#> 6 8 1 2
Now I want to tidyr::spread the second variable, in this case the am column. When I add to my function:
result <- dat %>%
tidyr::spread(!!!cols[[2]], "n", fill = 0)
I get:
Error: Invalid column specification
How should I refer to just the 2nd variable of the cols <- quos(...) list?
It is not clear whether spread works with quosure or not. An option is to use spread_ with strings
my_fun <- function(dat, ...){
cols <- quos(...)
dat %>%
select(!!! cols) %>%
count(!!! cols) %>%
spread_(quo_name(cols[[2]]), "n", fill = 0)
}
my_fun(mtcars, cyl, am)
# A tibble: 3 x 3
# cyl `0` `1`
#* <dbl> <dbl> <dbl>
#1 4 3 8
#2 6 4 3
#3 8 12 2
Use named parameters instead. If you're relying on doing different things to different elements of the ... list it would only make sense to be explicit so it's easier to understand what each input is doing and make it easier for you to manipulate.
When using summarise with plyr's ddply function, empty categories are dropped by default. You can change this behavior by adding .drop = FALSE. However, this doesn't work when using summarise with dplyr. Is there another way to keep empty categories in the result?
Here's an example with fake data.
library(dplyr)
df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
# Now add an extra level to df$b that has no corresponding value in df$a
df$b = factor(df$b, levels=1:3)
# Summarise with plyr, keeping categories with a count of zero
plyr::ddply(df, "b", summarise, count_a=length(a), .drop=FALSE)
b count_a
1 1 6
2 2 6
3 3 0
# Now try it with dplyr
df %.%
group_by(b) %.%
summarise(count_a=length(a), .drop=FALSE)
b count_a .drop
1 1 6 FALSE
2 2 6 FALSE
Not exactly what I was hoping for. Is there a dplyr method for achieving the same result as .drop=FALSE in plyr?
The issue is still open, but in the meantime, especially since your data are already factored, you can use complete from "tidyr" to get what you might be looking for:
library(tidyr)
df %>%
group_by(b) %>%
summarise(count_a=length(a)) %>%
complete(b)
# Source: local data frame [3 x 2]
#
# b count_a
# (fctr) (int)
# 1 1 6
# 2 2 6
# 3 3 NA
If you wanted the replacement value to be zero, you need to specify that with fill:
df %>%
group_by(b) %>%
summarise(count_a=length(a)) %>%
complete(b, fill = list(count_a = 0))
# Source: local data frame [3 x 2]
#
# b count_a
# (fctr) (dbl)
# 1 1 6
# 2 2 6
# 3 3 0
Since dplyr 0.8 group_by gained the .drop argument that does just what you asked for:
df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
df$b = factor(df$b, levels=1:3)
df %>%
group_by(b, .drop=FALSE) %>%
summarise(count_a=length(a))
#> # A tibble: 3 x 2
#> b count_a
#> <fct> <int>
#> 1 1 6
#> 2 2 6
#> 3 3 0
One additional note to go with #Moody_Mudskipper's answer: Using .drop=FALSE can give potentially unexpected results when one or more grouping variables are not coded as factors. See examples below:
library(dplyr)
data(iris)
# Add an additional level to Species
iris$Species = factor(iris$Species, levels=c(levels(iris$Species), "empty_level"))
# Species is a factor and empty groups are included in the output
iris %>% group_by(Species, .drop=FALSE) %>% tally
#> Species n
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
#> 4 empty_level 0
# Add character column
iris$group2 = c(rep(c("A","B"), 50), rep(c("B","C"), each=25))
# Empty groups involving combinations of Species and group2 are not included in output
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally
#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 versicolor A 25
#> 4 versicolor B 25
#> 5 virginica B 25
#> 6 virginica C 25
#> 7 empty_level <NA> 0
# Turn group2 into a factor
iris$group2 = factor(iris$group2)
# Now all possible combinations of Species and group2 are included in the output,
# whether present in the data or not
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally
#> Species group2 n
#> 1 setosa A 25
#> 2 setosa B 25
#> 3 setosa C 0
#> 4 versicolor A 25
#> 5 versicolor B 25
#> 6 versicolor C 0
#> 7 virginica A 0
#> 8 virginica B 25
#> 9 virginica C 25
#> 10 empty_level A 0
#> 11 empty_level B 0
#> 12 empty_level C 0
Created on 2019-03-13 by the reprex package (v0.2.1)
dplyr solution:
First make grouped df
by_b <- tbl_df(df) %>% group_by(b)
then we summarise those levels that occur by counting with n()
res <- by_b %>% summarise( count_a = n() )
then we merge our results into a data frame that contains all factor levels:
expanded_res <- left_join(expand.grid(b = levels(df$b)),res)
finally, in this case since we are looking at counts the NA values are changed to 0.
final_counts <- expanded_res[is.na(expanded_res)] <- 0
This can also be implemented functionally, see answers:
Add rows to grouped data with dplyr?
A hack:
I thought I would post a terrible hack that works in this case for interest's sake. I seriously doubt you should ever actually do this but it shows how group_by() generates the atrributes as if df$b was a character vector not a factor with levels. Also, I don't pretend to understand this properly -- but I am hoping this helps me learn -- this is the only reason I'm posting it!
by_b <- tbl_df(df) %>% group_by(b)
define an "out-of-bounds" value that cannot exist in dataset.
oob_val <- nrow(by_b)+1
modify attributes to "trick" summarise():
attr(by_b, "indices")[[3]] <- rep(NA,oob_val)
attr(by_b, "group_sizes")[3] <- 0
attr(by_b, "labels")[3,] <- 3
do the summary:
res <- by_b %>% summarise(count_a = n())
index and replace all occurences of oob_val
res[res == oob_val] <- 0
which gives the intended:
> res
Source: local data frame [3 x 2]
b count_a
1 1 6
2 2 6
3 3 0
this is not exactly what was asked in the question, but at least for this simple example, you could get the same result using xtabs, for example:
using dplyr:
df %>%
xtabs(formula = ~ b) %>%
as.data.frame()
or shorter:
as.data.frame(xtabs( ~ b, df))
result (equal in both cases):
b Freq
1 1 6
2 2 6
3 3 0