Find mean of counts within groups - r

I have a dataframe that looks like this:
library(tidyverse)
x <- tibble(
batch = rep(c(1,2), each=10),
exp_id = c(rep('a',3),rep('b',2),rep('c',5),rep('d',6),rep('e',4))
)
I can run the code below to get the count perexp_id:
x %>% group_by(batch,exp_id) %>%
summarise(count=n())
which generates:
batch exp_id count
<dbl> <chr> <dbl>
1 1 a 3
2 1 b 2
3 1 c 5
4 2 d 6
5 2 e 4
A really ugly way to generate the mean of these counts is:
x %>% group_by(batch,exp_id) %>%
summarise(count=n()) %>%
ungroup() %>%
group_by(batch) %>%
summarise(avg_exp = mean(count))
which generates:
batch avg_exp
<dbl> <dbl>
1 1 3.33
2 2 5
Is there a more succinct and "tidy" way generate this?

library(dplyr)
group_by(x, batch) %>%
summarize(avg_exp = mean(table(exp_id)))
# # A tibble: 2 x 2
# batch avg_exp
# <dbl> <dbl>
# 1 1 3.33
# 2 2 5

Here's another way -
library(dplyr)
x %>%
count(batch, exp_id, name = "count") %>%
group_by(batch) %>%
summarise(count = mean(count))
# batch count
# <dbl> <dbl>
#1 1 3.33
#2 2 5

Related

How can a table be rearranged one step at a time so that two or more observations are listed in a row in successive columns?

So far I have done this to achieve the desired result:
# A tibble: 4 x 2
frag treat
<dbl> <dbl>
1 1 1
2 2 1
3 1 2
4 2 2
treat_1 <- tab_example %>% filter(treat == "1")
treat_2 <- tab_example %>% filter(treat == "2")
new_tab_example <- full_join(treat_1, treat_2, by = "frag")
> new_tab_example
# A tibble: 2 x 3
frag treat.x treat.y
<dbl> <dbl> <dbl>
1 1 1 2
2 2 1 2
Is there a way to do it in one step?
You can use pivot_wider :
tidyr::pivot_wider(tab_example, names_from = treat,
names_prefix = 'treat', values_from = treat)
# frag treat1 treat2
# <dbl> <dbl> <dbl>
#1 1 1 2
#2 2 1 2
There is a way using spread() function:
library(dplyr)
library(tidyr)
# Yours data
df = tibble(frag = c(1, 2, 1, 2), treat = c(1,1,2,2) )
dfnew = df %>%
mutate(treat_name = case_when(treat==1 ~ 'treat.x', # Build names of columns
treat==2 ~ 'treat.y')
) %>%
spread(treat_name, treat) # Use spread function
If you print the result:
print(dfnew)
# A tibble: 2 x 3
frag treat.x treat.y
<dbl> <dbl> <dbl>
1 1 1 2
2 2 1 2

Writing function that calculates rowwise mean for subset of columns and creates column name

I want to turn this line of code into a function:
mutate(var_avg = rowMeans(select(., starts_with("var"))))
It works in the pipe:
df <- read_csv("var_one,var_two,var_three
1,1,1
2,2,2
3,3,3")
df %>% mutate(var_avg = rowMeans(select(., starts_with("var"))))
># A tibble: 3 x 4
> var_one var_two var_three var_avg
> <dbl> <dbl> <dbl> <dbl>
>1 1 1 1 1
>2 2 2 2 2
>3 3 3 3 3
Here's my attempt (I'm new at writing functions):
colnameMeans <- function(x) {
columnname <- paste0("avg_",x)
mutate(columnname <- rowMeans(select(., starts_with(x))))
}
It doesn't work.
df %>% colnameMeans("var")
>Error in colnameMeans(., "var") : unused argument ("var")
I have a lot to learn about functions and I'm not sure where to start with fixing this. Any help would be much appreciated. Note that this is a simplified example. In my real data, I have several column prefixes and I want to calculate a row-wise mean for each one. EDIT: Being able to run the function for multiple prefixes at once would be a bonus.
If we need to assign column name on the lhs of assignment, use := and evaluate (!!) the string. The <- inside mutate won't work as the default option is = and it would evaluate unquoted value on the lhs of = literally. In addition, we may need to specify the data as argument in the function
library(dplyr)
colnameMeans <- function(., x) {
columnname<- paste0("avg_", x)
mutate(., !! columnname := rowMeans(select(., starts_with(x))))
}
df %>%
colnameMeans('var')
# A tibble: 3 x 4
# var_one var_two var_three avg_var
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 2 2
#3 3 3 3 3
If there are several prefixes, use map
library(purrr)
library(stringr)
colnameMeans <- function(., x) {
columnname<- paste0("avg_", x)
transmute(., !! columnname := rowMeans(select(., starts_with(x))))
}
map_dfc(c('var', 'alt'), ~ df1 %>%
colnameMeans(.x)) %>%
bind_cols(df1, .)
# A tibble: 3 x 8
# var_one var_two var_three alt_var_one alt_var_two alt_var_three avg_var avg_alt
#* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1 1 1 1 1
#2 2 2 2 2 2 2 2 2
#3 3 3 3 3 3 3 3 3
data
df1 <- bind_cols(df, df %>% rename_all(~ str_replace(., 'var_', 'new_')))

R dplyr group_by summarise keep last non missing

Consider the following dataset where id uniquely identifies a person, and name varies within id only to the extent of minor spelling issues. I want to aggregate to id level using dplyr:
df= data.frame(id=c(1,1,1,2,2,2),name=c('michael c.','mike', 'michael','','John',NA),var=1:6)
Using group_by(id) yields the correct computation, but I lose the name column:
df %>% group_by(id) %>% summarise(newvar=sum(var)) %>%ungroup()
A tibble: 2 x 2
id newvar
<dbl> <int>
1 1 6
2 2 15
Using group_by(id,name) yields both name and id but obviously the "wrong" sums.
I would like to keep the last non-missing observatoin of the name within each group. I basically lack a dplyr version of Statas lastnm() function:
df %>% group_by(id) %>% summarise(sum = sum(var), Name = lastnm(name))
id sum Name
1 1 6 michael
2 2 15 John
Is there a "keep last non missing"-option?
1) Use mutate like this:
df %>%
group_by(id) %>%
mutate(sum = sum(var)) %>%
ungroup
giving:
# A tibble: 6 x 4
id name var sum
<dbl> <fct> <int> <int>
1 1 michael c. 1 6
2 1 mike 2 6
3 1 michael 3 6
4 2 john 4 15
5 2 john 5 15
6 2 john 6 15
2) Another possibility is:
df %>%
group_by(id) %>%
summarize(name = name %>% unique %>% toString, sum = sum(var)) %>%
ungroup
giving:
# A tibble: 2 x 3
id name sum
<dbl> <chr> <int>
1 1 michael c., mike, michael 6
2 2 john 15
3) Another variation is to only report the first name in each group:
df %>%
group_by(id) %>%
summarize(name = first(name), sum = sum(var)) %>%
ungroup
giving:
# A tibble: 2 x 3
id name sum
<dbl> <fct> <int>
1 1 michael c. 6
2 2 john 15
I posted a feature request on dplyrs github thread, and the reponse there is actually the best answer. For sake of completion I repost it here:
df %>%
group_by(id) %>%
summarise(sum=sum(var), Name=last(name[!is.na(name)]))
#> # A tibble: 2 x 3
#> id sum Name
#> <dbl> <int> <chr>
#> 1 1 6 michael
#> 2 2 15 John

Winners within pairs; or vector-valued group_by mutate?

I'm trying to assess which unit in a pair is the "winner". group_by() %>% mutate() is close to the right thing, but it's not quite there. in particular
dat %>% group_by(pair) %>% mutate(winner = ifelse(score[1] > score[2], c(1, 0), c(0, 1))) doesn't work.
The below does, but is clunky with an intermediate summary data frame. Can we improve this?
library(tidyverse)
set.seed(343)
# units within pairs get scores
dat <-
data_frame(pair = rep(1:3, each = 2),
unit = rep(1:2, 3),
score = rnorm(6))
# figure out who won in each pair
summary_df <-
dat %>%
group_by(pair) %>%
summarize(winner = which.max(score))
# merge back and determine whether each unit won
dat <-
left_join(dat, summary_df, "pair") %>%
mutate(won = as.numeric(winner == unit))
dat
#> # A tibble: 6 x 5
#> pair unit score winner won
#> <int> <int> <dbl> <int> <dbl>
#> 1 1 1 -1.40 2 0
#> 2 1 2 0.523 2 1
#> 3 2 1 0.142 1 1
#> 4 2 2 -0.847 1 0
#> 5 3 1 -0.412 1 1
#> 6 3 2 -1.47 1 0
Created on 2018-09-26 by the reprex
package (v0.2.0).
maybe related to Weird group_by + mutate + which.max behavior
You could do:
dat %>%
group_by(pair) %>%
mutate(won = score == max(score),
winner = unit[won == TRUE]) %>%
# A tibble: 6 x 5
# Groups: pair [3]
pair unit score won winner
<int> <int> <dbl> <lgl> <int>
1 1 1 -1.40 FALSE 2
2 1 2 0.523 TRUE 2
3 2 1 0.142 TRUE 1
4 2 2 -0.847 FALSE 1
5 3 1 -0.412 TRUE 1
6 3 2 -1.47 FALSE 1
Using rank:
dat %>% group_by(pair) %>% mutate(won = rank(score) - 1)
More for fun (and slightly faster), using the outcome of the comparison (score[1] > score[2]) to index a vector with 'won alternatives' :
dat %>% group_by(pair) %>%
mutate(won = c(0, 1, 0)[1:2 + (score[1] > score[2])])

R dplyr: summarise complete cases by group for all variables

I want to summarise variables by group for every variable in a dataset using dplyr. The summarised variables should be stored under a new name.
An example:
df <- data.frame(
group = c("A", "B", "A", "B"),
a = c(1,1,NA,2),
b = c(1,NA,1,1),
c = c(1,1,2,NA),
d = c(1,2,1,1)
)
df %>% group_by(group) %>%
mutate(complete_a = sum(complete.cases(a))) %>%
mutate(complete_b = sum(complete.cases(b))) %>%
mutate(complete_c = sum(complete.cases(c))) %>%
mutate(complete_d = sum(complete.cases(d))) %>%
group_by(group, complete_a, complete_b, complete_c, complete_d) %>% summarise()
results in my expected output:
# # A tibble: 2 x 5
# # Groups: group, complete_a, complete_b, complete_c [?]
# group complete_a complete_b complete_c complete_d
# <fct> <int> <int> <int> <int>
# A 1 2 2 2
# B 2 1 1 2
How can I generate the same output without duplicating the mutate statements per variable?
I tried:
df %>% group_by(group) %>% summarise_all(funs(sum(complete.cases(.))))
which works but does not rename the variables.
You are almost there. You have to use rename_all
library(dplyr)
df %>%
group_by(group) %>%
summarise_all(funs(sum(complete.cases(.)))) %>%
rename_all(~paste0("complete_", colnames(df)))
# A tibble: 2 x 5
# complete_group complete_a complete_b complete_c complete_d
# <fct> <int> <int> <int> <int>
#1 A 1 2 2 2
#2 B 2 1 1 2
Edit
Or as pointed all by #symbolrush, more directly without colnames:
df %>%
group_by(group) %>%
summarise_all(funs(sum(complete.cases(.)))) %>%
rename_all(~paste0("complete_", .))
## A tibble: 2 x 5
# complete_group complete_a complete_b complete_c complete_d
# <fct> <int> <int> <int> <int>
#1 A 1 2 2 2
#2 B 2 1 1 2

Resources