weighted mean in dplyr for multiple columns

weighted mean in dplyr for multiple columns - r

I'm trying to calculate the weighted mean for multiple columns using dplyr. at the moment I'm stuck with summarize_each which to me seems to be part of the solution. here's some example code:
library(dplyr)
f2a <- c(1,0,0,1)
f2b <- c(0,0,0,1)
f2c <- c(1,1,1,1)
clustervar <- c("A","B","B","A")
weight <- c(10,20,30,40)
df <- data.frame (f2a, f2b, f2c, clustervar, weight, stringsAsFactors=FALSE)
df
what I am looking for is something like
df %>%
group_by (clustervar) %>%
summarise_each(funs(weighted.mean(weight)), select=cbind(clustervar, f2a:f2c))
The result of this is only:
# A tibble: 2 × 4
clustervar select4 select5 select6
<chr> <dbl> <dbl> <dbl>
1 A 25 25 25
2 B 25 25 25
What am I missing here?

You can use summarise_at to specify which columns you want to operate on:
df %>% group_by(clustervar) %>%
summarise_at(vars(starts_with('f2')),
funs(weighted.mean(., weight)))
#> # A tibble: 2 × 4
#> clustervar f2a f2b f2c
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 1 0.8 1
#> 2 B 0 0.0 1

We can reshape it to 'long' format and then do this
library(tidyverse)
gather(df, Var, Val, f2a:f2c) %>%
group_by(clustervar, Var) %>%
summarise(wt =weighted.mean(Val, weight)) %>%
spread(Var, wt)
Or another option is
df %>%
group_by(clustervar) %>%
summarise_each(funs(weighted.mean(., weight)), matches("^f"))
# A tibble: 2 × 4
# clustervar f2a f2b f2c
# <chr> <dbl> <dbl> <dbl>
# 1 A 1 0.8 1
# 2 B 0 0.0 1
Or with summarise_at and matches (another variation of another post - didn't see the other post while posting)
df %>%
group_by(clustervar) %>%
summarise_at(vars(matches('f2')), funs(weighted.mean(., weight)))
# A tibble: 2 × 4
# clustervar f2a f2b f2c
# <chr> <dbl> <dbl> <dbl>
#1 A 1 0.8 1
#2 B 0 0.0 1
Or another option is data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) weighted.mean(x, weight)),
by = clustervar, .SDcols = f2a:f2c]
# clustervar f2a f2b f2c
#1: A 1 0.8 1
#2: B 0 0.0 1
NOTE: All four answers are based on legitimate tidyverse/data.table syntax and would get the expected output
We can also create a function that makes use of the syntax from devel version of dplyr (soon to be released 0.6.0). The enquo does the similar job of substitute by taking the input arguments and converting it to quosures. Within the group_by/summarise/mutate, we evalute the quosure by unquoting (UQ or !!) it
wtFun <- function(dat, pat, wtcol, grpcol){
wtcol <- enquo(wtcol)
grpcol <- enquo(grpcol)
dat %>%
group_by(!!grpcol) %>%
summarise_at(vars(matches(pat)), funs(weighted.mean(., !!wtcol)))
}
wtFun(df, "f2", weight, clustervar)
# A tibble: 2 × 4
# clustervar f2a f2b f2c
# <chr> <dbl> <dbl> <dbl>
#1 A 1 0.8 1
#2 B 0 0.0 1

Related

Find mean of counts within groups

I have a dataframe that looks like this:
library(tidyverse)
x <- tibble(
batch = rep(c(1,2), each=10),
exp_id = c(rep('a',3),rep('b',2),rep('c',5),rep('d',6),rep('e',4))
)
I can run the code below to get the count perexp_id:
x %>% group_by(batch,exp_id) %>%
summarise(count=n())
which generates:
batch exp_id count
<dbl> <chr> <dbl>
1 1 a 3
2 1 b 2
3 1 c 5
4 2 d 6
5 2 e 4
A really ugly way to generate the mean of these counts is:
x %>% group_by(batch,exp_id) %>%
summarise(count=n()) %>%
ungroup() %>%
group_by(batch) %>%
summarise(avg_exp = mean(count))
which generates:
batch avg_exp
<dbl> <dbl>
1 1 3.33
2 2 5
Is there a more succinct and "tidy" way generate this?

library(dplyr)
group_by(x, batch) %>%
summarize(avg_exp = mean(table(exp_id)))
# # A tibble: 2 x 2
# batch avg_exp
# <dbl> <dbl>
# 1 1 3.33
# 2 2 5

Here's another way -
library(dplyr)
x %>%
count(batch, exp_id, name = "count") %>%
group_by(batch) %>%
summarise(count = mean(count))
# batch count
# <dbl> <dbl>
#1 1 3.33
#2 2 5

Passing multiple columns from function's argument to group_by

Consider the following example:
library(tidyverse)
df <- tibble(
cat = rep(1:2, times = 4, each = 2),
loc = rep(c("a", "b"), each = 8),
value = rnorm(16)
)
df %>%
group_by(cat, loc) %>%
summarise(mean = mean(value), .groups = "drop")
# # A tibble: 4 x 3
# cat loc mean
# * <int> <chr> <dbl>
# 1 1 a -0.563
# 2 1 b -0.394
# 3 2 a 0.159
# 4 2 b 0.212
I would like to make a function of the last two lines that takes a group argument to pass multiple columns to group_by.
Here's a dummy function that computes the mean values by a group of columns as an example:
group_mean <- function(data, col_value, group) {
data %>%
group_by(across(all_of(group))) %>%
summarise(mean = mean({{col_value}}), .groups = "drop")
}
group_mean(df, value, c("cat", "loc"))
# # A tibble: 4 x 3
# cat loc mean
# * <int> <chr> <dbl>
# 1 1 a -0.563
# 2 1 b -0.394
# 3 2 a 0.159
# 4 2 b 0.212
The function works but I would prefer a tidyselect/rlang approach to avoid quoting column names, like so:
group_mean(df, value, c(cat, loc))
# Error: Problem adding computed columns in `group_by()`.
# x Problem with `mutate()` input `..1`.
# x object 'loc' not found
# ℹ Input `..1` is `across(all_of(c(cat, loc)))`.
Enclosing group in {{}} works for a single column but not for multiple columns. How can I do that?

Consider using ... and then we can have the option to use either quoted or unquoted after converting to symbol with ensym
group_mean <- function(data, col_value, ...) {
data %>%
group_by(!!! ensyms(...)) %>%
summarise(mean = mean({{col_value}}), .groups = "drop")
}
-testing
> group_mean(df, value, cat, loc)
# A tibble: 4 x 3
cat loc mean
<int> <chr> <dbl>
1 1 a 0.327
2 1 b -0.291
3 2 a -0.382
4 2 b -0.320
> group_mean(df, value, 'cat', 'loc')
# A tibble: 4 x 3
cat loc mean
<int> <chr> <dbl>
1 1 a 0.327
2 1 b -0.291
3 2 a -0.382
4 2 b -0.320
If we are already using ... as other arguments, then an option is
group_mean <- function(data, col_value, group) {
grp_lst <- as.list(substitute(group))
if(length(grp_lst)> 1) grp_lst <- grp_lst[-1]
grps <- purrr::map_chr(grp_lst, rlang::as_string)
data %>%
group_by(across(all_of(grps))) %>%
summarise(mean = mean({{col_value}}), .groups = "drop")
}
-testing
> group_mean(df, value, c(cat, loc))
# A tibble: 4 x 3
cat loc mean
<int> <chr> <dbl>
1 1 a 0.327
2 1 b -0.291
3 2 a -0.382
4 2 b -0.320

What is the best way to tidily append many columns?

I'm looking to append 30 columns which give values for gamma distributions by using the tidyverse. Here's an example of the data:
data.frame('rank'=1:3,'shape'=c(16,0.2,4),'rate'=c(13,0.4,0.2))
I'd like to use dgamma(1:30,shape,rate) to append 30 columns to the existing dataframe.

You can use map2() in purrr and unnest_wider() in tidyr.
library(tidyverse)
df %>%
mutate(density = map2(shape, rate, dgamma, x = 1:30)) %>%
unnest_wider(density, names_sep = "_")
Or use rowwise() at first and then mutate() with list().
df %>%
rowwise() %>%
mutate(density = list(dgamma(1:30, shape, rate))) %>%
unnest_wider(density, names_sep = "_")
Both of them give
# # A tibble: 3 x 33
# rank shape rate density_1 density_2 density_3 density_4 density_5 density_6 density_7
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 16 13 1.15 0.0852 0.0000843 1.43e-8 9.16e-13 3.19e-17 7.28e-22
# 2 2 0.2 0.4 0.122 0.0468 0.0227 1.21e-2 6.77e- 3 3.92e- 3 2.32e- 3
# 3 3 4 0.2 0.000218 0.00143 0.00395 7.67e-3 1.23e- 2 1.73e- 2 2.26e- 2
# # … with 23 more variables: density_8 <dbl>, density_9 <dbl>, density_10 <dbl>, ..., density_30 <dbl>

Writing function that calculates rowwise mean for subset of columns and creates column name

I want to turn this line of code into a function:
mutate(var_avg = rowMeans(select(., starts_with("var"))))
It works in the pipe:
df <- read_csv("var_one,var_two,var_three
1,1,1
2,2,2
3,3,3")
df %>% mutate(var_avg = rowMeans(select(., starts_with("var"))))
># A tibble: 3 x 4
> var_one var_two var_three var_avg
> <dbl> <dbl> <dbl> <dbl>
>1 1 1 1 1
>2 2 2 2 2
>3 3 3 3 3
Here's my attempt (I'm new at writing functions):
colnameMeans <- function(x) {
columnname <- paste0("avg_",x)
mutate(columnname <- rowMeans(select(., starts_with(x))))
}
It doesn't work.
df %>% colnameMeans("var")
>Error in colnameMeans(., "var") : unused argument ("var")
I have a lot to learn about functions and I'm not sure where to start with fixing this. Any help would be much appreciated. Note that this is a simplified example. In my real data, I have several column prefixes and I want to calculate a row-wise mean for each one. EDIT: Being able to run the function for multiple prefixes at once would be a bonus.

If we need to assign column name on the lhs of assignment, use := and evaluate (!!) the string. The <- inside mutate won't work as the default option is = and it would evaluate unquoted value on the lhs of = literally. In addition, we may need to specify the data as argument in the function
library(dplyr)
colnameMeans <- function(., x) {
columnname<- paste0("avg_", x)
mutate(., !! columnname := rowMeans(select(., starts_with(x))))
}
df %>%
colnameMeans('var')
# A tibble: 3 x 4
# var_one var_two var_three avg_var
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 2 2
#3 3 3 3 3
If there are several prefixes, use map
library(purrr)
library(stringr)
colnameMeans <- function(., x) {
columnname<- paste0("avg_", x)
transmute(., !! columnname := rowMeans(select(., starts_with(x))))
}
map_dfc(c('var', 'alt'), ~ df1 %>%
colnameMeans(.x)) %>%
bind_cols(df1, .)
# A tibble: 3 x 8
# var_one var_two var_three alt_var_one alt_var_two alt_var_three avg_var avg_alt
#* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1 1 1 1 1
#2 2 2 2 2 2 2 2 2
#3 3 3 3 3 3 3 3 3
data
df1 <- bind_cols(df, df %>% rename_all(~ str_replace(., 'var_', 'new_')))

dplyr: Handing over multiple variables to group_by in a function [duplicate]

This question already has an answer here:
How to pass multiple group_by arguments and a dynamic variable argument to a dplyr function
(1 answer)
Closed 3 years ago.
I have a function with dplyr::summarize. How can I hand over more than one variable to it?
Example:
myfunction <- function(mydf, grp) {
library(dplyr)
grp <- enquo(grp)
result <- mydf %>%
group_by(!! grp) %>%
summarise(sum = sum(x))
result
}
# works
myfunction(df, grp1)
# doesn't work
myfunction(df, c(grp1, grp2))

If we pass multiple variables, pass that as a string and make use of group_by_at
myfunction <- function(mydf, grp, xvar) {
mydf %>%
group_by_at(grp) %>%
summarise(sum = sum({{xvar}}))
}
myfunction(mtcars, "am", mpg)
# A tibble: 2 x 2
# am sum
# <dbl> <dbl>
#1 0 326.
#2 1 317.
myfunction(mtcars, c("am", "gear"), mpg)
# A tibble: 4 x 3
# Groups: am [2]
# am gear sum
# <dbl> <dbl> <dbl>
#1 0 3 242.
#2 0 4 84.2
#3 1 4 210.
#4 1 5 107.
In case, we want to pass the groups as showed in the OP's post, one way is to convert with enexpr and evaluate (!!!)
myfunction <- function(mydf, grp, xvar) {
grp <- as.list(rlang::enexpr(grp))
grp <- if(length(grp) > 1) grp[-1] else grp
mydf %>%
group_by(!!! grp) %>%
summarise(sum = sum({{xvar}}))
}
myfunction(mtcars, am, mpg)
# A tibble: 2 x 2
# am sum
# <dbl> <dbl>
#1 0 326.
#2 1 317.
myfunction(mtcars, c(am, gear), mpg)
# A tibble: 4 x 3
# Groups: am [2]
# am gear sum
# <dbl> <dbl> <dbl>
#1 0 3 242.
#2 0 4 84.2
#3 1 4 210.
#4 1 5 107.

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

weighted mean in dplyr for multiple columns - r

You can use summarise_at to specify which columns you want to operate on: df %>% group_by(clustervar) %>% summarise_at(vars(starts_with('f2')), funs(weighted.mean(., weight))) #> # A tibble: 2 × 4 #> clustervar f2a f2b f2c #> <chr> <dbl> <dbl> <dbl> #> 1 A 1 0.8 1 #> 2 B 0 0.0 1

Related

Find mean of counts within groups

Passing multiple columns from function's argument to group_by

What is the best way to tidily append many columns?

Writing function that calculates rowwise mean for subset of columns and creates column name

dplyr: Handing over multiple variables to group_by in a function [duplicate]

Categories

Resources