I would like to aggregate the following dataframe (variables y and z) by number and weight it by "weight". This works as follows:
df = data.frame(number=c("a","a","a","b","c","c"), y=c(1,2,3,4,1,7),
z=c(2,2,6,8,9,1), weight =c(1,1,3,1,2,1))
aggregate = df %>%
group_by(number) %>%
summarise_at(vars(y,z), funs(weighted.mean(. , w=weight)))
Since summarise_at should not longer be used, I tried it with across. But I wasn't successful:
aggregate = df %>%
group_by(number) %>%
summarise(across(everything(), list( mean = mean, sd = sd)))
# this works for mean but I can't just change it with "weighted.mean" etc.
We can pass the anonymous function with ~. By checking the summarise_at, the OP wants to only return the summarisation of columns 'y', 'z', i.e. using everything() would also return the mean, sd and weighted.mean of 'weight' column as well which doesn't make much sense
library(dplyr)
df %>%
group_by(number) %>%
summarise(across(c(y, z),
list( mean = mean, sd = sd,
weighted = ~weighted.mean(., w = weight))), .groups = 'drop')
# A tibble: 3 x 7
# number y_mean y_sd y_weighted z_mean z_sd z_weighted
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 a 2 1 2.4 3.33 2.31 4.4
#2 b 4 NA 4 8 NA 8
#3 c 4 4.24 3 5 5.66 6.33
Often, the mean and sd works well when there are no NA elements. But if there are NA values, we may need to use na.rm = TRUE (by default it is FALSE. In that case, the lambda call would be useful to pass additional parameters
df %>%
group_by(number) %>%
summarise(across(c(y, z),
list( mean = ~mean(., na.rm = TRUE), sd = ~sd(., na.rm = TRUE),
weighted = ~weighted.mean(., w = weight))), .groups = 'drop')
Related
I would like to use group_by( ) function with my customised function but the column names that goes within group_by would be defined in my function argument.
See a hypothetical example of what my data would look like:
data <- data.frame(ind = rep(c("A", "B", "C"), 4),
gender = rep(c("F", "M"), each = 6),
value = sample(1:100, 12))
And this is the result I would like to have:
result <- data %>%
group_by(ind, gender) %>%
mutate(value = mean(value)) %>%
distinct()
This is how I was trying to make my function to work:
myFunction <- function(data, set_group, variable){
result <- data %>%
group_by(get(set_group)) %>%
mutate(across(all_of(variable), ~ mean(.x, na.rm = TRUE))) %>%
distinct()
}
result3 <- myFunction(data, set_group = c("ind", "gender"), variable = c("value"))
result3
I want to allow that the user define as many set_group as needed and as many variable as needed. I tried using get( ) function, all_of( ) function and mget( ) function within group_by but none worked.
Does anyone know how can I code it?
Thank you!
We could use across within group_by
myFunction <- function(data, set_group, variable){
data %>%
group_by(across(all_of(set_group))) %>%
mutate(across(all_of(variable), ~ mean(.x, na.rm = TRUE))) %>%
ungroup %>%
distinct()
}
-testing
> myFunction(data, set_group = c("ind", "gender"), variable = c("value"))
# A tibble: 6 × 3
ind gender value
<chr> <chr> <dbl>
1 A F 43.5
2 B F 87.5
3 C F 67.5
4 A M 13
5 B M 43.5
6 C M 37.5
Another option is to convert to symbols and evaluate (!!!)
myFunction <- function(data, set_group, variable){
data %>%
group_by(!!! rlang::syms(set_group)) %>%
mutate(across(all_of(variable), ~ mean(.x, na.rm = TRUE))) %>%
ungroup %>%
distinct()
}
-testing
> myFunction(data, set_group = c("ind", "gender"), variable = c("value"))
# A tibble: 6 × 3
ind gender value
<chr> <chr> <dbl>
1 A F 43.5
2 B F 87.5
3 C F 67.5
4 A M 13
5 B M 43.5
6 C M 37.5
NOTE: get is used when there is a single object, for multiple objects mget can be used. But, it is better to use tidyverse functions
Given a vector of names of numeric variables in a dataframe, I need to calculate mean and sd for each variable. For example, given the mtcars dataset and the following vector of variable names:
vars_to_transform <- c("mpg", "disp")
I'd like to have the following as result:
The first solution that came into my mind is the following:
library(dplyr)
library(purrr)
data("mtcars")
vars_to_transform <- c("mpg", "disp")
vars_to_transform %>%
map_dfr( function(x) { c(variable = x, avg = mean(mtcars[[x]], na.rm = T), sd = sd(mtcars[[x]], na.rm = T)) } )
The result is the following:
As you can see, all the returned variables are characters, but I expected to have numbers for avg and sd.
Is there a way to fix this? Or is there any better solution than this?
P.S.
I'm using purr 0.3.4
Seems like an overcomplicated way of doing select->pivot->group->summarise.
mtcars %>%
select(all_of(vars_to_transform)) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(
mean = mean(value),
sd = sd(value)
)
# A tibble: 2 x 3
name mean sd
<chr> <dbl> <dbl>
1 disp 231. 124.
2 mpg 20.1 6.03
The following works (instead of using c() in your code, use tibble):
vars_to_transform %>%
map_dfr(~ tibble(variable = .x, avg = mean(mtcars[[.x]], na.rm = T),
sd = sd(mtcars[[.x]], na.rm = T)))
Explanation: With c(), you are using a vector, whose elements must have the same type (character in your case, because variable is character). With tibble, one can have a different type per element.
#Gwang-Jin Kim suggests, in a comment bellow that I thank, one could also have used list instead of tibble.
Or try with adding type.convert:
library(dplyr)
library(purrr)
data("mtcars")
vars_to_transform <- c("mpg", "disp")
vars_to_transform %>%
map_dfr( function(x) { c(variable = x, avg = mean(mtcars[[x]], na.rm = T), sd = sd(mtcars[[x]], na.rm = T)) } ) %>%
type.convert(as.is=T)
#> # A tibble: 2 × 3
#> variable avg sd
#> <chr> <dbl> <dbl>
#> 1 mpg 20.1 6.03
#> 2 disp 231. 124.
Another option:
library(purrr)
library(dplyr)
vars_to_transform <- c("mpg", "disp")
funs <- lst(mean, sd)
mtcars %>%
select(all_of(vars_to_transform)) %>%
map_df(~ funs %>%
map(exec, .x), .id = "var")
# A tibble: 2 x 3
var mean sd
<chr> <dbl> <dbl>
1 mpg 20.1 6.03
2 disp 231. 124.
m <- mtcars[, vars_to_transform]
tibble(variable = names(m), avg = apply(m, 2, mean), sd = apply(m, 2, sd))
## A tibble: 2 × 3
# variable avg sd
# <chr> <dbl> <dbl>
#1 mpg 20.1 6.03
#2 disp 231. 124.
I have already asked a similar question to this here with the answer below. I wanted to aggregate my dataframe by "number" and calculate a weighted mean. Now I would like to do a weighted sum but somehow I cannot find out how to apply a weighted sum to my dataframe. The weighted.sum function doesn no longer work for my R version.
df = data.frame(number=c("a","a","a","b","c","c"), y=c(1,2,3,4,1,7),
z=c(2,2,6,8,9,1), weight =c(1,1,3,1,2,1))
df %>%
group_by(number) %>%
summarise(across(c(y, z),
list( mean = ~mean(., na.rm = TRUE), sd = ~sd(., na.rm = TRUE),
weighted = ~weighted.mean(., w = weight))), .groups = 'drop')
We could use
library(dplyr)
df %>%
group_by(number) %>%
summarise(across(c(y, z),
list( mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
weighted = ~weighted.mean(., w = weight),
weightedsum = ~ sum(. * weight)), .groups = 'drop'))
# A tibble: 3 x 9
# number y_mean y_sd y_weighted y_weightedsum z_mean z_sd z_weighted z_weightedsum
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 a 2 1 2.4 12 3.33 2.31 4.4 22
#2 b 4 NA 4 4 8 NA 8 8
#3 c 4 4.24 3 9 5 5.66 6.33 19
I am having some trouble with the new dplyr::summarise() function
Here is the data
df <- data.frame(id = factor(1:10),
group = factor(rep(letters[1:2],each = 5)),
w1 = rnorm(10),
w2 = rnorm(10),
w3 = rnorm(10),
dummy = as.character(LETTERS[1:10]),
stringsAsFactors = F)
Now I want to get means and standard deviations for the numeric variables only. So I ran the following code
df %>%
dplyr::select(id, group, w1:w3) %>%
group_by(group) %>%
dplyr::summarise(across(where(is.numeric), ~ mean(.x, na.rm = T), .names = "mean_{col}"),
across(where(is.numeric), ~ sd(.x, na.rm = T), .names = "sd_{col}"),
count = n())
Which gives me the following output
# A tibble: 2 x 11
# group mean_w1 mean_w2 mean_w3 sd_w1 sd_w2 sd_w3 sd_mean_w1 sd_mean_w2 sd_mean_w3 count
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# a -0.399 0.152 -0.151 1.07 0.703 1.15 NA NA NA 5
# b 0.560 -0.107 -0.0439 1.18 0.612 0.862 NA NA NA 5
Now the columns starting with mean_ and sd_ are exactly what I want, but I'm also getting this set of sd_mean_ columns, I assume because it is trying to find the sd of the new mean_ columns.
How do I get the output without the superfluous columns?
The issue is when you go to second across the number of numeric columns have increased, so it applies sd function to the new columns as well. To avoid this apply multiple function in the same across using list().
library(dplyr)
df %>%
group_by(group) %>%
summarise(across(where(is.numeric), list(mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE)),
.names = "{fn}_{col}"),
count = n())
# group mean_w1 sd_w1 mean_w2 sd_w2 mean_w3 sd_w3 count
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#1 a 0.0746 0.696 0.760 1.39 0.0530 1.29 5
#2 b 0.522 0.686 0.0979 0.566 -0.0133 1.12 5
Also, your attempt would work as expected if you don't select columns by their type :
df %>%
group_by(group) %>%
summarise(across(w1:w3, ~ mean(.x, na.rm = T), .names = "mean_{col}"),
across(w1:w3, ~ sd(.x, na.rm = T), .names = "sd_{col}"),
count = n())
I want to use summarize_all on the following data and create my desired output, but I was curious how to do this the tidy way using some combination of mutate and summarize I think? Any help appreciated!!
dummy <- tibble(
a = 1:10,
b = 100:109,
c = 1000:1009
)
Desired Output
tibble(
Mean = colMeans(dummy[1:3]),
Variance = colVars(as.matrix(dummy[1:3])),
CV = Variance/Mean
)
Mean Variance CV
<dbl> <dbl> <dbl>
1 5.5 9.17 1.67
2 104. 9.17 0.0877
3 1004. 9.17 0.00913
It would be easier to reshape to 'long' format and then do it once after grouping by 'name'
library(dplyr)
library(tidyr)
pivot_longer(dummy, cols = everything()) %>%
group_by(name) %>%
summarise(Mean = mean(value), Variance = var(value), CV = Variance/Mean) %>%
select(-name)
# A tibble: 3 x 3
# Mean Variance CV
# <dbl> <dbl> <dbl>
#1 5.5 9.17 1.67
#2 104. 9.17 0.0877
#3 1004. 9.17 0.00913
Or either use summarise_all or summarise/across, but the output would be a single row, then do the reshaping
dummy %>%
summarise(across(everything(), list(Mean = mean,
Variance = var, CV = ~ mean(.)/var(.)))) %>%
pivot_longer(everything()) %>%
separate(name, into = c('name', 'name2')) %>%
pivot_wider(names_from = name2, values_from = value)