Given a vector of names of numeric variables in a dataframe, I need to calculate mean and sd for each variable. For example, given the mtcars dataset and the following vector of variable names:
vars_to_transform <- c("mpg", "disp")
I'd like to have the following as result:
The first solution that came into my mind is the following:
library(dplyr)
library(purrr)
data("mtcars")
vars_to_transform <- c("mpg", "disp")
vars_to_transform %>%
map_dfr( function(x) { c(variable = x, avg = mean(mtcars[[x]], na.rm = T), sd = sd(mtcars[[x]], na.rm = T)) } )
The result is the following:
As you can see, all the returned variables are characters, but I expected to have numbers for avg and sd.
Is there a way to fix this? Or is there any better solution than this?
P.S.
I'm using purr 0.3.4
Seems like an overcomplicated way of doing select->pivot->group->summarise.
mtcars %>%
select(all_of(vars_to_transform)) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(
mean = mean(value),
sd = sd(value)
)
# A tibble: 2 x 3
name mean sd
<chr> <dbl> <dbl>
1 disp 231. 124.
2 mpg 20.1 6.03
The following works (instead of using c() in your code, use tibble):
vars_to_transform %>%
map_dfr(~ tibble(variable = .x, avg = mean(mtcars[[.x]], na.rm = T),
sd = sd(mtcars[[.x]], na.rm = T)))
Explanation: With c(), you are using a vector, whose elements must have the same type (character in your case, because variable is character). With tibble, one can have a different type per element.
#Gwang-Jin Kim suggests, in a comment bellow that I thank, one could also have used list instead of tibble.
Or try with adding type.convert:
library(dplyr)
library(purrr)
data("mtcars")
vars_to_transform <- c("mpg", "disp")
vars_to_transform %>%
map_dfr( function(x) { c(variable = x, avg = mean(mtcars[[x]], na.rm = T), sd = sd(mtcars[[x]], na.rm = T)) } ) %>%
type.convert(as.is=T)
#> # A tibble: 2 × 3
#> variable avg sd
#> <chr> <dbl> <dbl>
#> 1 mpg 20.1 6.03
#> 2 disp 231. 124.
Another option:
library(purrr)
library(dplyr)
vars_to_transform <- c("mpg", "disp")
funs <- lst(mean, sd)
mtcars %>%
select(all_of(vars_to_transform)) %>%
map_df(~ funs %>%
map(exec, .x), .id = "var")
# A tibble: 2 x 3
var mean sd
<chr> <dbl> <dbl>
1 mpg 20.1 6.03
2 disp 231. 124.
m <- mtcars[, vars_to_transform]
tibble(variable = names(m), avg = apply(m, 2, mean), sd = apply(m, 2, sd))
## A tibble: 2 × 3
# variable avg sd
# <chr> <dbl> <dbl>
#1 mpg 20.1 6.03
#2 disp 231. 124.
Related
I wrote a r function to compute the median by group:
varA<-rep(c(1:2),times=30)
df1<-data.frame(varA)
df1$var1 <- sample(500:1000, length(df1$varA))
df1 <- df1 %>% mutate(outcome=ifelse(varA==1, "Yes", "No"))
ctn_me<- function(df, var, group_var) {
df[[group_var]]<-as.character(df[[group_var]])
# df[[var]]<-as.numeric(df[[var]])
tbl1<-df %>%
bind_rows(mutate(., !!group_var := 'Total')) %>%
dplyr::group_by(gpvar=.[[group_var]])%>%
dplyr::summarise(
median=median(.[[var]], na.rm = TRUE),
N = n())
print(tbl1)
}
ctn_me(df1, "var1", "outcome")
It gave me results like this:
#### gpvar median N
#### <chr> <dbl> <int>
#### 1 No 734 30
#### 2 Total 734 60
#### 3 Yes 734 30
So it can count the number of rows within each group, but for the median, it returned the overall median instead by the group.
This gave me the results I wanted:
df1 %>% bind_rows(mutate(., outcome := 'Total')) %>%
dplyr::group_by(outcome)%>%
dplyr::summarise(
median=median(var1, na.rm = TRUE),
N = n())
# A tibble: 3 x 3
# outcome median N
# <chr> <dbl> <int>
# 1 No 713 30
# 2 Total 734 60
# 3 Yes 788. 30
I was trying to figure out what was wrong with my r function. Can anyone let me know? Thanks!
The docs state that you need to specifically reference ".data" within the summarise() function:
"When you have an env-variable that is a character vector, you need to
index into the .data pronoun with [[, like summarise(df, mean =
mean(.data[[var]]))."
In this case, you need to change .[[variable]] to .data[[variable]], i.e.
library(tidyverse)
set.seed(123)
varA<-rep(c(1:2),times=30)
df1<-data.frame(varA)
df1$var1 <- sample(500:1000, length(df1$varA))
df1 <- df1 %>% mutate(outcome=ifelse(varA==1, "Yes", "No"))
ctn_me <- function(df, var, group_var) {
df %>%
bind_rows(mutate(., !!group_var := "Total")) %>%
group_by(gpvar = .[[group_var]]) %>%
summarise(
median_group = median(.data[[var]], na.rm = TRUE),
N = n()
)
}
ctn_me(df1, "var1", "outcome")
#> # A tibble: 3 × 3
#> gpvar median_group N
#> <chr> <dbl> <int>
#> 1 No 740. 30
#> 2 Total 754 60
#> 3 Yes 776. 30
Created on 2022-07-19 by the reprex package (v2.0.1)
Original answer:
If you use a different syntax inside the summarise() function it works as expected, so I think it's something to do with the summarise() function:
library(tidyverse)
set.seed(123)
varA<-rep(c(1:2),times=30)
df1<-data.frame(varA)
df1$var1 <- sample(500:1000, length(df1$varA))
df1 <- df1 %>% mutate(outcome=ifelse(varA==1, "Yes", "No"))
ctn_me <- function(df, var, group_var) {
df %>%
bind_rows(mutate(., !!group_var := "Total")) %>%
group_by(gpvar = .[[group_var]]) %>%
summarise(
median_group = median(!!sym(var), na.rm = TRUE),
N = n()
)
}
ctn_me(df1, "var1", "outcome")
#> # A tibble: 3 × 3
#> gpvar median_group N
#> <chr> <dbl> <int>
#> 1 No 740. 30
#> 2 Total 754 60
#> 3 Yes 776. 30
Created on 2022-07-19 by the reprex package (v2.0.1)
Try this for non-standard evaluation.
ctn_me<- function(df, var, group_var) {
df[[group_var]]<-as.character(df[[group_var]])
# df[[var]]<-as.numeric(df[[var]])
tbl1<-df %>%
bind_rows(mutate(., !!group_var := 'Total')) %>%
dplyr::group_by(.data[[group_var]])%>%
dplyr::summarise(
median=median(.data[[var]], na.rm = TRUE),
N = n())
print(tbl1)
}```
I would like to aggregate the following dataframe (variables y and z) by number and weight it by "weight". This works as follows:
df = data.frame(number=c("a","a","a","b","c","c"), y=c(1,2,3,4,1,7),
z=c(2,2,6,8,9,1), weight =c(1,1,3,1,2,1))
aggregate = df %>%
group_by(number) %>%
summarise_at(vars(y,z), funs(weighted.mean(. , w=weight)))
Since summarise_at should not longer be used, I tried it with across. But I wasn't successful:
aggregate = df %>%
group_by(number) %>%
summarise(across(everything(), list( mean = mean, sd = sd)))
# this works for mean but I can't just change it with "weighted.mean" etc.
We can pass the anonymous function with ~. By checking the summarise_at, the OP wants to only return the summarisation of columns 'y', 'z', i.e. using everything() would also return the mean, sd and weighted.mean of 'weight' column as well which doesn't make much sense
library(dplyr)
df %>%
group_by(number) %>%
summarise(across(c(y, z),
list( mean = mean, sd = sd,
weighted = ~weighted.mean(., w = weight))), .groups = 'drop')
# A tibble: 3 x 7
# number y_mean y_sd y_weighted z_mean z_sd z_weighted
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 a 2 1 2.4 3.33 2.31 4.4
#2 b 4 NA 4 8 NA 8
#3 c 4 4.24 3 5 5.66 6.33
Often, the mean and sd works well when there are no NA elements. But if there are NA values, we may need to use na.rm = TRUE (by default it is FALSE. In that case, the lambda call would be useful to pass additional parameters
df %>%
group_by(number) %>%
summarise(across(c(y, z),
list( mean = ~mean(., na.rm = TRUE), sd = ~sd(., na.rm = TRUE),
weighted = ~weighted.mean(., w = weight))), .groups = 'drop')
I want to use summarize_all on the following data and create my desired output, but I was curious how to do this the tidy way using some combination of mutate and summarize I think? Any help appreciated!!
dummy <- tibble(
a = 1:10,
b = 100:109,
c = 1000:1009
)
Desired Output
tibble(
Mean = colMeans(dummy[1:3]),
Variance = colVars(as.matrix(dummy[1:3])),
CV = Variance/Mean
)
Mean Variance CV
<dbl> <dbl> <dbl>
1 5.5 9.17 1.67
2 104. 9.17 0.0877
3 1004. 9.17 0.00913
It would be easier to reshape to 'long' format and then do it once after grouping by 'name'
library(dplyr)
library(tidyr)
pivot_longer(dummy, cols = everything()) %>%
group_by(name) %>%
summarise(Mean = mean(value), Variance = var(value), CV = Variance/Mean) %>%
select(-name)
# A tibble: 3 x 3
# Mean Variance CV
# <dbl> <dbl> <dbl>
#1 5.5 9.17 1.67
#2 104. 9.17 0.0877
#3 1004. 9.17 0.00913
Or either use summarise_all or summarise/across, but the output would be a single row, then do the reshaping
dummy %>%
summarise(across(everything(), list(Mean = mean,
Variance = var, CV = ~ mean(.)/var(.)))) %>%
pivot_longer(everything()) %>%
separate(name, into = c('name', 'name2')) %>%
pivot_wider(names_from = name2, values_from = value)
I have a quantile information from a dataframe in a named vector using the next code:
library(tidyverse)
quant_mpg <- mtcars %>%
pull(mpg) %>%
quantile(probs = seq(0, 1, 0.1))
And I want to cut this quantile in a summary dataframe created post:
grouped_mtcars <- mtcars %>%
group_by(cyl) %>%
summarize(mpg = mean(mpg)) %>%
ungroup() %>%
mutate(quantile = cut(mpg, quant_mpg, labels = FALSE))
Obtaning the next output:
# A tibble: 3 x 3
cyl mpg quantile
<dbl> <dbl> <int>
1 4 26.7 9
2 6 19.7 6
3 8 15.1 2
Is there a way I can make this straightforward for the grouped variable without defining the quant_mpg vector. I need it this way bacause I have several group variables and grouped dataframes and I need to obtain the quantiles without much processing.
We can extract the column from original data
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(mpg = mean(mpg)) %>%
mutate(quantile = cut(mpg, quantile(mtcars[['mpg']], #####
probs = seq(0, 1, 0.1)), labels = FALSE))
# A tibble: 3 x 3
# cyl mpg quantile
# <dbl> <dbl> <int>
#1 4 26.7 9
#2 6 19.7 6
#3 8 15.1 2
df1 <- mtcars %>%
group_by(gear) %>%
summarise(Mittelwert = mean(mpg, na.rm = TRUE))
df1
df2 <- mtcars %>%
group_by(mtcars[[10]]) %>%
summarise(Mittelwert = mean(mtcars[[1]]), na.rm = TRUE)
df2
The last code gives me the mean of the whole data.frame. Since this code is used in a loop, i need to use brackets. Can you help me to get a dynamic code with valid results?
We can use group_by_at and summarise_at to specify column number if we want to avoid using names.
library(dplyr)
mtcars %>%
group_by_at(10) %>%
summarise_at(1, mean, na.rm = TRUE)
# A tibble: 3 x 2
# gear mpg
# <dbl> <dbl>
#1 3.00 16.1
#2 4.00 24.5
#3 5.00 21.4
which is equivalent to
mtcars %>%
group_by(gear) %>%
summarise(Mittelwert = mean(mpg, na.rm = TRUE))
# gear Mittelwert
# <dbl> <dbl>
#1 3.00 16.1
#2 4.00 24.5
#3 5.00 21.4