Dynamic groupby function with brackets - r

df1 <- mtcars %>%
group_by(gear) %>%
summarise(Mittelwert = mean(mpg, na.rm = TRUE))
df1
df2 <- mtcars %>%
group_by(mtcars[[10]]) %>%
summarise(Mittelwert = mean(mtcars[[1]]), na.rm = TRUE)
df2
The last code gives me the mean of the whole data.frame. Since this code is used in a loop, i need to use brackets. Can you help me to get a dynamic code with valid results?

We can use group_by_at and summarise_at to specify column number if we want to avoid using names.
library(dplyr)
mtcars %>%
group_by_at(10) %>%
summarise_at(1, mean, na.rm = TRUE)
# A tibble: 3 x 2
# gear mpg
# <dbl> <dbl>
#1 3.00 16.1
#2 4.00 24.5
#3 5.00 21.4
which is equivalent to
mtcars %>%
group_by(gear) %>%
summarise(Mittelwert = mean(mpg, na.rm = TRUE))
# gear Mittelwert
# <dbl> <dbl>
#1 3.00 16.1
#2 4.00 24.5
#3 5.00 21.4

Related

Calculate mean and sd for given variables in a dataframe

Given a vector of names of numeric variables in a dataframe, I need to calculate mean and sd for each variable. For example, given the mtcars dataset and the following vector of variable names:
vars_to_transform <- c("mpg", "disp")
I'd like to have the following as result:
The first solution that came into my mind is the following:
library(dplyr)
library(purrr)
data("mtcars")
vars_to_transform <- c("mpg", "disp")
vars_to_transform %>%
map_dfr( function(x) { c(variable = x, avg = mean(mtcars[[x]], na.rm = T), sd = sd(mtcars[[x]], na.rm = T)) } )
The result is the following:
As you can see, all the returned variables are characters, but I expected to have numbers for avg and sd.
Is there a way to fix this? Or is there any better solution than this?
P.S.
I'm using purr 0.3.4
Seems like an overcomplicated way of doing select->pivot->group->summarise.
mtcars %>%
select(all_of(vars_to_transform)) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(
mean = mean(value),
sd = sd(value)
)
# A tibble: 2 x 3
name mean sd
<chr> <dbl> <dbl>
1 disp 231. 124.
2 mpg 20.1 6.03
The following works (instead of using c() in your code, use tibble):
vars_to_transform %>%
map_dfr(~ tibble(variable = .x, avg = mean(mtcars[[.x]], na.rm = T),
sd = sd(mtcars[[.x]], na.rm = T)))
Explanation: With c(), you are using a vector, whose elements must have the same type (character in your case, because variable is character). With tibble, one can have a different type per element.
#Gwang-Jin Kim suggests, in a comment bellow that I thank, one could also have used list instead of tibble.
Or try with adding type.convert:
library(dplyr)
library(purrr)
data("mtcars")
vars_to_transform <- c("mpg", "disp")
vars_to_transform %>%
map_dfr( function(x) { c(variable = x, avg = mean(mtcars[[x]], na.rm = T), sd = sd(mtcars[[x]], na.rm = T)) } ) %>%
type.convert(as.is=T)
#> # A tibble: 2 × 3
#> variable avg sd
#> <chr> <dbl> <dbl>
#> 1 mpg 20.1 6.03
#> 2 disp 231. 124.
Another option:
library(purrr)
library(dplyr)
vars_to_transform <- c("mpg", "disp")
funs <- lst(mean, sd)
mtcars %>%
select(all_of(vars_to_transform)) %>%
map_df(~ funs %>%
map(exec, .x), .id = "var")
# A tibble: 2 x 3
var mean sd
<chr> <dbl> <dbl>
1 mpg 20.1 6.03
2 disp 231. 124.
m <- mtcars[, vars_to_transform]
tibble(variable = names(m), avg = apply(m, 2, mean), sd = apply(m, 2, sd))
## A tibble: 2 × 3
# variable avg sd
# <chr> <dbl> <dbl>
#1 mpg 20.1 6.03
#2 disp 231. 124.

Summary statistics of numeric variables in data frame in specific format

I have a data frame and there are 3 numeric variables there. I need to calculate some parameters of these numeric variables like mean, median, std, kurtosis. And then I need to arrange this in a data frame. So, first column of this data frame will contain all numeric variable names and second column will contain all mean values, third column will contain all median values and so on. How can this be achieved ? I am familiar with dplyr package. So any suggestions ?
You can use summarise with across :
library(dplyr)
library(tidyr)
mtcars %>%
select(1:3) %>%
summarise(across(where(is.numeric), list(mean = mean, std = sd, med = median)))
# mpg_mean mpg_std mpg_med cyl_mean cyl_std cyl_med disp_mean disp_std disp_med
#1 20.09062 6.026948 19.2 6.1875 1.785922 6 230.7219 123.9387 196.3
In the older version of dplyr, you can use summarise_if :
mtcars %>%
select(1:3) %>%
summarise_if(is.numeric, list(mean = mean, std = sd, med = median))
You can add pivot_longer to above answer to get data in required format.
mtcars %>%
select(1:3) %>%
summarise(across(where(is.numeric),list(mean=mean,std=sd,med = median))) %>%
pivot_longer(cols = everything(),
names_to = c('col', '.value'),
names_sep = '_')
# A tibble: 3 x 4
# col mean std med
# <chr> <dbl> <dbl> <dbl>
#1 mpg 20.1 6.03 19.2
#2 cyl 6.19 1.79 6
#3 disp 231. 124. 196.
Or you can first pivot and then do the calculation :
mtcars %>%
select(1:3) %>%
pivot_longer(cols = everything()) %>%
group_by(name) %>%
summarise(mean = mean(value), std = sd(value), med = median(value))

Using summarize_all with colMeans and colVar to create pivoted table in R

I want to use summarize_all on the following data and create my desired output, but I was curious how to do this the tidy way using some combination of mutate and summarize I think? Any help appreciated!!
dummy <- tibble(
a = 1:10,
b = 100:109,
c = 1000:1009
)
Desired Output
tibble(
Mean = colMeans(dummy[1:3]),
Variance = colVars(as.matrix(dummy[1:3])),
CV = Variance/Mean
)
Mean Variance CV
<dbl> <dbl> <dbl>
1 5.5 9.17 1.67
2 104. 9.17 0.0877
3 1004. 9.17 0.00913
It would be easier to reshape to 'long' format and then do it once after grouping by 'name'
library(dplyr)
library(tidyr)
pivot_longer(dummy, cols = everything()) %>%
group_by(name) %>%
summarise(Mean = mean(value), Variance = var(value), CV = Variance/Mean) %>%
select(-name)
# A tibble: 3 x 3
# Mean Variance CV
# <dbl> <dbl> <dbl>
#1 5.5 9.17 1.67
#2 104. 9.17 0.0877
#3 1004. 9.17 0.00913
Or either use summarise_all or summarise/across, but the output would be a single row, then do the reshaping
dummy %>%
summarise(across(everything(), list(Mean = mean,
Variance = var, CV = ~ mean(.)/var(.)))) %>%
pivot_longer(everything()) %>%
separate(name, into = c('name', 'name2')) %>%
pivot_wider(names_from = name2, values_from = value)

Define a quantile group in a dataframe with the data source in another dataframe in R

I have a quantile information from a dataframe in a named vector using the next code:
library(tidyverse)
quant_mpg <- mtcars %>%
pull(mpg) %>%
quantile(probs = seq(0, 1, 0.1))
And I want to cut this quantile in a summary dataframe created post:
grouped_mtcars <- mtcars %>%
group_by(cyl) %>%
summarize(mpg = mean(mpg)) %>%
ungroup() %>%
mutate(quantile = cut(mpg, quant_mpg, labels = FALSE))
Obtaning the next output:
# A tibble: 3 x 3
cyl mpg quantile
<dbl> <dbl> <int>
1 4 26.7 9
2 6 19.7 6
3 8 15.1 2
Is there a way I can make this straightforward for the grouped variable without defining the quant_mpg vector. I need it this way bacause I have several group variables and grouped dataframes and I need to obtain the quantiles without much processing.
We can extract the column from original data
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(mpg = mean(mpg)) %>%
mutate(quantile = cut(mpg, quantile(mtcars[['mpg']], #####
probs = seq(0, 1, 0.1)), labels = FALSE))
# A tibble: 3 x 3
# cyl mpg quantile
# <dbl> <dbl> <int>
#1 4 26.7 9
#2 6 19.7 6
#3 8 15.1 2

Pass a vector of column names into dplyr::summarize to get max/min

I would like to do this:
data %>%
group_by(ID) %>%
summarize(maxVal = max(Val),
maxVal2 = max(Val2))
However, I have a lot of columns I would like to get the Max of. I would like to pass in a vector of columns like this:
cols <- c("Val", "Val2")
data %>%
group_by(ID) %>%
summarize(max(cols))
However, this does not work. How do I fix the syntax to do this easily?
If we wanted to have a prefix name after summarizeing the multiple columns, then use rename_at
library(tidyverse)
data %>%
group_by(ID) %>%
summarise_at(vars(cols), max) %>%
rename_at(-1, ~ paste0('max', .))
As a reproducible example, used the data(mtcars)
mtcars %>%
group_by(gear) %>%
summarise_at(vars(mpg, disp), max) %>%
rename_at(-1, ~ paste0('max', .))
# A tibble: 3 x 3
# gear maxmpg maxdisp
# <dbl> <dbl> <dbl>
#1 3 21.5 472
#2 4 33.9 168.
#3 5 30.4 351

Resources