How to create a function to get summary statistics as columns? - r

I have three workflows to get Mean, Standard Deviation, and Variance. Would it be possible to simplify this by creating one function with one table with all the summaries as the result?
Mean
iris %>%
select(-Species) %>%
summarise_all( , mean, na.rm = TRUE) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("Name") %>%
rename(Mean = V1)
Standard Deviation
iris %>%
select(-Species) %>%
summarise_all(., sd, na.rm = TRUE) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("Name") %>%
rename(SD = V1)
Variance
iris %>%
select(-Species) %>%
summarise_all(., var, na.rm = TRUE) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("Name") %>%
rename(Variance = V1)

We could reshape to 'long' format and then do a group by operation to create the three summarise columns
library(dplyr)
library(tidyr)
iris %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Name") %>%
group_by(Name) %>%
summarise(Mean = mean(value, na.rm = TRUE),
SD = sd(value, na.rm = TRUE),
Variance = var(value, na.rm = TRUE))
-output
# A tibble: 4 × 4
Name Mean SD Variance
<chr> <dbl> <dbl> <dbl>
1 Petal.Length 3.76 1.77 3.12
2 Petal.Width 1.20 0.762 0.581
3 Sepal.Length 5.84 0.828 0.686
4 Sepal.Width 3.06 0.436 0.190

iris %>%
select(-Species) %>%
summarise_all(list(mean = mean,sd = sd, var = var), na.rm = TRUE)%>%
pivot_longer(everything(), names_sep = '_', names_to = c('Name','.value'))
# A tibble: 4 x 4
Name mean sd var
<chr> <dbl> <dbl> <dbl>
1 Sepal.Length 5.84 0.828 0.686
2 Sepal.Width 3.06 0.436 0.190
3 Petal.Length 3.76 1.77 3.12
4 Petal.Width 1.20 0.762 0.581

Related

R create multiple columns in one mutate command

I have a data.frame like this.
library(tidyverse)
df <- tibble(
name = rep(c("a", "b"), each = 100),
value = runif(100*2),
date = rep(Sys.Date() + days(1:100), 2)
)
I would like to do something very similar to the code below. Is there a way to create these 10 columns in one go? Basically, I am trying to find out how much does 99th percent quantile change if we remove one observation, and then 2, and then 3 and so on.
df %>%
nest_by(name) %>%
mutate(
q99_lag_0 = data %>% pull(value) %>% quantile(.99),
q99_lag_1 = data %>% pull(value) %>% tail(-1) %>% quantile(.99),
q99_lag_2 = data %>% pull(value) %>% tail(-2) %>% quantile(.99),
q99_lag_3 = data %>% pull(value) %>% tail(-3) %>% quantile(.99),
q99_lag_4 = data %>% pull(value) %>% tail(-4) %>% quantile(.99),
q99_lag_5 = data %>% pull(value) %>% tail(-5) %>% quantile(.99),
q99_lag_6 = data %>% pull(value) %>% tail(-6) %>% quantile(.99),
q99_lag_7 = data %>% pull(value) %>% tail(-7) %>% quantile(.99),
q99_lag_8 = data %>% pull(value) %>% tail(-8) %>% quantile(.99),
q99_lag_9 = data %>% pull(value) %>% tail(-9) %>% quantile(.99),
q99_lag_10 = data %>% pull(value) %>% tail(-10) %>% quantile(.99)
)
First, reproducible random data:
library(dplyr)
library(purrr) # map_dfx
set.seed(42)
df <- tibble(
name = rep(c("a", "b"), each = 100),
value = runif(100*2),
date = rep(Sys.Date() + 1:100, 2)
)
head(df)
# # A tibble: 6 x 3
# name value date
# <chr> <dbl> <date>
# 1 a 0.915 2021-12-14
# 2 a 0.937 2021-12-15
# 3 a 0.286 2021-12-16
# 4 a 0.830 2021-12-17
# 5 a 0.642 2021-12-18
# 6 a 0.519 2021-12-19
Then the call:
df %>%
nest_by(name) %>%
mutate(
q99_lag_0 = quantile(data$value, 0.99),
map_dfc(-1:-10, ~ tibble("q99_lag_{-.x}" := quantile(tail(data$value, .x), 0.99)))
) %>%
ungroup()
# # A tibble: 2 x 13
# name data q99_lag_0 q99_lag_1 q99_lag_2 q99_lag_3 q99_lag_4 q99_lag_5 q99_lag_6 q99_lag_7 q99_lag_8 q99_lag_9 q99_lag_10
# <chr> <list<tbl_df[,2]>> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a [100 x 2] 0.983 0.983 0.983 0.983 0.983 0.983 0.983 0.983 0.983 0.983 0.983
# 2 b [100 x 2] 0.963 0.963 0.963 0.963 0.963 0.963 0.946 0.946 0.946 0.947 0.947

Summary statistics of numeric variables in data frame in specific format

I have a data frame and there are 3 numeric variables there. I need to calculate some parameters of these numeric variables like mean, median, std, kurtosis. And then I need to arrange this in a data frame. So, first column of this data frame will contain all numeric variable names and second column will contain all mean values, third column will contain all median values and so on. How can this be achieved ? I am familiar with dplyr package. So any suggestions ?
You can use summarise with across :
library(dplyr)
library(tidyr)
mtcars %>%
select(1:3) %>%
summarise(across(where(is.numeric), list(mean = mean, std = sd, med = median)))
# mpg_mean mpg_std mpg_med cyl_mean cyl_std cyl_med disp_mean disp_std disp_med
#1 20.09062 6.026948 19.2 6.1875 1.785922 6 230.7219 123.9387 196.3
In the older version of dplyr, you can use summarise_if :
mtcars %>%
select(1:3) %>%
summarise_if(is.numeric, list(mean = mean, std = sd, med = median))
You can add pivot_longer to above answer to get data in required format.
mtcars %>%
select(1:3) %>%
summarise(across(where(is.numeric),list(mean=mean,std=sd,med = median))) %>%
pivot_longer(cols = everything(),
names_to = c('col', '.value'),
names_sep = '_')
# A tibble: 3 x 4
# col mean std med
# <chr> <dbl> <dbl> <dbl>
#1 mpg 20.1 6.03 19.2
#2 cyl 6.19 1.79 6
#3 disp 231. 124. 196.
Or you can first pivot and then do the calculation :
mtcars %>%
select(1:3) %>%
pivot_longer(cols = everything()) %>%
group_by(name) %>%
summarise(mean = mean(value), std = sd(value), med = median(value))

Using summarize_all with colMeans and colVar to create pivoted table in R

I want to use summarize_all on the following data and create my desired output, but I was curious how to do this the tidy way using some combination of mutate and summarize I think? Any help appreciated!!
dummy <- tibble(
a = 1:10,
b = 100:109,
c = 1000:1009
)
Desired Output
tibble(
Mean = colMeans(dummy[1:3]),
Variance = colVars(as.matrix(dummy[1:3])),
CV = Variance/Mean
)
Mean Variance CV
<dbl> <dbl> <dbl>
1 5.5 9.17 1.67
2 104. 9.17 0.0877
3 1004. 9.17 0.00913
It would be easier to reshape to 'long' format and then do it once after grouping by 'name'
library(dplyr)
library(tidyr)
pivot_longer(dummy, cols = everything()) %>%
group_by(name) %>%
summarise(Mean = mean(value), Variance = var(value), CV = Variance/Mean) %>%
select(-name)
# A tibble: 3 x 3
# Mean Variance CV
# <dbl> <dbl> <dbl>
#1 5.5 9.17 1.67
#2 104. 9.17 0.0877
#3 1004. 9.17 0.00913
Or either use summarise_all or summarise/across, but the output would be a single row, then do the reshaping
dummy %>%
summarise(across(everything(), list(Mean = mean,
Variance = var, CV = ~ mean(.)/var(.)))) %>%
pivot_longer(everything()) %>%
separate(name, into = c('name', 'name2')) %>%
pivot_wider(names_from = name2, values_from = value)

Error in calculating standard error: (list) object cannot be coerced to type 'double'

I have calculated the mean of the observations for each treatment using
Data %>% group_by(Treatment, Rep) %>% summarise(Mean = mean(Nitrogen, na.rm = TRUE))
I have got a single value for each treatment. Now I want to calculate the standard deviation and standard error of the mean. For which I have used,
Data %>% group_by(Treatment, Rep) %>% summarise_each(funs = mean, sd, se=sd(.)/sqrt(n()), na.rm = TRUE)
But it gives an error. I am not sure what is my mistake. Thank you!
summarise_each is getting deprecated and funs is replaced by list
library(dplyr)
Data %>%
group_by(Treatment) %>%
summarise_at(vars(-group_cols()), list(mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
se=~ sd(., na.rm = TRUE)/sqrt(n())))
If we are not sure about the column types, check
str(Data)
and apply the functions only on the numeric columns. Without changing much in the previous code, replace the summarise_at to summarise_if for numeric columns
Data %>%
group_by(Treatment, Rep) %>%
summarise_if(is.numeric, list(mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
se=~ sd(., na.rm = TRUE)/sqrt(n())))
If some columns have class factor and needs to be used for the mean/sd, then first convert those column/columns to numeric with as.numeric(as.character(Data[[yourcolumn]]))
It can be reproduced with iris data
data(iris)
iris %>%
group_by(Species) %>%
summarise_at(vars(-group_cols()), list(mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
se=~ sd(., na.rm = TRUE)/sqrt(n())))
# A tibble: 3 x 13
# Species Sepal.Length_me… Sepal.Width_mean Petal.Length_me… Petal.Width_mean Sepal.Length_sd Sepal.Width_sd Petal.Length_sd
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 setosa 5.01 3.43 1.46 0.246 0.352 0.379 0.174
#2 versic… 5.94 2.77 4.26 1.33 0.516 0.314 0.470
#3 virgin… 6.59 2.97 5.55 2.03 0.636 0.322 0.552
# … with 5 more variables: Petal.Width_sd <dbl>, Sepal.Length_se <dbl>, Sepal.Width_se <dbl>, Petal.Length_se <dbl>,
# Petal.Width_se <dbl>
In the OP's post, some of the functions have anonymous function and na.rm = TRUE seems to be from the mean (not clear).

How to get a summary table with totals

I wonder if there is a more efficient way to get a summary table including totals.
I made a four step procedure here.
data<-iris %>% group_by(Species) %>%
summarise(
Sepal.Len = paste(format(round(median(Sepal.Length),2),nsmall=2) ),
P.len = paste(format(round(median(Petal.Length),2),nsmall=2) ) ,
counts=n() )
datatotal<-iris %>% group_by(.) %>%
summarize(
Sepal.Len = paste(format(round(median(Sepal.Length),2),nsmall=2) ),
P.len = paste(format(round(median(Petal.Length),2),nsmall=2) ) ,
counts=n() )
datatotal<-cbind(Species="Total",datatotal)
final<-rbind(data,datatotal)
final
# A tibble: 4 × 4
Species Sepal.Len P.len counts
* <fctr> <chr> <chr> <int>
1 setosa 5.00 1.50 50
2 versicolor 5.90 4.35 50
3 virginica 6.50 5.55 50
4 Total 5.80 4.35 150
A further improvement on #Richard's answer where everything is in one chain:
iris %>%
group_by(Species) %>%
summarise(
Sepal.Len = median(Sepal.Length),
P.len = median(Petal.Length) ,
counts = n()
) %>%
bind_rows(., iris %>%
summarize(
Sepal.Len = median(Sepal.Length),
P.len = median(Petal.Length) ,
counts = n()
) %>%
mutate(Species = "Total")
) %>%
mutate_each(funs(format(., nsmall = 2, digits = 2)), 2:3)
the result:
# A tibble: 4 × 4
Species Sepal.Len P.len counts
<chr> <chr> <chr> <int>
1 setosa 5.00 1.50 50
2 versicolor 5.90 4.35 50
3 virginica 6.50 5.55 50
4 Total 5.80 4.35 150
Another alternative is using the margins parameter of dcast from the reshape2 package:
dcast(transform(melt(iris, id.vars = 'Species', measure.vars = c('Sepal.Length','Petal.Length')),
counts = ave(value, variable, Species, FUN = length)),
Species + counts ~ variable,
fun.aggregate = median,
margins = 'Species')
the result (unfortunately not exactly as described):
Species counts Sepal.Length Petal.Length
1 setosa 50 5.0 1.50
2 versicolor 50 5.9 4.35
3 virginica 50 6.5 5.55
4 (all) (all) 5.8 4.35
You can simplify the code, by moving the formatting to the final object etc, but it won't make it much faster
data <- iris %>% group_by(Species) %>%
summarise(
Sepal.Len = median(Sepal.Length),
P.len = median(Petal.Length) ,
counts = n()
)
datatotal <- iris %>%
summarize(
Sepal.Len = median(Sepal.Length),
P.len = median(Petal.Length) ,
counts = n()
) %>%
mutate(Species = "Total")
final <- rbind(data, datatotal)
format(final, nsmall = 2, digits = 2)

Resources