summarise with mean, median, range and quants in R - r

I am currently working with the palmer penguins data set in R and want to summarise data that combines means, median, range and quants, grouping by sex.
My current solution has the quant data split from the summary data. Is there a way to do this in one go. If not how do I combine the data sets. The group quant is currently in long format, and I am not sure how to combine them.
group_summary <- penguins %>% group_by(sex) %>% summarize(mean = mean(bill_length_mm,
na.rm = TRUE), meadian = median(bill_length_mm, na.rm = TRUE), range =
max(bill_length_mm, na.rm = TRUE) - min(bill_length_mm, na.rm = TRUE))
group_quant <- penguins %>% group_by(sex) %>% summarize(quantile(bill_length_mm,
probs =seq(.1, 1, by = .1), na.rm =TRUE, .groups = 'drop'))
I had the following solution but it drops the NA values from Sex and I am not sure why.
group_summary <- do.call(data.frame,aggregate(bill_length_mm ~ sex, penguins,
function(x) c(mean = mean(x, na.rm = TRUE), median = median(x, na.rm = TRUE), range =
max(x, na.rm = TRUE) - min(x, na.rm = TRUE), quantile(x, probs = seq(.1, 1, by = .1),
na.rm = TRUE, .groups = 'drop'))))

You may save the quantiles in a list and then use unnest_wider to create new columns from them. To calculate range I used diff(range(...)) instead of max(...) - min(...). Both of them are fine but I included it to show an alternative.
library(palmerpenguins)
library(dplyr)
library(tidyr)
penguins %>%
group_by(sex) %>%
summarize(mean = mean(bill_length_mm, na.rm = TRUE),
median = median(bill_length_mm, na.rm = TRUE),
range = diff(range(bill_length_mm, na.rm = TRUE)),
quantile = list(quantile(bill_length_mm, probs = seq(.1, 1, by = .1), na.rm = TRUE))) %>%
unnest_wider(quantile)
# sex mean median range `10%` `20%` `30%` `40%` `50%` `60%` `70%` `80%` `90%` `100%`
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 female 42.1 42.8 25.9 35.8 36.7 38.2 40 42.8 45.1 45.7 46.5 47.5 58
#2 male 45.9 46.8 25 38.8 40.5 41.3 43.2 46.8 49.0 50.0 50.8 51.9 59.6
#3 NA 41.3 42 13.2 36.8 37.7 37.8 38.6 42 44 44.5 45.2 46.4 47.3

Related

how to apply a list of functions on a specific column in a data frame in r

I have a data frame like
river
discharge
river1
500
river1
450
river1
200
river1
250
river2
375
river2
235
river2
130
river2
250
I want to apply the following list of function to the column discharge ..
f <- list(
mean = function(x, ...) mean(x),
Q50 = function(x, ...) lfquantile(x, exc.freq = 0.5),
Q95 = function(x, ...) lfquantile(x, exc.freq = 0.95),
Q90 = function(x, ...) lfquantile(x, exc.freq = 0.9),
Q70 = function(x, ...) lfquantile(x, exc.freq = 0.7),
)
in the end I am supposed to have a table like this :
river
mean
Q50
Q95
Q90
Q70
river1
river2
rivern
I do not have any idea how to do that :(
If we have all the functions available, then use
library(dplyr)
library(purrr)
imap_dfc(f, ~ df1 %>%
group_by(river) %>%
reframe(!! .y := .x(discharge)))
You could use group_by() function and apply the list of statistics to calculate the summaries and no need to write functions:
library(dplyr)
df %>%
group_by(river)%>%
summarize(
mean = mean(discharge),
q50 = quantile(discharge, 0.50),
q95 = quantile(discharge, 0.95),
q90 = quantile(discharge, 0.90),
q70 = quantile(discharge, 0.70)
)
and the output is:
river mean q50 q95 q90 q70
river1 350 350 492 485 455
river2 248 242 356 338 262
A base R approach. Replacing lfquantile with quantile for this example.
func <- list(mean = function (x, ...) mean(x),
Q50 = function (x, ...) quantile(x, probs = 0.5),
Q95 = function (x, ...) quantile(x, probs = 0.95),
Q90 = function (x, ...) quantile(x, probs = 0.9),
Q70 = function (x, ...) quantile(x, probs = 0.7))
setNames(aggregate(discharge ~ river, df, function(x)
setNames(sapply(names(func), function(nm)
func[[nm]](x)), names(func))), c("river", ""))
river mean Q50 Q95 Q90 Q70
1 river1 350.00 350.00 492.50 485.00 455.00
2 river2 247.50 242.50 356.25 337.50 262.50
Data
df <- structure(list(river = c("river1", "river1", "river1", "river1",
"river2", "river2", "river2", "river2"), discharge = c(500L,
450L, 200L, 250L, 375L, 235L, 130L, 250L)), class = "data.frame",
row.names = c(NA, -8L))
library(tidyverse)
func <- list(mean = function (x, ...) mean(x),
Q50 = function (x, ...) quantile(x, probs = 0.5),
Q95 = function (x, ...) quantile(x, probs = 0.95),
Q90 = function (x, ...) quantile(x, probs = 0.9),
Q70 = function (x, ...) quantile(x, probs = 0.7))
df %>%
group_by(river) %>%
summarise_at(vars(discharge), func)
#> # A tibble: 2 × 6
#> river mean Q50 Q95 Q90 Q70
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 river1 350 350 492. 485 455
#> 2 river2 248. 242. 356. 338. 262.
df %>%
group_by(river) %>%
summarise(across(discharge, func))
#> # A tibble: 2 × 6
#> river discharge_mean discharge_Q50 discharge_Q95 discharge_Q90 discharge_Q70
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 river1 350 350 492. 485 455
#> 2 river2 248. 242. 356. 338. 262.
Created on 2023-02-08 with reprex v2.0.2
EDIT
since the quantile function is vectorized, you could do:
library(tidyverse)
func1 <- function(x){
qnts <- quantile(x, probs = c(0.5,0.95,0.9,0.7))
qnts <- setNames(qnts, paste0('Q', c(50,95,90,70)))
data.frame(mean = mean(x), as.list(qnts))
}
df %>%
summarise(across(discharge, func1, .unpack = TRUE), .by = river)
#> river discharge_mean discharge_Q50 discharge_Q95 discharge_Q90 discharge_Q70
#> 1 river1 350.0 350.0 492.50 485.0 455.0
#> 2 river2 247.5 242.5 356.25 337.5 262.5
Created on 2023-02-08 with reprex v2.0.2

How to cut a dataframe within a list after a certain marker in R?

I would like to cut my dataframe after a certain marker. Means after the first time 3 or more times TRUE shows up (=marker) in V1, I would like to cut the dataframes within a list and take the following next 4 rows as my new dataframe within a list.
library(dplyr)
set.seed(94756)
mat1 <- matrix(sample(seq(-1,100, 0.11),70, replace = TRUE),ncol = 5)
mat1 <- as.tibble(mat1)
mat2 <- matrix(sample(seq(-1,100, 0.11),70, replace = TRUE),ncol = 5)
mat2 <- as.tibble(mat2)
mat2[3,1] <- NA
mat2[6,1] <- NA
mat3 <- matrix(sample(seq(-1,100, 0.11), 70,replace = TRUE),ncol = 5)
mat3 <- as.tibble(mat3)
mat3[4,1] <- NA
data <- list(mat1, mat2, mat3)
data1 <- map(data, ~add_column(., V1_logical = between(.$V1, 20, 80), .after = 'V1'))
r_pre <- lapply(data1, "[", 2)
Maybe it is helpful to add an ID column for each dataframe within the list
r_pre1 <- rbindlist(r_pre, idcol = "ID")
r_pre1 <- split(r_pre1, r_pre1$ID)
So the result should be like:
mat1re <- data.frame(V1 = c(93.16, 47.18, 12.86, 38.71),
V2 = c(56.75, 57.85, 18.69, 3.18),
V3 = c(-0.01, 14.95, 46.08, 96.46),
V4 = c(20.89, 32.55, 91.73, 58.73),
V5 = c(66.54, 56.75, 92.94, 77.54))
mat2re <- data.frame(V1 = c(87.99, 53.23, 40.36, 0.65),
V2 = c(89.42, 81.28, 36.84, 73.58),
V3 = c(89.86, 78.75, 76.77, 61.81),
V4 = c(47.18, 22.98, 34.64, 25.18),
V5 = c(18.69, 77.21, 58.29, 94.04))
mat3re <- data.frame(V1 = c(81.50, 43.55, 54.55, 9.45),
V2 = c(33.21, 70.83, 21.66, 88.10),
V3 = c(72.15, -0.45, 11.65, 15.06),
V4 = c(47.07, 47.95, 88.10, 81.50),
V5 = c(80.07, 67.75, 14.84, 10.33))
result <- list(mat1re, mat2re, mat3re)
What I've tried already:
data2 <- lapply(data1, function(x) {x$V1_logical[x$V1_logical== TRUE] <- 1; x})
data3 <- lapply(data2, function(x) {x$V1_logical[x$V1_logical== FALSE] <- 0; x})
data4 <- map(data3, ~add_column(., ind = rleid(.$V1_logical), .after = "V1_logical"))
So in data 4 it's about to find the marker: $V1_logical = 1 & $ind = number that shows up >= 3 times consecutively (e. g. 5, 5, 5) and cut the data before away incl. marker or in other word start new dataframes after the marker.
The following code is also close, but doesn't cut the beginning incl. marker out when NA's are included in the data...Have a look at the second list here, doesn't cut the beginning and marker out.
matrix_final <- map(data, ~ .x %>%
mutate(V1_logical = between(V1, 20, 80), ind = rleid(V1_logical), .after = "V1") %>%
group_by(ind) %>%
mutate(rn = if(n() >=3 && first(V1_logical)) row_number() else NA_integer_) %>%
ungroup %>%
slice(seq(max(which.max(rn) + 1, 1, replace_na = TRUE), length.out = 4)) %>%
select(-ind, -rn) %>%
mutate(across(everything(), round, digits = 2)))
print(matrix_final[[2]])
Thanks in advance!
We may loop over the list with map, create the logical column on 'V1' with between, create a grouping column with rleid (returns a sequence column that increments when there is a change in value in adjacent elements) and slice the rows based on the condition
library(dplyr)
library(purrr)
library(data.table)
library(tidyr)
map(data, ~ .x %>%
mutate(V1_logical = replace_na(between(V1, 20, 80), FALSE),
ind = rleid(V1_logical), .after = "V1") %>%
group_by(ind) %>%
mutate(rn = if(n() >=3 && first(V1_logical)) row_number() else
NA_integer_) %>%
ungroup %>%
slice(seq(max(which.max(rn) + 1, 1, na.rm = TRUE), length.out = 4)) %>%
select(-ind, -rn, -V1_logical) %>%
mutate(across(everything(), round, digits = 2)))
-output
[[1]]
# A tibble: 4 × 5
V1 V2 V3 V4 V5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 93.2 56.8 -0.0100 20.9 66.5
2 47.2 57.8 15.0 32.6 56.8
3 12.9 18.7 46.1 91.7 92.9
4 38.7 3.18 96.5 58.7 77.5
[[2]]
# A tibble: 4 × 5
V1 V2 V3 V4 V5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 88.0 89.4 89.9 47.2 18.7
2 53.2 81.3 78.8 23.0 77.2
3 40.4 36.8 76.8 34.6 58.3
4 0.65 73.6 61.8 25.2 94.0
[[3]]
# A tibble: 4 × 5
V1 V2 V3 V4 V5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 81.5 33.2 72.2 47.1 80.1
2 43.6 70.8 -0.45 48.0 67.8
3 54.6 21.7 11.6 88.1 14.8
4 9.45 88.1 15.1 81.5 10.3

Multidimensional Crosstable (median values)

I have the following sample data:
samplesize=100
df <- data.frame(sex = sample(c("M", "F"), size = samplesize, replace = TRUE),
agegrp = sample(c("old", "middle", "young"), size = samplesize, replace = TRUE),
duration1 = runif(samplesize, min = 1, max = 100),
duration2 = runif(samplesize, min = 1, max = 100),
country = sample(c("USA", "CAN"), size = samplesize, replace = TRUE))
df
My goal is to plot a table like this that displays the median values [median(na.rm = TRUE) as there might be missing values]
USA CAN
total old middle young M F total old middle young M F
duration1 10.2 12.2 13.1 10.2 13.0 13.9 ... ... ... ... ... ...
duration2 10.4 13.2 13.2 10.0 13.1 14.0 ... ... ... ... ... ...
The way I would usually calculate such a table is to calculate the median values columnwise:
df %>%
group_by(country, agegrp) %>%
summarise(dur1 = median(duration1, na.rm = TRUE),
dur2 = median(duration, na.rm = TRUE)
And finally I put all the columns together. Unfortunately, as the number of combinations gets bigger, this methods becomes very cumbersome. So my question is:
Is there any function like table() that let's me calculate means or medians (instead of frequencies) using specific combinations of variables?
It would also be fine if it was just a two-dimensional table with multi-dimensional variable names like:
USA_total USA_old USA_middle USA_young USA_m USA_f CAN_total ...
duration1
duration2
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(sex, agegrp, country), names_to = "parameters") %>%
group_by(agegrp, country, parameters) %>%
summarise(mean = mean(value, na.rm=TRUE)) %>%
pivot_wider(names_from = c(country, agegrp), values_from = mean)
Returns:
parameters CAN_middle USA_middle CAN_old USA_old CAN_young USA_young
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 duration1 48.6 62.6 31.5 40.0 43.0 50.5
2 duration2 60.9 54.0 53.1 58.9 45.1 55.6
Edit
Including M and F:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = c(sex, agegrp), names_to = "groupings_names", values_to="groupings") %>%
select(-groupings_names) %>%
pivot_longer(cols = -c(groupings, country), names_to = "parameters") %>%
group_by(groupings, country, parameters) %>%
summarise(mean = mean(value, na.rm=TRUE)) %>%
pivot_wider(names_from = c(country, groupings), values_from = mean)
parameters CAN_F USA_F CAN_M USA_M CAN_middle USA_middle CAN_old USA_old CAN_young USA_young
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 duration1 63.3 59.0 50.9 47.7 57.9 46.1 56.8 60.6 59.5 49.1
2 duration2 60.6 59.0 54.9 48.3 65.0 45.6 48.5 49.5 55.8 62.4

calculate summary for more than one column [duplicate]

This question already has answers here:
Summarizing multiple columns with dplyr? [duplicate]
(5 answers)
Closed 1 year ago.
I am trying to calculate median mean for group of columns but its calculating only for one column. what i am doing wrong here ...??
df <- data.frame(Name = c("ABC", "DCA", "GOL",NA, "MNA",NA, "VAN"),
Goal =c("published", "pending", "not designed",NA, "pending", "pending", "not designed"),
Target_1 = c(3734, 2639, 2604, NA, 2793, 2688, 2403),
Target_2 = c(3322, 2016, 2310, NA, 3236, 3898, 2309),
Target_3 = c(3785, 2585, 3750, NA, 2781, 3589, 2830))
df_summary <- df %>% select(contains("Target")) %>% summarise(
q25 = round(quantile(., type=6, probs = seq(0, 1, 0.25), na.rm=TRUE)[2],digits = 0),
Median = round(quantile(., type=6, probs = seq(0, 1, 0.25), na.rm=TRUE)[3],digits = 0),
Mean = round( mean(., na.rm=TRUE),digits = 0),
q75 = round(quantile(., type=6, probs = seq(0, 1, 0.25), na.rm=TRUE)[4],digits = 0),
N = sum(!is.na(.)))
Use across to apply a function to multiple columns.
library(dplyr)
library(tidyr)
df %>%
summarise(across(contains("Target"), list(
q25 = ~round(quantile(., type=6, probs = 0.25, na.rm=TRUE),digits = 0),
Median = ~round(quantile(., type=6, probs = 0.5, na.rm=TRUE),digits = 0),
Mean = ~round( mean(., na.rm=TRUE),digits = 0),
q75 = ~round(quantile(., type=6, probs = 0.75, na.rm=TRUE),digits = 0),
N = ~sum(!is.na(.)))))
# Target_1_q25 Target_1_Median Target_1_Mean Target_1_q75 Target_1_N Target_2_q25
#1 2554 2664 2810 3028 6 2236
# Target_2_Median Target_2_Mean Target_2_q75 Target_2_N Target_3_q25 Target_3_Median
#1 2773 2848 3466 6 2732 3210
# Target_3_Mean Target_3_q75 Target_3_N
#1 3220 3759 6
Or maybe long format is a better way to display the values.
df %>%
pivot_longer(cols = contains("Target")) %>%
group_by(name) %>%
summarise( q25 = round(quantile(value, type=6, probs = 0.25, na.rm=TRUE),digits = 0),
Median = round(quantile(value, type=6, probs = 0.5, na.rm=TRUE),digits = 0),
Mean = round( mean(value, na.rm=TRUE),digits = 0),
q75 = round(quantile(value, type=6, probs = 0.75, na.rm=TRUE),digits = 0),
N = sum(!is.na(value)))
# name q25 Median Mean q75 N
# <chr> <dbl> <dbl> <dbl> <dbl> <int>
#1 Target_1 2554 2664 2810 3028 6
#2 Target_2 2236 2773 2848 3466 6
#3 Target_3 2732 3210 3220 3759 6
Using map:
df %>%
select(contains('Target'))%>%
map_dfr(~c(quantile(.x, type=6, probs = c(.25, .5,.75), na.rm = TRUE),
mean = mean(.x, na.rm = TRUE),
N = length(na.omit(.x))), .id = 'grp')
grp `25%` `50%` `75%` mean N
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Target_1 2554. 2664. 3028. 2810. 6
2 Target_2 2236. 2773 3466 2848. 6
3 Target_3 2732 3210. 3759. 3220 6
Whatever you are doing seems like a summary:
df %>%
select(contains('Target'))%>%
summary()
Another way could be:
df %>%
summarise(across(contains('Target'),
~list(quantile(.x, type=6, probs = c(.25, .5,.75), na.rm = TRUE),
mean(.x, na.rm = TRUE),
length(na.omit(.x))))
)%>%
unnest(everything())
A tibble: 5 x 3
Target_1 Target_2 Target_3
<dbl> <dbl> <dbl>
1 2554. 2236. 2732
2 2664. 2773 3210.
3 3028. 3466 3759.
4 2810. 2848. 3220
5 6 6 6
If you were to include pivoting:
df %>%
pivot_longer(contains('Target')) %>%
group_by(name) %>%
summarise(a = list(quantile(value, type=6, probs = c(.25, .5,.75), na.rm = TRUE)),
mean = mean(value, na.rm = TRUE), N = length(na.omit(value)))%>%
unnest_wider(a)
# A tibble: 3 x 6
name `25%` `50%` `75%` mean N
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 Target_1 2554. 2664. 3028. 2810. 6
2 Target_2 2236. 2773 3466 2848. 6
3 Target_3 2732 3210. 3759. 3220 6

Using dplyr summarise() for specific columns within purrr map() with grouped data

I have a problem I'm trying to solve, and I can't seem to find a succinct solution. There are a few similar questions on SO, but nothing that quite fits.
Take some sample data:
library(dplyr)
dat <- tibble(
group1 = factor(sample(c("one", "two"), 10, replace = T)),
group2 = factor(sample(c("alpha", "beta"), 10, replace = T)),
var1 = rnorm(10, 20, 2),
var2 = rnorm(10, 20, 2),
var3 = rnorm(10, 20, 2),
other1 = sample(c("a", "b", "c"), 10, replace = T),
other2 = sample(c("a", "b", "c"), 10, replace = T),
)
I would like to summarise just the numeric variables (i.e. ignoring other1 and other2), but have the output grouped by group1 and group2.
I have tried something like this, but it returns an error as it attempts to apply my summarise() functions to the grouping variables too.
dat %>%
group_by(group1, group2) %>%
select(where(is.numeric)) %>%
map(~ .x %>%
filter(!is.na(.x)) %>%
summarise(mean = mean(.x),
sd = sd(.x),
median = median(.x),
q1 = quantile(.x, p = .25),
q3 = quantile(.x, p = .75))
)
My expected output would be something like
group1 group2 mean sd median q1 q3
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 one alpha ? ? ? ? ?
2 one beta ? ? ? ? ?
3 two alpha ? ? ? ? ?
4 two beta ? ? ? ? ?
Any solutions would be greatly appreciated.
Thanks,
Sam
Try:
dat %>% group_by(group1,group2) %>%
summarize(across(is.numeric,c(sd = sd,
mean = mean,
median =median,
q1 = function(x) quantile(x,.25),
q3 = function(x) quantile(x,.75))))
group1 group2 var1_sd var1_mean var1_median var1_q1 var1_q3 var2_sd var2_mean var2_median var2_q1 var2_q3 var3_sd
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 one alpha 4.06 20.6 19.3 18.3 22.2 1.12 17.9 17.3 17.2 18.2 1.09
2 one beta 0.726 18.7 18.7 18.4 18.9 0.348 18.8 18.8 18.7 18.9 0.604
3 two alpha 1.31 19.9 20.0 19.3 20.6 1.10 17.8 18.3 17.4 18.5 0.624
4 two beta 0.777 21.2 21.2 21.0 21.5 1.13 19.6 19.6 19.2 20.0 0.0161
You can also pass the columns to the functions in summarise:
dat %>%
group_by(group1, group2) %>%
summarise(mean = mean(var1:var3),
sd = sd(var1:var3),
median = median(var1:var3),
q1 = quantile(var1:var3, p = .25),
q3 = quantile(var1:var3, p = .75))
dat
# A tibble: 4 x 7
# Groups: group1 [2]
# group1 group2 mean sd median q1 q3
# <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 one alpha 19.1 0.707 19.1 18.8 19.3
# 2 one beta 17.5 1.29 17.5 16.8 18.3
# 3 two alpha 17.1 NA 17.1 17.1 17.1
# 4 two beta 19.9 NA 19.9 19.9 19.9

Resources