Tidyverse: summarising values from different columns into a single cell

Tidyverse: summarising values from different columns into a single cell - r

I have the following dataset:
df <- structure(list(var = c("a", "a", "a", "a", "a", "a", "a", "a",
"a", "a", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b"),
beta_2 = c(-0.0441739987111475, -0.237256549142376, -0.167105040977351,
-0.140660549127359, -0.0623609020878716, -0.279740636040755,
-0.0211523654970921, 0.135368375550385, -0.0612770247281429,
-0.13183964102725, 0.363736380163624, -0.0134490092107583,
-0.0179957210095045, -0.00897746346470879, -0.0588242539401108,
-0.0571976057977875, -0.0290052449275881, 0.263181562031473,
0.00398338217426211, 0.0945495450635497), beta_3 = c(8.54560737016843e-05,
-0.0375859675101865, -0.0334219898732454, 0.0332275634691021,
6.41499442849741e-05, -0.0200724300602369, 8.046644459034e-05,
0.0626880671346749, 0.066218613897726, 0.0101268565262127,
0.44671567722757, 0.180543425234781, 0.526177616390516, 0.281245231195401,
-0.0362628519010746, 0.0609803646123324, 0.104137160504616,
0.804375133555955, 0.211218123083386, 0.824756942938928),
beta_4 = c(-8.50289708803184e-06, 0.0376601781861706, 0.104418586040791,
-0.0949557776511923, 2.11896613386966e-05, 0.0969765824620132,
4.95280289930771e-06, -0.0967836292162074, -0.132623370126544,
0.0579395551175153, -0.140392004360494, 0.00950912868877355,
-0.388317615535003, -0.0282634228070272, 0.0547116932731301,
0.0119441792873249, -0.0413015877795695, -0.720387490330028,
-0.0321860166581817, -0.627489324697221)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L))
df
# # A tibble: 20 × 4
# var beta_2 beta_3 beta_4
# <chr> <dbl> <dbl> <dbl>
# 1 a -0.0442 0.0000855 -0.00000850
# 2 a -0.237 -0.0376 0.0377
# 3 a -0.167 -0.0334 0.104
# 4 a -0.141 0.0332 -0.0950
# 5 a -0.0624 0.0000641 0.0000212
# ...
I would like to summarise each beta_ column grouped by var so that I have the mean of beta_2, beta_3 and beta_4 in a single cell.
I can do it with the following code:
df %>%
pivot_longer(!var) %>%
group_by(var, name) %>%
summarise(mean_beta = mean(value) %>% round(2), .groups = "drop") %>%
aggregate(mean_beta ~ var, ., function(x) paste0(x, collapse = ", ")) %>%
as_tibble()
# # A tibble: 2 × 2
# var mean_beta
# <chr> <chr>
# 1 a -0.1, 0.01, 0
# 2 b 0.05, 0.34, -0.19
I'm looking for a more straightforward, tidyverse-only solution. I have tried using map inside summarise but couldn't get what I wanted. Any idea?

You may do the following -
library(dplyr)
df %>%
group_by(var) %>%
summarise(mean_beta = cur_data() %>%
summarise(across(.fns =
~.x %>% mean(na.rm = TRUE) %>% round(2))) %>%
unlist() %>% toString())
# var mean_beta
# <chr> <chr>
#1 a -0.1, 0.01, 0
#2 b 0.05, 0.34, -0.19
cur_data() provides the sub-data within each group as dataframe that can be summarised for each column and concatenated together.

Another possible solution:
library(tidyverse)
df %>%
group_by(var) %>%
summarise(across(everything(), mean)) %>%
{bind_cols(var=.$var, mean_betas=apply(., 1, \(x) str_c(x[-1], collapse = ", ")))}
#> # A tibble: 2 × 2
#> var mean_betas
#> <chr> <chr>
#> 1 a "-0.10101983, 0.008141079, -0.002735024"
#> 2 b " 0.05400016, 0.340388682, -0.190217246"

Related

Transform data from long to wide

I have the following data frame:
df <- data.frame(
timestamp = c(1675930826.3839524, 1675930826.3839593, 1675930826.3839765, 1675930826.388385, 1675930826.3884094, 1675930826.3884153),
label = c("A", "B", "C", "A", "B", "C"),
value = c(1.996, 0.404, 4.941, 1.996, 0.404, 4.941)
)
Basically, the data are in cycles, first A, then B and finally C. So Instead of having them in three separate rows, I want to produce this output:
timestamp A B C
1675930826.3839524 1.996 0.404 4.941
1675930826.388385 1.996 0.404 4.941
I would like to have the timestamp of A and then add the A, B, and C values. I tried this to solve my problem:
df %>%
pivot_wider(names_from = label, values_from = value) %>%
pivot_longer(cols = c("A", "B", "C"), names_to = "label", values_to = "value") %>%
arrange(timestamp) %>%
select(timestamp, A, B, C)

library(tidyverse)
df %>%
group_by(grp = cumsum(label == 'A')) %>%
mutate(timestamp = timestamp[label == 'A']) %>%
ungroup() %>%
pivot_wider(id_cols = timestamp, names_from = label, values_from = value)
# # A tibble: 2 × 4
# timestamp A B C
# <dbl> <dbl> <dbl> <dbl>
# 1 1675930826. 2.00 0.404 4.94
# 2 1675930826. 2.00 0.404 4.94

How to count rows with NA values across a selection of columns and include 0 count?

I am trying to count the number of species per region which have missing data (NA) for a selection of variables.
Here is an example of my dataframe:
library(tidyverse)
df <- structure(
list(
ID = c("AL01", "AL01", "AL02", "AL02", "AL03", "AL03"),
Species = c("Sp1",
"Sp2",
"Sp3",
"Sp4",
"Sp5",
"Sp6"),
Var1 = c("A", NA, NA, NA, "B", "B"),
Var2 = c(NA,
"A",
"B",
"C",
"B",
"C"),
Var3 = c(NA,
2.71, 2.86, 3.21, 2.87, 3.05),
Var4 = c("S", NA,
"C", NA, "S",
"C")
),
class = "data.frame",
row.names = c(NA,
6L)
)
I can get the count of species with NA for any of Var2, Var3 of Var4 by running:
df %>%
filter_at(
vars(
Var2,
Var3,
Var4
),
any_vars(is.na(.))
) %>%
group_by(ID) %>%
count()
# A tibble: 2 × 2
# Groups: ID [2]
ID n
<chr> <int>
1 AL01 2
2 AL02 1
However this only shows me AL01 and AL02 and I would also like to include AL03 for which the count is 0. I have tried this code which I thought should work:
df %>%
group_by(ID) %>%
summarise_at(vars(
Var2,
Var3,
Var4
), ~ sum(any_vars(is.na(.))))
But I get this error:
Error in `summarise()`:
! Problem while computing `Var2 = (structure(function (..., .x = ..1, .y = ..2, . =
..1) ...`.
ℹ The error occurred in group 1: ID = "AL01".
Caused by error in `abort_quosure_op()`:
! Summary operations are not defined for quosures. Do you need to unquote the
quosure?
# Bad: sum(myquosure)
# Good: sum(!!myquosure)
Run `rlang::last_error()` to see where the error occurred.
I realise I am not sure exactly how any_vars works and am unclear on how to continue. The output I would like would be:
# A tibble: 2 × 2
# Groups: ID [2]
ID n
<chr> <int>
1 AL01 2
2 AL02 1
3 AL03 0

You can do:
library(tidyverse)
df %>%
mutate(missing = apply(across(num_range('Var', 2:4)), 1, function(x) any(is.na(x)))) %>%
group_by(ID) %>%
summarize(n = sum(missing))
# A tibble: 3 x 2
ID n
<chr> <int>
1 AL01 2
2 AL02 1
3 AL03 0

df %>%
rowwise() %>%
mutate(across(num_range('Var', 2:4), is.na),
x = any(c_across(num_range('Var', 2:4)))) %>%
group_by(ID) %>%
summarise(n = sum(x))
# A tibble: 3 × 2
ID n
<chr> <int>
1 AL01 2
2 AL02 1
3 AL03 0

Handling duplicated entries

I would like to reassign a given records to a single group if the records are duplicated. In the below dataset I would like to to have 12-4 all being assigned to group A or B but not both. Is there a way to go abou it?
library(tidyverse)
dat <- tibble(
group = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
assigned = c("12-1", "12-2", "12-3", "12-4", "12-4", "12-5", "12-6",
"12-7", "12-8")
)
# Attempts to tease out records for each group
dat %>% pivot_wider(names_from = group, values_from = assigned)

You can group by record and reassign all to the same group, chosen at random from the available groups:
dat %>%
group_by(assigned) %>%
mutate(group = nth(group, sample(n())[1])) %>%
ungroup()
#> # A tibble: 9 x 2
#> group assigned
#> <chr> <chr>
#> 1 A 12-1
#> 2 A 12-2
#> 3 A 12-3
#> 4 A 12-4
#> 5 A 12-4
#> 6 B 12-5
#> 7 B 12-6
#> 8 B 12-7
#> 9 B 12-8

library(tidyverse)
dat <- tibble(
group = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
assigned = c(
"12-1", "12-2", "12-3", "12-4", "12-4", "12-5", "12-6",
"12-7", "12-8"
)
)
dat %>%
select(-group) %>%
left_join(
dat %>%
left_join(dat %>% count(group)) %>%
# reassign to the smallest group
arrange(n) %>%
select(-n) %>%
distinct(assigned, .keep_all = TRUE)
)
#> Joining, by = "group"
#> Joining, by = "assigned"
#> # A tibble: 9 × 2
#> assigned group
#> <chr> <chr>
#> 1 12-1 A
#> 2 12-2 A
#> 3 12-3 A
#> 4 12-4 A
#> 5 12-4 A
#> 6 12-5 B
#> 7 12-6 B
#> 8 12-7 B
#> 9 12-8 B
Created on 2022-04-04 by the reprex package (v2.0.0)

How to consider the bigger date inside groups after summarize

I'm taking the mean, 3 by 3, by grouping. For that, I'm using the summarise function. In this context I would like to select the last date from the four that make up the average.
I tried to select the maximum, but this way I'm just selecting the highest date for the whole group.
test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C", "A", "A", "A"),
measure = c(10, 20, 5, 2, 62 ,2, 5, 4, 6, 7, 25),
time= c("20-09-2020", "25-09-2020", "19-09-2020", "20-05-2020", "20-06-2021",
"11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "19-01-2021"))
# > test
# my_groups measure time
# 1 A 10 20-09-2020
# 2 A 20 25-09-2020
# 3 A 5 19-09-2020
# 4 B 2 20-05-2020
# 5 B 62 20-06-2021
# 6 C 2 11-01-2021
# 7 C 5 13-01-2021
# 8 C 4 13-01-2021
# 9 A 6 15-01-2021
# 10 A 7 15-01-2021
# 11 A 25 19-01-2021
test %>%
arrange(time) %>%
group_by(my_groups) %>%
summarise(mean_3 = rollapply(measure, 3, mean, by = 3, align = "left", partial = F),
final_data = max(time))
# my_groups mean_3 final_data
# <chr> <dbl> <chr>
# 1 A 12.7 25-09-2020
# 2 A 11.7 25-09-2020
# 3 C 3.67 13-01-2021
In the second line I wish the date was 19-01-2021, and not the global maximum of group A, (25-09-2020).
Any hint on how I could do that?

I have 2 dplyr ways for you. Not happy with it because when the rollapply with max and dates doesn't find anything it in group B it uses a double by default which doesn't match the characters from group A and C.
Mutate:
test %>%
arrange(time) %>%
group_by(my_groups) %>%
mutate(final = rollapply(time, 3, max, by = 3, fill = NA, align = "left", partial = F),
mean_3 = rollapply(measure, 3, mean, by = 3, fill = NA, align = "left", partial = F)) %>%
filter(!is.na(final)) %>%
select(my_groups, final, mean_3) %>%
arrange(my_groups)
# A tibble: 3 x 3
# Groups: my_groups [2]
my_groups final mean_3
<chr> <chr> <dbl>
1 A 19-01-2021 12.7
2 A 25-09-2020 11.7
3 C 13-01-2021 3.67
Summarise that doesn't summarise, but is a bit cleaner in code:
test %>%
arrange(time) %>%
group_by(my_groups) %>%
summarise(final = rollapply(time, 3, max, by = 3, fill = NA, align = "left", partial = F),
mean_3 = rollapply(measure, 3, mean, by = 3, fill = NA, align = "left", partial = F)) %>%
filter(!is.na(final))
`summarise()` has grouped output by 'my_groups'. You can override using the `.groups` argument.
# A tibble: 3 x 3
# Groups: my_groups [2]
my_groups final mean_3
<chr> <chr> <dbl>
1 A 19-01-2021 12.7
2 A 25-09-2020 11.7
3 C 13-01-2021 3.67
Edit:
Added isa's solution from comment. Partial = TRUE does the trick:
test %>%
arrange(time) %>%
group_by(my_groups) %>%
summarise(mean_3 = rollapply(measure, 3, mean, by = 3, align = "left", partial = F),
final_data = rollapply(time, 3, max, by = 3, align = "left", partial = T))
`summarise()` has grouped output by 'my_groups'. You can override using the `.groups` argument.
# A tibble: 3 x 3
# Groups: my_groups [2]
my_groups mean_3 final_data
<chr> <dbl> <chr>
1 A 12.7 19-01-2021
2 A 11.7 25-09-2020
3 C 3.67 13-01-2021

Another possible solution:
library(tidyverse)
test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C", "A", "A", "A"),
measure = c(10, 20, 5, 2, 62 ,2, 5, 4, 6, 7, 25),
time= c("20-09-2020", "25-09-2020", "19-09-2020", "20-05-2020", "20-06-2021",
"11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "19-01-2021"))
test %>%
group_by(data.table::rleid(my_groups)) %>%
filter(n() == 3) %>%
summarise(
groups = unique(my_groups),
mean_3 = mean(measure), final_data = max(time), .groups = "drop") %>%
select(-1)
#> # A tibble: 3 × 3
#> groups mean_3 final_data
#> <chr> <dbl> <chr>
#> 1 A 11.7 25-09-2020
#> 2 C 3.67 13-01-2021
#> 3 A 12.7 19-01-2021
EDIT
To allow for calculation of mean of 2 values, as asked for in a comment below by the OP, I revised my code, using data.table::frollmean and data.table::frollapply:
library(tidyverse)
library(lubridate)
library(data.table)
n <- 2 # choose the number with which to calculate the mean
test %>%
group_by(rleid(my_groups)) %>%
summarise(
groups = unique(my_groups),
mean_n = frollmean(measure, n), final_data = frollapply(dmy(time), n, max) %>%
as_date(origin = lubridate::origin), .groups = "drop") %>%
drop_na(mean_n) %>% select(-1)
#> # A tibble: 7 × 3
#> groups mean_n final_data
#> <chr> <dbl> <date>
#> 1 A 15 2020-09-25
#> 2 A 12.5 2020-09-25
#> 3 B 32 2021-06-20
#> 4 C 3.5 2021-01-13
#> 5 C 4.5 2021-01-13
#> 6 A 6.5 2021-01-15
#> 7 A 16 2021-01-19

How can I use purrr to pivot a nested dataframe?

The code below creates a simplified version of the dataframe and illustrates my desired end result (df_wider) based on the unnested version. My question is: How can I achieve the same end result (df_wider) from the nested version (nested_df), using purrr?
library(tidyverse)
df <- tibble(id_01 = c(rep("01", 3), rep("02", 3)),
a = (c("a", "a", "b", "c", "c", "d")),
b = letters[7:12],
id_02 = rep(c(1, 2, 1), 2)
)
df_wider <- pivot_wider(df,
id_cols = c(id_01, a),
names_from = id_02,
values_from = b,
names_sep = "_"
)
nested_df <- nest(df, data = -id_01)
To be clear, I am trying to pivot while the dataframes are nested (i.e., before unnesting).

We can use purrr::map() within dplyr::mutate():
library(tidyverse)
df <- tibble(
id_01 = c(rep("01", 3), rep("02", 3)),
a = (c("a", "a", "b", "c", "c", "d")),
b = letters[7:12],
id_02 = rep(c(1, 2, 1), 2)
)
nested_df <- df %>%
nest(data = -id_01) %>%
mutate(data = map(data, ~ .x %>%
pivot_wider(
id_cols = a,
names_from = id_02,
values_from = b
)))
nested_df
#> # A tibble: 2 x 2
#> id_01 data
#> <chr> <list>
#> 1 01 <tibble [2 x 3]>
#> 2 02 <tibble [2 x 3]>
nested_df %>%
unnest(data)
#> # A tibble: 4 x 4
#> id_01 a `1` `2`
#> <chr> <chr> <chr> <chr>
#> 1 01 a g h
#> 2 01 b i <NA>
#> 3 02 c j k
#> 4 02 d l <NA>
Created on 2021-03-26 by the reprex package (v1.0.0)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Tidyverse: summarising values from different columns into a single cell - r

Related

Transform data from long to wide

How to count rows with NA values across a selection of columns and include 0 count?

Handling duplicated entries

How to consider the bigger date inside groups after summarize

How can I use purrr to pivot a nested dataframe?

Categories

Resources