How to summarise with sum dependent on another column - using dplyr - r

I am looking to perform a group by on id, code1 and then summarise. I want the summarise to do several conditional sums i.e. sum of the count column when code2 == "B". I know how to do this by creating an intermediary binary column but I was wondering if there is quicker method where this can all be performed in the summarise statement.
Here is some test data:
id <- c(1,1,1)
code1 <- c("M", "M", "M")
code2 <- c("B", "B", "U")
code3 <- c("H", "N", "N")
count <- c(15, 2, 1)
x <- data.frame(id, code1, code2, code3, count)
Desired output:
id | code1 | Total | B_count | U_count | H_count | N_count
1 M 18 17 1 15 3

We can use the conditions inside the summarise call:
library(dplyr)
x %>%
group_by(id, code1) %>%
summarise(total = sum(count),
B_count = sum(count[code2 == "B"]),
U_count = sum(count[code2 == "U"]),
H_count = sum(count[code3 == "H"]),
N_count = sum(count[code3 == "N"]))
`summarise()` regrouping output by 'id' (override with `.groups` argument)
# A tibble: 1 x 7
# Groups: id [1]
id code1 total B_count U_count H_count N_count
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 M 18 17 1 15 3

This solution is very complicated but it gets the job done.
library(dplyr)
library(tidyr)
x %>%
pivot_longer(
cols = matches('code[2-9]'),
names_to = 'vars',
values_to = 'code'
) %>%
dplyr::select(-vars) %>%
group_by(id, code1, code) %>%
summarise(count = sum(count), .groups = "rowwise") %>%
pivot_wider(
id_cols = c(id, code1),
names_from = code,
values_from = count
) %>%
left_join(
x %>%
group_by(id, code1) %>%
summarise(Total = sum(count), .groups = "rowwise"),
by = c("id", "code1")
) %>%
select(id, code1, Total, everything())
## A tibble: 1 x 7
# id code1 Total B H N U
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 M 18 17 15 3 1

Related

Extract digits from strings in R

i have a dataframe which contains a text string like below that shows the ingredients and the proportion of each ingredient. What i would like to achive is to extract the proportion of each ingredient as a separate variable:
What i have:
given <- tibble(
ingredients =c("1.5BZ+1FZ+2HT","2FZ","0.5HT+2BZ")
)
What i want to achive:
to_achieve <- tibble(
ingredients =c("1.5BZ+1FZ+2HT","2FZ","0.5HT+2BZ"),
proportion_bz = c(1.5,0,2),
proportion_fz = c(1,2,0),
proportion_ht=c(2,2,0.5)
)
Please note there might be more than a dozen different ingredients and tidyverse methods are preferred.
Thanks in advance,
Felix
Making heavy use of tidyr you could first split your strings into rows per ingredient using separate_rows, afterwards extract the numeric proportion and the type of ingredient and finally use pivot_wider to reshape into your desired format:
library(dplyr)
library(tidyr)
given %>%
mutate(ingredients_split = ingredients) |>
tidyr::separate_rows(ingredients_split, sep = "\\+") |>
tidyr::extract(
ingredients_split,
into = c("proportion", "ingredient"),
regex = "^([\\d+\\.]+)(.*)$"
) |>
mutate(
proportion = as.numeric(proportion),
ingredient = tolower(ingredient)
) |>
pivot_wider(
names_from = ingredient,
names_prefix = "proportion_",
values_from = proportion,
values_fill = 0
)
#> # A tibble: 3 × 4
#> ingredients proportion_bz proportion_fz proportion_ht
#> <chr> <dbl> <dbl> <dbl>
#> 1 1.5BZ+1FZ+2HT 1.5 1 2
#> 2 2FZ 0 2 0
#> 3 0.5HT+2BZ 2 0 0.5
library(tidyr)
library(readr)
library(stringr)
library(janitor)
# SOLUTION -----
given %>%
separate(ingredients, into = c("a", "b", "c"), sep = "\\+", remove = F) %>%
pivot_longer(a:c) %>%
select(-name) %>%
mutate(name = str_remove_all(value, "[0-9]|\\."),
value = parse_number(value)) %>%
na.omit() %>%
pivot_wider(names_prefix = "proportion_", values_fill = 0) %>%
clean_names()
# OUTPUT ----
#># A tibble: 3 × 4
#> ingredients proportion_bz proportion_fz proportion_ht
#> <chr> <dbl> <dbl> <dbl>
#>1 1.5BZ+1FZ+2HT 1.5 1 2
#>2 2FZ 0 2 0
#>3 0.5HT+2BZ 2 0 0.5

R: Reshape rows to columns and fill with NA

I have a dataframe with 0-3 rows depending on the underlying data. Here is an example with 2 rows:
df <- tibble(ID = c(1, 1), v = c(1, 2))
ID v
<dbl> <dbl>
1 1 1
2 1 2
I now want to convert each row of v into a separate column. As I have 3 rows at maximum, the result should look like this:
ID v1 v2 v3
<dbl> <dbl> <dbl> <dbl>
1 1 NA 1 2
Whats the best way to achieve this? Thanks!
Perhaps this helps
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(nm = str_c("v", 2:3)) %>%
complete(ID, nm = str_c("v", 1:3)) %>%
pivot_wider(names_from = nm, values_from = v)
Update: Op request, see comments:
df %>%
group_by(ID) %>%
summarise(cur_data()[seq(max_n),]) %>%
arrange(!is.na(v), v) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = row,
values_from = v,
names_glue = "v_{.name}")
ID v_1 v_2 v_3
<dbl> <dbl> <dbl> <dbl>
1 1 NA 1 2
First answer:
Maybe something like this:
What we are doing here is:
define the max of your group (in this case it is 3)
then fill up each group to max of 3 with adding NA
For naming add a row_number() column and use pivot_wider with it'S arguments:
library(dplyr)
library(tidyr)
max_n <- 3
df %>%
group_by(ID) %>%
summarise(cur_data()[seq(max_n),]) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = row,
values_from = v,
names_glue = "v_{.name}")
ID v_1 v_2 v_3
<dbl> <dbl> <dbl> <dbl>
1 1 1 2 NA

Dplyr Summarise Groups as Column Names

I got a data frame with a lot of columns and want to summarise them with multiple functions.
test_df <- data.frame(Group = sample(c("A", "B", "C"), 10, T), var1 = sample(1:5, 10, T), var2 = sample(3:7, 10, T))
test_df %>%
group_by(Group) %>%
summarise_all(c(Mean = mean, Sum = sum))
# A tibble: 3 x 5
Group var1_Mean var2_Mean var1_Sum var2_Sum
<chr> <dbl> <dbl> <int> <int>
1 A 3.14 5.14 22 36
2 B 4.5 4.5 9 9
3 C 4 6 4 6
This results in a tibble with the first row Group and column names with a combination of the previous column name and the function name.
The desired result is a table with the previous column names as first row and the groups and functions in the column names.
I can achive this with
test_longer <- test_df %>% pivot_longer(cols = starts_with("var"), names_to = "var", values_to = "val")
# Add row number because spread needs unique identifiers for rows
test_longer <- test_longer %>%
group_by(Group) %>%
mutate(grouped_id = row_number())
spread(test_longer, Group, val) %>%
select(-grouped_id) %>%
group_by(var) %>%
summarise_all(c(Mean = mean, Sum = sum), na.rm = T)
# A tibble: 2 x 7
var A_Mean B_Mean C_Mean A_Sum B_Sum C_Sum
<chr> <dbl> <dbl> <dbl> <int> <int> <int>
1 var1 3.14 4.5 4 22 9 4
2 var2 5.14 4.5 6 36 9 6
But this seems to be a rather long detour... There probably is a better way, but I could not find it. Any suggestions? Thank you
There's lots of ways to go about it, but I would simplify it by pivoting to a longer data frame initially, and then grouping by var and group. Then you can just pivot wider to get the final result you want. Note that I used summarize(across()) which replaces the deprecated summarize_all(), even though with a single column could've just manually specified Mean = ... and Sum = ....
set.seed(123)
test_df %>%
pivot_longer(
var1:var2,
names_to = "var"
) %>%
group_by(Group, var) %>%
summarize(
across(
everything(),
list(Mean = mean, Sum = sum),
.names = "{.fn}"
),
.groups = "drop"
) %>%
pivot_wider(
names_from = "Group",
values_from = c(Mean, Sum),
names_glue = "{Group}_{.value}"
)
#> # A tibble: 2 × 7
#> var A_Mean B_Mean C_Mean A_Sum B_Sum C_Sum
#> <chr> <dbl> <dbl> <dbl> <int> <int> <int>
#> 1 var1 1 2.5 3.2 1 10 16
#> 2 var2 5 4.5 4.4 5 18 22

how to get the output of proc tabulate (SAS) in R

Is there any R function which could give me directly the same output of proc tabulate ??
var1<-c(rep("A",4),rep("B",4))
var2<-c(rep("C",4),rep("D",4))
var3<-c(rep("E",2),rep("F",4),rep("G",2))
dataset<-data.frame(var1,var2,var3)
proc tabulate data=dataset;
class var1 var2 var3;
table var1*var2 ,var3 all (n rowpctn);
run;
The output that I want is like this:
Here is a way with R -
Create a column of 1s - n
Expand the data to fill the missing combinations - complete
Reshape to 'wide' format - pivot_wider
Create the 'Total' column by getting the row wise sum - rowSums
Add the percentage by looping across the 'var3' columns
library(dplyr)
library(tidyr)
library(stringr)
dataset %>%
mutate(n = 1, var3 = str_c('var3_', var3)) %>%
complete(var1, var2, var3, fill = list(n = 0)) %>%
pivot_wider(names_from = var3, values_from = n, values_fn = sum) %>%
mutate(Total = rowSums(across(where(is.numeric)))) %>%
group_by(var1) %>%
mutate(across(starts_with('var3'),
~ case_when(. == 0 ~ '0(0%)',
TRUE ~ sprintf('%d(%d%%)', ., 100 * mean(. != 0))))) %>%
ungroup
-output
# A tibble: 4 × 6
var1 var2 var3_E var3_F var3_G Total
<chr> <chr> <chr> <chr> <chr> <dbl>
1 A C 2(50%) 2(50%) 0(0%) 4
2 A D 0(0%) 0(0%) 0(0%) 0
3 B C 0(0%) 0(0%) 0(0%) 0
4 B D 0(0%) 2(50%) 2(50%) 4
Update
Based on the comments by #IceCreamToucan, there was a bug, which is corrected in the below code
dataset %>%
mutate(n = 1, var3 = str_c('var3_', var3)) %>%
complete(var1, var2, var3, fill = list(n = 0)) %>%
pivot_wider(names_from = var3, values_from = n, values_fn = sum) %>%
mutate(Total = rowSums(across(where(is.numeric))),
100 * across(starts_with('var3'), ~ . != 0,
.names = "{.col}_perc")/rowSums(across(starts_with('var3'), ~ .!= 0)),
across(matches('var3_[A-Z]$'), ~ case_when(. == 0 ~ '0(0%)',
TRUE ~ sprintf('%d(%.f%%)', ., get(str_c(cur_column(), '_perc')))))) %>%
select(-ends_with('perc'))
Here's a more generic version, where I define a function.
var1<-c(rep("A",4),rep("B",4))
var2<-c(rep("C",4),rep("D",4))
var3<-c(rep("E",2),rep("F",4),rep("G",2))
df<-data.frame(var1,var2,var3)
df_tabulate(df, id_cols = c(var1, var2), names_from = var3)
#> # A tibble: 4 × 6
#> var1 var2 var3_E var3_F var3_G Total
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 A C 2(50.0%) 2(50.0%) 0(0.0%) 4
#> 2 A D 0(0%) 0(0%) 0(0%) 0
#> 3 B C 0(0%) 0(0%) 0(0%) 0
#> 4 B D 0(0.0%) 2(50.0%) 2(50.0%) 4
You can define the function using janitor
library(janitor, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(rlang)
library(tidyr)
df_tabulate <- function(df, id_cols, names_from){
id_cols <- enquo(id_cols)
if (quo_is_call(id_cols, 'c'))
id_cols <- call_args(id_cols)
else
id_cols <- ensym(id_cols)
names_from_chr <- as_label(enquo(names_from))
counts <- df %>%
mutate(g = eval(call2(paste, !!!id_cols, sep = ',')),
col = paste0(names_from_chr, '_', {{ names_from }})) %>%
tabyl(g, col) %>%
adorn_totals('col')
percs <- adorn_percentages(counts) %>%
adorn_pct_formatting()
rbind(counts, percs) %>%
group_by(g) %>%
summarise(across(-Total, ~ paste0(first(.), '(', last(.), ')')),
Total = as.numeric(first(Total))) %>%
separate(g, into = as.character(id_cols)) %>%
complete(!!!id_cols) %>%
mutate(across(starts_with(names_from_chr), ~ coalesce(., '0(0%)')),
across(Total, ~ coalesce(., 0)))
}
Here it is as a single pipeline with discrete simple steps. Long, to be sure, but if you wanted many tables like this you could store it as a function.
library(tidyverse)
library(janitor)
dataset %>%
mutate(across(var1:var2, as.factor)) %>%
count(var1, var2, var3, .drop = FALSE) %>%
unite(vars, var1, var2) %>%
pivot_wider(names_from = var3, values_from = n) %>%
select(-`NA`) %>%
replace(is.na(.), 0) %>%
adorn_totals("col") %>%
adorn_percentages(,,,,-c(vars, Total)) %>%
adorn_pct_formatting(digits = 0,,,,-c(vars, Total)) %>%
adorn_ns(position = "front",,,-c(vars, Total)) %>%
separate(vars, into = c("var1", "var2"))
#> # A tibble: 4 x 6
#> var1 var2 E F G Total
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 A C 2 (50%) 2 (50%) 0 (0%) 4
#> 2 A D 0 (-) 0 (-) 0 (-) 0
#> 3 B C 0 (-) 0 (-) 0 (-) 0
#> 4 B D 0 (0%) 2 (50%) 2 (50%) 4
This replaces the questionable 0/0 = 0% with simply - for a cleaner(IMO) result.

dplyr number of rows across groups after filtering

I want the count and proportion (of all of elements) of each group in a data frame (after filtering). This code produces the desired output:
library(dplyr)
df <- data_frame(id = sample(letters[1:3], 100, replace = TRUE),
value = rnorm(100))
summary <- filter(df, value > 0) %>%
group_by(id) %>%
summarize(count = n()) %>%
ungroup() %>%
mutate(proportion = count / sum(count))
> summary
# A tibble: 3 x 3
id count proportion
<chr> <int> <dbl>
1 a 17 0.3695652
2 b 13 0.2826087
3 c 16 0.3478261
Is there an elegant solution to avoid the ungroup() and second summarize() steps. Something like:
summary <- filter(df, value > 0) %>%
group_by(id) %>%
summarize(count = n(),
proportion = n() / [?TOTAL_ROWS()?])
I couldn't find such a function in the documentation, but I must be missing something obvious. Thanks!
You can use nrow on . which refers to the entire data frame piped in:
df %>%
filter(value > 0) %>%
group_by(id) %>%
summarise(count = n(), proportion = count / nrow(.))
# A tibble: 3 x 3
# id count proportion
# <chr> <int> <dbl>
#1 a 14 0.2592593
#2 b 22 0.4074074
#3 c 18 0.3333333

Resources