Sum multiple variables by group and create new column with their sum - r

I have a data frame with grouped variable and I want to sum them by group. It's easy with dplyr.
library(dplyr)
library(magrittr)
data <- data.frame(group = c("a", "a", "b", "c", "c"), n1 = 1:5, n2 = 2:6)
data %>% group_by(group) %>%
summarise_all(sum)
# A tibble: 3 x 3
group n1 n2
<fctr> <int> <int>
1 a 3 5
2 b 3 4
3 c 9 11
But now I want a new column total with the sum of n1 and n2 by group. Like this:
# A tibble: 3 x 3
group n1 n2 ttl
<fctr> <int> <int> <int>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20
How can I do that with dplyr?
EDIT:
Actually, it's just an example, I have a lot of variables.
I tried these two codes but it's not in the right dimension...
data %>% group_by(group) %>%
summarise_all(sum) %>%
summarise_if(is.numeric, sum)
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate_if(is.numeric, .funs = sum)

You can use mutate after summarize:
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt1 = n1 + n2)
# A tibble: 3 x 4
# group n1 n2 tt1
# <fctr> <int> <int> <int>
#1 a 3 5 8
#2 b 3 4 7
#3 c 9 11 20
If need to sum all numeric columns, you can use rowSums with select_if (to select numeric columns) to sum columns up:
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt1 = rowSums(select_if(., is.numeric)))
# A tibble: 3 x 4
# group n1 n2 tt1
# <fctr> <int> <int> <dbl>
#1 a 3 5 8
#2 b 3 4 7
#3 c 9 11 20

We can use apply together with the dplyr functions.
data <- data.frame(group = c("a", "a", "b", "c", "c"), n1 = 1:5, n2 = 2:6)
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate(ttl = apply(.[, 2:ncol(.)], 1, sum))
# A tibble: 3 × 4
group n1 n2 ttl
<fctr> <int> <int> <int>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20
Or rowSums with the same strategy. The key is to use . to specify the data frame and [] with x:ncol(.) to keep the columns you want.
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate(ttl = rowSums(.[, 2:ncol(.)]))
# A tibble: 3 × 4
group n1 n2 ttl
<fctr> <int> <int> <dbl>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20

Base R
cbind(aggregate(.~group, data, sum), ttl = sapply(split(data[,-1], data$group), sum))
# group n1 n2 ttl
#a a 3 5 8
#b b 3 4 7
#c c 9 11 20

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'group', get the sum of each columns in the Subset of data.table, and then with Reduce, get the sum of the rows of the columns of interest
library(data.table)
setDT(data)[, lapply(.SD, sum) , group][, tt1 := Reduce(`+`, .SD),
.SDcols = names(data)[-1]][]
# group n1 n2 tt1
#1: a 3 5 8
#2: b 3 4 7
#3: c 9 11 20
Or with base R
addmargins(as.matrix(rowsum(data[-1], data$group)), 2)
# n1 n2 Sum
#a 3 5 8
#b 3 4 7
#c 9 11 20
Or with dplyr
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt = rowSums(.[-1]))

Related

Re-ordering a factor (in a table) with hierarchical groups in R

Suppose the following table with two factor variabels and one numerical variable:
df <- tibble(
x = as_factor(c("a", "a", "a", "b", "b", "b")),
y = as_factor(1:6),
val = c(10, 3, 8, 2, 6, 1)
)
> df
# A tibble: 6 x 3
x y val
<fct> <fct> <dbl>
1 a 1 10
2 a 2 3
3 a 3 8
4 b 4 2
5 b 5 6
6 b 6 1
I would like to re-order y such that the sum of val, when grouped by x, takes precedent, but y is still ordered by val. To illustrate the goal:
# A tibble: 6 x 4
# Groups: x [2]
x y val sum
<fct> <fct> <dbl> <dbl>
1 a 1 10 21 # all y for which x=="a" come first, because
2 a 3 8 21 # the sum of val for x=="a" is greater than
3 a 2 3 21 # for x=="b"
4 b 5 6 9 # within each group, y is ordered by val
5 b 4 2 9
6 b 6 1 9
But how do I get there? Within tidyverse, I tried to solve it with forcats::fct_reorder(), thinking that grouping might help (df |> group_by(x) |> mutate(y = fct_reorder(y, val))), but it doesn't.
Can fct_reorder() do that at all? What other approaches could work?
Edit: I have found a solution, but it feels rather hacky:
df |>
group_by(x) |>
mutate(sum = sum(val)) |>
arrange(desc(sum), desc(val)) |> ungroup() |>
tibble::rowid_to_column() |>
mutate(across(c(x, y), \(x) fct_reorder(x, rowid)))
Perhaps, we need to arrange
library(dplyr)
library(forcats)
df %>%
arrange(desc(ave(val, x, FUN = sum)), desc(val)) %>%
mutate(across(where(is.factor), fct_inorder))
-output
# A tibble: 6 × 3
x y val
<fct> <fct> <dbl>
1 a 1 10
2 a 3 8
3 a 2 3
4 b 5 6
5 b 4 2
6 b 6 1
Or use fct_reorder/reorder in arrange
df %>%
arrange(desc(fct_reorder(x, val, .fun = sum)), desc(val)) %>%
mutate(across(where(is.factor), fct_inorder)
Probably we can use the following data.table option along with fct_inorder
setorder(
setDT(df)[
,
sum := sum(val), x
],
-sum, -val
)[
,
lapply(
.SD,
function(x) ifelse(is.factor(x), fct_inorder, c)(x)
)
]
and you will obtain
x y val sum
1: a 1 10 21
2: a 3 8 21
3: a 2 3 21
4: b 5 6 9
5: b 4 2 9
6: b 6 1 9

Pivot longer with mutliple data points in a single column

I have a data frame containing a varying number of data points in the same column:
library(tidyverse)
df <- tribble(~id, ~data,
"A", "a;b;c",
"B", "e;f")
I want to obtain one row per data point, separating the content of column data and distributing it on rows. This code gives the expected result, but is clumsy:
df %>%
separate(data,
into = paste0("dat_",1:5),
sep = ";",
fill = "right") %>%
pivot_longer(starts_with("dat_"),
names_to = "data_number",
names_pattern = "dat_(\\d+)") %>%
filter(!is.na(value))
#> # A tibble: 5 x 3
#> id data_number value
#> <chr> <chr> <chr>
#> 1 A 1 a
#> 2 A 2 b
#> 3 A 3 c
#> 4 B 1 e
#> 5 B 2 f
Tidyverse solutions preferred.
Here is one way
library(dplyr)
library(tidyr)
library(data.table)
df %>%
separate_rows(data) %>%
mutate(data_number = rowid(id), .before = 2)
-output
# A tibble: 5 x 3
id data_number data
<chr> <int> <chr>
1 A 1 a
2 A 2 b
3 A 3 c
4 B 1 e
5 B 2 f
library(dplyr)
library(tidyr)
df %>%
separate_rows(data)
output:
# A tibble: 5 x 2
id data
<chr> <chr>
1 A a
2 A b
3 A c
4 B e
5 B f
Using str_split and unnest -
library(tidyverse)
df %>%
mutate(data = str_split(data, ';'),
data_number = map(data, seq_along)) %>%
unnest(c(data, data_number))
# id data data_number
# <chr> <chr> <int>
#1 A a 1
#2 A b 2
#3 A c 3
#4 B e 1
#5 B f 2

a beautiful solution to decode a table with dplyr and mutate

Dear dplyr/tidyverse companions, I am looking for a nice solution to the following problem. I only get my solutions in base R with a loop. How do you solve this cleanly in tidyverse?
I have a dataset called data, which has not useful column names and not useful values (integer).
data <- tibble(var1 = rep(c(1:3), 2),
var2 = rep(c(1:3), 2))
# A tibble: 6 x 2
var1 var2
<int> <int>
1 1 1
2 2 2
3 3 3
4 1 1
5 2 2
6 3 3
Additional I have a coding table, which has for every column a better name (var1 -> variable1) and a better value (1 -> "a")
coding <- tibble(variable = c(rep("var1", 3),rep("var2", 3)),
name = c(rep("variable1", 3),rep("variable2", 3)),
code = rep(c(1:3), 2),
value = rep(c("a", "b", "c"), 2))
# A tibble: 6 x 4
variable name code value
<chr> <chr> <int> <chr>
1 var1 variable1 1 a
2 var1 variable1 2 b
3 var1 variable1 3 c
4 var2 variable2 1 a
5 var2 variable2 2 b
6 var2 variable2 3 c
I'm looking for a result, which has transformed names of the columns and the real values as factors in the dataset, compare:
result <- tibble(variable1 = factor(rep(c("a", "b", "c"), 2)),
variable2 = factor(rep(c("a", "b", "c"), 2)))
# A tibble: 6 x 2
variable1 variable2
<fct> <fct>
1 a a
2 b b
3 c c
4 a a
5 b b
6 c c
Thank you for your commitment :) :) :) :)
library(dplyr)
library(tidyr)
data %>%
stack() %>%
left_join(coding, by = c(ind = "variable", values = "code")) %>%
group_by(name) %>%
mutate(j = row_number()) %>%
pivot_wider(id_cols = j, values_from = value) %>%
select(-j)
# # A tibble: 6 x 2
# variable1 variable2
# <chr> <chr>
# 1 a a
# 2 b b
# 3 c c
# 4 a a
# 5 b b
# 6 c c
A general solution for any number of columns -
create a row number column to identify each row
get data in long format
join it with coding for each value
keep only unique rows and get it back in wide format.
library(dplyr)
library(tidyr)
data %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row, values_to = 'code') %>%
left_join(coding, by = 'code') %>%
select(row, name = name.y, value) %>%
distinct() %>%
pivot_wider() %>%
select(-row)
# variable1 variable2
# <chr> <chr>
#1 a a
#2 b b
#3 c c
#4 a a
#5 b b
#6 c c

R how to 'spread' data with no key-value pair

I have data:
rowID incidentID participant.type
1 1 A
2 1 B
3 2 A
4 3 A
5 3 B
6 3 C
7 4 B
8 4 C
And I would like to end up with:
rowID incident participant.type participant.type.1 participant.type.2
1 1 A B
2 2 A
3 3 A B C
4 4 B C
I tried the spread but can't achieve one line per incident; I don't think I have a way of creating a key-value pair so I wonder if there is some other method for doing this.
Before using spread(), you need to create a proper key argument.
df %>% select(-rowID) %>%
group_by(incidentID) %>%
mutate(id = 1:n()) %>%
spread(id, participant.type)
# incidentID `1` `2` `3`
# <int> <fct> <fct> <fct>
# 1 1 A B NA
# 2 2 A NA NA
# 3 3 A B C
# 4 4 B C NA
Since your grouping is based on the row order within the icidentID column. The following simple solution will also work.
It is just filtering the dataframe and then merging in the end.
It is probably not the best solution in terms of effective use of computing power, but it is easy to understand.
library(tidyverse)
df <-
tribble(
~rowID, ~incidentID, ~participant.type,
1, 1, "A",
2, 1, "B",
3, 2, "A",
4, 3, "A",
5, 3, "B",
6, 3, "C",
7, 4, "B",
8, 4, "C")
df_1 <- df %>%
select(-rowID) %>%
group_by(incidentID) %>%
filter(row_number()==1)
df_2 <- df %>%
select(-rowID) %>%
group_by(incidentID) %>%
filter(row_number()==2) %>%
rename(participant.type.1 = participant.type)
df_3 <- df %>%
select(-rowID) %>%
group_by(incidentID) %>%
filter(row_number()==3) %>%
rename(participant.type.2 = participant.type)
full_join(df_1, full_join(df_2, df_3))
Result:
Joining, by = "incidentID"
Joining, by = "incidentID"
# A tibble: 4 x 4
# Groups: incidentID [?]
incidentID participant.type participant.type.1 participant.type.2
<dbl> <chr> <chr> <chr>
1 1 A B NA
2 2 A NA NA
3 3 A B C
4 4 B C NA
Here's my solution:
df %>%
select(-rowID) %>%
group_by(incidentID) %>%
nest() %>%
mutate(data = map_chr(data, ~str_c(.x$participant.type, collapse = '_'))) %>%
separate(data, paste0('participant.type.', 0:2)) %>%
mutate_at(2:4, ~replace_na(.x, ''))
We can use reshape2::dcast for this
reshape2::dcast(df, insidentID ~ participant.type)
# insidentID A B C
# 1 1 <NA> B <NA>
# 2 8 <NA> B <NA>
# 3 12 <NA> <NA> C
# 4 16 A <NA> <NA>
# 5 24 <NA> B <NA>
# 6 27 <NA> B C
# 7 29 <NA> <NA> C
with the data
set.seed(123)
df <- data.frame(insidentID = sample(0:30, 8L, replace = TRUE),
participant.type = sample(LETTERS[1:3], 8L, replace = TRUE),
stringsAsFactors = FALSE)
df
# insidentID participant.type
# 1 8 B
# 2 24 B
# 3 12 C
# 4 27 B
# 5 29 C
# 6 1 B
# 7 16 A
# 8 27 C
The 'related question' link provided by #markus shows a variety of other solutions, including what appears to be the most concise in a tidyverse format:
df %>%
group_by(incidentID) %>%
mutate(rn = paste0("newcolumn",row_number())) %>%
spread(rn, participant.type)
gives:
incidentID newcolumn1 newcolumn2 newcolumn3
<int> <fct> <fct> <fct>
1 1 A B NA
2 2 A NA NA
3 3 A B C
4 4 B C NA
A

How to summarize value not matching the group using dplyr

I want to sum values of rows which belongs to group other than the row's group. For example using this sample data
> df <- data.frame(id=1:5, group=c("A", "A", "B", "B", "A"), val=seq(9, 1, -2))
> df
id group val
1 1 A 9
2 2 A 7
3 3 B 5
4 4 B 3
5 5 A 1
Summarizing with dplyr by group
> df %>% group_by(group) %>% summarize(sumval = sum(val))
Source: local data frame [2 x 2]
group sumval
(fctr) (dbl)
1 A 17
2 B 8
What I want is the value for rows belonging to group A to use sumval of not group A. i.e. the final result is
id group val notval
1 1 A 9 8
2 2 A 7 8
3 3 B 5 17
4 4 B 3 17
5 5 A 1 8
Is there a way to do this in dplyr? Preferrably in a single chain?
We can do this with base R
s1 <- sapply(unique(df$group), function(x) sum(df$val[df$group !=x]))
s1[with(df, match(group, unique(group)))]
#[1] 8 8 17 17 8
Or using data.table
library(data.table)
setDT(df)[,notval := sum(df$val[df$group!=group]) ,group]
#akrun answers are best. But if you want to do in dplyr, this is a round about way.
df <- data.frame(id=1:5, group=c("A", "A", "B", "B", "A"), val=seq(9, 1, -2))
df %>% mutate(TotalSum = sum(val)) %>% group_by(group) %>%
mutate(valsumval = TotalSum - sum(val))
Source: local data frame [5 x 5]
Groups: group [2]
id group val TotalSum valsumval
(int) (fctr) (dbl) (dbl) (dbl)
1 1 A 9 25 8
2 2 A 7 25 8
3 3 B 5 25 17
4 4 B 3 25 17
5 5 A 1 25 8
This also works even if there are more than two groups.
Also Just this works
df %>% group_by(group) %>% mutate(notval = sum(df$val)- sum(val))

Resources