R: How to summarize and group by variables as column names - r

I have a wide dataframe with about 200 columns and want to summarize it over various columns. I can not figure the syntax for this, I think it should work with .data$ and .env$ but I don't get it. Heres an example:
> library(dplyr)
> df = data.frame('A'= c('X','X','X','Y','Y'), 'B'= 1:5, 'C' = 6:10)
> df
A B C
1 X 1 6
2 X 2 7
3 X 3 8
4 Y 4 9
5 Y 5 10
> df %>% group_by(A) %>% summarise(sum(B), sum(C))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 3
A `sum(B)` `sum(C)`
<chr> <int> <int>
1 X 6 21
2 Y 9 19
But I want to be able to do something like this:
columns_to_sum = c('B','C')
columns_to_group = c('A')
df %>% group_by(colums_to_group)%>% summarise(sum(columns_to_sum))

We can use across from the new version of dplyr
library(dplyr)
df %>%
group_by(across(colums_to_group)) %>%
summarise(across(all_of(columns_to_sum), sum, na.rm = TRUE), .groups = 'drop')
# A tibble: 2 x 3
# A B C
# <chr> <int> <int>
#1 X 6 21
#2 Y 9 19
In the previous version, we could use group_by_at along with summarise_at
df %>%
group_by_at(colums_to_group) %>%
summarise_at(vars(columns_to_sum), sum, na.rm = TRUE)

Related

How to filter out groups empty for 1 column in Tidyverse

tibble(
A = c("A","A","B","B"),
x = c(NA,NA,NA,1),
y = c(1,2,3,4),
) %>% group_by(A) -> df
desired output:
tibble(
A = c("B","B"),
x = c(NA,1)
y = c(3,4),
)
I want to find all groups for which all elements of x and x only are all NA, then remove those groups. "B" is filtered in because it has at least 1 non NA element.
I tried:
df %>%
filter(all(!is.na(x)))
but it seems that filters out if it finds at least 1 NA; I need the correct word, which is not all.
This will remove groups of column A if all elements of x are NA:
library(dplyr)
df %>%
group_by(A) %>%
filter(! all(is.na(x)))
# A tibble: 2 × 3
# Groups: A [1]
# A x y
# <chr> <dbl> <dbl>
#1 B NA 3
#2 B 1 4
Note that group "A" was removed because both cells in the column x are not defined.
We can use any with complete.cases
library(dplyr)
df %>%
group_by(A) %>%
filter(any(complete.cases(x))) %>%
ungroup
-output
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4
In the devel version of dplyr, we could use .by in filter thus we don't need to group_by/ungroup
df %>%
filter(any(complete.cases(x)), .by = 'A')
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4

Applying mutate across columns called using paste in R

I would like to apply a function across columns in a data frame using mutate in dplyr. I would like to reference the columns using paste.
Here's example data, but the actual data set has many columns making the paste functionality key:
data <- data.frame(var1 = c(1:4), var2 = (5:8))
data
var1 var2
1 1 5
2 2 6
3 3 7
4 4 8
I've got it working when the columns are called separately without quotes:
data <- data %>%
rowwise() %>%
mutate(
total = sum(var1,var2)
)
data
# A tibble: 4 x 3
# Rowwise:
var1 var2 total
<int> <int> <int>
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
But, I'd like to be able to call columns with paste:
data <- data %>%
rowwise() %>%
mutate(
total = sum(paste("var",c(1:2),sep=""))
)
This return this error:
Error: Problem with `mutate()` input `total`.
x invalid 'type' (character) of argument
ℹ Input `total` is `sum(paste("var", c(1:2), sep = ""))`.
ℹ The error occurred in row 1.
Run `rlang::last_error()` to see where the error occurred.
Here, we don't need a rowwise as rowSums would be more efficient
library(dplyr)
data %>%
mutate(total = rowSums(.))
Or for a subset of columns (using paste), we select them and use rowSums
data %>%
mutate(total = select(., paste0('var', 1:2)) %>%
rowSums)
If we need to use column names, select the dataset columns within c_across and get the sum (after rowwise)
data %>%
rowwise %>%
mutate(total = sum(c_across(c(var1, var2)))) %>%
ungroup
Or use paste to select columns in c_across
data %>%
rowwise %>%
mutate(total = sum(c_across(paste0('var', 1:2)))) %>%
ungroup
# A tibble: 4 x 3
# var1 var2 total
# <int> <int> <int>
#1 1 5 6
#2 2 6 8
#3 3 7 10
#4 4 8 12
Or extract the selected columns ([) with cur_data()
data %>%
rowwise %>%
mutate(totall = sum(cur_data()[paste0('var', 1:2)])) %>%
ungroup
# A tibble: 4 x 3
# var1 var2 totall
# <int> <int> <int>
#1 1 5 6
#2 2 6 8
#3 3 7 10
#4 4 8 12

mutate or summarise across rows by variable containing string

I'd like to create a new data table which is the sum across rows from variables which contain a string. I have been trying to keep this within the tidyverse as a noob using new dplyr across. Help much appreciated.
dat<- data.frame("Image" = c(1,2,3,4),
"A" = c(1,2,3,4),
"A:B"= c(5,6,7,8),
"A:B:C"= c(9,10,11,12))
to obtain the sums across the rows of variables containing "A", "B", or "C".
datsums<- data.frame("Image" = c(1,2,3,4),
"Asum"= c(15,18,21,24),
"Bsum"=c(14,16,18,20),
"Csum"=c(9,10,11,12))
I have been unsuccessful using the newer dplyr verbs:
datsums<- dat %>% summarise(across(str_detect("A")), sum, .names ="Asum",
across(str_detect("B")), sum, .names="Bsum",
across(str_detect("C")), sum, .names"Csum")
use rowwise and c_across:
library(tidyverse)
dat %>%
rowwise() %>%
summarise(
Asum = sum(c_across(contains("A"))),
Bsum = sum(c_across(contains("B"))),
Csum = sum(c_across(contains("C")))
)
Returns:
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 4 x 3
Asum Bsum Csum
<dbl> <dbl> <dbl>
1 16 14 9
2 20 16 10
3 24 18 11
4 28 20 12
To add columns to the original data.frame, use mutate instead of summarise:
dat %>%
rowwise() %>%
mutate(
Asum = sum(c_across(contains("A"))),
Bsum = sum(c_across(contains("B"))),
Csum = sum(c_across(contains("C")))
)
# A tibble: 4 x 7
# Rowwise:
Image A A.B A.B.C Asum Bsum Csum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 5 9 16 14 9
2 2 2 6 10 20 16 10
3 3 3 7 11 24 18 11
4 4 4 8 12 28 20 12
Since you want row-wise sum you could use :
library(dplyr)
dat %>%
transmute(Asum = rowSums(select(., contains('A', ignore.case = FALSE))),
Bsum = rowSums(select(., contains('B', ignore.case = FALSE))),
Csum = rowSums(select(., contains('C', ignore.case = FALSE))))
Or for many variables use :
cols <- c('A', 'B', 'C')
purrr::map_dfc(cols, ~dat %>%
transmute(!!paste0(.x, 'sum') :=
rowSums(select(., contains(.x, ignore.case = FALSE)))))
# Asum Bsum Csum
#1 15 14 9
#2 18 16 10
#3 21 18 11
#4 24 20 12
use pivot_longer and pivot_wider
library(tidyverse)
dat %>%
pivot_longer(-Image) %>%
separate_rows(name, sep = "\\.") %>%
pivot_wider(Image,
names_from = name,
values_from = value,
values_fn = sum,
names_prefix = "sum")
#> # A tibble: 4 x 4
#> Image sumA sumB sumC
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 15 14 9
#> 2 2 18 16 10
#> 3 3 21 18 11
#> 4 4 24 20 12
Created on 2020-12-07 by the reprex package (v0.3.0)

When I use diff(), it breaks summarise()

One of the columns I want to have after using summarise() is the difference between the (two) values. Each group is ALWAYS going to have two or less rows, in m case. The function I found online was diff(). However, I ran into a problem.
Look at this example:
df <- data.frame(value = runif(198),
id = c(
sample(1:100, 99),
sample(1:100, 99))
)
find <- df %>%
group_by(id) %>%
summarise(count = n()) %>%
filter(count != 2)
find
In this case, since I'm not using diff(), I get this:
> find
# A tibble: 2 x 2
id count
<int> <int>
1 14 1
2 39 1
It works fine. Now, if I include diff():
> find <- df %>%
+ group_by(id) %>%
+ summarise(diference = diff(value), count = n()) %>%
+ filter(count != 2)
`summarise()` regrouping output by 'id' (override with `.groups` argument)
> find
# A tibble: 0 x 3
# Groups: id [0]
# … with 3 variables: id <int>, diference <dbl>, count <int>
It comes up with nothing. If I don't filter (it was a relatively short data frame, so I went one by one), I see those rows disappear. In a shorter example, it would be:
> df <- data.frame(value = runif(10),
+ id = c(
+ sample(1:6, 5),
+ sample(1:6, 5))
+ )
> find <- df %>%
+ group_by(id) %>%
+ summarise(diference = diff(value), count = n())
`summarise()` regrouping output by 'id' (override with `.groups` argument)
> find
# A tibble: 4 x 3
# Groups: id [4]
id diference count
<int> <dbl> <int>
1 2 -0.309 2
2 3 0.474 2
3 4 -0.148 2
4 6 0.291 2
As you can see, the 1 and 5 rows (id) disappeared. I believe apllying diff() causes it, since without it, that doesn't happen:
> find <- df %>%
+ group_by(id) %>%
+ summarise(count = n())
`summarise()` ungrouping output (override with `.groups` argument)
> find
# A tibble: 6 x 2
id count
<int> <int>
1 1 1
2 2 2
3 3 2
4 4 2
5 5 1
6 6 2
That was the exact same data.
However, if I do it manually, diff() gives me an output, though in a slightly different way:
> diff(5)
numeric(0)
> diff(c(5, 4))
[1] -1
My question, then, is whether or not there is a better function to do this, or just some way for me to get the output without it erasing the one-item groups, like this:
id count diference
1 1 1 1
2 58 1 58
I know the differenc will be the same as the id, but the reason I'm interested in this is because this is just one of the arguments I will put in filter(). It will be: filter(diference != 0 || count != 2) (once again, this isn't my original data).
Maybe this is what the question wants. It uses ifelse to compute the difference between values only if the groups have 2 rows, else it returns value unchanged.
library(dplyr)
set.seed(2020)
df <- data.frame(value = runif(10),
id = c(
sample(1:6, 5),
sample(1:6, 5))
)
find <- df %>%
group_by(id) %>%
summarise(count = n(),
diference = ifelse(count > 1, c(0, diff(value)), value),
.groups = 'drop') %>%
filter(count != 2 | diference != 0)
find
## A tibble: 2 x 3
# id count diference
# <int> <int> <dbl>
#1 1 1 0.647
#2 6 1 0.0674

dplyr: getting grouped min and max of columns in a for loop [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 3 years ago.
I am trying to get the grouped min and max of several columns using a for loop:
My data:
df <- data.frame(a=c(1:5, NA), b=c(6:10, NA), c=c(11:15, NA), group=c(1,1,1,2,2,2))
> df
a b c group
1 1 6 11 1
2 2 7 12 1
3 3 8 13 1
4 4 9 14 2
5 5 10 15 2
6 NA NA NA 2
My attempt:
cols <- df %>% select(a,b) %>% names()
for(i in seq_along(cols)) {
output <- df %>% dplyr::group_by(group) %>%
dplyr::summarise_(min=min(.dots=i, na.rm=T), max=max(.dots=i, na.rm=T))
print(output)
}
Desired output for column a:
group min max
<dbl> <int> <int>
1 1 1 3
2 2 4 5
Using dplyr package, you can get:
df %>%
na.omit() %>%
pivot_longer(-group) %>%
group_by(group, name) %>%
summarise(min = min(value),
max = max(value)) %>%
arrange(name, group)
# group name min max
# <dbl> <chr> <int> <int>
# 1 1 a 1 3
# 2 2 a 4 5
# 3 1 b 6 8
# 4 2 b 9 10
# 5 1 c 11 13
# 6 2 c 14 15
We can use summarise_all after grouping by 'group' and if it needs to be in a particular order, then use select to select based on the column names
library(dplyr)
library(stringr)
df %>%
group_by(group) %>%
summarise_all(list(min = ~ min(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE))) %>%
select(group, order(str_remove(names(.), "_.*")))
# A tibble: 2 x 7
# group a_min a_max b_min b_max c_min c_max
# <dbl> <int> <int> <int> <int> <int> <int>
#1 1 1 3 6 8 11 13
#2 2 4 5 9 10 14 15
Without to use for loop but using dplyr and tidyr from tidyverse, you can get the min and max of each columns by 1) pivoting the dataframe in a longer format, 2) getting the min and max value per group and then 3) pivoting wider the dataframe to get the expected output:
library(tidyverse)
df %>% pivot_longer(., cols = c(a,b,c), names_to = "Names",values_to = "Value") %>%
group_by(group,Names) %>% summarise(Min = min(Value, na.rm =TRUE), Max = max(Value,na.rm = TRUE)) %>%
pivot_wider(., names_from = Names, values_from = c(Min,Max)) %>%
select(group,contains("_a"),contains("_b"),contains("_c"))
# A tibble: 2 x 7
# Groups: group [2]
group Min_a Max_a Min_b Max_b Min_c Max_c
<dbl> <int> <int> <int> <int> <int> <int>
1 1 1 3 6 8 11 13
2 2 4 5 9 10 14 15
Is it what you are looking for ?
In base R, we can use aggregate and get min and max for multiple columns by group.
aggregate(.~group, df, function(x)
c(min = min(x, na.rm = TRUE),max= max(x, na.rm = TRUE)))
# group a.min a.max b.min b.max c.min c.max
#1 1 1 3 6 8 11 13
#2 2 4 5 9 10 14 15

Resources