Aggregate string variable using summarise and across function - r

df_input is the input file, and the ideal output file is df_output.
df_input <- data.frame(id = c(1,2,3,4,4,5,5,5,6,7,8,9,10),
party = c("A","B","C","D","E","F","G","H","I","J","K","L","M"),
winner= c(1,1,1,1,1,1,1,1,1,1,1,1,1))
df_output <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10),
party = c("A","B","C","D,E","F_G_H","I","J","K","L","M"),
winner_sum = c(1,1,1,2,3,1,1,1,1,1))
Previously the code worked using the "summarise_at" function as follows:
df_output <- df_input %>%
dplyr::group_by_at(.vars = vars(id)) %>%
{left_join(
dplyr::summarise_at(., vars(party), ~ str_c(., collapse = ",")),
dplyr::summarise_at(., vars(winner), funs(sum))
)}
But it no longer works as it seems both "summarise_at" and "funs" has been deprecated.
I am trying to replicate using across with dplyr (1.0.10), but I am getting an error. Here is my attempt:
df_output <- df_input %>%
group_by(id) %>%
summarise(across(winner, sum, na.rm=T)) %>%
summarise(across(party, str_c(., collapse = ",")))
I have multiple numeric and character variables,s not just one, as in the example. Thanks a lot.

We don't need across if we need to apply different functions on single columns
library(dplyr)
library(stringr)
df_input %>%
group_by(id) %>%
summarise(party = str_c(party, collapse = ","),
winner_sum = sum(winner))
-output
# A tibble: 10 × 3
id party winner_sum
<dbl> <chr> <dbl>
1 1 A 1
2 2 B 1
3 3 C 1
4 4 D,E 2
5 5 F,G,H 3
6 6 I 1
7 7 J 1
8 8 K 1
9 9 L 1
10 10 M 1
If there are multiple 'party', 'winner' columns, loop across them in a single summarise as after the first summarise we have only the summarised column with the group column
df_input %>%
group_by(id) %>%
summarise(across(winner, sum, na.rm=TRUE),
across(party, ~ str_c(.x, collapse = ",")), .groups = "drop")
-output
# A tibble: 10 × 3
id winner party
<dbl> <dbl> <chr>
1 1 1 A
2 2 1 B
3 3 1 C
4 4 2 D,E
5 5 3 F,G,H
6 6 1 I
7 7 1 J
8 8 1 K
9 9 1 L
10 10 1 M
NOTE: If the columns have a simplar prefix then use starts_with to select all those columns i.e. across(starts_with("party"), or if there are different column names - across(c(party, othercol), or if the functions applied are based on their type - across(where(is.numeric), sum,, na.rm = TRUE)
df_input %>%
group_by(id) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE),
across(where(is.character), str_c, collapse = ","),
.groups = 'drop')

Related

Replacing NA values with mode from multiple imputation in R

I ran 5 imputations on a data set with missing values. For my purposes, I want to replace missing values with the mode from the 5 imputations. Let's say I have the following data sets, where df is my original data, ID is a grouping variable to identify each case, and imp is my imputed data:
df <- data.frame(ID = c(1,2,3,4,5),
var1 = c(1,NA,3,6,NA),
var2 = c(NA,1,2,6,6),
var3 = c(NA,2,NA,4,3))
imp <- data.frame(ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5),
var1 = c(1,2,3,3,2,5,4,5,6,6,7,2,3,2,5,6,5,6,6,6,3,1,2,3,2),
var2 = c(4,3,2,3,2,4,6,5,4,4,7,2,4,2,3,6,5,6,4,5,3,3,4,3,2),
var3 = c(7,6,5,6,6,2,3,2,4,2,5,4,5,3,5,1,2,1,3,2,1,2,1,1,1))
I have a method that works, but it involves a ton of manual coding as I have ~200 variables total (I'm doing this on 3 different data sets with different variables). My code looks like this for one variable:
library(dplyr)
mode <- function(codes){
which.max(tabulate(codes))
}
var1 <- imp %>% group_by(ID) %>% summarise(var1 = mode(var1))
df3 <- df %>%
left_join(var1, by = "ID") %>%
mutate(var1 = coalesce(var1.x, var1.y)) %>%
select(-var1.x, -var1.y)
Thus, the original value in df is replaced with the mode only if the value was NA.
It is taking forever to keep manually coding this for every variable. I'm hoping there is an easier way of calculating the mode from the imputed data set for each variable by ID and then replacing the NAs with that mode in the original data. I thought maybe I could put the variable names in a vector and somehow iterate through them with one code where i changes to each variable name, but I didn't know where to go with that idea.
x <- colnames(df)
# Attempting to iterate through variables names using i
i = as.factor(x[[2]])
This is where I am stuck. Any help is much appreciated!
Here is one option using tidyverse. Essentially, we can pivot both dataframes long, then join together and coalesce in one step rather than column by column. Mode function taken from here.
library(tidyverse)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
imp_long <- imp %>%
group_by(ID) %>%
summarise(across(everything(), Mode)) %>%
pivot_longer(-ID)
df %>%
pivot_longer(-ID) %>%
left_join(imp_long, by = c("ID", "name")) %>%
mutate(var1 = coalesce(value.x, value.y)) %>%
select(-c(value.x, value.y)) %>%
pivot_wider(names_from = "name", values_from = "var1")
Output
# A tibble: 5 × 4
ID var1 var2 var3
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 6
2 2 5 1 2
3 3 3 2 5
4 4 6 6 4
5 5 3 6 3
You can use -
library(dplyr)
mode_data <- imp %>%
group_by(ID) %>%
summarise(across(starts_with('var'), Mode))
df %>%
left_join(mode_data, by = 'ID') %>%
transmute(ID,
across(matches('\\.x$'),
function(x) coalesce(x, .[[sub('x$', 'y', cur_column())]]),
.names = '{sub(".x$", "", .col)}'))
# ID var1 var2 var3
#1 1 1 3 6
#2 2 5 1 2
#3 3 3 2 5
#4 4 6 6 4
#5 5 3 6 3
mode_data has Mode value for each of the var columns.
Join df and mode_data by ID.
Since all the pairs have name.x and name.y in their name, we can take all the name.x pairs replace x with y to get corresponding pair of columns. (.[[sub('x$', 'y', cur_column())]])
Use coalesce to select the non-NA value in each pair.
Change the column name by removing .x from the name. ({sub(".x$", "", .col)}) so var1.x becomes only var1.
where Mode function is taken from here
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr, warn.conflicts = FALSE)
imp %>%
group_by(ID) %>%
summarise(across(everything(), Mode)) %>%
bind_rows(df) %>%
group_by(ID) %>%
summarise(across(everything(), ~ coalesce(last(.x), first(.x))))
#> # A tibble: 5 × 4
#> ID var1 var2 var3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 3 6
#> 2 2 5 1 2
#> 3 3 3 2 5
#> 4 4 6 6 4
#> 5 5 3 6 3
Created on 2022-01-03 by the reprex package (v2.0.1)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

Separate rows with conditions

I have this dataframe separate_on_condition with two columns:
separate_on_condition <- data.frame(first = 'a3,b1,c2', second = '1,2,3,4,5,6')`
# first second
# 1 a3,b1,c2 1,2,3,4,5,6
How can I turn it to:
# A tibble: 6 x 2
first second
<chr> <chr>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
where:
a3 will be separated into 3 rows
b1 into 1 row
c2 into 2 rows
Is there a better way on achieving this instead of using rep() on first column and separate_rows() on the second column?
Any help would be much appreciated!
Create a row number column to account for multiple rows.
Split second column on , in separate rows.
For each row extract the data to be repeated along with number of times it needs to be repeated.
library(dplyr)
library(tidyr)
library(stringr)
separate_on_condition %>%
mutate(row = row_number()) %>%
separate_rows(second, sep = ',') %>%
group_by(row) %>%
mutate(first = rep(str_extract_all(first(first), '[a-zA-Z]+')[[1]],
str_extract_all(first(first), '\\d+')[[1]])) %>%
ungroup %>%
select(-row)
# first second
# <chr> <chr>
#1 a 1
#2 a 2
#3 a 3
#4 b 4
#5 c 5
#6 c 6
You can the following base R option
with(
separate_on_condition,
data.frame(
first = unlist(sapply(
unlist(strsplit(first, ",")),
function(x) rep(gsub("\\d", "", x), as.numeric(gsub("\\D", "", x)))
), use.names = FALSE),
second = eval(str2lang(sprintf("c(%s)", second)))
)
)
which gives
first second
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
Here is an alternative approach:
add NA to first to get same length
use separate_rows to bring each element to a row
use extract by regex digit to split first into first and helper
group and slice by values in helper
do some tweaking
library(tidyr)
library(dplyr)
separate_on_condition %>%
mutate(first = str_c(first, ",NA,NA,NA")) %>%
separate_rows(first, second, sep = "[^[:alnum:].]+", convert = TRUE) %>%
extract(first, into = c("first", "helper"), "(.{1})(.{1})", remove=FALSE) %>%
group_by(second) %>%
slice(rep(1:n(), each = helper)) %>%
ungroup() %>%
drop_na() %>%
mutate(second = row_number()) %>%
select(first, second)
first second
<chr> <int>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6

Why does this (grouped) mutate_at syntax work but summarise_at fails? [duplicate]

This question already has answers here:
Using dplyr summarise_at with column index
(2 answers)
Closed 2 years ago.
Example data:
(tmp_df <-
expand.grid(id = letters[1:3], y = 1:3))
# id y
# 1 a 1
# 2 b 1
# 3 c 1
# 4 a 2
# 5 b 2
# 6 c 2
# 7 a 3
# 8 b 3
# 9 c 3
The following works:
tmp_df %>%
group_by(id) %>%
mutate_at(which(colnames(.) %in% c("y")),
sum)
# id y
# <fct> <int>
# 1 a 6
# 2 b 6
# 3 c 6
# 4 a 6
# 5 b 6
# 6 c 6
# 7 a 6
# 8 b 6
# 9 c 6
but the following throws the error Error: Only strings can be converted to symbols:
tmp_df %>%
group_by(id) %>%
summarise_at(which(colnames(.) %in% c("y")),
sum)
Note that the following code snippets are alternative ways that successfully generate the expected result:
tmp_df %>%
group_by(id) %>%
summarise_at(vars(y),
sum)
tmp_df %>%
group_by(id) %>%
summarise_at("y",
sum)
EDIT: following akrun's answer I should note that the dplyr version I am using is dplyr_0.8.4
It seems that in mutate_at the column numbers include the grouping variables but in summarize_at they do not as both of the lines of code below work. You could report this bug although given that the _at functions have been superseded by across I don't know whether it would be fixed.
tmp_df %>% group_by(id) %>% mutate_at(2, sum)
tmp_df %>% group_by(id) %>% summarize_at(1, sum)
This is further reinforced by the fact that if we swap the columns then they both work consistently since the grouping variable no longer affects the position of the y column.
tmp_df[2:1] %>% group_by(id) %>% mutate_at(1, sum)
tmp_df[2:1] %>% group_by(id) %>% summarize_at(1, sum)
which(colnames(.) %in% c("y")) returns you the index 2.
which(colnames(tmp_df) %in% c("y"))
#[1] 2
This is fine when you use mutate_at.
library(dplyr)
tmp_df %>% group_by(id) %>% mutate_at(2,sum)
# id y
# <fct> <int>
#1 a 6
#2 b 6
#3 c 6
#4 a 6
#5 b 6
#6 c 6
#7 a 6
#8 b 6
#9 c 6
However, when you use summarise_at it does not count the grouped column. So you get an error when you do :
tmp_df %>% group_by(id) %>% summarise_at(2,sum)
Error: Only strings can be converted to symbols
What you actually needed here is
tmp_df %>% group_by(id) %>% summarise_at(1,sum)
# id y
#* <fct> <int>
#1 a 6
#2 b 6
#3 c 6
However, it is not possible to dynamically change the position of column number that we want to use in summarise_at based on number of columns in group_by so a better option is to pass column names in vars instead of column number.
tmp_df %>% group_by(id) %>% mutate_at(vars('y'),sum)
# id y
# <fct> <int>
#1 a 6
#2 b 6
#3 c 6
#4 a 6
#5 b 6
#6 c 6
#7 a 6
#8 b 6
#9 c 6
tmp_df %>% group_by(id) %>% summarise_at(vars('y'),sum)
# id y
#* <fct> <int>
#1 a 6
#2 b 6
#3 c 6
Good thing in across is that it behaves consistently for mutate as well as summarise.
tmp_df %>% group_by(id) %>% mutate(across(2,sum))
x Can't subset columns that don't exist.
x Location 2 doesn't exist.
tmp_df %>% group_by(id) %>% summarise(across(2,sum))
x Can't subset columns that don't exist.
x Location 2 doesn't exist.
Even with across it is better to use column name rather than position.
tmp_df %>% group_by(id) %>% mutate(across(y,sum))
tmp_df %>% group_by(id) %>% summarise(across(y,sum))
We can use contains
library(dplyr)
tmp_df %>%
group_by(id) %>%
summarise(across(contains('y'), sum), .groups = 'drop')
The _at, _all suffix functions are deprecated and in place it is the across currently used

Flip group_by variable to columns, and flip columns to rows dplyr

thank you in advance for your response! I am working in Rstudio, trying to create a specific table format that my customer is looking for. Specifically, I would like to show each metric as a row and the group_by variable, in this case application type, as a column. I'm using group_by to consolidate all my data by application type, and I'm using the summarise function to create the new variables.
subs <- data.frame(
App_type = c('A','A','A','B','B','B','C','C','C','C'),
Has_error = c(1,1,1,0,0,1,1,0,1,1),
Has_critical_error = c(1,0,1,0,0,1,0,0,1,1)
)
I'm able to group the submissions together by application type to see total submissions with errors and total with critical errors. Here's what I've done -
subs %>%
group_by(App_type) %>%
summarise(
total_sub = n(),
total_error = sum(Has_error),
total_critical_error = sum(Has_critical_error)
)
# A tibble: 3 x 4
App_type total_sub total_error total_critical_error
<fct> <int> <dbl> <dbl>
1 A 3 3 2
2 B 3 1 1
3 C 4 3 2
However, my customer would like to see it this way with application totals.
A B C TOTAL
1 total_sub 3 3 4 10
2 total_error 3 1 3 7
3 total_critical_error 2 1 2 5
We can pivot to 'wide' format after reshaping to 'long' and then change the column name 'name' to rowname
library(dplyr)
library(tidyr)
library(tibble)
subs %>%
group_by(App_type) %>%
summarise(
total_sub = n(),
total_error = sum(Has_error),
total_critical_error = sum(Has_critical_error)) %>%
pivot_longer(cols = -App_type) %>%
pivot_wider(names_from = App_type, values_from = value) %>%
mutate(TOTAL = A + B + C) %>%
column_to_rownames("name")
# A B C TOTAL
#total_sub 3 3 4 10
#total_error 3 1 3 7
#total_critical_error 2 1 2 5
Or another option is transpose from data.table
library(data.table)
data.table::transpose(setDT(out), make.names = 'App_type',
keep.names = 'name')[, TOTAL := A + B + C][]
where out is the OP's summarised output
out <- subs %>%
group_by(App_type) %>%
summarise(
total_sub = n(),
total_error = sum(Has_error),
total_critical_error = sum(Has_critical_error)
)
Or with base R
addmargins(t(cbind(total_sub = as.integer(table(subs$App_type)),
rowsum(subs[-1], subs$App_type))), 2)
# A B C Sum
#total_sub 3 3 4 10
#Has_error 3 1 3 7
#Has_critical_error 2 1 2 5

R: Break a data.frame according to value of column with dplyr

I have this data.frame
MWE <- data.frame(x = c("a", "a", "a", "b", "b", "b"), y = c(1,2,3,4,5,6))
and what I want to obtain is this data.frame
data.frame(a = c(1,2,3), b = c(4,5,6))
Actually, what I originally want is to sum the 2 vectors a and b (well, I have in reality many more vectors, but it is easier to explain with only 2), so that's why I thought about this transformation. I can do a rowSums then, or something equivalent.
I tried to use pivot_wider from tidyr but I had an error.
Any idea of how to do this with dplyr or tidyr?
Continuing from #Mr.Flick's attempt in tidyverse you could create an id column and grouped on that id column calculate the sum like
library(dplyr)
MWE %>%
group_by(x) %>%
mutate(row = row_number()) %>%
group_by(row) %>%
mutate(total_sum = sum(y)) %>%
tidyr::pivot_wider(names_from = x, values_from = y) %>%
ungroup() %>%
select(-row)
# A tibble: 3 x 3
# total_sum a b
# <dbl> <dbl> <dbl>
#1 5 1 4
#2 7 2 5
#3 9 3 6
We can use unstack from base R
unstack(MWE, y ~ x)
# a b
#1 1 4
#2 2 5
#3 3 6
Or using rowid from data.table with pivot_wider from tidyr
library(dplyr)
library(data.table)
library(tidyr)
MWE %>%
mutate(rn = rowid(x)) %>%
pivot_wider(names_from = x, values_from = y) %>%
select(-rn)
# A tibble: 3 x 2
# a b
# <dbl> <dbl>
#1 1 4
#2 2 5
#3 3 6
Using base R:
data.frame(with(MWE, split(y, x)))
a b
1 1 4
2 2 5
3 3 6

Resources