How can I make a custom aggregation of a dataframe in R? - r

I have a dataframe such as
group <- c("A", "A", "B", "C", "C")
tx <- c("A-201", "A-202", "B-201", "C-205", "C-206")
feature <- c("coding", "decay", "pending", "coding", "coding")
df <- data.frame(group, tx, feature)
I want to generate a new df with the entries in tx "listed" for each feature. I want the output to look like
group <- c("A", "B", "C")
coding <- c("A-201", NA, "C-205|C-206")
decay <- c("A-202", NA, NA)
pending <- c(NA, "B-201", NA)
df.out <- data.frame(group, coding, decay, pending)
So far I did not find a means to achieve this via a dplyr function. Do I have to loop through my initial df?

You may get the data in wide format using tidyr::pivot_wider and use a function in values_fn -
df.out <- tidyr::pivot_wider(df, names_from = feature, values_from = tx,
values_fn = function(x) paste0(x, collapse = '|'))
df.out
# group coding decay pending
# <chr> <chr> <chr> <chr>
#1 A A-201 A-202 NA
#2 B NA NA B-201
#3 C C-205|C-206 NA NA

Here is an alternative way:
library(dplyr)
library(tidyr)
df %>%
group_by(group, feature) %>%
mutate(tx = paste(tx, collapse = "|")) %>%
distinct() %>%
pivot_wider(
names_from = feature,
values_from = tx
)
group coding decay pending
<chr> <chr> <chr> <chr>
1 A A-201 A-202 NA
2 B NA NA B-201
3 C C-205|C-206 NA NA

Using dcast from data.table
library(data.table)
dcast(setDT(df), group ~ feature, value.var = 'tx',
function(x) paste(x, collapse = "|"), fill = NA)
group coding decay pending
1: A A-201 A-202 <NA>
2: B <NA> <NA> B-201
3: C C-205|C-206 <NA> <NA>

Related

Collapsing Columns in R using tidyverse with mutate, replace, and unite. Writing a function to reuse?

Data:
ID
B
C
1
NA
x
2
x
NA
3
x
x
Results:
ID
Unified
1
C
2
B
3
B_C
I'm trying to combine colums B and C, using mutate and unify, but how would I scale up this function so that I can reuse this for multiple columns (think 100+), instead of having to write out the variables each time? Or is there a function that's already built in to do this?
My current solution is this:
library(tidyverse)
Data %>%
mutate(B = replace(B, B == 'x', 'B'), C = replace(C, C == 'x', 'C')) %>%
unite("Unified", B:C, na.rm = TRUE, remove= TRUE)
We may use across to loop over the column, replace the value that corresponds to 'x' with column name (cur_column())
library(dplyr)
library(tidyr)
Data %>%
mutate(across(B:C, ~ replace(., .== 'x', cur_column()))) %>%
unite(Unified, B:C, na.rm = TRUE, remove = TRUE)
-output
ID Unified
1 1 C
2 2 B
3 3 B_C
data
Data <- structure(list(ID = 1:3, B = c(NA, "x", "x"), C = c("x", NA,
"x")), class = "data.frame", row.names = c(NA, -3L))
Here are couple of options.
Using dplyr -
library(dplyr)
cols <- names(Data)[-1]
Data %>%
rowwise() %>%
mutate(Unified = paste0(cols[!is.na(c_across(B:C))], collapse = '_')) %>%
ungroup -> Data
Data
# ID B C Unified
# <int> <chr> <chr> <chr>
#1 1 NA x C
#2 2 x NA B
#3 3 x x B_C
Base R
Data$Unified <- apply(Data[cols], 1, function(x)
paste0(cols[!is.na(x)], collapse = '_'))

Taking ... (three dots) argument for grouping variables in dplyr and use ... as names of a new data frame in a function

Purpose
I would like to take ... (three dots) argument for grouping variables in dplyr and use ... as names of a new data frame in a function. Question section includes details about what I want to achieve.
Sample Data
library(tidyverse)
library(tibble)
library(data.table)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10002)
id <- sample(1:20, 1000, replace=T)
set.seed(10003)
group1 <- sample(letters, 1000, replace=T)
set.seed(10004)
group2 <- sample(LETTERS, 1000, replace=T)
df <-
data.frame(id, group1, group2)
Question
fn <- function(df, ...){
group_ <- enquos(...)
# First I will use this as grouping variables in dplyr
df %>%
group_by(!!!group_) %>%
summarise(obs = n())
# The question is the second operation.
# I would like to create a data frame with NAs here so that I can rbind using for loop later
# for example, if ... = group1
# f <- data.frame(id = NA, group1 = NA, output = NA)
# for example, if ... = group1, group2
# f <- data.frame(id = NA, group1 = NA, group1 = NA, output = NA)
# Is there a way to take the ... argument abd use them as column names in a new data frame f in a function?
}
After we create the grouping attributes, get the group column names directly with group_vars, then create the dataset dynamically using those names
fn <- function(df, ...){
group_ <- enquos(...)
tmp <- df %>%
group_by(!!!group_) %>%
summarise(obs = n(), .groups = 'keep')
nm1 <- group_vars(tmp)
tibble::as_tibble(setNames(rep(list(NA), length(nm1) + 2),
c('id', nm1, 'output')))
}
-testing
> fn(df, group1)
# A tibble: 1 x 3
id group1 output
<lgl> <lgl> <lgl>
1 NA NA NA
> fn(df, group1, group2)
# A tibble: 1 x 4
id group1 group2 output
<lgl> <lgl> <lgl> <lgl>
1 NA NA NA NA

Fill a new column from multiple columns if they exist [duplicate]

This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
Closed 2 years ago.
Example data.frame:
df <- data.frame(col_1=c("A", NA, NA), col_2=c(NA, "B", NA), col_3=c(NA, NA, "C"), other_col=rep("x", 3), stringsAsFactors=F)
df
col_1 col_2 col_3 other_col
1 A <NA> <NA> x
2 <NA> B <NA> x
3 <NA> <NA> C x
I can create a new column new_col filled with non-NA values from the 3 columns col_1, col_2 and col_3:
df %>%
mutate(new_col = case_when(
!is.na(col_1) ~ col_1,
!is.na(col_2) ~ col_2,
!is.na(col_3) ~ col_3,
TRUE ~ "none"))
col_1 col_2 col_3 other_col new_col
1 A <NA> <NA> x A
2 <NA> B <NA> x B
3 <NA> <NA> C x C
However, sometimes the number of columns from which I pick the new_col value can vary.
How could I check that the columns exist before applying the previous case_when command?
The following triggers an error:
df %>%
select(-col_3) %>%
mutate(new_col = case_when(
!is.null(.$col_1) & !is.na(col_1) ~ col_1,
!is.null(.$col_2) & !is.na(col_2) ~ col_2,
!is.null(.$col_3) & !is.na(col_3) ~ col_3,
TRUE ~ "none"))
Error: Problem with `mutate()` input `new_col`.
x object 'col_3' not found
ℹ Input `new_col` is `case_when(...)`.
I like Adam's answer, but if you want to be able to combine from col_1 and col_2 (assuming they both have values), you should use unite()
library(tidyverse)
df %>%
unite(new_col, starts_with("col"), remove = FALSE, na.rm = TRUE)
Edit to respond to: "How could I check that the columns exist before applying the previous case_when command?"
You won't need to check with this command. And if your columns to unite aren't named consistently, replace starts_with("col") with c("your_name_1", "your_name_2", "etc.")
You can use coalesce.
library(dplyr)
# vector of all the columns you might want
candidate_cols <- paste("col", 1:3, sep = "_")
# convert to symbol only the ones in the dataframe
check_cols <- syms(intersect(candidate_cols, names(df)))
# coalesce over the columns to check
df %>%
mutate(new_col = coalesce(!!!check_cols))
# col_1 col_2 col_3 other_col new_col
#1 A <NA> <NA> x A
#2 <NA> B <NA> x B
#3 <NA> <NA> C x C

Elegant solution for casting (spreading) multiple columns of character vectors

I want to transforms a data frame with contact information with of a for a list of municipalities in which similar information such as e.g. phone number appears in multiple columns.
I have tried using both reshape2::dcast() as well as tidyr::spread(), neither of which solves my problem. I have also checked other post of stack overflow e.g.
Multiple column spread
Have yet to find a solution which works. It seems to me that the problems should be fairly straightforward (and solvable with spread or dcast).
tmp <- tibble(municipality = c("M1", "M2"),
name1 = c("n1", "n2"), name2 = c("n3", "n4"), name3 = c(NA, "n5"), # placeholder names
phone1 = c("p1", "p2"), phone2 = c("p3", "p4"), phone3 = c(NA, "p5")) # placeholder phone numbers
#solution 1
tmp %>% gather("colname", "value", -municipality) %>%
filter(municipality == "M1") %>% #too simplify, should be replaced with group_by(municipality)
na.omit() %>% mutate(colname = str_replace(colname, "\\d", replacement = "")) %>%
spread(., key = "colname", value = "value")
#Solution 2
tmp %>% gather("colname", "value", -municipality) %>%
filter(municipality == "M1") %>% # same as above
na.omit() %>% mutate(colname = str_replace(colname, "\\d", replacement = "")) %>%
dcast(municipality + value ~colname)
Solution 1 results in the following error:
Error: Each row of output must be identified by a unique combination of keys.
Solution 2 results in the following data frame (which is the desired result except it needs to be collapsed):
municipality value name phone
1 M1 n1 n1 <NA>
2 M1 n3 n3 <NA>
3 M1 p1 <NA> p1
4 M1 p3 <NA> p3
Are you looking for?
library(dplyr)
library(tidyr)
tmp %>%
gather(key, value, -municipality, na.rm = TRUE) %>%
mutate(key = gsub("\\d+", "", key)) %>%
group_by(municipality, key) %>%
mutate(row = row_number()) %>%
spread(key, value) %>%
select(-row)
# municipality name phone
# <chr> <chr> <chr>
#1 M1 n1 p1
#2 M1 n3 p3
#3 M2 n2 p2
#4 M2 n4 p4
#5 M2 n5 p5
We can use gather to bring the data in long format dropping NA values. Remove numbers from individual column names so that they share the same key, create a column group_by municipality and key to spread the data into wide format.
We can do this elegantly with pivot_longer from the dev version of tidyr
library(dplyr)
library(tidyr)# 0.8.3.9000
library(stringr)
tmp %>%
rename_at(-1, ~str_replace(., "(\\d+$)", "_\\1")) %>%
pivot_longer(cols = -municipality, names_to = c(".value", "group"),
names_sep="_", values_drop_na = TRUE) %>%
select(-group)
# A tibble: 5 x 3
# municipality name phone
# <chr> <chr> <chr>
#1 M1 n1 p1
#2 M1 n3 p3
#3 M2 n2 p2
#4 M2 n4 p4
#5 M2 n5 p5
Or another option is melt from data.table
library(data.table)
melt(setDT(tmp), measure = patterns("^name", "^phone"),
value.name = c("name", "phone"), na.rm = TRUE)[, variable := NULL][]
#. municipality name phone
#1: M1 n1 p1
#2: M2 n2 p2
#3: M1 n3 p3
#4: M2 n4 p4
#5: M2 n5 p5

Error when duplicating a row conditionally - R

I have a data frame with columns A, B, C as follows:
A <- c("NX300", "BT400", "GD200")
B <- c("M0102", "N0703", "M0405")
C <- c(NA, "M0104", "N0404")
df <- data.frame (A,B,C)
Instead, I would like to duplicate a row whenever a value in C is not NA and replace the value of B with NA for the duplicated row. This is the desired output:
A1 <- c("NX300", "BT400", "BT400", "GD200", "GD200")
B1 <- c("M0102", "N0703", NA, "M0405", NA)
C1 <- c(NA, NA, "M0104", NA, "N0404")
df1 <- data.frame(A1,B1,C1)
To achieve this, I tried duplicating the row, without replacing B with NA just yet, but I get the following error code:
rbind(df, df[,is.na(C)==FALSE])
Error: object "C" not found
Can anyone help please?
Define a function newrows which accepts a row x and returns it or the duplicated rows and then apply it to each row. No packages are used.
newrows <- function(x) {
if (is.na(x$C)) x
else rbind(replace(x, "C", NA), replace(x, "B", NA))
}
do.call("rbind", by(df, 1:nrow(df), newrows))
giving:
A B C
1 NX300 M0102 <NA>
2.2 BT400 N0703 <NA>
2.21 BT400 <NA> M0104
3.3 GD200 M0405 <NA>
3.31 GD200 <NA> N0404
An option would be
library(dplyr)
df %>%
mutate(i1 = 1 + !is.na(C)) %>%
uncount(i1) %>%
mutate(B = replace(B, duplicated(B), NA)) %>%
group_by(A) %>%
mutate(C = replace(C, duplicated(C, fromLast = TRUE), NA))
If sorting does not matter, and continuing your first steps you can try:
x <- rbind(df, cbind(df[!is.na(df$C),1:2], C=NA))
x$B[!is.na(x$C)] <- NA
x
# A B C
#1 NX300 M0102 <NA>
#2 BT400 <NA> M0104
#3 GD200 <NA> N0404
#21 BT400 N0703 <NA>
#31 GD200 M0405 <NA>

Resources