This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
Closed 2 years ago.
Example data.frame:
df <- data.frame(col_1=c("A", NA, NA), col_2=c(NA, "B", NA), col_3=c(NA, NA, "C"), other_col=rep("x", 3), stringsAsFactors=F)
df
col_1 col_2 col_3 other_col
1 A <NA> <NA> x
2 <NA> B <NA> x
3 <NA> <NA> C x
I can create a new column new_col filled with non-NA values from the 3 columns col_1, col_2 and col_3:
df %>%
mutate(new_col = case_when(
!is.na(col_1) ~ col_1,
!is.na(col_2) ~ col_2,
!is.na(col_3) ~ col_3,
TRUE ~ "none"))
col_1 col_2 col_3 other_col new_col
1 A <NA> <NA> x A
2 <NA> B <NA> x B
3 <NA> <NA> C x C
However, sometimes the number of columns from which I pick the new_col value can vary.
How could I check that the columns exist before applying the previous case_when command?
The following triggers an error:
df %>%
select(-col_3) %>%
mutate(new_col = case_when(
!is.null(.$col_1) & !is.na(col_1) ~ col_1,
!is.null(.$col_2) & !is.na(col_2) ~ col_2,
!is.null(.$col_3) & !is.na(col_3) ~ col_3,
TRUE ~ "none"))
Error: Problem with `mutate()` input `new_col`.
x object 'col_3' not found
ℹ Input `new_col` is `case_when(...)`.
I like Adam's answer, but if you want to be able to combine from col_1 and col_2 (assuming they both have values), you should use unite()
library(tidyverse)
df %>%
unite(new_col, starts_with("col"), remove = FALSE, na.rm = TRUE)
Edit to respond to: "How could I check that the columns exist before applying the previous case_when command?"
You won't need to check with this command. And if your columns to unite aren't named consistently, replace starts_with("col") with c("your_name_1", "your_name_2", "etc.")
You can use coalesce.
library(dplyr)
# vector of all the columns you might want
candidate_cols <- paste("col", 1:3, sep = "_")
# convert to symbol only the ones in the dataframe
check_cols <- syms(intersect(candidate_cols, names(df)))
# coalesce over the columns to check
df %>%
mutate(new_col = coalesce(!!!check_cols))
# col_1 col_2 col_3 other_col new_col
#1 A <NA> <NA> x A
#2 <NA> B <NA> x B
#3 <NA> <NA> C x C
Related
I have a dataframe such as
group <- c("A", "A", "B", "C", "C")
tx <- c("A-201", "A-202", "B-201", "C-205", "C-206")
feature <- c("coding", "decay", "pending", "coding", "coding")
df <- data.frame(group, tx, feature)
I want to generate a new df with the entries in tx "listed" for each feature. I want the output to look like
group <- c("A", "B", "C")
coding <- c("A-201", NA, "C-205|C-206")
decay <- c("A-202", NA, NA)
pending <- c(NA, "B-201", NA)
df.out <- data.frame(group, coding, decay, pending)
So far I did not find a means to achieve this via a dplyr function. Do I have to loop through my initial df?
You may get the data in wide format using tidyr::pivot_wider and use a function in values_fn -
df.out <- tidyr::pivot_wider(df, names_from = feature, values_from = tx,
values_fn = function(x) paste0(x, collapse = '|'))
df.out
# group coding decay pending
# <chr> <chr> <chr> <chr>
#1 A A-201 A-202 NA
#2 B NA NA B-201
#3 C C-205|C-206 NA NA
Here is an alternative way:
library(dplyr)
library(tidyr)
df %>%
group_by(group, feature) %>%
mutate(tx = paste(tx, collapse = "|")) %>%
distinct() %>%
pivot_wider(
names_from = feature,
values_from = tx
)
group coding decay pending
<chr> <chr> <chr> <chr>
1 A A-201 A-202 NA
2 B NA NA B-201
3 C C-205|C-206 NA NA
Using dcast from data.table
library(data.table)
dcast(setDT(df), group ~ feature, value.var = 'tx',
function(x) paste(x, collapse = "|"), fill = NA)
group coding decay pending
1: A A-201 A-202 <NA>
2: B <NA> <NA> B-201
3: C C-205|C-206 <NA> <NA>
Data:
ID
B
C
1
NA
x
2
x
NA
3
x
x
Results:
ID
Unified
1
C
2
B
3
B_C
I'm trying to combine colums B and C, using mutate and unify, but how would I scale up this function so that I can reuse this for multiple columns (think 100+), instead of having to write out the variables each time? Or is there a function that's already built in to do this?
My current solution is this:
library(tidyverse)
Data %>%
mutate(B = replace(B, B == 'x', 'B'), C = replace(C, C == 'x', 'C')) %>%
unite("Unified", B:C, na.rm = TRUE, remove= TRUE)
We may use across to loop over the column, replace the value that corresponds to 'x' with column name (cur_column())
library(dplyr)
library(tidyr)
Data %>%
mutate(across(B:C, ~ replace(., .== 'x', cur_column()))) %>%
unite(Unified, B:C, na.rm = TRUE, remove = TRUE)
-output
ID Unified
1 1 C
2 2 B
3 3 B_C
data
Data <- structure(list(ID = 1:3, B = c(NA, "x", "x"), C = c("x", NA,
"x")), class = "data.frame", row.names = c(NA, -3L))
Here are couple of options.
Using dplyr -
library(dplyr)
cols <- names(Data)[-1]
Data %>%
rowwise() %>%
mutate(Unified = paste0(cols[!is.na(c_across(B:C))], collapse = '_')) %>%
ungroup -> Data
Data
# ID B C Unified
# <int> <chr> <chr> <chr>
#1 1 NA x C
#2 2 x NA B
#3 3 x x B_C
Base R
Data$Unified <- apply(Data[cols], 1, function(x)
paste0(cols[!is.na(x)], collapse = '_'))
How do I filter a dataframe df for all rows where one or more of columns_to_check meet a condition. As an example: Where is at least one cell NA?
df <- tibble(a = c('x', 'x', 'x'),
b = c(NA, 'x', 'x'),
c = c(NA, NA, 'x'))
columns_to_check <- c('b', 'c')
Checking where all columns are NA is straightforward:
library(tidyverse)
df %>%
filter(across(all_of(columns_to_check), ~ !is.na(.x)))
#> # A tibble: 1 x 3
#> a b c
#> <chr> <chr> <chr>
#> 1 x x x
But (how) can I combine the filter() statements created with across() using OR?
Here's an approach with reduce from purrr:
df %>%
filter(reduce(.x = across(all_of(columns_to_check), ~ !is.na(.x)), .f = `|`))
This works because across returns a list of logical vectors that are length nrow(df).
You can see that behavior when you execute it in mutate:
df %>%
+ mutate(across(all_of(columns_to_check), ~ !is.na(.x)))
# A tibble: 3 x 3
a b c
<chr> <lgl> <lgl>
1 x FALSE FALSE
2 x TRUE FALSE
3 x TRUE TRUE
Therefore, you can reduce them together with | to get one logical vector. You don't need .x or .f, they are only there for illustrative purposes.
My mistake, this is documented in vignette("rowwise"):
df %>%
filter(rowSums(across(all_of(columns_to_check), ~ !is.na(.x))) > 0)
Another solution could be:
df %>%
filter(across(all_of(columns_to_check), ~ !is.na(.x)) == TRUE)
a b c
<chr> <chr> <chr>
1 x x <NA>
2 x x x
I have a data frame with columns A, B, C as follows:
A <- c("NX300", "BT400", "GD200")
B <- c("M0102", "N0703", "M0405")
C <- c(NA, "M0104", "N0404")
df <- data.frame (A,B,C)
Instead, I would like to duplicate a row whenever a value in C is not NA and replace the value of B with NA for the duplicated row. This is the desired output:
A1 <- c("NX300", "BT400", "BT400", "GD200", "GD200")
B1 <- c("M0102", "N0703", NA, "M0405", NA)
C1 <- c(NA, NA, "M0104", NA, "N0404")
df1 <- data.frame(A1,B1,C1)
To achieve this, I tried duplicating the row, without replacing B with NA just yet, but I get the following error code:
rbind(df, df[,is.na(C)==FALSE])
Error: object "C" not found
Can anyone help please?
Define a function newrows which accepts a row x and returns it or the duplicated rows and then apply it to each row. No packages are used.
newrows <- function(x) {
if (is.na(x$C)) x
else rbind(replace(x, "C", NA), replace(x, "B", NA))
}
do.call("rbind", by(df, 1:nrow(df), newrows))
giving:
A B C
1 NX300 M0102 <NA>
2.2 BT400 N0703 <NA>
2.21 BT400 <NA> M0104
3.3 GD200 M0405 <NA>
3.31 GD200 <NA> N0404
An option would be
library(dplyr)
df %>%
mutate(i1 = 1 + !is.na(C)) %>%
uncount(i1) %>%
mutate(B = replace(B, duplicated(B), NA)) %>%
group_by(A) %>%
mutate(C = replace(C, duplicated(C, fromLast = TRUE), NA))
If sorting does not matter, and continuing your first steps you can try:
x <- rbind(df, cbind(df[!is.na(df$C),1:2], C=NA))
x$B[!is.na(x$C)] <- NA
x
# A B C
#1 NX300 M0102 <NA>
#2 BT400 <NA> M0104
#3 GD200 <NA> N0404
#21 BT400 N0703 <NA>
#31 GD200 M0405 <NA>
I have a dataframe containing columns named Q1 through Q98. These columns contain strings ("This is a string"), yet some entries only contain a varying number of blanks (" ", " "). I would like to replace all entries containing only blanks with NA.
Consider the dataframe created by the following code:
df<-data.frame(Q1=c("Test test","Test"," "," "),Q2=c("Sample sample"," ","Sample","Sample"))
The solution would modify the above dataframe df such that df$Q1[3:4]==NA and df$Q2[2]==NA.
I have already tried using grepl(" ", df), but this lets me replace every entry that contains blanks, not only those which consist purely of blanks.
One dplyr possibility could be:
df %>%
mutate_all(~ ifelse(nchar(trimws(.)) == 0, NA_character_, .))
Q1 Q2
1 Test test Sample sample
2 Test <NA>
3 <NA> Sample
4 <NA> Sample
Or the same with base R:
df[] <- lapply(df, function(x) ifelse(nchar(trimws(x)) == 0, NA_character_, x))
Or:
df %>%
mutate_all(~ trimws(.)) %>%
na_if(., "")
A dplyr+stringr option
library(dplyr)
library(stringr)
df %>% mutate_all(~str_replace(., "^\\s+$", NA_character_))
# Q1 Q2
#1 Test test Sample sample
#2 Test <NA>
#3 <NA> Sample
#4 <NA> Sample
You can search for strings with a start ^, then one or more spaces +, then an end $.
df[sapply(df, function(x) grepl('^ +$', x))] <- NA
# Q1 Q2
# 1 Test test Sample sample
# 2 Test <NA>
# 3 <NA> Sample
# 4 <NA> Sample
Some other possibilities
df[] <- lapply(df, function(x) replace(x, grep('^ +$', x), NA))
#or
replace(df, sapply(df, function(x) grepl('^ +$', x)), NA)
Apply sub to all columns of whitespaces:
lapply(df, FUN = sub, pattern = "^\\s*$", replacement = NA)
We can do this in base R
df[trimws(as.matrix(df)) == ''] <- NA
df
# Q1 Q2
#1 Test test Sample sample
#2 Test <NA>
#3 <NA> Sample
#4 <NA> Sample
Or with replace
library(dplyr)
df %>%
mutate_all(list(~ replace(., trimws(.)=="", NA)))
# Q1 Q2
#1 Test test Sample sample
#2 Test <NA>
#3 <NA> Sample
#4 <NA> Sample