How do I remove NAs with the tidyr::unite function? - r

After combining several columns with tidyr::unite(), NAs from missing data remain in my character vector, which I do not want.
I have a series of medical diagnoses per row (1 per column) and would like to benchmark searching for a series of codes via. %in% and grepl().
There is an open issue on Github on this problem, is there any movement - or work arounds? I would like to keep the vector comma-separated.
Here is a representative example:
library(dplyr)
library(tidyr)
df <- data_frame(a = paste0("A.", rep(1, 3)), b = " ", c = c("C.1", "C.3", " "), d = "D.4", e = "E.5")
cols <- letters[2:4]
df[, cols] <- gsub(" ", NA_character_, as.matrix(df[, cols]))
tidyr::unite(df, new, cols, sep = ",")
Current output:
# # A tibble: 3 x 3
# a new e
# <chr> <chr> <chr>
# 1 A.1 NA,C.1,D.4 E.5
# 2 A.1 NA,C.3,D.4 E.5
# 3 A.1 NA,NA,D.4 E.5
Desired output:
# # A tibble: 3 x 3
# a new e
# <chr> <chr> <chr>
# 1 A.1 C.1,D.4 E.5
# 2 A.1 C.3,D.4 E.5
# 3 A.1 D.4 E.5

In the new tidyr , you can now use na.rm parameter to remove NA values.
library(tidyr)
library(dplyr)
df %>% unite(new, cols, sep = ",", na.rm = TRUE)
# a new e
# <chr> <chr> <chr>
#1 A.1 C.1,D.4 E.5
#2 A.1 C.3,D.4 E.5
#3 A.1 D.4 E.5
However, NAs would not be removed if have columns are factors. We need to change them to character before using unite.
df %>%
mutate_all(as.character) %>%
unite(new, cols, sep = ",", na.rm = TRUE)
You could also use base R apply method for the same.
apply(df[cols], 1, function(x) toString(na.omit(x)))
#[1] "C.1, D.4" "C.3, D.4" "D.4"
data
df <- data_frame(
a = c("A.1", "A.1", "A.1"),
b = c(NA_character_, NA_character_, NA_character_),
c = c("C.1", "C.3", NA),
d = c("D.4", "D.4", "D.4"),
e = c("E.5", "E.5", "E.5")
)
cols <- letters[2:4]

You could use regex to remove the NAs after they are created:
library(dplyr)
library(tidyr)
df <- data_frame(a = paste0("A.", rep(1, 3)),
b = " ",
c = c("C.1", "C.3", " "),
d = "D.4", e = "E.5")
cols <- letters[2:4]
df[, cols] <- gsub(" ", NA_character_, as.matrix(df[, cols]))
tidyr::unite(df, new, cols, sep = ",") %>%
dplyr::mutate(new = stringr::str_replace_all(new, 'NA,?', '')) # New line
Output:
# A tibble: 3 x 3
a new e
<chr> <chr> <chr>
1 A.1 C.1,D.4 E.5
2 A.1 C.3,D.4 E.5
3 A.1 D.4 E.5

You can avoid inserting them by iterating over the rows:
library(tidyverse)
df <- data_frame(
a = c("A.1", "A.1", "A.1"),
b = c(NA_character_, NA_character_, NA_character_),
c = c("C.1", "C.3", NA),
d = c("D.4", "D.4", "D.4"),
e = c("E.5", "E.5", "E.5")
)
cols <- letters[2:4]
df %>% mutate(x = pmap_chr(.[cols], ~paste(na.omit(c(...)), collapse = ',')))
#> # A tibble: 3 x 6
#> a b c d e x
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A.1 <NA> C.1 D.4 E.5 C.1,D.4
#> 2 A.1 <NA> C.3 D.4 E.5 C.3,D.4
#> 3 A.1 <NA> <NA> D.4 E.5 D.4
or using tidyr's underlying stringi package,
df %>% mutate(x = pmap_chr(.[cols], ~stringi::stri_flatten(
c(...), collapse = ",",
na_empty = TRUE, omit_empty = TRUE
)))
#> # A tibble: 3 x 6
#> a b c d e x
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A.1 <NA> C.1 D.4 E.5 C.1,D.4
#> 2 A.1 <NA> C.3 D.4 E.5 C.3,D.4
#> 3 A.1 <NA> <NA> D.4 E.5 D.4
The problem is that iterating over rows usually entails making a lot of calls, and can therefore be quite slow at scale. Unfortunately, there doesn't appear to be a great vectorized alternative for removing NAs before joining the strings.

Thanks all, I've put together a summary of the solutions and bench-marked on my data:
library(microbenchmark)
library(dplyr)
library(stringr)
library(tidyr)
library(biometrics) # has my helper function for column selection
cols <- biometrics::variables(c("diagnosis", "dagger", "ediag"), 20)
system.time({
df <- dat[, cols]
df <- gsub(" ", NA_character_, as.matrix(df)) %>% tbl_df()
})
microbenchmark(
## search by base R `match()` function
match_spaces = apply(dat, 1, function(x) any(c("A37.0","A37.1","A37.8","A37.9") %in% x[cols])), # original search (match)
match_NAs = apply(df, 1, function(x) any(c("A37.0","A37.1","A37.8","A37.9") %in% x[cols])), # matching with " " replaced by NAs with gsub
## search by base R 'grep()' function - the same regex is used in each case
regex_str_replace_all = tidyr::unite(df, new, cols, sep = ",") %>% # grepl search with NAs removed with `stringr::str_replace_all()`
mutate(new = str_replace_all(new, "NA,?", "")) %>%
apply(1, function(x) grepl("A37.*", x, ignore.case = T)),
regex_toString = tidyr::unite(df, new, cols, sep = ",") %>% # grepl search with NAs removed with `apply()` & `toString()`
mutate(new = apply(df[cols], 1, function(x) toString(na.omit(x)))) %>%
apply(1, function(x) grepl("A37.*", x, ignore.case = T)),
regex_row_iteration = df %>% # grepl search after iterating over rows (using syntax I'm not familiar with and need to learn!)
mutate(new = pmap_chr(.[cols], ~paste(na.omit(c(...)), collapse = ','))) %>%
select(new) %>%
apply(1, function(x) grepl("A37.*", x, ignore.case = T)),
regex_stringi = df %>% mutate(new = pmap_chr(.[cols], ~stringi::stri_flatten( # grepl after stringi
c(...), collapse = ",",
na_empty = TRUE, omit_empty = TRUE
))) %>%
select(new) %>%
apply(1, function(x) grepl("A37.*", x, ignore.case = T)),
times = 10L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# match_spaces 14820.2076 15060.045 15558.092 15573.885 15901.015 16521.855 10
# match_NAs 998.3184 1061.973 1191.691 1203.849 1301.511 1378.314 10
# regex_str_replace_all 1464.4502 1487.473 1637.832 1596.522 1701.718 2114.055 10
# regex_toString 4324.0914 4341.725 4631.998 4487.373 4977.603 5439.026 10
# regex_row_iteration 5794.5994 6107.475 6458.339 6436.273 6720.185 7256.980 10
# regex_stringi 4772.3859 5267.456 5466.510 5436.804 5806.272 6011.713 10
It looks like %in% is the winner - after replacing empty values (" ") with NAs. If If I go with regular expressions, then removing NAs with stringr::string_replace_all() is the quickest.

You might get some errors if you remove them while you use the unite function. I would just remove them from the column after the fact.
df <- data_frame(a = paste0("A.", rep(1, 3)), b = " ", c = c("C.1", "C.3", " "), d = "D.4", e = "E.5")
cols <- letters[2:4]
df[, cols] <- gsub(" ", NA_character_, as.matrix(df[, cols]))
df <- tidyr::unite(df, new, cols, sep = ",")
df$new <- gsub("NA,","",df$new)

Related

Collapsing Columns in R using tidyverse with mutate, replace, and unite. Writing a function to reuse?

Data:
ID
B
C
1
NA
x
2
x
NA
3
x
x
Results:
ID
Unified
1
C
2
B
3
B_C
I'm trying to combine colums B and C, using mutate and unify, but how would I scale up this function so that I can reuse this for multiple columns (think 100+), instead of having to write out the variables each time? Or is there a function that's already built in to do this?
My current solution is this:
library(tidyverse)
Data %>%
mutate(B = replace(B, B == 'x', 'B'), C = replace(C, C == 'x', 'C')) %>%
unite("Unified", B:C, na.rm = TRUE, remove= TRUE)
We may use across to loop over the column, replace the value that corresponds to 'x' with column name (cur_column())
library(dplyr)
library(tidyr)
Data %>%
mutate(across(B:C, ~ replace(., .== 'x', cur_column()))) %>%
unite(Unified, B:C, na.rm = TRUE, remove = TRUE)
-output
ID Unified
1 1 C
2 2 B
3 3 B_C
data
Data <- structure(list(ID = 1:3, B = c(NA, "x", "x"), C = c("x", NA,
"x")), class = "data.frame", row.names = c(NA, -3L))
Here are couple of options.
Using dplyr -
library(dplyr)
cols <- names(Data)[-1]
Data %>%
rowwise() %>%
mutate(Unified = paste0(cols[!is.na(c_across(B:C))], collapse = '_')) %>%
ungroup -> Data
Data
# ID B C Unified
# <int> <chr> <chr> <chr>
#1 1 NA x C
#2 2 x NA B
#3 3 x x B_C
Base R
Data$Unified <- apply(Data[cols], 1, function(x)
paste0(cols[!is.na(x)], collapse = '_'))

Creating a new variable in a dataset from data withing the same dataset using ifelse statements [duplicate]

For example if I have this:
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
n s b
1 2 aa TRUE
2 3 bb FALSE
3 5 cc TRUE
Then how do I combine the two columns n and s into a new column named x such that it looks like this:
n s b x
1 2 aa TRUE 2 aa
2 3 bb FALSE 3 bb
3 5 cc TRUE 5 cc
Use paste.
df$x <- paste(df$n,df$s)
df
# n s b x
# 1 2 aa TRUE 2 aa
# 2 3 bb FALSE 3 bb
# 3 5 cc TRUE 5 cc
For inserting a separator:
df$x <- paste(df$n, "-", df$s)
As already mentioned in comments by Uwe and UseR, a general solution in the tidyverse format would be to use the command unite:
library(tidyverse)
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b) %>%
unite(x, c(n, s), sep = " ", remove = FALSE)
Using dplyr::mutate:
library(dplyr)
df <- mutate(df, x = paste(n, s))
df
> df
n s b x
1 2 aa TRUE 2 aa
2 3 bb FALSE 3 bb
3 5 cc TRUE 5 cc
Some examples with NAs and their removal using apply
n = c(2, NA, NA)
s = c("aa", "bb", NA)
b = c(TRUE, FALSE, NA)
c = c(2, 3, 5)
d = c("aa", NA, "cc")
e = c(TRUE, NA, TRUE)
df = data.frame(n, s, b, c, d, e)
paste_noNA <- function(x,sep=", ") {
gsub(", " ,sep, toString(x[!is.na(x) & x!="" & x!="NA"] ) ) }
sep=" "
df$x <- apply( df[ , c(1:6) ] , 1 , paste_noNA , sep=sep)
df
We can use paste0:
df$combField <- paste0(df$x, df$y)
If you do not want any padding space introduced in the concatenated field. This is more useful if you are planning to use the combined field as a unique id that represents combinations of two fields.
Instead of
paste (default spaces),
paste0 (force the inclusion of missing NA as character) or
unite (constrained to 2 columns and 1 separator),
I'd suggest an alternative as flexible as paste0 but more careful with NA: stringr::str_c
library(tidyverse)
# check the missing value!!
df <- tibble(
n = c(2, 2, 8),
s = c("aa", "aa", NA_character_),
b = c(TRUE, FALSE, TRUE)
)
df %>%
mutate(
paste = paste(n,"-",s,".",b),
paste0 = paste0(n,"-",s,".",b),
str_c = str_c(n,"-",s,".",b)
) %>%
# convert missing value to ""
mutate(
s_2=str_replace_na(s,replacement = "")
) %>%
mutate(
str_c_2 = str_c(n,"-",s_2,".",b)
)
#> # A tibble: 3 x 8
#> n s b paste paste0 str_c s_2 str_c_2
#> <dbl> <chr> <lgl> <chr> <chr> <chr> <chr> <chr>
#> 1 2 aa TRUE 2 - aa . TRUE 2-aa.TRUE 2-aa.TRUE "aa" 2-aa.TRUE
#> 2 2 aa FALSE 2 - aa . FALSE 2-aa.FALSE 2-aa.FALSE "aa" 2-aa.FALSE
#> 3 8 <NA> TRUE 8 - NA . TRUE 8-NA.TRUE <NA> "" 8-.TRUE
Created on 2020-04-10 by the reprex package (v0.3.0)
extra note from str_c documentation
Like most other R functions, missing values are "infectious": whenever a missing value is combined with another string the result will always be missing. Use str_replace_na() to convert NA to "NA"
There are other great answers, but in the case where you don't know the column names or the number of columns you want to concatenate beforehand, the following is useful.
df = data.frame(x = letters[1:5], y = letters[6:10], z = letters[11:15])
colNames = colnames(df) # could be any number of column names here
df$newColumn = apply(df[, colNames, drop = F], MARGIN = 1, FUN = function(i) paste(i, collapse = ""))
I'd like to also propose a method for concatenating a large/unknown number of columns. The solution proposed by Ben Ernest can be pretty slow on large datasets.
Below is my proposed solution:
# setup data.frame - Making it large for the time benchmarking
n = rep(c(2, 3, 5), 1000000)
s = rep(c("aa", "bb", "cc"), 1000000)
b = rep(c(TRUE, FALSE, TRUE), 1000000)
df = data.frame(n, s, b)
# The proposed solution:
colNames = c("n", "s") # could be any number of column names here
df$x <- do.call(paste0, c(df[,colNames], sep=" "))
# running system.time on this yields:
# user system elapsed
# 1.861 0.005 1.865
# compare with alternative method:
df$x <- apply(df[, colNames, drop = F], MARGIN = 1,
FUN = function(i) paste(i, collapse = ""))
# running system.time on this yields:
# user system elapsed
# 16.127 0.147 16.304

Separate rows by matching two columns in similar pattern

i have data like
df1 <- data.frame(A = c("P,Q","X,Y"), B = c("P1,Q1",""), C = c("P2,Q2","X2,Y2"))
i am looking for output like
output <- data.frame(A = c("P","Q","X","Y"), B = c("P1","Q1","",""), C = c("P2","Q2","X2","Y2"))
i tried using separate_rows like mentioned below but it is not matching the strings seperated by comma.
separate_rows(df1, A, sep=",") %>%
separate_rows(B) %>%
separate_rows(C)
I like splitstackshape package for such operations,
library(splitstackshape)
cSplit(df1, splitCols = names(df1), sep = ',', direction = 'long')
# A B C
#1: P P1 P2
#2: Q Q1 Q2
you simply have to do :
library(tidyr)
separate_rows(df1, A, B, C, convert = TRUE)
Output :
A B C
1 P P1 P2
2 Q Q1 Q2
Edit if you have NA and empty strings :
data:
df1 <- data.frame(A = c("P,Q","X,Y"), B = c("P1,Q1",""), C =
c("P2,Q2","X2,Y2"))
Code:
df1 <- data.frame(lapply(df1, as.character), stringsAsFactors=FALSE)
df1[df1 == ""] <- "0,0"
df1 <- separate_rows(df1, A, B, C, convert = TRUE)
df1[df1 == "0"] <- ""
Output :
A B C
1 P P1 P2
2 Q Q1 Q2
3 X X2
4 Y Y2
An option using base R with strsplit
data.frame(lapply(df1, function(x) strsplit(as.character(x), ",")[[1]]))
# A B C
#1 P P1 P2
#2 Q Q1 Q2
Or with scan
data.frame(lapply(df1, function(x)
scan(text = as.character(x), what = "", sep=",", quiet = TRUE)))
As suggested by Gainz's answer, separate_rows(df1, A, B, C, convert = T) works really well.
However, if you do have blank cells in the dataframe then it does become harder to use, since it will give you an error about all the columns not having the same number of rows.
I suggest using a column that you know will have no blank values. Let's assume it is column A.
I would first then convert the dataframe to a tibble, and all factor columns to character columns. Then I would replace the blank cells with a string with the correct number of commas. Then separate_rows() should be able to work correctly.
Then the code will look as follows:
df1_tibble <- df1 %>%
as_tibble() %>%
mutate_if(is.factor, as.character)
df1_clean <- df1_tibble %>%
mutate(count = str_count(A, ",") + 1) %>%
mutate(temp_str = map_chr(count, ~ rep("", .x) %>% paste0(collapse = ","))) %>%
mutate_at(vars(B, C), funs(ifelse(str_length(.) == 0, temp_str, .))) %>%
select(A, B, C)
df1_clean
#> # A tibble: 2 x 3
#> A B C
#> <chr> <chr> <chr>
#> 1 P,Q P1,Q1 P2,Q2
#> 2 X,Y , X2,Y2
df1_clean %>% separate_rows(A, B, C)
#> # A tibble: 4 x 3
#> A B C
#> <chr> <chr> <chr>
#> 1 P P1 P2
#> 2 Q Q1 Q2
#> 3 X "" X2
#> 4 Y "" Y2

Is there an R function for appending the lists with unequal columns

I have a below lists (with sublists as well). But here the columns are unequal. "a" list has 2 columns and "b" lists has 3 columns.
f <- list(a=list(1,2.5,9.5),b=list("2","-true","3",4))
I need to append this list keeping references like below. For example,
COl1 COl2 COl3 Col4
a 1 false NA
b 2 true 3
As you can see above, there is a reference in col 1 from where the data object the lists is taken. Please guide
1) data.table Set names on the list giving the new list fnam and then use rbindlist from data.table:
library(data.table)
fnam <- lapply(f, function(x) setNames(x, paste0("COL", seq(2, length = length(x)))))
cbind(COL1 = names(f), rbindlist(fnam , fill = TRUE))
giving:
COL1 COL2 COL3 COL4
1: a 1 false <NA>
2: b 2 true 3
2) base R This alternative uses no packages. We create a character vector out of f and then read it in using read.table.
Lines <- paste(names(f), sapply(f, paste, collapse = " "))
nc <- max(lengths(f)) + 1
col.names <- paste0("COL", seq_len(nc))
read.table(text = Lines, header = FALSE, fill = TRUE, col.names = col.names)
giving:
COL1 COL2 COL3 COL4
1 a 1 false NA
2 b 2 true 3
Use some separator not appearing in the data if the data can contain spaces.
One option would be to set the names of the list elements using map and specify the .id as 'COL1' to create a new column based on the names of 'f'. Note that map returns a list, while map_df a tb_df/data.frame
1)
library(tidyverse)
f %>%
map_df(~ set_names(., paste0("COL", seq_along(.)+1)), .id = 'COL1')
# A tibble: 2 x 4
# COL1 COL2 COL3 COL4
# <chr> <dbl> <chr> <chr>
#1 a 1 false <NA>
#2 b 2 true 3
2) If the types are different, retype (from hablar) and then do
library(hablar)
f1 %>%
map_df(~ set_names(.x, paste0("COL", seq_along(.)+1)) %>%
map(retype), .id = 'COL1')
# A tibble: 2 x 4
# COL1 COL2 COL3 COL4
# <chr> <int> <chr> <int>
#1 a 1 false NA
#2 b 2 true 3
3) Or with type.convert
f1 %>%
map_df(~ map(.x, type.convert, as.is = TRUE) %>%
set_names(paste0("COL", seq_along(.x))), .id = "COL1")
# A tibble: 2 x 4
# COL1 COL1 COL2 COL3
# <chr> <int> <chr> <int>
#1 a 1 false NA
#2 b 2 true 3
4) if the integer/numeric is giving an issue, then convert it to common type ie. to numeric
f1 %>%
map_df(~ map(.x, type.convert, as.is = TRUE) %>%
map_if(is.integer, as.numeric) %>%
set_names(paste0("COL", seq_along(.x))), .id = "COL1")
5) As the types are mixed up, it may be better to do the retype after converting to a single data.frame
f %>%
map_df(~ map(.x, as.character) %>%
set_names(paste0("COL", seq_along(.x) + 1)), .id = "COL1") %>%
retype
data
f <- list(a = list(1, "false"), b = list(2, "true", "3"))
f1 <- list(a=list(1,"false"),b=list("2","true","3"))
How about another simple base R solution.
f <- list(a=list(1,2.5,9.5),b=list("2","-true","3",4))
m = matrix(NA,ncol=max(sapply(f,length)),nrow=length(f))
for(i in 1:nrow(m)) {
u = unlist(f[[i]])
m[i,1:length(u)] = u
}
your_data_frame = as.data.frame(m)

Combine two or more columns in a dataframe into a new column with a new name

For example if I have this:
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
n s b
1 2 aa TRUE
2 3 bb FALSE
3 5 cc TRUE
Then how do I combine the two columns n and s into a new column named x such that it looks like this:
n s b x
1 2 aa TRUE 2 aa
2 3 bb FALSE 3 bb
3 5 cc TRUE 5 cc
Use paste.
df$x <- paste(df$n,df$s)
df
# n s b x
# 1 2 aa TRUE 2 aa
# 2 3 bb FALSE 3 bb
# 3 5 cc TRUE 5 cc
For inserting a separator:
df$x <- paste(df$n, "-", df$s)
As already mentioned in comments by Uwe and UseR, a general solution in the tidyverse format would be to use the command unite:
library(tidyverse)
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b) %>%
unite(x, c(n, s), sep = " ", remove = FALSE)
Using dplyr::mutate:
library(dplyr)
df <- mutate(df, x = paste(n, s))
df
> df
n s b x
1 2 aa TRUE 2 aa
2 3 bb FALSE 3 bb
3 5 cc TRUE 5 cc
Some examples with NAs and their removal using apply
n = c(2, NA, NA)
s = c("aa", "bb", NA)
b = c(TRUE, FALSE, NA)
c = c(2, 3, 5)
d = c("aa", NA, "cc")
e = c(TRUE, NA, TRUE)
df = data.frame(n, s, b, c, d, e)
paste_noNA <- function(x,sep=", ") {
gsub(", " ,sep, toString(x[!is.na(x) & x!="" & x!="NA"] ) ) }
sep=" "
df$x <- apply( df[ , c(1:6) ] , 1 , paste_noNA , sep=sep)
df
We can use paste0:
df$combField <- paste0(df$x, df$y)
If you do not want any padding space introduced in the concatenated field. This is more useful if you are planning to use the combined field as a unique id that represents combinations of two fields.
Instead of
paste (default spaces),
paste0 (force the inclusion of missing NA as character) or
unite (constrained to 2 columns and 1 separator),
I'd suggest an alternative as flexible as paste0 but more careful with NA: stringr::str_c
library(tidyverse)
# check the missing value!!
df <- tibble(
n = c(2, 2, 8),
s = c("aa", "aa", NA_character_),
b = c(TRUE, FALSE, TRUE)
)
df %>%
mutate(
paste = paste(n,"-",s,".",b),
paste0 = paste0(n,"-",s,".",b),
str_c = str_c(n,"-",s,".",b)
) %>%
# convert missing value to ""
mutate(
s_2=str_replace_na(s,replacement = "")
) %>%
mutate(
str_c_2 = str_c(n,"-",s_2,".",b)
)
#> # A tibble: 3 x 8
#> n s b paste paste0 str_c s_2 str_c_2
#> <dbl> <chr> <lgl> <chr> <chr> <chr> <chr> <chr>
#> 1 2 aa TRUE 2 - aa . TRUE 2-aa.TRUE 2-aa.TRUE "aa" 2-aa.TRUE
#> 2 2 aa FALSE 2 - aa . FALSE 2-aa.FALSE 2-aa.FALSE "aa" 2-aa.FALSE
#> 3 8 <NA> TRUE 8 - NA . TRUE 8-NA.TRUE <NA> "" 8-.TRUE
Created on 2020-04-10 by the reprex package (v0.3.0)
extra note from str_c documentation
Like most other R functions, missing values are "infectious": whenever a missing value is combined with another string the result will always be missing. Use str_replace_na() to convert NA to "NA"
There are other great answers, but in the case where you don't know the column names or the number of columns you want to concatenate beforehand, the following is useful.
df = data.frame(x = letters[1:5], y = letters[6:10], z = letters[11:15])
colNames = colnames(df) # could be any number of column names here
df$newColumn = apply(df[, colNames, drop = F], MARGIN = 1, FUN = function(i) paste(i, collapse = ""))
I'd like to also propose a method for concatenating a large/unknown number of columns. The solution proposed by Ben Ernest can be pretty slow on large datasets.
Below is my proposed solution:
# setup data.frame - Making it large for the time benchmarking
n = rep(c(2, 3, 5), 1000000)
s = rep(c("aa", "bb", "cc"), 1000000)
b = rep(c(TRUE, FALSE, TRUE), 1000000)
df = data.frame(n, s, b)
# The proposed solution:
colNames = c("n", "s") # could be any number of column names here
df$x <- do.call(paste0, c(df[,colNames], sep=" "))
# running system.time on this yields:
# user system elapsed
# 1.861 0.005 1.865
# compare with alternative method:
df$x <- apply(df[, colNames, drop = F], MARGIN = 1,
FUN = function(i) paste(i, collapse = ""))
# running system.time on this yields:
# user system elapsed
# 16.127 0.147 16.304

Resources