I want to paste together multiple columns but ignore NAs.
Here's a basic working example of what the df looks like and what I'd like it to look like. Does anyone have any tips?
df <- data.frame("col1" = c("A", NA, "B", "C"),
"col2" = c(NA, NA, NA, "E"),
"col3" = c(NA, "D", NA, NA),
"col4" = c(NA, NA, NA, NA))
df_fixed <- data.frame("col" = c("A", "D", "B", "C,E"))
Using paste.
data.frame(col1=sapply(apply(df, 1, \(x) x[!is.na(x)]), paste, collapse=','))
# col1
# 1 A
# 2 D
# 3 B
# 4 C,E
Or without apply:
data.frame(col1=unname(as.list(as.data.frame(t(df))) |>
(\(x) sapply(x, \(x) paste(x[!is.na(x)], collapse=',')))()))
# col1
# 1 A
# 2 D
# 3 B
# 4 C,E
To add as a column use transform.
transform(df, colX=sapply(apply(df, 1, \(x) x[!is.na(x)]), paste, collapse=','))
# col1 col2 col3 col4 colX
# 1 A <NA> <NA> NA A
# 2 <NA> <NA> D NA D
# 3 B <NA> <NA> NA B
# 4 C E <NA> NA C,E
Note: Actually, you also could replace \(x) x[!is.na(x)] by na.omit, since it's attributes vanish; see e.g. # G. Grothendieck's answer.
A possible base R solution:
df2 <- data.frame(col=apply(df,1, function(x) paste0(na.omit(x), collapse = ",")))
df2
#> col
#> 1 A
#> 2 D
#> 3 B
#> 4 C,E
Use na.omit and toString. No packages are used.
data.frame(col = apply(df, 1, function(x) toString(na.omit(x)))
## col
## 1 A
## 2 D
## 3 B
## 4 C, E
Use one of these instead of the anonymous function shown if spaces in the output are a problem:
function(x) paste(na.omit(x), collapse = ",")
function(x) gsub(", ", ",", toString(na.omit(x)))
We may use unite which can have na.rm as argument
library(tidyr)
library(dplyr)
df %>%
unite(col, everything(), na.rm = TRUE, sep=",")
-output
col
1 A
2 D
3 B
4 C,E
Or using base R with do.call and trimws
data.frame(col = trimws(do.call(paste, c(df, sep = ",")),
whitespace = "(?:,?NA,?)+"))
-output
col
1 A
2 D
3 B
4 C,E
Related
For example if I have this:
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
n s b
1 2 aa TRUE
2 3 bb FALSE
3 5 cc TRUE
Then how do I combine the two columns n and s into a new column named x such that it looks like this:
n s b x
1 2 aa TRUE 2 aa
2 3 bb FALSE 3 bb
3 5 cc TRUE 5 cc
Use paste.
df$x <- paste(df$n,df$s)
df
# n s b x
# 1 2 aa TRUE 2 aa
# 2 3 bb FALSE 3 bb
# 3 5 cc TRUE 5 cc
For inserting a separator:
df$x <- paste(df$n, "-", df$s)
As already mentioned in comments by Uwe and UseR, a general solution in the tidyverse format would be to use the command unite:
library(tidyverse)
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b) %>%
unite(x, c(n, s), sep = " ", remove = FALSE)
Using dplyr::mutate:
library(dplyr)
df <- mutate(df, x = paste(n, s))
df
> df
n s b x
1 2 aa TRUE 2 aa
2 3 bb FALSE 3 bb
3 5 cc TRUE 5 cc
Some examples with NAs and their removal using apply
n = c(2, NA, NA)
s = c("aa", "bb", NA)
b = c(TRUE, FALSE, NA)
c = c(2, 3, 5)
d = c("aa", NA, "cc")
e = c(TRUE, NA, TRUE)
df = data.frame(n, s, b, c, d, e)
paste_noNA <- function(x,sep=", ") {
gsub(", " ,sep, toString(x[!is.na(x) & x!="" & x!="NA"] ) ) }
sep=" "
df$x <- apply( df[ , c(1:6) ] , 1 , paste_noNA , sep=sep)
df
We can use paste0:
df$combField <- paste0(df$x, df$y)
If you do not want any padding space introduced in the concatenated field. This is more useful if you are planning to use the combined field as a unique id that represents combinations of two fields.
Instead of
paste (default spaces),
paste0 (force the inclusion of missing NA as character) or
unite (constrained to 2 columns and 1 separator),
I'd suggest an alternative as flexible as paste0 but more careful with NA: stringr::str_c
library(tidyverse)
# check the missing value!!
df <- tibble(
n = c(2, 2, 8),
s = c("aa", "aa", NA_character_),
b = c(TRUE, FALSE, TRUE)
)
df %>%
mutate(
paste = paste(n,"-",s,".",b),
paste0 = paste0(n,"-",s,".",b),
str_c = str_c(n,"-",s,".",b)
) %>%
# convert missing value to ""
mutate(
s_2=str_replace_na(s,replacement = "")
) %>%
mutate(
str_c_2 = str_c(n,"-",s_2,".",b)
)
#> # A tibble: 3 x 8
#> n s b paste paste0 str_c s_2 str_c_2
#> <dbl> <chr> <lgl> <chr> <chr> <chr> <chr> <chr>
#> 1 2 aa TRUE 2 - aa . TRUE 2-aa.TRUE 2-aa.TRUE "aa" 2-aa.TRUE
#> 2 2 aa FALSE 2 - aa . FALSE 2-aa.FALSE 2-aa.FALSE "aa" 2-aa.FALSE
#> 3 8 <NA> TRUE 8 - NA . TRUE 8-NA.TRUE <NA> "" 8-.TRUE
Created on 2020-04-10 by the reprex package (v0.3.0)
extra note from str_c documentation
Like most other R functions, missing values are "infectious": whenever a missing value is combined with another string the result will always be missing. Use str_replace_na() to convert NA to "NA"
There are other great answers, but in the case where you don't know the column names or the number of columns you want to concatenate beforehand, the following is useful.
df = data.frame(x = letters[1:5], y = letters[6:10], z = letters[11:15])
colNames = colnames(df) # could be any number of column names here
df$newColumn = apply(df[, colNames, drop = F], MARGIN = 1, FUN = function(i) paste(i, collapse = ""))
I'd like to also propose a method for concatenating a large/unknown number of columns. The solution proposed by Ben Ernest can be pretty slow on large datasets.
Below is my proposed solution:
# setup data.frame - Making it large for the time benchmarking
n = rep(c(2, 3, 5), 1000000)
s = rep(c("aa", "bb", "cc"), 1000000)
b = rep(c(TRUE, FALSE, TRUE), 1000000)
df = data.frame(n, s, b)
# The proposed solution:
colNames = c("n", "s") # could be any number of column names here
df$x <- do.call(paste0, c(df[,colNames], sep=" "))
# running system.time on this yields:
# user system elapsed
# 1.861 0.005 1.865
# compare with alternative method:
df$x <- apply(df[, colNames, drop = F], MARGIN = 1,
FUN = function(i) paste(i, collapse = ""))
# running system.time on this yields:
# user system elapsed
# 16.127 0.147 16.304
I want to see whether the text column has elements outside the specified values of "a" and "b"
specified_value=c("a","b")
df=data.frame(key=c(1,2,3,4),text=c("a,b,c","a,d","1,2","a,b")
df_out=data.frame(key=c(1,2,3),text=c("c","d","1,2",NA))
This is what I have tried:
df=df%>%mutate(text_vector=strsplit(text, split=","),
extra=text_vector[which(!text_vector %in% specified_value)])
But this doesn't work, any suggestions?
We can split the 'text' by the delimiter , with separate_rows, grouped by 'key', get the elements that are not in 'specified_value' with setdiff and paste them together (toString), then do a join to get the other columns in the original dataset
library(dplyr) # >= 1.0.0
library(tidyr)
df %>%
separate_rows(text) %>%
group_by(key) %>%
summarise(extra = toString(setdiff(text, specified_value))) %>%
left_join(df) %>%
mutate(extra = na_if(extra, ""))
# A tibble: 4 x 3
# key extra text
# <dbl> <chr> <chr>
#1 1 c a,b,c
#2 2 d a,d
#3 3 1, 2 1,2
#4 4 <NA> a,b
Using setdiff.
df$outside <- sapply({
x <- lapply(strsplit(df$text, ","), setdiff, specified_value)
replace(x, lengths(x) == 0, NA)},
paste, collapse=",")
df
# key text outside
# 1 1 a,b,c c
# 2 2 a,d d
# 3 3 1,2 1,2
# 4 4 a,b NA
Data:
df <- structure(list(key = c(1, 2, 3, 4), text = c("a,b,c", "a,d",
"1,2", "a,b")), class = "data.frame", row.names = c(NA, -4L))
specified_value <- c("a", "b")
use stringi::stri_split_fixed
library(stringi)
!all(stri_split_fixed("a,b", ",", simplify=T) %in% specified_value) #FALSE
!all(stri_split_fixed("a,b,c", ",", simplify=T) %in% specified_value) #TRUE
An option using regex without splitting the data on comma :
#Collapse the specified_value in one string and remove from text
df$text1 <- gsub(paste0(specified_value, collapse = "|"), '', df$text)
#Remove extra commas
df$text1 <- gsub('(?<![a-z0-9]),', '', df$text1, perl = TRUE)
df
# key text text1
#1 1 a,b,c c
#2 2 a,d d
#3 3 1,2 1,2
#4 4 a,b
I have a dataset that looks like this:
Col1 Col2 Col3 Col4 Col5
A B 4 5 7
G H 5 6 NA
H I NA 9 8
K F 9 NA NA
E L NA 8 9
H I 1 0 10
How do I apply the na.fill() function to all the columns after Col2?
If I were to do it individually, it would be something like this:
df$Col3<-na.fill(df$Col3, c(NA, "extend", NA))
df$Col4<-na.fill(df$Col4, c(NA, "extend", NA))
df$Col5<-na.fill(df$Col5, c(NA, "extend", NA))
The problem is that my actual dataframe has over 100 columns. Is there a quick way to apply this function to all the columns after the first 2?
na.fill does handle multiple columns. Really no need to use lapply, mutate, etc. Just replace the relevant columns with the result of running na.fill on those same columns. If you know what ix is then you could replace the first line with it so that in this example we could alternately use ix <- 3:5 or ix <- -(1:2) .
ix <- sapply(DF, is.numeric)
replace(DF, ix, na.fill(DF[ix], c(NA, "extend", NA)))
giving:
Col1 Col2 Col3 Col4 Col5
1 A B 4 5.0 7.0
2 G H 5 6.0 7.5
3 H I 7 9.0 8.0
4 K F 9 8.5 8.5
5 E L 5 8.0 9.0
6 H I 1 0.0 10.0
Note that you could alternately use na.approx:
replace(DF, ix, na.approx(DF[ix], na.rm = FALSE))
Note
Lines <- "Col1 Col2 Col3 Col4 Col5
A B 4 5 7
G H 5 6 NA
H I NA 9 8
K F 9 NA NA
E L NA 8 9
H I 1 0 10"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, strip.white = TRUE)
The mutate_-family of functions in the dplyr package would do the trick.
There are a few ways to do this. Some may work better than others depending on what your other columns look like. Here are three versions that would work better in different circumstances.
# Make dummy data.
df <- data.frame(
Col1 = LETTERS[1:6],
Col2 = LETTERS[7:12],
Col3 = c(4, 5, NA, 9, NA, 1),
Col4 = c(5,6,9,NA,8,0),
Col5 = c(7,NA,8,NA,9,10)
)
You can apply the na.fill function to columns specified by name vector. This is useful if you want to use a regular expression to select columns with certain name parts.
cn <- names(df) %>%
str_subset("[345]") # Column names with 3, 4 or 5 in them.
result_1 <- df %>%
mutate_at(vars(cn),
zoo::na.fill, c(NA, 'extend', NA)
)
You can apply the na.fill function to any numeric column.
result_2 <- df %>%
mutate_if(is.numeric, # First argument is function that returns a logical vector.
zoo::na.fill, c(NA, 'extend', NA)
)
You can apply the function to columns specified in an numeric index vector.
result_3 <- df
result_3[ , 3:5] <- result_3[ , 3:5] %>% # Just replace columns 3 through 5
mutate_all(
zoo::na.fill, c(NA, 'extend', NA)
)
In this case, all three versions should have done the same thing.
all.equal(result_1, result_2) # TRUE
all.equal(result_1, result_3) # TRUE
I have the following data frame(small subset from a big dataframe taken)
gene counts
a 1,4,5
b 2,1
c 9,2,4,5
d 1,2,3
I want to get the mean of column2 and then output it as the 3rd column. So I want some thing like this as my output:
gene counts avg
a 1,4,5 3.33
b 2,1 1.5
c 9,2,4,5 5
d 1,2,3 2
I tried something like this:
df <- read.table("test.txt",header=TRUE,sep="\t")
s <- strsplit(df$counts,split=",") # This creates a list with 4 elements in this case
This converts into character? Any help how can I get the average?
Thanks
This will work:
df$mean <- sapply(strsplit(df$counts, ','), function(x) mean(as.numeric(x)))
We can loop through the list and get the mean
df$avg <- sapply(s, function(x) mean(as.numeric(x)))
df$avg
#[1] 3.333333 1.500000 5.000000 2.000000
Or using tidyverse
library(tidyverse)
df %>%
separate_rows(counts, sep = ",", convert = TRUE) %>%
group_by(gene) %>%
summarise(avg = mean(counts), counts = toString(counts))
# A tibble: 4 x 3
# gene avg counts
# <chr> <dbl> <chr>
#1 a 3.33 1, 4, 5
#2 b 1.5 2, 1
#3 c 5 9, 2, 4, 5
#4 d 2 1, 2, 3
data
df <- structure(list(gene = c("a", "b", "c", "d"), counts = c("1,4,5",
"2,1", "9,2,4,5", "1,2,3")), class = "data.frame", row.names = c(NA,
-4L))
s <- strsplit(df$counts,split=",")
A lazy approach using eval and parse
sapply(paste0("mean(c(", df$counts, "))"), function(x) eval(parse(text=x)))
# mean(c(1,4,5)) mean(c(2,1)) mean(c(9,2,4,5)) mean(c(1,2,3))
# 3.333333 1.500000 5.000000 2.000000
data
df <- read.table(text=
"gene counts
a 1,4,5
b 2,1
c 9,2,4,5
d 1,2,3",header=TRUE, stringsAsFactors=FALSE)
For example if I have this:
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
n s b
1 2 aa TRUE
2 3 bb FALSE
3 5 cc TRUE
Then how do I combine the two columns n and s into a new column named x such that it looks like this:
n s b x
1 2 aa TRUE 2 aa
2 3 bb FALSE 3 bb
3 5 cc TRUE 5 cc
Use paste.
df$x <- paste(df$n,df$s)
df
# n s b x
# 1 2 aa TRUE 2 aa
# 2 3 bb FALSE 3 bb
# 3 5 cc TRUE 5 cc
For inserting a separator:
df$x <- paste(df$n, "-", df$s)
As already mentioned in comments by Uwe and UseR, a general solution in the tidyverse format would be to use the command unite:
library(tidyverse)
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b) %>%
unite(x, c(n, s), sep = " ", remove = FALSE)
Using dplyr::mutate:
library(dplyr)
df <- mutate(df, x = paste(n, s))
df
> df
n s b x
1 2 aa TRUE 2 aa
2 3 bb FALSE 3 bb
3 5 cc TRUE 5 cc
Some examples with NAs and their removal using apply
n = c(2, NA, NA)
s = c("aa", "bb", NA)
b = c(TRUE, FALSE, NA)
c = c(2, 3, 5)
d = c("aa", NA, "cc")
e = c(TRUE, NA, TRUE)
df = data.frame(n, s, b, c, d, e)
paste_noNA <- function(x,sep=", ") {
gsub(", " ,sep, toString(x[!is.na(x) & x!="" & x!="NA"] ) ) }
sep=" "
df$x <- apply( df[ , c(1:6) ] , 1 , paste_noNA , sep=sep)
df
We can use paste0:
df$combField <- paste0(df$x, df$y)
If you do not want any padding space introduced in the concatenated field. This is more useful if you are planning to use the combined field as a unique id that represents combinations of two fields.
Instead of
paste (default spaces),
paste0 (force the inclusion of missing NA as character) or
unite (constrained to 2 columns and 1 separator),
I'd suggest an alternative as flexible as paste0 but more careful with NA: stringr::str_c
library(tidyverse)
# check the missing value!!
df <- tibble(
n = c(2, 2, 8),
s = c("aa", "aa", NA_character_),
b = c(TRUE, FALSE, TRUE)
)
df %>%
mutate(
paste = paste(n,"-",s,".",b),
paste0 = paste0(n,"-",s,".",b),
str_c = str_c(n,"-",s,".",b)
) %>%
# convert missing value to ""
mutate(
s_2=str_replace_na(s,replacement = "")
) %>%
mutate(
str_c_2 = str_c(n,"-",s_2,".",b)
)
#> # A tibble: 3 x 8
#> n s b paste paste0 str_c s_2 str_c_2
#> <dbl> <chr> <lgl> <chr> <chr> <chr> <chr> <chr>
#> 1 2 aa TRUE 2 - aa . TRUE 2-aa.TRUE 2-aa.TRUE "aa" 2-aa.TRUE
#> 2 2 aa FALSE 2 - aa . FALSE 2-aa.FALSE 2-aa.FALSE "aa" 2-aa.FALSE
#> 3 8 <NA> TRUE 8 - NA . TRUE 8-NA.TRUE <NA> "" 8-.TRUE
Created on 2020-04-10 by the reprex package (v0.3.0)
extra note from str_c documentation
Like most other R functions, missing values are "infectious": whenever a missing value is combined with another string the result will always be missing. Use str_replace_na() to convert NA to "NA"
There are other great answers, but in the case where you don't know the column names or the number of columns you want to concatenate beforehand, the following is useful.
df = data.frame(x = letters[1:5], y = letters[6:10], z = letters[11:15])
colNames = colnames(df) # could be any number of column names here
df$newColumn = apply(df[, colNames, drop = F], MARGIN = 1, FUN = function(i) paste(i, collapse = ""))
I'd like to also propose a method for concatenating a large/unknown number of columns. The solution proposed by Ben Ernest can be pretty slow on large datasets.
Below is my proposed solution:
# setup data.frame - Making it large for the time benchmarking
n = rep(c(2, 3, 5), 1000000)
s = rep(c("aa", "bb", "cc"), 1000000)
b = rep(c(TRUE, FALSE, TRUE), 1000000)
df = data.frame(n, s, b)
# The proposed solution:
colNames = c("n", "s") # could be any number of column names here
df$x <- do.call(paste0, c(df[,colNames], sep=" "))
# running system.time on this yields:
# user system elapsed
# 1.861 0.005 1.865
# compare with alternative method:
df$x <- apply(df[, colNames, drop = F], MARGIN = 1,
FUN = function(i) paste(i, collapse = ""))
# running system.time on this yields:
# user system elapsed
# 16.127 0.147 16.304