Concatenate character vectors in R where some elements have 0 characters - r

I have character vectors, where some elements have 0 characters. I want to concatenate them, but ignoring these 0 elements:
x <- c("a", "b", "", "d", "", "f")
y <- c("a", "", "c", "", "e", "f")
z <- c("a", "", "c", "d", "", "f")
paste(x, y, z, sep = ":")
# This gives:
# [1] "a:a:a" "b::" ":c:c" "d::d" ":e:" "f:f:f"
# But I want this:
# "a:a:a" "b" "c:c" "d:d" "e" "f:f:f"
EDIT: The above was a simplified example, this is a better approximation (I'm concatenating comments to single field)
x <- c("alpah beta", "better", "", "delta", "")
y <- c("alpha", "", "c", "", "fox, one")
z <- c("alpha", "", "can of worms", "delta", "")
paste(x, y, z, sep = "; ")
# Gives:
# "alpha beta; alpha; alpha" "better; ; " "; c; can of worms" "delta; ; delta" "; fox, one; "
# required
# "alpha beta; alpha; alpha" "better" "c; can of worms" "delta; delta" "fox, one"
I'd also be interested in a solution that works where the "" are replaced with NAs, but gives the same result.

You can paste0 them together which will ignore the blanks, then strsplit each character and paste them back together, collapsing with :.
sapply(strsplit(paste0(x,y,z),""),paste,collapse=":")
[1] "a:a:a" "b" "c:c" "d:d" "e" "f:f:f"
Updated example
Another approach is to use Reduce and a custom function to check for the blank elements:
Reduce(function (x,y) ifelse(x==""|y=="",paste0(x,y),paste(x,y,sep=":")),list(x,y,z))
[1] "alpah beta:alpha:alpha" "better" "c:can of worms"
[4] "delta:delta" "fox, one"

Here is an option using gsub
gsub("^:+|:+$|:(?=:)", "", paste(x, y, z, sep = ":"), perl = TRUE)
#[1] "a:a:a" "b" "c:c" "d:d" "e" "f:f:f"
Update
The above code should also work for the updated example (as the OP changed the delimiter, we are also changing it)
gsub("^; |; $|; (?=;)", "", paste(x, y, z, sep = "; "), perl = TRUE)
#[1] "alpah beta; alpha; alpha" "better"
#[3] "c; can of worms" "delta; delta"
#[5] "fox, one"
NOTE: The OP's input string in 'x' is alpah.

Related

how to replace multiple spaces in a single string

I have a column of data with strings that have spaces - some strings have a single space and some have multiple (eg. X Y vs X Y Z). The code segment below worked well for a single space removal (making X Y to X_Y) but doesn't seem to work for multiple spaces - X Y Z becomes X_Y Z.
all_data$`Facility Name` <- str_replace(all_data$`Facility Name`, pattern = " ", replacement = "_")
You can try to do this :
$a = 'X Y Z T';
$r = preg_replace('/\s/', '_', $a);
var_dump($r);
And the result will be :
string(7) "X_Y_Z_T"
Use str_replace_all:
library(stringr)
all_data$`Facility Name` <- str_replace_all(all_data$`Facility Name`, pattern = " ", replacement = "_")
Example:
x <- c("A B", "A B A", "A", "A B A C D")
str_replace_all(x, pattern = " ", replacement = "_")
[1] "A_B" "A_B_A" "A" "A_B_A_C_D"

For loop: How to print the remaining sequence?

I have a vector:
vector <- c("A", "B", "C")
And I want to print the following:
[1] A then B and C
[1] B then A and C
[1] C then A and B
I have been working with a for loop. However, I can't figure out how to print the sequence seperated by 'and'?
for(i in vector){
print(paste(i, "then", XXX))
}
I guess something needs to added where I wrote XXX?
You can use paste with collapse = " then " and reorder vector using [ in your for loop.
for(i in seq_along(vector)) {
print(paste0(vector[i], " then ", paste(vector[-i], collapse = " and ")))
}
#[1] "A then B and C"
#[1] "B then A and C"
#[1] "C then A and B"
You can use setdiff to find the remaining vector and then paste with collapse= to put that whole vector together into some text:
for(i in vector){
remaining.elements.vector <- setdiff(vector, i)
remaining.elements.text <- paste(remaining.elements.vector, collapse=' and ')
print(paste(i, "then", remaining.elements.text))
}

Subset string by counting specific characters

I have the following strings:
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:
some_function(strings)
c("ABBSDGN", "AABSDG", "AGN", "GGG")
I tried to use the stringi, stringr and regex expressions but I can't figure it out.
You can accomplish your task with a simple call to str_extract from the stringr package:
library(stringr)
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:
str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN" NA "AGNA" "GGGDSRTYHG"
There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:
m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Alternatively, you can use sub:
sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Here is a base R option using strsplit
sapply(strsplit(strings, ""), function(x)
paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Or in the tidyverse
library(tidyverse)
map_chr(str_split(strings, ""),
~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))
Identify positions of pattern using gregexpr then extract n-th position (3) and substring everything from 1 to this n-th position using subset.
nChars <- 3
pattern <- "A|G|N"
# Using sapply to iterate over strings vector
sapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))
PS:
If there's a string that doesn't have 3 matches it will generate NA, so you just need to use na.omit on the final result.
This is just a version without strsplit to Maurits Evers neat solution.
sapply(strings,
function(x) {
raw <- rawToChar(charToRaw(x), multiple = TRUE)
idx <- which.max(cumsum(raw %in% c("A", "G", "N")) == 3)
paste(raw[1:idx], collapse = "")
})
## ABBSDGNHNGA AABSDGDRY AGNAFG GGGDSRTYHG
## "ABBSDGN" "AABSDG" "AGN" "GGG"
Or, slightly different, without strsplit and paste:
test <- charToRaw("AGN")
sapply(strings,
function(x) {
raw <- charToRaw(x)
idx <- which.max(cumsum(raw %in% test) == 3)
rawToChar(raw[1:idx])
})
Interesting problem. I created a function (see below) that solves your problem. It's assumed that there are just letters and no special characters in any of your strings.
reduce_strings = function(str, chars, cnt){
# Replacing chars in str with "!"
chars = paste0(chars, collapse = "")
replacement = paste0(rep("!", nchar(chars)), collapse = "")
str_alias = chartr(chars, replacement, str)
# Obtain indices with ! for each string
idx = stringr::str_locate_all(pattern = '!', str_alias)
# Reduce each string in str
reduce = function(i) substr(str[i], start = 1, stop = idx[[i]][cnt, 1])
result = vapply(seq_along(str), reduce, "character")
return(result)
}
# Example call
str = c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
chars = c("A", "G", "N") # Characters that are counted
cnt = 3 # Count of the characters, at which the strings are cut off
reduce_strings(str, chars, cnt) # "ABBSDGN" "AABSDG" "AGN" "GGG"

Handling string search and substitution in R

I am a beginner in R, used Matlab before and I have been searching around for a solution to my problem but I do not appear to find one.
I have a very large vector with text entries. Something like
CAT06
6CAT
CAT 6
DOG3
3DOG
I would like to be able to find a function such that: If an entry is found and it contains "CAT" & "6" (no matter position), substitute cat6. If an entry is found and it contains "DOG" & "3" (no matter position) substitute dog3. So the outcome should be:
cat6 cat6 cat6 dog3 dog3
Can anybody help on this? Thank you very much, find myself a bit lost!
First remove blank spaces i.e. elements like "CAT 6" to "CAT6":
sp = gsub(" ", "", c("CAT06", "6CAT", "CAT 6", "DOG3", "3DOG"))
Then use some regex magic to find any combination of "CAT", "0", "6" and replace these matches with "cat6" as follows:
sp = gsub("^(?:CAT|0|6)*$", "cat6", sp)
Same here with DOG case:
sp = gsub("^(?:DOG|0|3)*$", "dog3", sp)
The input shown in the question is ambiguous as per my comment under the question. We show how to calculate it depending on which of three assumptions was intended.
1) vector input with embedded spaces Remove the digits and spaces ("[0-9 ]") in the first gsub and remove the non-digits ("\\D") in the second gsub converting to numeric to avoid leading zeros and then paste together:
x1 <- c("CAT06", "6CAT", "CAT 6", "DOG3", "3DOG") # test input
paste0(gsub("[0-9 ]", "", x1), as.numeric(gsub("\\D", "", x1)))
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
2) single string Form chars by removing all digits and scanning the result in. Then form nums by removing everything except digits and spaces and scanning the result. Finally paste these together.
x2 <- "CAT06 6CAT CAT 6 DOG3 3DOG" # test input
chars <- scan(textConnection(gsub("\\d", "", x2)), what = "", quiet = TRUE)
nums <- scan(textConnection(gsub("[^ 0-9]", "", x2)), , quiet = TRUE)
y <- paste0(chars, nums)
y
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
or if a single output stirng is wanted add this:
paste(y, collapse = " ")
3) vector input without embedded spaces Reduce this to case (2) and then apply (2).
x3 <- c("CAT06", "6CAT", "CAT", "6", "DOG3", "3DOG") # test input
xx <- paste(x3, collapse = " ")
chars <- scan(textConnection(gsub("\\d", "", xx)), what = "", quiet = TRUE)
nums <- scan(textConnection(gsub("[^ 0-9]", "", xx)), , quiet = TRUE)
y <- paste0(chars, nums)
y
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
Note that this actually works for all three inputs. That is if we replace x3 with x1 or x2 it still works and as with (2) then if a single output string is wanted then add paste(y, collapse = " ")

suppress NAs in paste()

Regarding the bounty
Ben Bolker's paste2-solution produces a "" when the strings that are pasted contains NA's in the same position. Like this,
> paste2(c("a","b", "c", NA), c("A","B", NA, NA))
[1] "a, A" "b, B" "c" ""
The fourth element is an "" instead of an NA Like this,
[1] "a, A" "b, B" "c" NA
I'm offering up this small bounty for anyone who can fix this.
Original question
I've read the help page ?paste, but I don't understand how to have R ignore NAs. I do the following,
foo <- LETTERS[1:4]
foo[4] <- NA
foo
[1] "A" "B" "C" NA
paste(1:4, foo, sep = ", ")
and get
[1] "1, A" "2, B" "3, C" "4, NA"
What I would like to get,
[1] "1, A" "2, B" "3, C" "4"
I could do like this,
sub(', NA$', '', paste(1:4, foo, sep = ", "))
[1] "1, A" "2, B" "3, C" "4"
but that seems like a detour.
I know this question is many years old, but it's still the top google result for r paste na. I was looking for a quick solution to what I assumed was a simple problem, and was somewhat taken aback by the complexity of the answers. I opted for a different solution, and am posting it here in case anyone else is interested.
bar <- apply(cbind(1:4, foo), 1,
function(x) paste(x[!is.na(x)], collapse = ", "))
bar
[1] "1, A" "2, B" "3, C" "4"
In case it isn't obvious, this will work on any number of vectors with NAs in any positions.
IMHO, the advantage of this over the existing answers is legibility. It's a one-liner, which is always nice, and it doesn't rely on a bunch of regexes and if/else statements which may trip up your colleagues or future self. Erik Shitts' answer mostly shares these advantages, but assumes there are only two vectors and that only the last of them contains NAs.
My solution doesn't satisfy the requirement in your edit, because my project has the opposite requirement. However, you can easily solve this by adding a second line borrowed from 42-'s answer:
is.na(bar) <- bar == ""
For the purpose of a "true-NA": Seems the most direct route is just to modify the value returned by paste2 to be NA when the value is ""
paste3 <- function(...,sep=", ") {
L <- list(...)
L <- lapply(L,function(x) {x[is.na(x)] <- ""; x})
ret <-gsub(paste0("(^",sep,"|",sep,"$)"),"",
gsub(paste0(sep,sep),sep,
do.call(paste,c(L,list(sep=sep)))))
is.na(ret) <- ret==""
ret
}
val<- paste3(c("a","b", "c", NA), c("A","B", NA, NA))
val
#[1] "a, A" "b, B" "c" NA
I found a dplyr/tidyverse solution to that question, which is rather elegant in my opinion.
library(tidyr)
foo <- LETTERS[1:4]
foo[4] <- NA
df <- data.frame(foo, num = 1:4)
df %>% unite(., col = "New.Col", num, foo, na.rm=TRUE, sep = ",")
> New.Col
1: 1,A
2: 2,B
3: 3,C
4: 4
A function that follows up on #ErikShilt's answer and #agstudy's comment. It generalizes the situation slightly by allowing sep to be specified and handling cases where any element (first, last, or intermediate) is NA. (It might break if there are multiple NA values in a row, or in other tricky cases ...) By the way, note that this situation is described exactly in the second paragraph of the Details section of ?paste, which indicates that at least the R authors are aware of the situation (although no solution is offered).
paste2 <- function(...,sep=", ") {
L <- list(...)
L <- lapply(L,function(x) {x[is.na(x)] <- ""; x})
gsub(paste0("(^",sep,"|",sep,"$)"),"",
gsub(paste0(sep,sep),sep,
do.call(paste,c(L,list(sep=sep)))))
}
foo <- c(LETTERS[1:3],NA)
bar <- c(NA,2:4)
baz <- c("a",NA,"c","d")
paste2(foo,bar,baz)
# [1] "A, a" "B, 2" "C, 3, c" "4, d"
This doesn't handle #agstudy's suggestions of (1) incorporating the optional collapse argument; (2) making NA-removal optional by adding an na.rm argument (and setting the default to FALSE to make paste2 backward compatible with paste). If one wanted to make this more sophisticated (i.e. remove multiple sequential NAs) or faster it might make sense to write it in C++ via Rcpp (I don't know much about C++'s string-handling, but it might not be too hard -- see convert Rcpp::CharacterVector to std::string and Concatenating strings doesn't work as expected for a start ...)
As Ben Bolker mentioned the above approaches may fall over if there are multiple NAs in a row. I tried a different approach that seems to overcome this.
paste4 <- function(x, sep = ", ") {
x <- gsub("^\\s+|\\s+$", "", x)
ret <- paste(x[!is.na(x) & !(x %in% "")], collapse = sep)
is.na(ret) <- ret == ""
return(ret)
}
The second line strips out extra whitespace introduced when concatenating text and numbers.
The above code can be used to concatenate multiple columns (or rows) of a dataframe using the apply command, or repackaged to first coerce the data into a dataframe if needed.
EDIT
After a few more hours thought I think the following code incorporates the suggestions above to allow specification of the collapse and na.rm options.
paste5 <- function(..., sep = " ", collapse = NULL, na.rm = F) {
if (na.rm == F)
paste(..., sep = sep, collapse = collapse)
else
if (na.rm == T) {
paste.na <- function(x, sep) {
x <- gsub("^\\s+|\\s+$", "", x)
ret <- paste(na.omit(x), collapse = sep)
is.na(ret) <- ret == ""
return(ret)
}
df <- data.frame(..., stringsAsFactors = F)
ret <- apply(df, 1, FUN = function(x) paste.na(x, sep))
if (is.null(collapse))
ret
else {
paste.na(ret, sep = collapse)
}
}
}
As above, na.omit(x) can be replaced with (x[!is.na(x) & !(x %in% "") to also drop empty strings if desired. Note, using collapse with na.rm = T returns a string without any "NA", though this could be changed by replacing the last line of code with paste(ret, collapse = collapse).
nth <- paste0(1:12, c("st", "nd", "rd", rep("th", 9)))
mnth <- month.abb
nth[4:5] <- NA
mnth[5:6] <- NA
paste5(mnth, nth)
[1] "Jan 1st" "Feb 2nd" "Mar 3rd" "Apr NA" "NA NA" "NA 6th" "Jul 7th" "Aug 8th" "Sep 9th" "Oct 10th" "Nov 11th" "Dec 12th"
paste5(mnth, nth, sep = ": ", collapse = "; ", na.rm = T)
[1] "Jan: 1st; Feb: 2nd; Mar: 3rd; Apr; 6th; Jul: 7th; Aug: 8th; Sep: 9th; Oct: 10th; Nov: 11th; Dec: 12th"
paste3(c("a","b", "c", NA), c("A","B", NA, NA), c(1,2,NA,4), c(5,6,7,8))
[1] "a, A, 1, 5" "b, B, 2, 6" "c, , 7" "4, 8"
paste5(c("a","b", "c", NA), c("A","B", NA, NA), c(1,2,NA,4), c(5,6,7,8), sep = ", ", na.rm = T)
[1] "a, A, 1, 5" "b, B, 2, 6" "c, 7" "4, 8"
You can use ifelse, a vectorized if-else construct to determine if a value is NA and substitute a blank. You'll then use gsub to strip out the trailing ", " if it isn't followed by any other string.
gsub(", $", "", paste(1:4, ifelse(is.na(foo), "", foo), sep = ", "))
Your answer is correct. There isn't a better way to do it. This issue is explicitly mentioned in the paste documentation in the Details section.
If working with df or tibbles using tidyverse, I use mutate_all or mutate_at with str_replace_na before paste or unite to avoid pasting NAs.
library(tidyverse)
new_df <- df %>%
mutate_all(~str_replace_na(., "")) %>%
mutate(combo_var = paste0(var1, var2, var3))
OR
new_df <- df %>%
mutate_at(c('var1', 'var2'), ~str_replace_na(., "")) %>%
mutate(combo_var = paste0(var1, var2))
This can be acheived in a single line.
For e.g.,
vec<-c("A","B",NA,"D","E")
res<-paste(vec[!is.na(vec)], collapse=',' )
print(res)
[1] "A,B,D,E"
Or remove the NAs after paste with str_replace_all
data$1 <- str_replace_all(data$1, "NA", "")
A variant of Joe's solution (https://stackoverflow.com/a/49201394/3831096) that respects both sep and collapse and returns NA when all values are NA is:
paste_missing <- function(..., sep=" ", collapse=NULL) {
ret <-
apply(
X=cbind(...),
MARGIN=1,
FUN=function(x) {
if (all(is.na(x))) {
NA_character_
} else {
paste(x[!is.na(x)], collapse = sep)
}
}
)
if (!is.null(collapse)) {
paste(ret, collapse=collapse)
} else {
ret
}
}
Here is a solution that behaves more like paste and handles more edge cases than current solutions (empty strings, "NA" strings, more than 2 arguments, use of collapse argument...).
paste2 <- function(..., sep = " ", collapse = NULL, na.rm = FALSE){
# in default case, use paste
if(!na.rm) return(paste(..., sep = sep, collapse = collapse))
# cbind is convenient to recycle, it warns though so use suppressWarnings
dots <- suppressWarnings(cbind(...))
res <- apply(dots, 1, function(...) {
if(all(is.na(c(...)))) return(NA)
do.call(paste, as.list(c(na.omit(c(...)), sep = sep)))
})
if(is.null(collapse)) res else
paste(na.omit(res), collapse = collapse)
}
# behaves like `paste()` by default
paste2(c("a","b", "c", NA), c("A","B", NA, NA))
#> [1] "a A" "b B" "c NA" "NA NA"
# trigger desired behavior by setting `na.rm = TRUE` and `sep = ", "`
paste2(c("a","b", "c", NA), c("A","B", NA, NA), sep = ",", na.rm = TRUE)
#> [1] "a,A" "b,B" "c" NA
# handles hedge cases
paste2(c("a","b", "c", NA, "", "", ""),
c("a","b", "c", NA, "", "", "NA"),
c("A","B", NA, NA, NA, "", ""),
sep = ",", na.rm = TRUE)
#> [1] "a,a,A" "b,b,B" "c,c" NA "," ",," ",NA,"
Created on 2019-10-01 by the reprex package (v0.3.0)
This works for me
library(stringr)
foo <- LETTERS[1:4]
foo[4] <- NA
foo
# [1] "A" "B" "C" NA
if_else(!is.na(foo),
str_c(1:4, str_replace_na(foo, ""), sep = ", "),
str_c(1:4, str_replace_na(foo, ""), sep = "")
)
# [1] "1, A" "2, B" "3, C" "4"
Updating #Erik Shilts solution in order to get rid of the last one comma:
x = gsub(",$", "", paste(1:4, ifelse(is.na(foo), "", foo), sep = ","))
Then in order to get rid of the trailing last "," in it just repeat it once again:
x <- gsub(",$", "", x)

Resources