Insert vertical bar between each character of a string in R - r

How would I be able to insert a vertical bar in between every character of a string in R? For example, say I have a string "ABC123". How could I obtain the output to be "A|B|C|1|2|3"? If anyone could vectorize this idea for a vector of character strings, that would be great.

First, separate into individual characters and then collapse
paste(unlist(strsplit("ABC123", "")), collapse = "|")
#[1] "A|B|C|1|2|3"
For vector of strings, use sapply to loop through them
mystrings = c("ABC123", "PASDP")
sapply(strsplit(mystrings, ""), paste, collapse = "|")
#[1] "A|B|C|1|2|3" "P|A|S|D|P"

Here is an option using regex
gsub("(?<=.)(?=.)", "|", "ABC123", perl = TRUE)
#[1] "A|B|C|1|2|3"
Or with more than one string
mystrings <- c("ABC123", "PASDP")
gsub("(?<=.)(?=.)", "|", mystrings, perl = TRUE)
#[1] "A|B|C|1|2|3" "P|A|S|D|P"

Related

Formatting UK Postcodes in R

I am trying to format UK postcodes that come in as a vector of different input in R.
For example, I have the following postcodes:
postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4 9RW","G32-7EJ")
How do I write a generic code that would convert entries of the above vector into:
c("IV41 8PW","IV40 8BU","KY11 4HJ","KY1 1UU","KY4 9RW","G32 7EJ")
That is the first part of the postcode is separated from the second part of the postcode by one space and all letters are capitals.
EDIT: the second part of the postcode is always the 3 last characters (combination of a number followed by letters)
I couldn't come up with a smart regex solution so here is a split-apply-combine approach.
sapply(strsplit(sub('^(.*?)(...)$', '\\1:\\2', postcodes), ':', fixed = TRUE), function(x) {
paste0(toupper(trimws(x, whitespace = '[.\\s-]')), collapse = ' ')
})
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
The logic here is that we insert a : (or any character that is not in the data) in the string between the 1st and 2nd part. Split the string on :, remove unnecessary characters, get it in upper case and combine it in one string.
One approach:
Convert to uppercase
extract the alphanumeric characters
Paste back together with a space before the last three characters
The code would then be:
library(stringr)
postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4 9RW","G32-7EJ")
postcodes <- str_to_upper(postcodes)
sapply(str_extract_all(postcodes, "[:alnum:]"), function(x)paste(paste0(head(x,-3), collapse = ""), paste0(tail(x,3), collapse = "")))
# [1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
You can remove everything what is not a word caracter \\W (or [^[:alnum:]_]) and then insert a space before the last 3 characters with (.{3})$ and \\1.
sub("(.{3})$", " \\1", toupper(gsub("\\W+", "", postcodes)))
#sub("(...)$", " \\1", toupper(gsub("\\W+", "", postcodes))) #Alternative
#sub("(?=.{3}$)", " ", toupper(gsub("\\W+", "", postcodes)), perl=TRUE) #Alternative
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
# Option 1 using regex:
res1 <- gsub(
"(\\w+)(\\d[[:upper:]]\\w+$)",
"\\1 \\2",
gsub(
"\\W+",
" ",
postcodes
)
)
# Option 2 using substrings:
res2 <- vapply(
trimws(
gsub(
"\\W+",
" ",
postcodes
)
),
function(ir){
paste(
trimws(
substr(
ir,
1,
nchar(ir) -3
)
),
substr(
ir,
nchar(ir) -2,
nchar(ir)
)
)
},
character(1),
USE.NAMES = FALSE
)

string split and interchange the position of string in R

I have a vector called myvec. I would like to split it at _ and interchange the position. What would be the simplest way to do this?
myvec <- c("08AD09144_NACC022453", "08AD8245_NACC657970")
Result I want:
NACC022453_08AD09144, NACC657970_08AD8245
You can do this with regex capturing data in two groups and interchanging them using back reference.
myvec <- c("A1_B1", "B2_C1", "D1_A2")
sub('(\\w+)_(\\w+)', '\\2_\\1', myvec)
#[1] "B1_A1" "C1_B2" "A2_D1"
We can use strsplit from base R
sapply(strsplit(myvec, "_"), function(x) paste(x[2], x[1], sep = "_"))
#[1] "NACC022453_08AD09144" "NACC657970_08AD8245"

How to find if a string contain certain characters without considering sequence?

I'm trying to match a name using elements from another vector with R. But I don't know how to escape sequence when using grep() in R.
name <- "Cry River"
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
grep(name, string, value = TRUE)
I expect the output to be "Cry Me A River", but I don't know how to do it.
Use .* in the pattern
grep("Cry.*River", string, value = TRUE)
#[1] "Cry Me A River"
Or if you are getting names as it is and can't change it, you can split on whitespace and insert the .* between the words like
grep(paste(strsplit(name, "\\s+")[[1]], collapse = ".*"), string, value = TRUE)
where the regex is constructed in the below fashion
strsplit(name, "\\s+")[[1]]
#[1] "Cry" "River"
paste(strsplit(name, "\\s+")[[1]], collapse = ".*")
#[1] "Cry.*River"
Here is a base R option, using grepl:
name <- "Cry River"
parts <- paste0("\\b", strsplit(name, "\\s+")[[1]], "\\b")
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
result <- sapply(parts, function(x) { grepl(x, string) })
string[rowSums(result) == length(parts)]
[1] "Cry Me A River"
The strategy here is to first split the string containing the various search terms, and generating individual regex patterns for each term. In this case, we generate:
\bCry\b and \bRiver\b
Then, we iterate over each term, and using grepl we check that the term appears in each of the strings. Finally, we retain only those matches which contained all terms.
We can do the grepl on splitted string and Reduce the list of logical vectors to a single logicalvector` and extract the matching element in 'string'
string[Reduce(`&`, lapply(strsplit(name, " ")[[1]], grepl, string))]
#[1] "Cry Me A River"
Also, instead of strsplit, we can insert the .* with sub
grep(sub(" ", ".*", name), string, value = TRUE)
#[1] "Cry Me A River"
Here's an approach using stringr. Is order important? Is case important? Is it important to match whole words. If you would just like to match 'Cry' and 'River' in any order and don't care about case.
name <- "Cry River"
string <- c("Yesterday Once More",
"Are You happy",
"Cry Me A River",
"Take me to the River or I'll Cry",
"The Cryogenic River Rag",
"Crying on the Riverside")
string[str_detect(string, pattern = regex('\\bcry\\b', ignore_case = TRUE)) &
str_detect(string, regex('\\bRiver\\b', ignore_case = TRUE))]

convert digits to special format

In my data processing, I need to do the following:
#convert '7-25' to '0007 0025'
#pad 0's to make each four-digit number
digits.formatter <- function ('7-25'){.......?}
I have no clue how to do that in R. Can anyone help?
In base R, split the character string (or vector of strings) at -, convert its parts to numeric, format the parts using sprintf, and then paste them back together.
sapply(strsplit(c("7-25", "20-13"), "-"), function(x)
paste(sprintf("%04d", as.numeric(x)), collapse = " "))
#[1] "0007 0025" "0020 0013"
A solution with stringr:
library(stringr)
digits.formatter <- function(string){
str_vec = str_split(string, "-")
output = sapply(str_vec, function(x){
str_padded = str_pad(x, width = 4, pad = "0")
paste(str_padded, collapse = " ")
})
return(output)
}
digits.formatter(c('7-25', '8-30'))
# [1] "0007 0025" "0008 0030"
The pad= argument in str_pad specifies whatever you like to pad, whereas width= specifies the minimum width of the padded string. You can also use an optional argument side= to specify which side you want to pad the string (defaults to side=left). For example:
str_pad(1:5, width = 4, pad = "0", side = "right")
# [1] "1000" "2000" "3000" "4000" "5000"
We could do this with gsubfn
library(gsubfn)
gsubfn("(\\d+)", ~sprintf("%04d", as.numeric(x)), v1)
#[1] "0007-0025" "0020-0013"
If we don't need the -,
either use sub after the gsubfn
sub("-", " ", gsubfn("(\\d+)", ~sprintf("%04d", as.numeric(x)), v1))
#[1] "0007 0025" "0020 0013"
or directly use two capture groups in gsubfn
gsubfn("(\\d+)-(\\d+)", ~sprintf("%04d %04d", as.numeric(x), as.numeric(y)), v1)
#[1] "0007 0025" "0020 0013"
data
v1 <- c("7-25", "20-13")

Removing a group of words from a character vector

Let's say that I have a character vector of random names. I also have another character vector with a number of car makes and I want to remove any occurrence of a car incident in the original vector.
So given the vectors:
dat = c("Tonyhonda","DaveFord","Alextoyota")
car = c("Honda","Ford","Toyota","honda","ford","toyota")
I want to end up with something like below:
dat = c("Tony","Dave","Alex")
How can I remove part of a string in R?
gsub(x = dat, pattern = paste(car, collapse = "|"), replacement = "")
[1] "Tony" "Dave" "Alex"
Just formalizing 42-'s comment above. Rather than using
car = c("Honda","Ford","Toyota","honda","ford","toyota")
You can just use:
carlist = c("Honda","Ford","Toyota")
gsub(x = dat, pattern = paste(car, collapse = "|"), replacement = "", ignore.case = TRUE)
[1] "Tony" "Dave" "Alex"
That allows you to only put each word you want to exclude in the list one time.

Resources