Split character string by forward slash or nothing - r

I want to split this vector
c("CC", "C/C")
to
[[1]]
[1] "C" "C"
[[2]]
[1] "C" "C"
My final data should look like:
c("C_C", "C_C")
Thus, I need some regex, but don't found how to solve the "non-space" part:
strsplit(c("CC", "C/C"),"|/")

You can use sub (or gsub if it occurs more than once in your string) to directly replace either nothing or a forward slash with an underscore (capturing one character words around):
sub("(\\w)(|/)(\\w)", "\\1_\\3", c("CC", "C/C"))
#[1] "C_C" "C_C"

We can split the string at every character, omit the "/" and paste them together.
sapply(strsplit(x, ""), function(v) paste0(v[v!= "/"], collapse = "_"))
#[1] "C_C" "C_C"
data
x <- c("CC", "C/C")

We can use
lapply(strsplit(v1, "/|"), function(x) x[nzchar(x)])
Or use a regex lookaround
strsplit(v1, "(?<=[^/])(/|)", perl = TRUE)
#[[1]]
#[1] "C" "C"
#[[2]]
#[1] "C" "C"
If the final output should be a vector, then
gsub("(?<=[^/])(/|)(?=[^/])", "_", v1, perl = TRUE)
#[1] "C_C" "C_C"

Related

Remove duplicates within consecutive runs of characters

I have strings containing lots of duplicates, like this:
tst <- c("C>C>C>B>B>B>B>C>C>*>*>*>*>*>C", "A>A>A", "*>B>B",
"A>A>A>A>A>*>A>A>A>*>*>*>*>A>A", "*>C>C", "A")
I'd like to remove all consecutive duplicated upper-case and "*" characters, so the expected result is this:
[1] "CBC*C" "A" "*B" "A*A*A" "*C" "A"
I've successfully extracted the duplicated capitals:
library(stringr)
unlist(str_extract_all(gsub(">", "", tst), "(.)(?=\\1)"))
[1] "C" "C" "B" "B" "B" "C" "*" "*" "*" "*"
but am somewhat stuck here. My hunch is that the function which, which returns indices, might be of help but don't know how to implement it in this case.
Any ideas?
EDIT:
I wasn't that far from the solution myself - just using a negative lookahead (instead of the positive lookahead) does the trick:
str_extract_all(gsub(">", "", tst), "(.)(?!\\1)")
[[1]]
[1] "C" "B" "C" "*" "C"
[[2]]
[1] "A"
[[3]]
[1] "*" "B"
[[4]]
[1] "A" "*" "A" "*" "A"
[[5]]
[1] "*" "C"
[[6]]
[1] "A"
We can use gsub
gsub("([A-Z*]>)\\1+", "\\1", tst)
#[1] "C>B>C>*>C"
In order to get the second result, remove the >
gsub(">", "", gsub("([A-Z*]\\>)\\1+", "\\1", tst) ,fixed = TRUE)
#[1] "CBC*C"
Based on the OP's comments below, may be
gsub("(.)\\1+", "\\1", gsub(">", "", tst))
#[1] "CBC*C"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A>A>A"))
#[1] "A"
Another way to get CBC*C could be using 2 groups and using group 2 in the replacement.
((.)>)\1+
Regex demo
Example
tst <- "C>C>C>B>B>B>B>C>C>*>*>*>*>*>C"
gsub("((.)>)\\1+", "\\2", tst)
Output
[1] "CBC*C"
For us allergic to regex:
paste(rle(strsplit(tst, ">")[[1]])$values, collapse = ">") # or collapse = ""
[1] "C>B>C>*>C"
...which of course fails for strings with runs of lowercase letters, like "A>A>a>a>A>A"
A somewhat universal base R approach without regexps.
The idea here is to melt down the string to groups and then remove the repeating patterns successively (which makes it distinct from unique):
tst <- "C>C>C>B>B>B>B>C>C>*>*>*>*>*>C"
st <- paste(unlist(strsplit(tst,">")),collapse="")
#[1] "CCCBBBBCC*****C"
paste( unlist( sapply( 1:nchar(st), function(x){
if( substr(st,x,x) != substr(st,(x+1),(x+1)) ){ substr(st,x,x) } } ) ), collapse="" )
#[1] "CBC*C"
Oh, and if you want lowercase functionality (excluding lowercase letters from removal), use this instead:
paste( unlist( sapply( 1:nchar(st), function(x){
a=substr(st,x,x); b=substr(st,(x+1),(x+1));
if( a != b & toupper(a) == a ){ a } else if( toupper(a) != a ){ a } } ) ), collapse="" )

R: strsplit on negative lookaround

Say I need to strsplit caabacb into individual letters except when a letter is followed by a b, thus resulting in "c" "a" "ab" "a" "cb". I tried using the following line, which looks OK on regex tester but does not work in R. What did I do wrong?
strsplit('caabacb','(?!b)',perl=TRUE)
[[1]]
[1] "c" "a" "a" "b" "a" "c" "b"
You could also add a prefix positive lookbehind that matches any character (?<=.). The positive lookbehind (?<=.) would split the string at every character (without removal of characters), but the negative lookahead (?!b) excludes splits where a character is followed by a b:
strsplit('caabacb', '(?<=.)(?!b)', perl = TRUE)
#> [[1]]
#> [1] "c" "a" "ab" "a" "cb"
strsplit() probably needs something to split. You could insert e.g. a ";" with gsub().
strsplit(gsub("(?!^.|b|\\b)", ";", "caabacb", perl=TRUE), ";", perl=TRUE)
# [[1]]
# [1] "c" "a" "ab" "a" "cb"

Extract substring after the final dot

I want to implement a regex to extract the substring after the final dot.
For example,
a = c("a.b.c.d", "e.b.e", "c", "f.d.e", "a.e.b.g.z")
gsub(".*(\\..*)$", "\\1", a)
The code returns
".d" ".e" "c" ".e" ".z"
How do I modify the code to get
"d" "e" "" "e" "z"
That is to say, if the string contains dot, it will remove the last part without the dot; if the string doesn't contain dot, it will return "".
Here is a way to do this using sub without capture groups. We can try replacing all content up to and including the final dot with empty string.
a = c("a.b.c.d", "e.b.e", "c", "f.d.e", "a.e.b.g.z")
sub(".*\\.", "", a)
[1] "d" "e" "c" "e" "z"
If you want to return empty string should the input have no dot, then we can use ifelse with grepl:
input <- "Hello World!"
output <- ifelse(grepl("\\.", input), sub(".*\\.", "", input), "")
The reason for the verbose above code is that sub by default just returns the original string should no match be found. But, in your case, you want a different behavior.
You need . outside the capture group as you don't need it
sub(".*\\.(.*)", "\\1", a)
#[1] "d" "e" "c" "e" "z"
This will capture everything after the last dot.
For strings where we have no dots, we could check for it using grepl and then extract
ifelse(grepl("\\.", a), sub(".*\\.(.*)", "\\1", a), "")
#[1] "d" "e" "" "e" "z"

Why does asterisk wildcard fail with sub() command? [r]

When using the sub() function in R, how do we use an asterisk wildcard to replace all characters after (or before) an indicator?
If we want to remove an underscore and all arbitrary text afterward:
x <- c("a_101", "a_275", "b_133", "b_277")
The following code removes nothing:
sub(pattern = "_*", replacement = "", x = x)
[1] "a_101" "a_275" "b_133" "b_277"
Desired output:
"a" "a" "b" "b"
Why does the wildcard fail?
If using sub, you have to specify everything you want to replace, and what you want to replace it with. Here I've done that using a group function for the letter of interest.
sub('([a-z])_\\d+', replacement = '\\1', x)
[1] "a" "a" "b" "b"
Using the wild card will work too.
sub('([a-z])_.*', replacement = '\\1', x)
[1] "a" "a" "b" "b"
And finally more along the lines of what you were thinking:
sub('_.*', replacement = "", x)
[1] "a" "a" "b" "b"

extract text from alphanumeric vector in R

i have a data like below and need to extract text comes before any number. or if we can separate the text and number then it would be great
df<-c("axz123","bww2","c334")
output
"axz", "bww", "c"
or
"axz","bww","c"
"123","2","334"
We can do:
df <- c("axz123","bww2","c334")
gsub("\\d+", "", df)
#[1] "axz" "bww" "c"
gsub("(\\D+)", "", df)
#[1] "123" "2" "334"
For your other example:
df <- "BAILEYS IRISH CREAM 1.75 LITERS REGULAR_NOT FLAVORED"
gsub("\\d.*", "", df)
#[1] "BAILEYS IRISH CREAM "
gsub("[A-Z_ ]*", "", df)
#[1] "1.75"
We can use [:alpha:] to match the alphabetic characters, and combine this with gsub() and a negation to remove all characters that are not alphabetic:
gsub("[^[:alpha:]]", "", df)
#[1] "axz" "bww" "c"
To obtain only the non-alphabetic characters we can drop the negation ^:
gsub("[[:alpha:]]", "", df)
#[1] "123" "2" "334"
Using str_extract and regex lookarounds. We match one or more characters before any number ((?=\\d)) and extract it.
library(stringr)
str_extract(df, "[[:alpha:]]+(?=\\d)")
#[1] "axz" "bww" "c"
If we need to separate the numeric and non-numeric, strsplit can be used
lst <- strsplit(df, "(?<=[^0-9])(?=[0-9])", perl=TRUE)

Resources