Character "|" in strsplit function (vertical bar / pipe) - r

I was curious about:
> strsplit("ty,rr", split = ",")
[[1]]
[1] "ty" "rr"
> strsplit("ty|rr", split = "|")
[[1]]
[1] "t" "y" "|" "r" "r"
Why don't I get c("ty","rr") from strsplit("ty|rr", split="|")?

It's because the split argument is interpreted as a regular expression, and | is a special character in a regex.
To get round this, you have two options:
Option 1: Escape the |, i.e. split = "\\|"
strsplit("ty|rr", split = "\\|")
[[1]]
[1] "ty" "rr"
Option 2: Specify fixed = TRUE:
strsplit("ty|rr", split = "|", fixed = TRUE)
[[1]]
[1] "ty" "rr"
Please also note the See Also section of ?strsplit, which tells you to read ?"regular expression" for details of the pattern specification.

Related

find the longest consecutive set of vowels in a string

Trying to find the longest consecutive set of vowels in a string.
I have a very long string.
random_letters <- readr::read_lines("https://byuistats.github.io/M335/data/randomletters.txt")
I have stripped the string of all spaces and periods.
random_letters %>% str_replace_all(., fixed(" "), "") %>% str_replace_all('\\.', '')
I am now trying to find every single time a vowel, or combination of vowels occurs, and then identify the longest one.
so if the string looked like
string <- c("abcnduakngoaibhui")
the output would be
"a" "ua" "oai" "ui
In base R you could use strsplit:
strsplit(string, '[^aeiou]+')
[[1]]
[1] "a" "ua" "oai" "ui"
Also You could use str_extract_all from stringr package:
stringr::str_extract_all(string, "[aeiou]+")
[[1]]
[1] "a" "ua" "oai" "ui"
These gives you a list, since they are vectorized meaning that the string can be a vector:
string <- c("abcnduakngoaibhui", "aeityuioaiii")
strsplit(string, '[^aeiou]+')
[[1]]
[1] "a" "ua" "oai" "ui"
[[2]]
[1] "aei" "uioaiii"

Split string after comma without trailing whitespace

As the title already says, I want to split this string
strsplit(c("aaa,aaa", "bbb, bbb", "ddd , ddd"), ",")
to that
[[1]]
[1] "aaa" "aaa"
[[2]]
[1] "bbb, bbb"
[[3]]
[1] "ddd , ddd"
Thus, the regular expression has to consider that no whitespace should occur after the comma. Could be a dupe, but was not able to find a solution by googling.
regular expression has to consider that no whitespace should occur after the comma
Use negative lookahead assertion:
> strsplit(c("aaa,aaa", "bbb, bbb", "ddd , ddd"), ",(?!\\s)", perl = TRUE)
[[1]]
[1] "aaa" "aaa"
[[2]]
[1] "bbb, bbb"
[[3]]
[1] "ddd , ddd"
,(?!\\s) matches , only if it's not followed by a space
Just to provide an alternative using (*SKIP)(*FAIL):
pattern <- " , (*SKIP)(*FAIL)|,"
data <- c("aaa,aaa", "bbb, bbb", "ddd , ddd")
strsplit(data, pattern, perl = T)
This yields the same as above.

Split character string by forward slash or nothing

I want to split this vector
c("CC", "C/C")
to
[[1]]
[1] "C" "C"
[[2]]
[1] "C" "C"
My final data should look like:
c("C_C", "C_C")
Thus, I need some regex, but don't found how to solve the "non-space" part:
strsplit(c("CC", "C/C"),"|/")
You can use sub (or gsub if it occurs more than once in your string) to directly replace either nothing or a forward slash with an underscore (capturing one character words around):
sub("(\\w)(|/)(\\w)", "\\1_\\3", c("CC", "C/C"))
#[1] "C_C" "C_C"
We can split the string at every character, omit the "/" and paste them together.
sapply(strsplit(x, ""), function(v) paste0(v[v!= "/"], collapse = "_"))
#[1] "C_C" "C_C"
data
x <- c("CC", "C/C")
We can use
lapply(strsplit(v1, "/|"), function(x) x[nzchar(x)])
Or use a regex lookaround
strsplit(v1, "(?<=[^/])(/|)", perl = TRUE)
#[[1]]
#[1] "C" "C"
#[[2]]
#[1] "C" "C"
If the final output should be a vector, then
gsub("(?<=[^/])(/|)(?=[^/])", "_", v1, perl = TRUE)
#[1] "C_C" "C_C"

Delete pattern in string and semicolon before and/or after (R)

In genomics, we often have to work with many strings of gene names that are separated by semicolons. I want to do pattern matching (find a specific gene name in a string), and then remove that from the string. I also need to remove any semicolon before or after the gene name. This toy example illustrates the problem.
s <- c("a;b;x", "a;x;b", "x;b", "x")
library(stringr)
str_replace(s, "x", "")
#[1] "a;b;" "a;;b" ";b" ""
The desired output should be.
#[1] "a;b" "a;b" "b" ""
I could do pattern matching for ;x and x; as well and that would give me the output; but that wouldn't be very efficient. We can also use gsub or the stringi package and that would be fine as well.
Remove x and optional ; after it if x is the starting character of the string otherwise remove x and optional ; before it which should cover all the cases as listed:
str_replace(s, "^x(;?)|(;?)x", "")
# [1] "a;b" "a;b" "b" ""
We can use gsub from base R
gsub("^x;|;?x", "", s)
#[1] "a;b" "a;b" "b" ""

How to use the strsplit function with a period

I would like to split the following string by its periods. I tried strsplit() with "." in the split argument, but did not get the result I want.
s <- "I.want.to.split"
strsplit(s, ".")
[[1]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
The output I want is to split s into 4 elements in a list, as follows.
[[1]]
[1] "I" "want" "to" "split"
What should I do?
When using a regular expression in the split argument of strsplit(), you've got to escape the . with \\., or use a charclass [.]. Otherwise you use . as its special character meaning, "any single character".
s <- "I.want.to.split"
strsplit(s, "[.]")
# [[1]]
# [1] "I" "want" "to" "split"
But the more efficient method here is to use the fixed argument in strsplit(). Using this argument will bypass the regex engine and search for an exact match of ".".
strsplit(s, ".", fixed = TRUE)
# [[1]]
# [1] "I" "want" "to" "split"
And of course, you can see help(strsplit) for more.
You need to either place the dot . inside of a character class or precede it with two backslashes to escape it since the dot is a character of special meaning in regex meaning "match any single character (except newline)"
s <- 'I.want.to.split'
strsplit(s, '\\.')
# [[1]]
# [1] "I" "want" "to" "split"
Besides strsplit(), you can also use scan(). Try:
scan(what = "", text = s, sep = ".")
# Read 4 items
# [1] "I" "want" "to" "split"

Resources