Split string after comma without trailing whitespace - r

As the title already says, I want to split this string
strsplit(c("aaa,aaa", "bbb, bbb", "ddd , ddd"), ",")
to that
[[1]]
[1] "aaa" "aaa"
[[2]]
[1] "bbb, bbb"
[[3]]
[1] "ddd , ddd"
Thus, the regular expression has to consider that no whitespace should occur after the comma. Could be a dupe, but was not able to find a solution by googling.

regular expression has to consider that no whitespace should occur after the comma
Use negative lookahead assertion:
> strsplit(c("aaa,aaa", "bbb, bbb", "ddd , ddd"), ",(?!\\s)", perl = TRUE)
[[1]]
[1] "aaa" "aaa"
[[2]]
[1] "bbb, bbb"
[[3]]
[1] "ddd , ddd"
,(?!\\s) matches , only if it's not followed by a space

Just to provide an alternative using (*SKIP)(*FAIL):
pattern <- " , (*SKIP)(*FAIL)|,"
data <- c("aaa,aaa", "bbb, bbb", "ddd , ddd")
strsplit(data, pattern, perl = T)
This yields the same as above.

Related

How to extract words with exactly one vowel

I have strings like these:
turns <- c("does him good to stir him up now and again .",
"when , when I see him he w's on the settees .",
"yes it 's been eery for a long time .",
"blissful timing , indeed it was ")
What I'm trying to do is extract those words that have exactly one vowel. I do get the correct result with this:
library(stringr)
str_extract_all(turns, "\\b[b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\\b")
[[1]]
[1] "him" "to" "stir" "him" "up" "now" "and"
[[2]]
[1] "when" "when" "i" "him" "he" "on" "the"
[[3]]
[1] "yes" "it" "for" "a" "long"
[[4]]
[1] "it" "was"
However, it feels cumbersome to define a consonant class. Is there a more elegant and more concise way?
We can use str_count on the words after splitting the 'turns' at the spaces
library(stringr)
lapply(strsplit(turns, "\\s+"), function(x) x[str_count(x, '[aeiou]') == 1])
-output
#[[1]]
#[1] "him" "to" "stir" "him" "up" "now" "and"
#[[2]]
#[1] "when" "when" "him" "he" "on" "the"
#[[3]]
#[1] "yes" "it" "for" "a" "long"
#[[4]]
#[1] "it" "was"
You can use a PCRE regex with character classes containing double negation:
turns <- c("does him good to stir him up now and again .",
"when , when I see him he w's on the settees .",
"yes it 's been eery for a long time .",
"blissful timing , indeed it was ")
rx <- "\\b[^[:^alpha:]aeiou]*[aeiou][^[:^alpha:]aeiou]*\\b"
regmatches(turns, gregexpr(rx, turns, perl=TRUE, ignore.case=TRUE))
See the R demo online. The result is as in the question.
See the regex demo. Details:
\b - word boundary
[^[:^alpha:]aeiou]* - zero or more chars other than letters and aeiou chars
[aeiou] - a vowel
[^[:^alpha:]aeiou]* - zero or more chars other than letters and aeiou chars
\b - word boundary.
An equivalent expression:
(?i)\b[^\P{L}aeiou]*[aeiou][^\P{L}aeiou]*\b
See this regex demo. \P{L} matches any char but a letter. (?i) is equivalent of ignore.case=TRUE.
Here is a base R option using strsplit + nchar + gsub
lapply(
strsplit(turns, "\\s"),
function(v) v[nchar(gsub("[^aeiou]", "", v)) == 1]
)
which gives
[[1]]
[1] "him" "to" "stir" "him" "up" "now" "and"
[[2]]
[1] "when" "when" "him" "he" "on" "the"
[[3]]
[1] "yes" "it" "for" "a" "long"
[[4]]
[1] "it" "was"

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

I have a vector of character strings like this x = c("ABC", "ABC, EF", "ABC, DEF, 2 stems", "DE, other comments, and stuff").
I'd like to split each of these into two components: 1) the set of capital letters (2 or 3 letters, separated by commas), and 2) everything after the last "[A-Z][A-Z], ".
The results should be
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, DEF" "2 stems"
[[4]]
[1] "DE" "other comments, and stuff"
I tried strsplit(x, "[A-Z][A-Z], [a-z0-9]") and strsplit(x, "(?:[A-Z][A-Z], )[a-z0-9]"), both of which returned
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, D" " stems"
[[4]]
[1] "" "ther comments, and stuff"
The identification of where to split depends on a combination of the end of the first substring and the beginning of the second substring, and so those parts get excluded from the final result.
Any help appreciated in splitting as indicated above while including the relevant parts of the split regex in each substring!
One option would be str_split
library(stringr)
str_split(x, ", (?=[a-z0-9])", n = 2)
#[[1]]
#[1] "ABC"
#[[2]]
#[1] "ABC, EF"
#[[3]]
#[1] "ABC, DEF" "2 stems"
#[[4]]
#[1] "DE" "other comments, and stuff"

Extract string between spaces

I have this data frame:
df <-c("AA AAAA 1B","A BBB 1", "CC RR 1W3", "SS RGTYC 0")
[1] "AA AAAA 1B" "A BBB 1" "CC RR 1W3" "SS RGTYC 0"
and I want to extract what is between spaces.
Desired result:
[1] "AAAA" "BBB" "RR" "RGTYC"
df <- c("AA AAAA 1B","A BBB 1", "CC RR 1W3", "SS RGTYC 0")
lst <- strsplit(df," ")
sapply(lst, '[[', 2)
# [1] "AAAA" "BBB" "RR" "RGTYC"
Instead of splitting it first and then selecting the relevant split, you can also extract it straight away using the stringr-package:
library(stringr)
str_extract(df, "(?<=\\s)(.*)(?=\\s)")
# [1] "AAAA" "BBB" "RR" "RGTYC"
This solution uses regular expressions, and this pattern is built up like this:
(?<=\\s) checks whether there is whitespace before
(?=\\s) checks whether there is a whitespace after
(.*) extracts everything in between the white spaces
Here is a gsub based approach (from base R). We match one more non-white spaces from the start (^) of the string followed by one or more spaces or (|) one or more white spaces followed by non-white spaces at the end of the string ($) and replace it with blank ("")
gsub("^\\S+\\s+|\\s+\\S+$", "", df)
#[1] "AAAA" "BBB" "RR" "RGTYC"
There is also a convenient function word from stringr
stringr::word(df, 2)
#[1] "AAAA" "BBB" "RR" "RGTYC"

Extract a fixed-length character in R

I have an attribute consisting DNA sequences and would like to translate it to its amino name.
So I need to split the sequence in a fixed-length character that is 3.
Here is the sample of the data
data=c("AATAGACGT","TGACCC","AAATCACTCTTT")
How can I extract it into:
[1] "AAT" "AGA" "CGT"
[2] "TGA" "CCC"
[3] "AAA" "TCA" "CTC" "TTT"
So far I can only find how to split a string given a certain regex as the separator
Try
strsplit(data, '(?<=.{3})', perl=TRUE)
Or
library(stringi)
stri_extract_all_regex(data, '.{1,3}')
Another solution, still one liner, but less elegant than the other ones (using lapply):
lapply(data, function(u) substring(u, seq(1, nchar(u), 3), seq(3, nchar(u),3)))
#[[1]]
#[1] "AAT" "AGA" "CGT"
#[[2]]
#[1] "TGA" "CCC"
#[[3]]
#[1] "AAA" "TCA" "CTC" "TTT"
as.list(gsub("(.{3})", "\\1 ", data))
[[1]]
[1] "AAT AGA CGT "
[[2]]
[1] "TGA CCC "
[[3]]
[1] "AAA TCA CTC TTT "
or
regmatches(data, gregexpr(".{3}", data))
[[1]]
[1] "AAT" "AGA" "CGT"
[[2]]
[1] "TGA" "CCC"
[[3]]
[1] "AAA" "TCA" "CTC" "TTT"
Another:
library(gsubfn)
strapply(data, "...")

Character "|" in strsplit function (vertical bar / pipe)

I was curious about:
> strsplit("ty,rr", split = ",")
[[1]]
[1] "ty" "rr"
> strsplit("ty|rr", split = "|")
[[1]]
[1] "t" "y" "|" "r" "r"
Why don't I get c("ty","rr") from strsplit("ty|rr", split="|")?
It's because the split argument is interpreted as a regular expression, and | is a special character in a regex.
To get round this, you have two options:
Option 1: Escape the |, i.e. split = "\\|"
strsplit("ty|rr", split = "\\|")
[[1]]
[1] "ty" "rr"
Option 2: Specify fixed = TRUE:
strsplit("ty|rr", split = "|", fixed = TRUE)
[[1]]
[1] "ty" "rr"
Please also note the See Also section of ?strsplit, which tells you to read ?"regular expression" for details of the pattern specification.

Resources