I would like to split the following string by its periods. I tried strsplit() with "." in the split argument, but did not get the result I want.
s <- "I.want.to.split"
strsplit(s, ".")
[[1]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
The output I want is to split s into 4 elements in a list, as follows.
[[1]]
[1] "I" "want" "to" "split"
What should I do?
When using a regular expression in the split argument of strsplit(), you've got to escape the . with \\., or use a charclass [.]. Otherwise you use . as its special character meaning, "any single character".
s <- "I.want.to.split"
strsplit(s, "[.]")
# [[1]]
# [1] "I" "want" "to" "split"
But the more efficient method here is to use the fixed argument in strsplit(). Using this argument will bypass the regex engine and search for an exact match of ".".
strsplit(s, ".", fixed = TRUE)
# [[1]]
# [1] "I" "want" "to" "split"
And of course, you can see help(strsplit) for more.
You need to either place the dot . inside of a character class or precede it with two backslashes to escape it since the dot is a character of special meaning in regex meaning "match any single character (except newline)"
s <- 'I.want.to.split'
strsplit(s, '\\.')
# [[1]]
# [1] "I" "want" "to" "split"
Besides strsplit(), you can also use scan(). Try:
scan(what = "", text = s, sep = ".")
# Read 4 items
# [1] "I" "want" "to" "split"
Related
I have sentences from spoken conversation and would like to identify the words that are repeated fom sentence to sentence; here's some illustartive data (in reproducible format below)
df
# A tibble: 10 x 1
Orthographic
<chr>
1 "like I don't understand sorry like how old's your mom"
2 "eh sixty-one"
3 "yeah (...) yeah yeah like I mean she's not like in the risk age group but still"
4 "yeah"
5 "HH"
6 "I don't know"
7 "yeah I talked to my grandparents last night and last time I talked to them it was like two weeks…
8 "yeah"
9 "she said you should come home probably "
10 "no and like why would you go to the airport where people have corona sit in the plane where peop…
I'm not unsuccessful at extracting the repeated words using a forloop but do also get some strange results: Here's what I've been doing so far:
# initialize pattern and new column `rept` in `df`:
pattern1 <- c()
df$rept <- NA
# for loop:
for(i in 2:nrow(df)){
pattern1[i-1] <- paste0("\\b(", paste0(unlist(str_split(df$Orthographic[i-1], " ")), collapse = "|"), ")\\b")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
The results are these; result # 10 is strange/incorrect - it should be character(0). How can the code be improved so that no such strange results are obtained?
df$rept
[[1]]
[1] NA
[[2]]
character(0)
[[3]]
character(0)
[[4]]
[1] "yeah"
[[5]]
character(0)
[[6]]
character(0)
[[7]]
[1] "I" "I" "don't" "I" "I" "don't" "I"
[[8]]
[1] "yeah"
[[9]]
character(0)
[[10]]
[1] "" "" "" "" "" "" "" "" "" "" "you" "" "" "" "" ""
[17] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[33] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[49] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[65] "" "" "" "" "" "" "" "" "" "" "" ""
Reproducible data:
structure(list(Orthographic = c("like I don't understand sorry like how old's your mom",
"eh sixty-one", "yeah (...) yeah yeah like I mean she's not like in the risk age group but still",
"yeah", "HH", "I don't know", "yeah I talked to my grandparents last night and last time I talked to them it was like two weeks ago and they at that time they were already like maybe you should just get on a plane and come home and like you can't just be here and and then last night they were like are you sure you don't wanna come home and I was I don't think I can and my mom said the same thing",
"yeah", "she said you should come home probably ", "no and like why would you go to the airport where people have corona sit in the plane where people have corona to get there where people have corona and then go and take it to your family"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
When you debug such regex issues concerning dynamic patterns with word boundaries, there are a lot of things to keep in mind (so as to understand how to best approach the whole issue).
First, check the patterns you get,
for(i in 2:nrow(df)) {
pattern1[i-1] <- paste0("(?<!\\S)(?:", paste0(escape.for.regex(unlist(str_split(trimws(df$Orthographic[i-1]), "\\s+"))), collapse = "|"), ")(?!\\S)")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
Here is the list of regexps:
[1] "\\b(like|I|don't|understand|sorry|like|how|old's|your|mom)\\b"
[1] "\\b(eh|sixty-one)\\b"
[1] "\\b(yeah|(...)|yeah|yeah|like|I|mean|she's|not|like|in|the|risk|age|group|but|still)\\b"
[1] "\\b(yeah)\\b"
[1] "\\b(HH)\\b"
[1] "\\b(I|don't|know)\\b"
[1] "\\b(yeah|I|talked|to|my|grandparents|last|night|and|last|time|I|talked|to|them|it|was|like|two|weeks|ago|and|they|at|that|time|they|were|already|like|maybe|you|should|just|get|on|a|plane|and|come|home|and|like|you|can't|just|be|here|and|and|then|last|night|they|were|like|are|you|sure|you|don't|wanna|come|home|and|I|was|I|don't|think|I|can|and|my|mom|said|the|same|thing)\\b"
[1] "\\b(yeah)\\b"
[1] "\\b(she|said|you|should|come|home|probably|)\\b"
Look at the second pattern: \b(eh|sixty-one)\b. What if the first word was sixty? The \b(sixty|sixty-one)\b regex will never match sixty-one because sixty would have matched first and the other alternative would not even have been considered. You need to always sort the alternatives by length in the descending order to assure you always match the longest alternative first when you use word boundaries and you know there can be alternatives with more than one word in them. Here, you do not need to sort the alternatives because you only have single word alternatives.
See the next pattern containing |(...)| alternative. It matches any three chars other than line break chars and captures them into a group. However, the string contained a (...) substring where the parentheses and dots are literal chars. To match them with a regex, you need to escape all special chars.
Next, you consider "words" to be non-whitespace chunks of chars because you use str_split(df$Orthographic[i-1], " "). This invalidates the approach with \b altogether, you need to use whitespace boundaries, (?<!\S) at the start and (?!\S) at the end instead of \bs. More, since you only split with a single space, you may get empty alternatives if there are two or more consecutive spaces in the input string. You need to use \s+ pattern here to split by one or more whitespaces.
Next, there is a trailing space in the last but one string, and it creates an empty alternative. You need to trimws your input before splitting into tokens/words.
This is what you need to do with the regex solution: add the escape.for.regex function:
## Escape for regex
escape.for.regex <- function(string) {
gsub("([][{}()+*^${|\\\\?.])", "\\\\\\1", string)
}
and then use it to escape the tokens that you obtain by splitting the trimmed df$Orthographic[i-1] with \s+ regex, appy unique to remove duplicates to make the pattern more efficient and shorter, and add the whitespace boundaries:
for(i in 2:nrow(df)){
pattern1[i-1] <- paste0("(?<!\\S)(?:", paste0(escape.for.regex(unique(unlist(str_split(trimws(df$Orthographic[i-1]), "\\s+")))), collapse = "|"), ")(?!\\S)")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
See the list of regexps:
[1] "(?<!\\S)(?:like|I|don't|understand|sorry|how|old's|your|mom)(?!\\S)"
[1] "(?<!\\S)(?:eh|sixty-one)(?!\\S)"
[1] "(?<!\\S)(?:yeah|\\(\\.\\.\\.\\)|like|I|mean|she's|not|in|the|risk|age|group|but|still)(?!\\S)"
[1] "(?<!\\S)(?:yeah)(?!\\S)"
[1] "(?<!\\S)(?:HH)(?!\\S)"
[1] "(?<!\\S)(?:I|don't|know)(?!\\S)"
[1] "(?<!\\S)(?:yeah|I|talked|to|my|grandparents|last|night|and|time|them|it|was|like|two|weeks|ago|they|at|that|were|already|maybe|you|should|just|get|on|a|plane|come|home|can't|be|here|then|are|sure|don't|wanna|think|can|mom|said|the|same|thing)(?!\\S)"
[1] "(?<!\\S)(?:yeah)(?!\\S)"
[1] "(?<!\\S)(?:she|said|you|should|come|home|probably)(?!\\S)"
Output:
> df$rept
[[1]]
NULL
[[2]]
character(0)
[[3]]
character(0)
[[4]]
[1] "yeah"
[[5]]
character(0)
[[6]]
character(0)
[[7]]
[1] "I" "I" "don't" "I" "I" "don't" "I"
[[8]]
[1] "yeah"
[[9]]
character(0)
[[10]]
[1] "you"
Depending on whether it is sufficient to identify repeated words, or also their repeat frequencies, you might want to modify the function, but here is one approach using the dplyr::lead function:
library(stringr)
library(dplyr)
# general function that identifies intersecting words from multiple strings
getRpt <- function(...){
l <- lapply(list(...), function(x) unlist(unique(
str_split(as.character(x), pattern=boundary(type="word")))))
Reduce(intersect, l)
}
df$rept <- mapply(getRpt, df$Orthographic, lead(df$Orthographic), USE.NAMES=FALSE)
I'm trying to split a string in R using strsplit and a perl regex. The string consists of various alphanumeric tokens separated by periods or hyphens, e.g "WXYZ-AB-A4K7-01A-13B-J29Q-10". I want to split the string:
wherever a hyphen appears.
wherever a period appears.
between the second and third character of a token that is exactly 3 characters long and consists of 2 digits followed by 1 capital letter, e.g "01A" produces ["01", "A"] (but "012A", "B1A", "0A1", and "01A2" are not split).
For example, "WXYZ-AB-A4K7-01A-13B-J29Q-10" should produce ["WXYZ", "AB", "01", "A", "13", "B", "J29Q", "10"].
My current regex is ((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-] and it works perfectly in this online regex tester.
Furthermore, the two parts of the alternative, ((?<=[-.]\\d{2})(?=[A-Z][-.])) and [.-], both serve to split the string as intended in R, when they are used separately:
#correctly splits on periods and hyphens
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[.-]", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01A" "13B" "J29Q" "10"
#correctly splits tokens where a letter follows two digits
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "((?<=[-.]\\d{2})(?=[A-Z][-.]))", perl=T)
[[1]]
[1] "WXYZ-AB-A4K7-01" "A-13" "B-J29Q-10"
But when I try and combine them using an alternative, the second regex stops working, and the string is only split on periods and hyphens:
#only second alternative is used
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-]", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01A" "13B" "J29Q" "10"
Why is this happening? Is it a problem with my regex, or with strsplit? How can I achieve the desired behavior?
Desired output:
## [[1]]
## [1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
An alternative that prevents you from having to consider how the strsplit algorithm works, is to use your original regex with gsub to insert a simple splitting character in all the right places, then do use strsplit to do the straightforward splitting.
strsplit(
gsub("((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-]", "-", x, perl = TRUE),
"-",
fixed = TRUE)
#[[1]]
#[1] "XYZ" "02" "01" "C" "33" "D" "2285"
Of course, RichScriven's answer and Wiktor Stribiżew's comment are probably better since they only have one function call.
You may use a consuming version of a positive lookahead (a match reset operator \K) to make sure strsplit works correctly in R and avoid the problem of using a negative lookbehind inside a positive one.
"(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]"
See the R demo online (and a regex demo here).
strsplit("XYZ-02-01C-33D-2285", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)
## => [[1]]
## [1] "XYZ" "02" "01" "C" "33" "D" "2285"
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)
## => [[1]]
## [1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
Here, the pattern matches:
(?<![^.-])\d{2}\K(?=[A-Z](?:[.-]|$)) - a sequence of:
(?<![^.-])\d{2} - 2 digits (\d{2}) that are not preceded with a char other than . and - (i.e. that are preceded with . or - or start of string, it is a common trick to avoid alternation inside a lookaround)
\K - the match reset operator that makes the regex engine discard the text matched so far and go on matching the subsequent subpatterns if any
| - or
[.-] - matches . or -.
Thanks to Rich Scriven and Jota I was able to solve the problem. Every time strsplit finds a match, it removes the match and everything to its left before looking for the next match. This means that regex's that rely on lookbehinds may not function as expected when the lookbehind overlaps with a previous match. In my case, the hyphens between tokens were removed upon being matched, meaning that the second regex could not use them to detect the beginning of the token:
#first match found
"WXYZ-AB-A4K7-01A-13B-J29Q-10"
^
#match + left removed
"AB-A4K7-01A-13B-J29Q-10"
#further matches found and removed
"01A-13B-J29Q-10"
#second regex fails to match because of missing hyphen in lookbehind:
#((?<=[-.]\\d{2})(?=[A-Z][-.]))
# ^^^^^^^^
"01A-13B-J29Q-10"
#algorithm continues
"13B-J29Q-10"
This was fixed by replacing the [.-] class to detect the edges of the token in the lookbehind with a boundary anchor, as per Jota's suggestion:
> strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[-.]|(?<=\\b\\d{2})(?=[A-Z]\\b)", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
In genomics, we often have to work with many strings of gene names that are separated by semicolons. I want to do pattern matching (find a specific gene name in a string), and then remove that from the string. I also need to remove any semicolon before or after the gene name. This toy example illustrates the problem.
s <- c("a;b;x", "a;x;b", "x;b", "x")
library(stringr)
str_replace(s, "x", "")
#[1] "a;b;" "a;;b" ";b" ""
The desired output should be.
#[1] "a;b" "a;b" "b" ""
I could do pattern matching for ;x and x; as well and that would give me the output; but that wouldn't be very efficient. We can also use gsub or the stringi package and that would be fine as well.
Remove x and optional ; after it if x is the starting character of the string otherwise remove x and optional ; before it which should cover all the cases as listed:
str_replace(s, "^x(;?)|(;?)x", "")
# [1] "a;b" "a;b" "b" ""
We can use gsub from base R
gsub("^x;|;?x", "", s)
#[1] "a;b" "a;b" "b" ""
I was curious about:
> strsplit("ty,rr", split = ",")
[[1]]
[1] "ty" "rr"
> strsplit("ty|rr", split = "|")
[[1]]
[1] "t" "y" "|" "r" "r"
Why don't I get c("ty","rr") from strsplit("ty|rr", split="|")?
It's because the split argument is interpreted as a regular expression, and | is a special character in a regex.
To get round this, you have two options:
Option 1: Escape the |, i.e. split = "\\|"
strsplit("ty|rr", split = "\\|")
[[1]]
[1] "ty" "rr"
Option 2: Specify fixed = TRUE:
strsplit("ty|rr", split = "|", fixed = TRUE)
[[1]]
[1] "ty" "rr"
Please also note the See Also section of ?strsplit, which tells you to read ?"regular expression" for details of the pattern specification.
What a strsplit function in R does is, match and delete a given regular expression to split the rest of the string into vectors.
>strsplit("abc123def", "[0-9]+")
[[1]]
[1] "abc" "" "" "def"
But how should I split the string the same way using regular expression, but also retain the matches? I need something like the following.
>FUNCTION("abc123def", "[0-9]+")
[[1]]
[1] "abc" "123" "def"
Using strapply("abc123def", "[0-9]+|[a-z]+") works here, but what if the rest of the string other than the matches cannot be captured by a regular expression?
Fundamentally, it seems to me that what you want is not to split on [0-9]+ but to split on the transition between [0-9]+ and everything else. In your string, that transition is not pre-existing. To insert it, you could pre-process with gsub and back-referencing:
test <- "abc123def"
strsplit( gsub("([0-9]+)","~\\1~",test), "~" )
[[1]]
[1] "abc" "123" "def"
You could use lookaround assertions.
> test <- "abc123def"
> strsplit(test, "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", perl=T)
[[1]]
[1] "abc" "123" "def"
You can use strapply from gsubfn package.
test <- "abc123def"
strapply(X=test,
pattern="([^[:digit:]]*)(\\d+)(.+)",
FUN=c,
simplify=FALSE)
[[1]]
[1] "abc" "123" "def"