If I have a character vector:
links <- c("http://fdsfdsfdsfsdaaa.com/t5/this/bd-p/fdsfsdfdsfscshdad/dasd",
"http://ffdsfdddddfdf.com/t5/that/bd-p/fdsfdsfsddfjfsd")
I want to extract "this" and "that" knowing that they are between "t5" and "bd-p." Totally lost on this one.
Using sub:
sub(".*t5/(.*)/bd-p.*","\\1",links)
[1] "this" "that"
Try this:
lapply(regmatches(links, regexec("t5/(.*)/bd-p", links)), '[', 2)
[[1]]
[1] "this"
[[2]]
[1] "that"
regexec combined with regmatches is good for getting subexpressions (i.e. the stuff in the parentheses). regmatches will return the whole search string and the subexpression, which is why I extract only the second element, which is the subexpression.
Related
I'm trying to split a string in R using strsplit and a perl regex. The string consists of various alphanumeric tokens separated by periods or hyphens, e.g "WXYZ-AB-A4K7-01A-13B-J29Q-10". I want to split the string:
wherever a hyphen appears.
wherever a period appears.
between the second and third character of a token that is exactly 3 characters long and consists of 2 digits followed by 1 capital letter, e.g "01A" produces ["01", "A"] (but "012A", "B1A", "0A1", and "01A2" are not split).
For example, "WXYZ-AB-A4K7-01A-13B-J29Q-10" should produce ["WXYZ", "AB", "01", "A", "13", "B", "J29Q", "10"].
My current regex is ((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-] and it works perfectly in this online regex tester.
Furthermore, the two parts of the alternative, ((?<=[-.]\\d{2})(?=[A-Z][-.])) and [.-], both serve to split the string as intended in R, when they are used separately:
#correctly splits on periods and hyphens
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[.-]", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01A" "13B" "J29Q" "10"
#correctly splits tokens where a letter follows two digits
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "((?<=[-.]\\d{2})(?=[A-Z][-.]))", perl=T)
[[1]]
[1] "WXYZ-AB-A4K7-01" "A-13" "B-J29Q-10"
But when I try and combine them using an alternative, the second regex stops working, and the string is only split on periods and hyphens:
#only second alternative is used
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-]", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01A" "13B" "J29Q" "10"
Why is this happening? Is it a problem with my regex, or with strsplit? How can I achieve the desired behavior?
Desired output:
## [[1]]
## [1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
An alternative that prevents you from having to consider how the strsplit algorithm works, is to use your original regex with gsub to insert a simple splitting character in all the right places, then do use strsplit to do the straightforward splitting.
strsplit(
gsub("((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-]", "-", x, perl = TRUE),
"-",
fixed = TRUE)
#[[1]]
#[1] "XYZ" "02" "01" "C" "33" "D" "2285"
Of course, RichScriven's answer and Wiktor Stribiżew's comment are probably better since they only have one function call.
You may use a consuming version of a positive lookahead (a match reset operator \K) to make sure strsplit works correctly in R and avoid the problem of using a negative lookbehind inside a positive one.
"(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]"
See the R demo online (and a regex demo here).
strsplit("XYZ-02-01C-33D-2285", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)
## => [[1]]
## [1] "XYZ" "02" "01" "C" "33" "D" "2285"
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)
## => [[1]]
## [1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
Here, the pattern matches:
(?<![^.-])\d{2}\K(?=[A-Z](?:[.-]|$)) - a sequence of:
(?<![^.-])\d{2} - 2 digits (\d{2}) that are not preceded with a char other than . and - (i.e. that are preceded with . or - or start of string, it is a common trick to avoid alternation inside a lookaround)
\K - the match reset operator that makes the regex engine discard the text matched so far and go on matching the subsequent subpatterns if any
| - or
[.-] - matches . or -.
Thanks to Rich Scriven and Jota I was able to solve the problem. Every time strsplit finds a match, it removes the match and everything to its left before looking for the next match. This means that regex's that rely on lookbehinds may not function as expected when the lookbehind overlaps with a previous match. In my case, the hyphens between tokens were removed upon being matched, meaning that the second regex could not use them to detect the beginning of the token:
#first match found
"WXYZ-AB-A4K7-01A-13B-J29Q-10"
^
#match + left removed
"AB-A4K7-01A-13B-J29Q-10"
#further matches found and removed
"01A-13B-J29Q-10"
#second regex fails to match because of missing hyphen in lookbehind:
#((?<=[-.]\\d{2})(?=[A-Z][-.]))
# ^^^^^^^^
"01A-13B-J29Q-10"
#algorithm continues
"13B-J29Q-10"
This was fixed by replacing the [.-] class to detect the edges of the token in the lookbehind with a boundary anchor, as per Jota's suggestion:
> strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[-.]|(?<=\\b\\d{2})(?=[A-Z]\\b)", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
Say I have a character vector like:
x <- c('A__B__Mike','A__Paul','Daniel','A__B__C__Martha','A__John','A__B__C__D__Laura')
I want a vector of only the names in the last position; I guess I could do it removing the first chunk using regular expressions, but say I want to use strsplit() to split by '__':
x.list <- strsplit(x, '__')
How would I access the last subelement (the names) of each element in this list? I only know how to do it if I know the position:
sapply(x.list, "[[", 1)
But how to access the last when the position is variable? Thanks!
In any case, what would be the fastest way to extract the names out of x in the first place? Anything faster than the strsplit approach?
We can do this with base R. Either using sub
sub(".*__", "", x)
#[1] "Mike" "Paul" "Daniel" "Martha" "John" "Laura"
or with strsplit, we get the last element with tail
sapply(strsplit(x, '__'), tail, 1)
#[1] "Mike" "Paul" "Daniel" "Martha" "John" "Laura"
Or to find the position, we can use gregexpr and then extract using substring
substring(x, sapply(gregexpr("[^__]+", x), tail, 1))
#[1] "Mike" "Paul" "Daniel" "Martha" "John" "Laura"
Or with stri_extract_last
library(stringi)
stri_extract_last(x, regex="[^__]+")
#[1] "Mike" "Paul" "Daniel" "Martha" "John" "Laura"
Use word function of stringr package
library(stringr)
word(x,start = -1,sep = "\\_+")
In genomics, we often have to work with many strings of gene names that are separated by semicolons. I want to do pattern matching (find a specific gene name in a string), and then remove that from the string. I also need to remove any semicolon before or after the gene name. This toy example illustrates the problem.
s <- c("a;b;x", "a;x;b", "x;b", "x")
library(stringr)
str_replace(s, "x", "")
#[1] "a;b;" "a;;b" ";b" ""
The desired output should be.
#[1] "a;b" "a;b" "b" ""
I could do pattern matching for ;x and x; as well and that would give me the output; but that wouldn't be very efficient. We can also use gsub or the stringi package and that would be fine as well.
Remove x and optional ; after it if x is the starting character of the string otherwise remove x and optional ; before it which should cover all the cases as listed:
str_replace(s, "^x(;?)|(;?)x", "")
# [1] "a;b" "a;b" "b" ""
We can use gsub from base R
gsub("^x;|;?x", "", s)
#[1] "a;b" "a;b" "b" ""
I would like to split the following string by its periods. I tried strsplit() with "." in the split argument, but did not get the result I want.
s <- "I.want.to.split"
strsplit(s, ".")
[[1]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
The output I want is to split s into 4 elements in a list, as follows.
[[1]]
[1] "I" "want" "to" "split"
What should I do?
When using a regular expression in the split argument of strsplit(), you've got to escape the . with \\., or use a charclass [.]. Otherwise you use . as its special character meaning, "any single character".
s <- "I.want.to.split"
strsplit(s, "[.]")
# [[1]]
# [1] "I" "want" "to" "split"
But the more efficient method here is to use the fixed argument in strsplit(). Using this argument will bypass the regex engine and search for an exact match of ".".
strsplit(s, ".", fixed = TRUE)
# [[1]]
# [1] "I" "want" "to" "split"
And of course, you can see help(strsplit) for more.
You need to either place the dot . inside of a character class or precede it with two backslashes to escape it since the dot is a character of special meaning in regex meaning "match any single character (except newline)"
s <- 'I.want.to.split'
strsplit(s, '\\.')
# [[1]]
# [1] "I" "want" "to" "split"
Besides strsplit(), you can also use scan(). Try:
scan(what = "", text = s, sep = ".")
# Read 4 items
# [1] "I" "want" "to" "split"
What a strsplit function in R does is, match and delete a given regular expression to split the rest of the string into vectors.
>strsplit("abc123def", "[0-9]+")
[[1]]
[1] "abc" "" "" "def"
But how should I split the string the same way using regular expression, but also retain the matches? I need something like the following.
>FUNCTION("abc123def", "[0-9]+")
[[1]]
[1] "abc" "123" "def"
Using strapply("abc123def", "[0-9]+|[a-z]+") works here, but what if the rest of the string other than the matches cannot be captured by a regular expression?
Fundamentally, it seems to me that what you want is not to split on [0-9]+ but to split on the transition between [0-9]+ and everything else. In your string, that transition is not pre-existing. To insert it, you could pre-process with gsub and back-referencing:
test <- "abc123def"
strsplit( gsub("([0-9]+)","~\\1~",test), "~" )
[[1]]
[1] "abc" "123" "def"
You could use lookaround assertions.
> test <- "abc123def"
> strsplit(test, "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", perl=T)
[[1]]
[1] "abc" "123" "def"
You can use strapply from gsubfn package.
test <- "abc123def"
strapply(X=test,
pattern="([^[:digit:]]*)(\\d+)(.+)",
FUN=c,
simplify=FALSE)
[[1]]
[1] "abc" "123" "def"