How should I split and retain elements using strsplit? - r

What a strsplit function in R does is, match and delete a given regular expression to split the rest of the string into vectors.
>strsplit("abc123def", "[0-9]+")
[[1]]
[1] "abc" "" "" "def"
But how should I split the string the same way using regular expression, but also retain the matches? I need something like the following.
>FUNCTION("abc123def", "[0-9]+")
[[1]]
[1] "abc" "123" "def"
Using strapply("abc123def", "[0-9]+|[a-z]+") works here, but what if the rest of the string other than the matches cannot be captured by a regular expression?

Fundamentally, it seems to me that what you want is not to split on [0-9]+ but to split on the transition between [0-9]+ and everything else. In your string, that transition is not pre-existing. To insert it, you could pre-process with gsub and back-referencing:
test <- "abc123def"
strsplit( gsub("([0-9]+)","~\\1~",test), "~" )
[[1]]
[1] "abc" "123" "def"

You could use lookaround assertions.
> test <- "abc123def"
> strsplit(test, "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", perl=T)
[[1]]
[1] "abc" "123" "def"

You can use strapply from gsubfn package.
test <- "abc123def"
strapply(X=test,
pattern="([^[:digit:]]*)(\\d+)(.+)",
FUN=c,
simplify=FALSE)
[[1]]
[1] "abc" "123" "def"

Related

Extracting every nth element of vector of lists

I have the following ids.
ids <- c('a-000', 'b-001', 'c-002')
I want to extract the numeric part of them (001, 002, 003).
I tried this :
(str_split(ids, '-', n=2))[2]
returns the following :
[[1]]
[1] "b" "001"
I don't want the second element of the list. I want the second element of all elements in the vector. I know this is definitely a basic question, but how do I resolve the syntax conflict? Going through lambda function ?
The function is also available in base R.
sapply(strsplit(ids, "-"), `[`, 2)
# [1] "000" "001" "002"
You can also try gsub and substring.
gsub("\\D+", "", ids)
# [1] "000" "001" "002"
substring(ids, 3)
# [1] "000" "001" "002"
To continue with your attempt, you can use sapply :
sapply(stringr::str_split(ids, '-', n=2), `[`, 2)
#[1] "000" "001" "002"
It is better to use str_split_fixed though here.
stringr::str_split_fixed(ids, '-', n=2)[, 2]
#[1] "000" "001" "002"
Or in base R :
sub('.*?-(.*)-?.*', '\\1', ids)
You could try str_remove(ids, "\\D+")
With base R you can remove all the characters that are not digits:
ids <- c('a-000', 'b-001', 'c-002')
gsub("[^[:digit:]]", "", ids)
#> [1] "000" "001" "002"
[:digit:] is regex for digit and ^ means everything that is not a digit, so you basically replace every other characters with empty string "".
For more information see documentation for gsub() and regex in R.
An option with str_replace
library(stringr)
str_replace(ids, "\\D+", "")
#[1] "000" "001" "002"

Delete pattern in string and semicolon before and/or after (R)

In genomics, we often have to work with many strings of gene names that are separated by semicolons. I want to do pattern matching (find a specific gene name in a string), and then remove that from the string. I also need to remove any semicolon before or after the gene name. This toy example illustrates the problem.
s <- c("a;b;x", "a;x;b", "x;b", "x")
library(stringr)
str_replace(s, "x", "")
#[1] "a;b;" "a;;b" ";b" ""
The desired output should be.
#[1] "a;b" "a;b" "b" ""
I could do pattern matching for ;x and x; as well and that would give me the output; but that wouldn't be very efficient. We can also use gsub or the stringi package and that would be fine as well.
Remove x and optional ; after it if x is the starting character of the string otherwise remove x and optional ; before it which should cover all the cases as listed:
str_replace(s, "^x(;?)|(;?)x", "")
# [1] "a;b" "a;b" "b" ""
We can use gsub from base R
gsub("^x;|;?x", "", s)
#[1] "a;b" "a;b" "b" ""

How to use the strsplit function with a period

I would like to split the following string by its periods. I tried strsplit() with "." in the split argument, but did not get the result I want.
s <- "I.want.to.split"
strsplit(s, ".")
[[1]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
The output I want is to split s into 4 elements in a list, as follows.
[[1]]
[1] "I" "want" "to" "split"
What should I do?
When using a regular expression in the split argument of strsplit(), you've got to escape the . with \\., or use a charclass [.]. Otherwise you use . as its special character meaning, "any single character".
s <- "I.want.to.split"
strsplit(s, "[.]")
# [[1]]
# [1] "I" "want" "to" "split"
But the more efficient method here is to use the fixed argument in strsplit(). Using this argument will bypass the regex engine and search for an exact match of ".".
strsplit(s, ".", fixed = TRUE)
# [[1]]
# [1] "I" "want" "to" "split"
And of course, you can see help(strsplit) for more.
You need to either place the dot . inside of a character class or precede it with two backslashes to escape it since the dot is a character of special meaning in regex meaning "match any single character (except newline)"
s <- 'I.want.to.split'
strsplit(s, '\\.')
# [[1]]
# [1] "I" "want" "to" "split"
Besides strsplit(), you can also use scan(). Try:
scan(what = "", text = s, sep = ".")
# Read 4 items
# [1] "I" "want" "to" "split"

Return only matched pattern from grep

given the follwing example in R:
my.list<-list(a='ivw_2014_abc.pdf',b='ivw_2014_def.pdf',c='ivw_2014_ghi.pdf')
grep('(?<=ivw_2014_)[a-z]*',my.list,perl=T,value=T)
returns
a b c
"ivw_2014_abc.pdf" "ivw_2014_def.pdf" "ivw_2014_ghi.pdf"
I would like to make it return only
[1] 'abc' 'def' 'ghi'
in bash I would use the -o option. How do I achieve this in R?
Without using any capturing groups,
> my.list<-list(a='ivw_2014_abc.pdf',b='ivw_2014_def.pdf',c='ivw_2014_ghi.pdf')
> gsub("^.*_|\\..*$", "", my.list, perl=T)
[1] "abc" "def" "ghi"
For example :
sub('.*_(.*)[.].*','\\1',my.list)
[1] "abc" "def" "ghi"
Following may be of interest:
as.character(unlist(data.frame(strsplit(as.character(unlist(data.frame(strsplit(as.character(my.list),'\\.'))[1,])), '_'))[3,]))
[1] "abc" "def" "ghi"
Following is easier to read:
as.character(
unlist(data.frame(strsplit(as.character(
unlist(data.frame(strsplit(as.character(
my.list),'\\.'))[1,])), '_'))[3,]))
[1] "abc" "def" "ghi"
Another option would be:
library(stringi)
stri_extract_first_regex(unlist(my.list), "[A-Za-z]+(?=\\.)")
#[1] "abc" "def" "ghi"
Look at the regmatches function. It works with regexpr rather than grep, but returns just the matched part of the string.

Extracting string inbetween two string patterns in R

If I have a character vector:
links <- c("http://fdsfdsfdsfsdaaa.com/t5/this/bd-p/fdsfsdfdsfscshdad/dasd",
"http://ffdsfdddddfdf.com/t5/that/bd-p/fdsfdsfsddfjfsd")
I want to extract "this" and "that" knowing that they are between "t5" and "bd-p." Totally lost on this one.
Using sub:
sub(".*t5/(.*)/bd-p.*","\\1",links)
[1] "this" "that"
Try this:
lapply(regmatches(links, regexec("t5/(.*)/bd-p", links)), '[', 2)
[[1]]
[1] "this"
[[2]]
[1] "that"
regexec combined with regmatches is good for getting subexpressions (i.e. the stuff in the parentheses). regmatches will return the whole search string and the subexpression, which is why I extract only the second element, which is the subexpression.

Resources