How to extract unique letters among word of consecutive letters? - r

For example, there is character x = "AAATTTGGAA".
What I want to achieve is, from x, split x by consecutive letters, "AAA", "TTT", "GG", "AA".
Then, unique letters of each chunk is "A", "T", "G", "A" , so expected output is ATGA.
How should I get this?

Here is a useful regex trick approach:
x <- "AAATTTGGAA"
out <- strsplit(x, "(?<=(.))(?!\\1)", perl=TRUE)[[1]]
out
[1] "AAA" "TTT" "GG" "AA"
The regex pattern used here says to split at any boundary where the preceding and following characters are different.
(?<=(.)) lookbehind and also capture preceding character in \1
(?!\\1) then lookahead and assert that following character is different

You can split each character in the string. Use rle to find consecutive runs and select only the unique ones.
x <- "AAATTTGGAA"
vec <- unlist(strsplit(x, ''))
rle(vec)$values
#[1] "A" "T" "G" "A"
paste0(rle(vec)$values, collapse = '')
#[1] "ATGA"

We can use regmatch with pattern (.)\\1+ like below
> regmatches(x,gregexpr("(.)\\1+",x))[[1]]
[1] "AAA" "TTT" "GG" "AA"
or if you need the unique letters only
> gsub("(.)\\1+", "\\1", x)
[1] "ATGA"

Related

R: strsplit on negative lookaround

Say I need to strsplit caabacb into individual letters except when a letter is followed by a b, thus resulting in "c" "a" "ab" "a" "cb". I tried using the following line, which looks OK on regex tester but does not work in R. What did I do wrong?
strsplit('caabacb','(?!b)',perl=TRUE)
[[1]]
[1] "c" "a" "a" "b" "a" "c" "b"
You could also add a prefix positive lookbehind that matches any character (?<=.). The positive lookbehind (?<=.) would split the string at every character (without removal of characters), but the negative lookahead (?!b) excludes splits where a character is followed by a b:
strsplit('caabacb', '(?<=.)(?!b)', perl = TRUE)
#> [[1]]
#> [1] "c" "a" "ab" "a" "cb"
strsplit() probably needs something to split. You could insert e.g. a ";" with gsub().
strsplit(gsub("(?!^.|b|\\b)", ";", "caabacb", perl=TRUE), ";", perl=TRUE)
# [[1]]
# [1] "c" "a" "ab" "a" "cb"

Why is this grep exclusion failed to work in R?

I am trying to do exclude certain characters when using grep in R. But I cannot get the result that I expect.
Here is the code:
x <- c("a", "ab", "b", "abc")
grep("[^b]", x, value=T)
> [1] "a" "ab" "abc"
I want to grab anything in vector x that does not contain b. It should not return "ab" or "abc".
Ultimately I want to pick up any element that contains "a" but not "b".
This is the result that I would expect:
grep("a[^b]", x, value=T)
> [1] "a"
How can I do that?
Try this:
grep("^[^b]*a[^b]*$", x, value=TRUE)
# [1] "a"
It looks for the start of the string, then allows any number of characters that are not "b", then an "a", then any number of characters that are not "b" again and then the end of the string is reached.
We can use the invert property of grep which returns values which do not match. So here it returns those values which do not have "b" in them.
grep("b", x, value = TRUE, invert = TRUE)
#[1] "a"
I've got the result, what are you looking for, using this regular expression in grep:
grep("^[^b]*$", x, value=TRUE)
[1] "a"

Delete pattern in string and semicolon before and/or after (R)

In genomics, we often have to work with many strings of gene names that are separated by semicolons. I want to do pattern matching (find a specific gene name in a string), and then remove that from the string. I also need to remove any semicolon before or after the gene name. This toy example illustrates the problem.
s <- c("a;b;x", "a;x;b", "x;b", "x")
library(stringr)
str_replace(s, "x", "")
#[1] "a;b;" "a;;b" ";b" ""
The desired output should be.
#[1] "a;b" "a;b" "b" ""
I could do pattern matching for ;x and x; as well and that would give me the output; but that wouldn't be very efficient. We can also use gsub or the stringi package and that would be fine as well.
Remove x and optional ; after it if x is the starting character of the string otherwise remove x and optional ; before it which should cover all the cases as listed:
str_replace(s, "^x(;?)|(;?)x", "")
# [1] "a;b" "a;b" "b" ""
We can use gsub from base R
gsub("^x;|;?x", "", s)
#[1] "a;b" "a;b" "b" ""

extract text from alphanumeric vector in R

i have a data like below and need to extract text comes before any number. or if we can separate the text and number then it would be great
df<-c("axz123","bww2","c334")
output
"axz", "bww", "c"
or
"axz","bww","c"
"123","2","334"
We can do:
df <- c("axz123","bww2","c334")
gsub("\\d+", "", df)
#[1] "axz" "bww" "c"
gsub("(\\D+)", "", df)
#[1] "123" "2" "334"
For your other example:
df <- "BAILEYS IRISH CREAM 1.75 LITERS REGULAR_NOT FLAVORED"
gsub("\\d.*", "", df)
#[1] "BAILEYS IRISH CREAM "
gsub("[A-Z_ ]*", "", df)
#[1] "1.75"
We can use [:alpha:] to match the alphabetic characters, and combine this with gsub() and a negation to remove all characters that are not alphabetic:
gsub("[^[:alpha:]]", "", df)
#[1] "axz" "bww" "c"
To obtain only the non-alphabetic characters we can drop the negation ^:
gsub("[[:alpha:]]", "", df)
#[1] "123" "2" "334"
Using str_extract and regex lookarounds. We match one or more characters before any number ((?=\\d)) and extract it.
library(stringr)
str_extract(df, "[[:alpha:]]+(?=\\d)")
#[1] "axz" "bww" "c"
If we need to separate the numeric and non-numeric, strsplit can be used
lst <- strsplit(df, "(?<=[^0-9])(?=[0-9])", perl=TRUE)

R: trim consecutive trailing and leading special characters from set of strings

I have a list of character vectors, all equal lengths. Example data:
> a = list('**aaa', 'bb*bb', 'cccc*')
> a = sapply(a, strsplit, '')
> a
[[1]]
[1] "*" "*" "a" "a" "a"
[[2]]
[1] "b" "b" "*" "b" "b"
[[3]]
[1] "c" "c" "c" "c" "*"
I would like to identify the indices of all leading and trailing consecutive occurrences of the character *. Then I would like to remove these indices from all three vectors in the list. By trailing and leading consecutive characters I mean e.g. either only a single occurrence as in the third one (cccc*) or multiple consecutive ones as in the first one (**aaa).
After the removal, all three character vectors should still have the same length.
So the first two and the last character should be removed from all three vectors.
[[1]]
[1] "a" "a"
[[2]]
[1] "*" "b"
[[3]]
[1] "c" "c"
Note that the second vector of the desired result will still have a leading *, which, however became the first character after the operation, so it should be in.
I tried using which to identify the indices (sapply(a, function(x)which(x=='*'))) but this would still require some code to detect the trailing ones.
Any ideas for a simple solution?
I would replace the lead and lag stars with NA:
aa <- lapply(setNames(a,seq_along(a)), function(x) {
star = x=="*"
toNA = cumsum(!star) == 0 | rev(cumsum(rev(!star))) == 0
replace(x, toNA, NA)
})
Store in a data.frame:
DF <- do.call(data.frame, c(aa, list(stringsAsFactors=FALSE)) )
Omit all rows with NA:
res <- na.omit(DF)
# X1 X2 X3
# 3 a * c
# 4 a b c
If you hate data.frames and want your list back: lapply(res,I) or c(unclass(res)), which gives
$X1
[1] "a" "a"
$X2
[1] "*" "b"
$X3
[1] "c" "c"
First of, like Richard Scriven asked in his comment to your question, your output is not the same as the thing you asked for. You ask for removal of leading and trailing characters, but your given ideal output is just the 3rd and 4th element of the character lists.
This would be easily achievable by something like
a <- list('**aaa', 'bb*bb', 'cccc*')
alist = sapply(a, strsplit, '')
lapply(alist, function(x) x[3:4])
Now for an answer as you asked it:
IMHO, sapply() isn't necessary here.
You need a function of the grep family to operate directly on your characters, which all share a help page in R opened by ?grep.
I would propose gsub() and a bit of Regular Expressions for your problem:
a <- list('**aaa', 'bb*bb', 'cccc*')
b <- gsub(pattern = "^(\\*)*", x = a, replacement = "")
c <- gsub(pattern = "(\\*)*$", x = b, replacement = "")
> c
[1] "aaa" "bb*bb" "cccc"
This is doable in one regex, but then you need a backreference for the stuff in between i think, and i didn't get this to work.
If you are familiar with the magrittr package and its excellent pipe operator, you can do this more elegantly:
library(magrittr)
gsub(pattern = "^(\\*)*", x = a, replacement = "") %>%
gsub(pattern = "(\\*)*$", x = ., replacement = "")

Resources