Unable to extract postfixes using regular expression in R - r

I am working on Single Cell RNA, and trying to demultiplex RAW count matrix. I am following this. In this tutorial, the barcode is in this format:
"BIOKEY_33_Pre_AAACCTGAGAGACTTA-1". Where BIOKEY13_Pre is the prefix and AAACCTGCAACAACCT-1 is the sequence of bases. The prefixes are sample names, so they will be used to demultiplex the data.
Using this regular expression, I can extract the prefixes.
data.pfx <- gsub("(.+)_[A-Z]+-1$", "\\1", colnames(data.count), perl=TRUE).
The problem is, in my data, the barcode is in this format:
AAACCTGAGAAACCGC_LN_05 where the sequence is first, and the sample name is last. I need to extract postfixes. If I run the above regular expression on my data, I get the following output:
data.pfx <- gsub("(.+)_[A-Z]+-1$", "\\1", colnames(data.count), perl=TRUE)
sample.names <- unique(data.pfx)
head(sample.names)
"AAACCTGAGAAACCGC_LN_05"
"AAACCTGAGAAACGCC_NS_13"
"AAACCTGAGCAATATG_LUNG_N34"
The desired output:
"LN_05"
"NS_13"
"LUNG_N34"

You can use
sub(".*_([A-Z]+_[0-9A-Z]+)$", "\\1", sample.names)
See the regex demo.
Details:
.* - any zero or more chars as many as possible
_ - an underscore
([A-Z]+_[0-9A-Z]+) - Group 1 (\1): one or more uppercase ASCII letters, _ and one or more uppercase ASCII letters o digits
$ - end of string.

A bit easier by just removing all leading capital letters up to and including the first underscore
sample.names <- c("AAACCTGAGAAACCGC_LN_05" ,
"AAACCTGAGAAACGCC_NS_13")
sub("^[A-Z]+_", "", sample.names)
#> [1] "LN_05" "NS_13"

Related

Add symbol between the letter S and any number in a column dataframe

I am trying to add a - between letter S and any number in a column of a data frame. So, this is an example:
VariableA
TRS34
MMH22
GFSR104
GS23
RRTM55
P3
S4
My desired output is:
VariableA
TRS-34
MMH22
GFSR104
GS-23
RRTM55
P3
S-4
I was trying yo use gsub:
gsub('^([a-z])-([0-9]+)$','\\1d\\2',myDF$VariableA)
but this is not working.
How can I solve this?
Thanks!
Your ^([a-z])-([0-9]+)$ regex attempts to match strings that start with a letter, then have a - and then one or more digits. This can't work as there are no hyphens in the strings, you want to introduce it into the strings.
You can use
gsub('(S)([0-9])', '\\1-\\2', myDF$VariableA)
The (S)([0-9]) regex matches and captures S into Group 1 (\1) and then any digit is captured into Group 2 (\2) and the replacement pattern is a concatenation of group values with a hyphen in between.
If there is only one substitution expected, replace gsub with sub.
See the regex demo and the online R demo.
Other variations:
gsub('(S)(\\d)', '\\1-\\2', myDF$VariableA) # \d also matches digits
gsub('(?<=S)(?=\\d)', '-', myDF$VariableA, perl=TRUE) # Lookarounds make backreferences redundant
Here is the version I like using sub:
myDF$VariableA <- gsub('S(\\d)', 'S-\\1', myDF$VariableA)
This requires using only one capture group.
Using stringr package
library(stringr)
str_replace_all(myDF$VariableA, 'S(\\d)', 'S-\\1')
You could also use lookbehinds if you set perl=TRUE:
> gsub('(?<=S)([0-9]+)', '-\\1', myDF$VariableA, perl=TRUE)
[1] "TRS-34" "MMH22" "GFSR104" "GS-23" "RRTM55" "P3" "S-4"
>

Negative lookahead in R to match delimited chunks in a string that do not contain an specific character

I am trying to extract (from a string) all the chunks of characters between two \r\n expressions that do not contain a white space. To do so, I am using the negative lookahead operator.
This is my string:
my_string <- "\r\nContent: base64\r\n\r\nDBhHB\r\nDGlV\r\nPAAHJ\r\nAwQU\r\n"
And this is what I've tried:
pat <- "\\r\\n+(?! )\\r\\n.*"
out <- unlist(regmatches(my_string,
regexpr(pat, my_string, perl=TRUE)))
This is what I got in R:
> out
[1] "\r\n\r\nDBhHB\r\n"
As you can see, it stops on the first match.
EDIT
My expected output, in this case, would be the final part of the string.
> out
[1] "DBhHB\r\nDGlV\r\nPAAHJ\r\nAwQU\r\n"
I would like to be able to retrieve multiple parts if there is another one or two white spaces in other chunks in the middle of the string.
my_string <- "\r\nNot This\r\n\r\KeepThis\r\nKeepThis\r\nNot This\r\nKeepThis\r\n"
Suggestions under the base R approach would be greatly appreciated.
Thanks in advance.
I suggest using
(?m)^\S+(?:\R\S+)*$
See the regex demo. Details:
(?m) - multiline mode on
^ - this anchor now matches all line start positions
\S+ - one or more non-whitespace chars
(?:\R\S+)* - zero or more repetitions of a line break sequence and then one or more non-whitespace chars
$ - end of a line.
R demo:
library(stringr)
my_string <- "\r\nContent: base64\r\n\r\nDBhHB\r\nDGlV\r\nPAAHJ\r\nAwQU\r\n"
pat <- "(?m)^\\S+(?:\\R\\S+)*$"
unlist(str_extract_all(my_string, pat))
## => [1] "DBhHB\r\nDGlV\r\nPAAHJ\r\nAwQU"
my_string <- "\r\nNot This\r\n\r\nKeepThis\r\nKeepThis\r\nNot This\r\nKeepThis\r\n"
unlist(str_extract_all(my_string, pat))
## => [1] "KeepThis\r\nKeepThis" "KeepThis"
Base R usage
Note that in base R, PCRE engine is used, and $ in a multiline mode (when (?m) is used) only matches right before \n. Since you have \r\n line endings, you cannot use plain $ to mark the line end. Consuming \r is not a good idea (\r$) as you do not want to have \r in the output. You can tell PCRE to treat CRLF, CR or LF as line ending sequence with the (*ANYCRLF) PCRE verb:
unlist(regmatches(my_string, gregexpr("(*ANYCRLF)(?m)^\\S+(?:\\R\\S+)*$",my_string, perl=TRUE)))
Note (*ANYCRLF) PCRE verb must be at the start of the regex pattern.
See this R demo online.

Regex - Best way to match all values between two two digit numbers?

Let's say I want a Regex expression that will only match numbers between 18 and 31. What is the right way to do this?
I have a set of strings that look like this:
"quiz.18.player.total_score"
"quiz.19.player.total_score"
"quiz.20.player.total_score"
"quiz.21.player.total_score"
I am trying to match only the strings that contain the numbers 18-31, and am currently trying something like this
(quiz.)[1-3]{1}[1-9]{1}.player.total_score
This obviously won't work because it will actually match all numbers between 11-39. What is the right way to do this?
Regex: 1[89]|2\d|3[01]
For matching add additional text and escape the dots:
quiz\.(?:1[89]|2\d|3[01])\.player\.total_score
Details:
(?:) non-capturing group
[] match a single character present in the list
| or
\d matches a digit (equal to [0-9])
\. dot
. matches any character
!) If s is the character vector read the fields into a data frame picking off the second field and check whether it is in the desired range. Put the result in logical vector ok and get those elements from s. This uses no regular expressions and only base R.
digits <- read.table(text = s, sep = ".")$V2
s[digits %in% 18:31]
2) Another approach based on the pattern "\\D" matching any non-digit is to remove all such characters and then check if what is left is in the desired range:
digits <- gsub("\\D", "", s)
s[digits %in% 18:31]
2a) In the development version of R (to be 3.6.0) we could alternately use the new whitespace argument of trimws like this:
digits <- trimws(s, whitespace = "\\D")
s[digits %in% 18:31]
3) Another alternative is to simply construct the boundary strings and compare s to them. This will work only if all the number parts in s are exactly the same number of digits (which for the sample shown in the question is the case).
ok <- s >= "quiz.18.player.total_score" & s <= "quiz.31.player.total_score"
s[ok]
This is done using character ranges and alternations. For your range
3[10]|[2][0-9]|1[8-9]
Demo

replace positions of elements in a string using R

I have a string:
str = 'Mr[5]'
I want to switch the positions of Mr and 5 in str, and get a result like this:
result = '[5]Mr'
How can I do this in R?
You can use a regex with 2 matching group for which you change position.
stringr package helps with character manipulation.
s <- c("Mr[5]", "Mr[3245]", "Mrs[98j]")
stringr::str_replace_all(s, "^(.*)(\\[.*\\])$", "\\2\\1")
#> [1] "[5]Mr" "[3245]Mr" "[98j]Mrs"
about the regex
^ is the beginning of the string and $ the end
.* matches every character, zero or more time
( and ) define matching group
\\[ and \\] match literal bracket
together you have a simple regex that match for exemple Mr then [5] : "(.*)(\\[.*\\])"
\\1 refers to the first matching group, \\2 refers to the second. \\2\\1 inverse the groups
Obviously, you can create a better regex that fits precisely to your need. The mechanism with matching groups with remain. regex101 is a good site to help you with regex.
In R, stringr website have nice intro about regex
You can use gsub :
values <- c("Mr[5]","Mr[1234]", "Mrs[456]")
values2 <- gsub("^(.+)(\\[[0-9]+\\])$", "\\2\\1", values)
# > values2
# [1] "[5]Mr" "[1234]Mr" "[456]Mrs"

How can I replace part of a string if it is included in a pattern

I am looking for a way to replace all _ (by say '') in each of the following characters
x <- c('test_(match)','test_xMatchToo','test_a','test_b')
if and only if _ is followed by ( or x. So the output wanted is:
x <- c('test(match)','testxMatchToo','test_a','test_b')
How can this be done (using any package is fine)?
Using a lookahead:
_(?=[(x])
What a lookahead does is assert that the pattern matches, but does not actually match the pattern it's looking ahead for. So, here, the final match text consists of only the underscore, but the lookahead asserts that it's followed by an x or (.
Demo on Regex101
Your R code would look a bit like this (one arg per line for clarity):
gsub(
"_(?=[(x])", # The regex
"", # Replacement text
c("your_string", "your_(other)_string"), # Vector of strings
perl=TRUE # Make sure to use PCRE
)

Resources