easy way to extract uppercase in string in R - r

I am beginner programmer in R.
I have "cCt/cGt" and I want to extract C and G and write it like C>G.
test ="cCt/cGt"
str_extract(test, "[A-Z]+$")

Try this:
gsub(".*([A-Z]).*([A-Z]).*", "\\1>\\2", test )
[1] "C>G"
Here, we capture the two occurrences of the upper case letters in capturing groups given in parentheses (...). This enables us to refer to them (and only to them but not the rest of the string!) in gsub's replacement clause using backreferences \\1 and \\2. In the replacement clause we also include the desired >.

You seem to look for a mutation in two concatenated strings, this function should solve your problem:
extract_mutation <- function(text){
splitted <- strsplit(text, split = "/")[[1]]
pos <- regexpr("[[:upper:]]", splitted)
uppercases <- regmatches(splitted, pos)
mutation <- paste0(uppercases, collapse = ">")
return(mutation)
}
If the two base exchanges are always at the same index, you could also return the position if you're interested:
position <- pos[1]
return(list(mutation, position))
instead of the return(mutation)

You might also capture the 2 uppercase chars followed and preceded by optional lowercase characters and a / in between.
test ="cCt/cGt"
res = str_match(test, "([A-Z])[a-z]*/[a-z]*([A-Z])")
sprintf("%s>%s", res[2], res[3])
Output
[1] "C>G"
See an R demo.
An exact match for the whole string could be:
^[a-z]([A-Z])[a-z]/[a-z]([A-Z])[a-z]$

Related

Splitting sequence of letters, whilst retaining original sequence position

I need to split the following sequence of letters into distinct chunks
SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC
I have used the following code provided from a previous user to achieve what I initially wanted, which was to split the sequence after every C.
library(dplyr)
TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"
Test <- strsplit(TestSequence, "(?<=[C])", perl = TRUE) %>% unlist
df <- data.frame(Fragment = Test) %>%
mutate("position" = cumsum(nchar(Test)))
This allowed me to split the sequence after every C and retain it's position in the sequence, for example C at position 2, 11 etc.
Now I need to split the same sequence at different locations, which I can do using the following to split after P,A,G or S:
Test2 <- strsplit(TestSequence, "(?<=[P,A,G,S])", perl = TRUE) %>% unlist
This is fine if I want it to split after a given character, but if I try to split it before a character for example D, I cannot seem to retain the D in the fragment. I can only have it retained if it is split after the D.
I have tried every combination of look behind or look ahead I can think of, the following cuts before and after every D which isn't that useful.
Test3 <- strsplit(TestSequence, "(?=[D])", perl = TRUE) %>% unlist
Also is there a way to retain the exact position of every C in the original sequence?
So if I were to split the test sequence after the initial K, I'd have a fragment that was SCDK, could I have a separate column that tells me where the C was in the original sequence. Just as a second example, the next fragment would be SFNRGECSCDK and in that separate column it would say the C was originally in position 11.
Zero-length matches that result from the use of lookahead only patterns used in strsplit are not handled properly.
In this case, you need to "anchor" the matches on the left, too. Either use a non-word boundary, or a lookbehind that disallows the match at the start of string:
TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"
strsplit(TestSequence, "\\B(?=D)", perl = TRUE)
# => [[1]]
# => [1] "SC" "DKSFNRGECSC" "DKSFNRGECSC" "DKSFNRGEC"
strsplit(TestSequence, "(?<!^)(?=D)", perl = TRUE)
# => [[1]]
# => [1] "SC" "DKSFNRGECSC" "DKSFNRGECSC" "DKSFNRGEC"
See the online R demo.
The \B(?=D) pattern matches a location that is immediately preceded with a word char and is immediately followed with D.
The (?<!^)(?=D) pattern matches a location that is not immediately preceded with a start of string location (i.e. not at the start of string) and is immediately followed with D.
Also, note that [P,A,G,S] matches P, A, G, S and a comma. You should use [PAGS] to match one of the letters.

How to find repeating pattern in R string?

Suppose I have the following string:
v = c("fam gen geo gen")
I need a regular expression, which will find a repeating pattern in this string. For example, if I go with:
str_extract(v, "*regular expression*")
The output should be:
"gen"
Can you please come up with a regular expression for this case?
you can use a regex with "backreference":
sub(".*?(\\w+).+\\1", "\\1", v)
If there is a group of letters (\\w+) followed by some (at least 1) other elements .+ then the first captured group of letters appears again \\1 (backreference), then return this group of letters (second argument to the sub function).
ok I'm going to assume that you are trying to create a vector with 4 elements of char
Instead of
v = c("fam gen geo gen")
It should be
v = c("fam", "gen", "geo", "gen")
Then
v[duplicated(v)]
In the case that you have over two repetitions of an element and you only want the duplicated element to be returned once, you can use anyDuplicated
v <- c("fam", "gen", "geo", "gen", "gen")
v[duplicated(v)]
>"gen" "gen"
v[anyDuplicated(v)]
>"gen"
I'm going to make a different assumption than Nomad. I'm going to assume that you have a pure string, and then I'm going to split it by the assumed delimiter
(space).
v = c("fam gen geo gen")
vec <- unlist(
strsplit(v, " ")
)
vec[anyDuplicated(vec)]
[1] "gen"

How to extract bracket from string into new columns

I need to export information from a string into different columns.
More specifically the content of the brackets within the string;
Lets say I have a string
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
What I am trying to output is a vector with the contents of the brackets, if there is a comma save them as separate bracketed strings, and remove parentheses.
e.g.
tmp <- function(a)
Result
tmp
"[K89]" , "[K96]", "[N-Term]", "[S87]", "[S93]"
My approach so far:
pattern <- "(\\[.*?\\])"
hits <- gregexpr(pattern, a)
matches <- regmatches(a, hits)
unlisted_matches <- unlist(matches)
Results
"[K89; K96]" "[N-Term]" "[S87(100); S93(100)]"
This does give me the brackets but still doesn't split the terms. And for any reason I am not able to efficiently separate the ";" terms.
Here's a way using the tidyverse :
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
library(tidyverse)
a %>%
# extract between square, brackets, not keeping brackets, and unlist
str_extract_all("(?<=\\[).*?(?=\\])") %>%
unlist() %>%
# remove round brackets and content
str_replace_all("\\(.*?\\)", "") %>%
# split by ";" and unlist
str_split("; ") %>%
unlist() %>%
# put the brackets back
str_c("[",.,"]")
#> [1] "[K89]" "[K96]" "[N-Term]" "[S87]" "[S93]"
You may use
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
pattern <- "(?:\\G(?!^)(?:\\([^()]*\\))?\\s*;\\s*|\\[)\\K[^][;()]+"
matches <- regmatches(a, gregexpr(pattern, a, perl=TRUE))
unlisted_matches <- paste0("[", unlist(matches),"]")
unlisted_matches
## => [1] "[K89]" "[K96]" "[N-Term]" "[S87]" "[S93]"
See the R demo and the regex demo.
Pattern details
(?:\G(?!^)(?:\([^()]*\))?\s*;\s*|\[) - either the end of the previous successful match (\G(?!^)) followed with any substring inside round parentheses (optional, see (?:\([^()]*\))?) and then a ; enclosed with optional 0+ whitespaces (see \s*;\s*) or a [ char
\K - match reset operator discarding all text matched so far
[^][;()]+ - one or more chars other than [, ], ;, ( and ).
The paste0("[", unlist(matches),"]") part wraps the matches with square brackets.

RegEx for a conditional pattern in a string

I need to extract substrings from some strings,for example:
My data is a vector: c("Shigella dysenteriae","PREDICTED: Ceratitis")
a = "Shigella dysenteriae"
b = "PREDICTED: Ceratitis"
I hope that if the string starts with "PREDICTED:", it can be extracted to the subsequent word(maybe "Ceratitis"), and if the string doesn't start with "PREDICTED", it can be extracted to the first word(maybe Shigella);
In this example, the result would be:
result_of_a = "Shigella"
result_of_b = "Ceratitis"
Well,it is a typical conditional regular expression.I tried,but always failed;
I used R which can compatible perl's regular expression.
I know R supports perl's regular expression so I tried to use regexpr and regmatches, two functions to extract the substrings that I want.
The code is :
pattern = "(?<=PREDICTED:)?(?(1)(\\s+\\w+\\b)|(\\w+\\b))"
a = c("Shigella dysenteriae")
m_a = regexpr(pattern,a,perl = TRUE)
result_a = regmatches(a,m_a)
b = c("PREDICTED: Ceratitis")
m_b = regexpr(pattern,a,perl = TRUE)
result_b = regmatches(b,m_b)
Finaly,the result is :
# result_a = "Shigella"
# result_b = "PREDICTED"
It is not the result I expect,result_a is right,result_b is wrong.
WHY???Its seem that the condition didn't work...
PS:
I tried to read some details of conditional reg-expresstion. this is the web I tried to read : https://www.regular-expressions.info/conditional.html and I try to imitate "pattern" from this web ,and also tried to use "RegexBuddy" software to find the reason.
EDIT:
To use the function below on a vector, one can do:
Vector: myvec<-c("Shigella dysenteriae","PREDICTED: Ceratitis")
lapply(myvec,extractor)
[[1]]
[1] "Shigella"
[[2]]
[1] "Ceratitis"
Or:
unlist(lapply(myvec,extractor))
[1] "Shigella" "Ceratitis"
This assumes that the strings are always in the format shown above:
extractor<- function(string){
if(grepl("^PREDICTED",string)){
strsplit(string,": ")[[1]][2]
}
else{
strsplit(string," ")[[1]][1]
}
}
extractor(b)
#[1] "Ceratitis"
extractor(a)
#[1] "Shigella"
I think the reason it does not work is because (1) checks if a numbered capture group has been set but there is no first capturing group set yet, also not in the positive lookbehind (?<=PREDICTED:)?.
There are a first and second capturing group in the parts that follow. The if clause will check for group 1, it is not set so it will match group 2.
If you would make it the only capturing group (?<=(PREDICTED: )?) and omit the other 2 then the if clause will be true but you will get an error because the lookbehind assertion is not fixed length.
Instead of using a conditional pattern, to get both words you might use a capturing group and make PREDICTED: optional:
^(?:PREDICTED: )?(\w+)
Regex demo | R demo
If I understand correctly, the OP wants to extract
the first word after "PREDICTED:" if the strings starts with "PREDICTED:"
the first word of the string if the string does not start with "PREDICTED:".
So, if there is no specific requirement to use only one regex, this is what I would do:
Remove any leading "PREDICTED:" (if any)
Extract the first word from the intermediate result.
For working with regex, I prefer to use Hadley Wickham's stringr package:
inp <- c("Shigella dysenteriae", "PREDICTED: Ceratitis")
library(magrittr) # piping used to improve readability
inp %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")
[1] "Shigella" "Ceratitis"
To be on the safe side, I would remove any leading spaces beforehand:
inp %>%
stringr::str_trim() %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")

How to do a replace with backreferences, when the number of occurences is unknown?

In order to make a few corrections to a .tex file generated by Bookdown, I need to replace occurrences of }{ with , when it is used in a citation, i.e.
s <- "Text.\\autocites{REF1}{REF2}{REF3}. More text \\autocites{REF4}{REF5} and \\begin{tabular}{ll}"
Should become
"Text.\\autocites{REF1,REF2,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}
Because I need to keep the references I tried to look into backreferences, but I cannot seem to get it right, because the number of groups to match is unknown beforehand. Also, I cannot do stringr::str_replace_all(s, "\\}\\{", ","), because }{ occurs in other places in the document as well.
My best approach so far, is to use a look-behind to only do the replace when the occurence is after \\autocites, but then I cannot get the backreferences and grouping right:
stringr::str_replace_all(s, "(?<=\\\\autocites\\{)([:alnum:]+)(\\}\\{)", "\\1,")
[1] "Text.\\autocites{REF1,REF2}{REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"
stringr::str_replace_all(s, "(?<=\\\\autocites\\{)([:alnum:]+)((\\}\\{)([:alnum:]+))*", "\\1,\\4")
[1] "Text.\\autocites{REF1,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"
I might be missing some completely obvious approach, so I hope someone can help.
pat matches
autocites followed by
the shortest string that ends in } and is
followed by end of string or a non-{
It then uses gsubfn to replace each occurrence of }{ in that with a comma. It uses formula notation to express the replacement function -- the body of the function is on the RHS of the ~ and because the body contains ..1 the arguments are taken to be ... . It does not use zero width lookahead or lookbehind.
library(gsubfn)
pat <- "(autocites.*?\\}($|[^{]))"
gsubfn(pat, ~ gsub("}{", ",", ..1, fixed = TRUE), s)
giving:
[1] "Text.\\autocites{REF1,REF2,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"
Variation
One minor simplificaiton of the regular expression shown above is to remove the outer parentheses from pat and instead specify backref = 0 in gsubfn. That tells it to pass the entire match to the function. We could use ..1 to specify the argument as above but since we know that there is necessarily only one argument passed we can specify it as x in the body of the function. Any variable name would do as it assumes that any free variable is an argument. The output would be the same as above.
pat2 <- "autocites.*?\\}($|[^{])"
gsubfn(pat2, ~ gsub("}{", ",", x, fixed = TRUE), s, backref = 0)
Cool problem - I got to learn a new trick with str_replace. You can make the return value a function, and it applies the function to the strings you've picked out.
replace_brakets <- function(str) {
str_replace_all(str, "\\}\\{", ",")
}
s %>% str_replace_all("(?<=\\\\autocites\\{)([:alnum:]+\\}\\{)+", replace_brakets)
# [1] "Text.\\autocites{REF1,REF2,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"

Resources