regular expression - find a letter in by a specific template - nsregularexpression

I need to write a code about DNA.
I need to find between a str of multipal letters a seqence of 6 letters that need to be a match to the template.
For example: in index 0 it can be the letter A or T, that it, only those letters. What do I use to do this? Which type of function in regular expression ?
the_str = "AAATAAAATAAATAATAAAGAGCCAGAGGCCCTTGAAGAATGGATGGAAT\
TTGGACTTTAGCGGGGCTGGGGGACCCCGGAAATGGACGAGAAGCAGAAC\
CGAGGCCCTTTAGGGCTCAGCGGAGGCCTGCCTGTCTCTCTAAGGTCCCT\
CTTGGAGCAACTGAAGAAACTCCAGGCCATTGTGGTGCAGTCCACCAGCA\
AGTCAGCCCAGACAGGCACCTGTGTCGCAGTGAGTCCTGGTGCCCCCAGG\
CAAGCCGGGGACCTAGGCTTCTGTAGAGGGGCCCATAGGGAGGTGACAAT\
GAGTCCAAGCTCTCCTTGTGCCCCAGCTCAAGTATGATCCAGTCTGGTCT\
TTGGGGCCTCAGTTTCCCTGCCTGTGGGATGGAGATGCTTGCAGGGGAGG\
GGAGGGAGGGGGTGACTCTGCCGCTGTCTCCACCAGGTCCTGTTGCTGTC"

It sounds like maybe you need something like this?
(A|T)AGCGG
That would match either AAGCGG or TAGCGG.
If the fourth character could be any of C, A, G then it might look like this:
(A|T)AG(C|A|G)GG

Related

String Matching in R - Problem with pattern

I have a small Problem. I want to extract a special pattern like this:
v-97bcer
or b-chyfvg or ghd6db
I tried this:
identifier_1 <- "([:alnum:]{6})" # for things like this ghd6db
identifier_2 <- "([:lower:]{1})[- ][:alnum:]{6})" # for things like this v-97bcer or b-chyfvg
The problem is that the first "identifier" works well ok, but extracts for example names as well. In GHD6D8 this example the numbers have no fixed place and can occur everywhere. I do just now that the length is 6.
And the second problem is that for example V-97bcer can occur like v97bcer but I need this format v-97bcer. Here too the numbers are randomly.
If somebody could help or give me a good source for better understanding how to do this. I have not much exp in string matching. Thank you
this should work:
x <- c("v-97bcer", "b-chyfvg", "ghd6db", "v97bcer")
grep("^([a-z].)?[a-z0-9]{6}$", x)
Note that in order to fix the length of the string I provide ^ and $ to the string.
This pattern matches v-97bcer and b-chyfvg and ghd6db but not v97bcer.

Removing part of strings within a column

I have a column within a data frame with a series of identifiers in, a letter and 8 numbers, i.e. B15006788.
Is there a way to remove all instances of B15.... to make them empty cells (there’s thousands of variations of numbers within each category) but keep B16.... etc?
I know if there was just one thing I wanted to remove, like the B15, I could do;
sub(“B15”, ””, df$col)
But I’m not sure on the how to remove a set number of characters/numbers (or even all subsequent characters after B15).
Thanks in advance :)
Welcome to SO! This is a case of regex. You can use base R as I show here or look into the stringR package for handy tools that are easier to understand. You can also look for regex rules to help define what you want to look for. For what you ask you can use the following code example to help:
testStrings <- c("KEEPB15", "KEEPB15A", "KEEPB15ABCDE")
gsub("B15.{2}", "", testStrings)
gsub is the base R function to replace a pattern with something else in one or a series of inputs. To test our regex I created the testStrings vector for different examples.
Breaking down the regex code, "B15" is the pattern you're specifically looking for. The "." means any character and the "{2}" is saying what range of any character we want to grab after "B15". You can change it as you need. If you want to remove everything after "B15". replace the pattern with "B15.". the "" means everything till the end.
edit: If you want to specify that "B15" must be at the start of the string, you can add "^" to the start of the pattern as so: "^B15.{2}"
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf has a info on different regex's you can make to be more particular.

Using str_detect to select items between curly brackets

I have a column of items like this
{apple}
{orange}>s>
{pine--apple}
{kiwi}
{strawberry}>s>
I would like to filter it so that I only get items that are NOT just a word between brackets (but have other stuff before or after the bracket), so in this example I would like to select these two:
{orange}>s>
{strawberry}>s>
I have tried the following code using dplyr and stringr, but even though on https://regexr.com/ the regular expression works as expected, in R it does not (it just selected rows in which the var column is empty. What am I doing wrong?
d_filtered <- d %>%
filter(!str_detect(var, "\\{(.*?)\\}"))
Your pattern is saying "match anything where there are brackets, with or without stuff between them". Then you negate it with !, so filtering out anything that has a { followed by a } anywhere in the string.
Sounds like what you want to keep strings if there is something before or after the brackets, so let's match that. A . matches any (single) thing, so a pattern for "something before open bracket" is ".\\{". Similarly a pattern for "something after closing bracket" is "\\}.". We can connect them with | for "or". In your filter, use
filter(str_detect(var, ".\\{|\\}."))
This will solve your problem by testing if all character within the vector is within [a-zA-Z], { or }:
cl=c("{apple}",
"{orange}>s>",
"{pine--apple}",
"{kiwi}",
"{strawberry}>s>")
find=function(x){
x=unlist(strsplit(x,""))
poss=c(letters,LETTERS,"{","}")
all(x%in%poss)
}
cl=cl[!sapply(cl,find)]
One can also use grep of base R:
> d = c("<s{apple}", "{orange}>s>", "{pine--apple}", "{kiwi}", "{strawberry}>s>")
# I have added "<s" before {apple} in above vector
> d[grep(".\\{|}.", d)]
[1] "<s{apple}" "{orange}>s>" "{strawberry}>s>"

Pattern lookup within a string in R using regular expression matching

I am trying to pick patterns within a specific string and their respective location. I have explained below with an example:
String = "Web_797-Web_797-Web_797-Web_797-PCP_IM_PAR-Pharm_1-Pharm_1-
Web_797-PCP_IM_PAR-Prior_OP-Web_797-Prior_OP-Event_0-"
pattern = "Web_797-*Web_797" (Web_797 followed by Web_797 with anything in between)
I used the following function:
str_locate_all(String,pattern)[[1]]
I am getting the following result:
start end
[1,] 1 15
[2,] 17 31
which is what I need partially. However I the pattern is not able to pick the following combination (highlighted in black).
String = "Web_797-Web_797-Web_797-Web_797-PCP_IM_PAR-Pharm_1-Pharm_1-
Web_797-PCP_IM_PAR-Prior_OP-Web_797-Prior_OP-Event_0-"
I would appreciate if anyone could help with this. I believe there is something wrong with the way I am defining the pattern but not able to fix it.
The problem with your pattern pattern = "Web_797-*Web_797" is the -* part. That means zero or more dashes (-). I believe what you wanted was a dash followed by any characters. So a first (incorrect) attempt would be
pattern = "Web_797-.*Web_797" Where the . means "any character". But that is not quite right. You only want to collect characters until the next time you see Web_797, not all the way until the last time you see Web_797. By default, the matches are "greedy" taking the biggest possible match. If we use
pattern = "Web_797-.*?Web_797" the ? turns off greedy matching so that it only matches to the next Web_797.

regex to match page[0-9] and nothing before or after

I have a regex but it's not quite working the way i want
page[0-9]*
/pages/search.aspx?pageno=3&pg=232323&hdhdhd/page73733/xyz
In the above example, the only thing I want to match is page73733. But my regex matches the page in /pages and it matches page in pageno=3
i also tried page[0-9].*, then it matches page73733 but it also matches everything that comes after it so that it actually matches page73733/xyz
page[0-9].*[^a-zA-Z&?/=]
That seems to do what i want, but that also seems like a ugly way to do it. Plus if i had something like /page123/xyz/page456 it'll match that whole string.
So is there a better way to do this? I want to match ONLY the string page when it is followed by any number of digits, and if anything comes after the digits it should stop.
* means 0 or more occurrences. + means 1 or more occurrences.
page[0-9]+ should work.
page[0-9]*
Will match page followed by zero or more numbers. What you want is:
page[0-9]+
Which will match page followed by one or more numbers.
You almost got it. Just use + instead of * as that will force a match that has numbers after it.
Another way to type that expression would be
/page[0-9]+
note the / , this would be helpful because without it you might get a match with something like "notApage123"
The regex page[0-9]* will match [0-9] 0 or more times. + would match it 1 or more times, and ? would match it 0 or 1 times. An equivalent method to ?+* is as follows:
?={0,1}
*={0,}
+={1,}
This may be helpful for if you wanted to match a date:\\d{4}(-\\d{1,2}){2} which would match 2013-5-31
-
That said, the resulting Regex for your particular problem would be:
page\\d+
page\\d{1,}
page[0-9]+
or page[0-9]{1,}
In your example "/page123/xyz/page456" you may want to match all occurrences, so don't forget the g or global modifier.
If I understand your problem correctly, you only need to add $ to your original regex to specify that after page you want the string to end. So the regex would be
page[0-9]*$
Also, this will match strings that end in page too, if you want only strings that end in page followed by any number, use this regex
page[0-9]+$

Resources