Pattern lookup within a string in R using regular expression matching - r

I am trying to pick patterns within a specific string and their respective location. I have explained below with an example:
String = "Web_797-Web_797-Web_797-Web_797-PCP_IM_PAR-Pharm_1-Pharm_1-
Web_797-PCP_IM_PAR-Prior_OP-Web_797-Prior_OP-Event_0-"
pattern = "Web_797-*Web_797" (Web_797 followed by Web_797 with anything in between)
I used the following function:
str_locate_all(String,pattern)[[1]]
I am getting the following result:
start end
[1,] 1 15
[2,] 17 31
which is what I need partially. However I the pattern is not able to pick the following combination (highlighted in black).
String = "Web_797-Web_797-Web_797-Web_797-PCP_IM_PAR-Pharm_1-Pharm_1-
Web_797-PCP_IM_PAR-Prior_OP-Web_797-Prior_OP-Event_0-"
I would appreciate if anyone could help with this. I believe there is something wrong with the way I am defining the pattern but not able to fix it.

The problem with your pattern pattern = "Web_797-*Web_797" is the -* part. That means zero or more dashes (-). I believe what you wanted was a dash followed by any characters. So a first (incorrect) attempt would be
pattern = "Web_797-.*Web_797" Where the . means "any character". But that is not quite right. You only want to collect characters until the next time you see Web_797, not all the way until the last time you see Web_797. By default, the matches are "greedy" taking the biggest possible match. If we use
pattern = "Web_797-.*?Web_797" the ? turns off greedy matching so that it only matches to the next Web_797.

Related

Extracting a specific identifier from a column containing excess information

I'm trying to use stringr/dplyr to extract a pathway name from a table cell containing excess information. All cells in this table follow the same general format. Some examples are:
(R)-lactate from methylglyoxal: step 1/2. {ECO:0000256|ARBA:ARBA00005008, ECO:0000256|RuleBase:RU361179}.
(S)-dihydroorotate from bicarbonate: step 3/3. {ECO:0000256|ARBA:ARBA00004880}.
3,4',5-trihydroxystilbene biosynthesis
From these examples, I want to extract "(R)-lactate from methylglyoxal", "(S)-dihydroorotate from bicarbonate", and "3,4',5-trihydroxystilbene biosynthesis" respectively. I'm struggling to figure out which combination of regular expressions to use in order to accomplish this. I've been trying to use the positive look behind assertion ?<=... along with str_extract to extract all information preceding the first ":", but I can't get it to work. Any help would be appreciated!
please try the following pattern:
(?<=^)(.+?)(:|$)
(?<=^) the first part is looking exclusively at the beginning of the sentence
(.+?)(:|$) the second part is looking for at least one character before first ":" or end of sentence
enter image description here
You don't need any lookarounds, you can match the values using:
^[^\r\n:]+
The pattern matches:
^ Start of string
[^\r\n:]+ Match 1+ chars other than newlines or :
Regex demo
library(stringr)
s <- c("(R)-lactate from methylglyoxal: step 1/2. {ECO:0000256|ARBA:ARBA00005008, ECO:0000256|RuleBase:RU361179}.",
"(S)-dihydroorotate from bicarbonate: step 3/3. {ECO:0000256|ARBA:ARBA00004880}.",
"3,4',5-trihydroxystilbene biosynthesis")
str_extract(s, "^[^\\r\\n:]+")
Output
[1] "(R)-lactate from methylglyoxal"
[2] "(S)-dihydroorotate from bicarbonate"
[3] "3,4',5-trihydroxystilbene biosynthesis"

I need help figuring out why my regex does not match with what I am looking for

I am working on a R script aiming to check if a data.frame is correctly made and contains the right information at the right place.
I need to make sure a row contains the right information, so I want to use a regular expression to compare with each case of said row.
I thought maybe it did not work because I compared the regex to the value by calling the value directly from the table, but it did not work.
I used regex101.com to make sure my regular expression was correct, and it matched when the test string was put between quotes.
Then I added as.character() to the value, but it came out FALSE.
To sum up, the regex works on regex101.com, but never did on my R script
test = c("b40", "b40")
".[ab][0-8]{2}." == test[1]
FALSE
I expect the output to be TRUE, but it is always FALSE
The == is for fixed full string match and not used for substring match. For that, we can use grep
grepl("^[ab][0-8]{2}", test[1])
#[1] TRUE
Here, we match either 'a' or 'b' at the start (^) of the string followed by two digits ranging from 0 to 8 (if it should be at the end - use $)

can someone explain this regular expression inside gsub()? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I'm trying to understand a regular expression someone has written in the gsub() function.
I've never used regular expressions before seeing this code, and i have tried to work out how it's getting the final result with some googling, but i have hit a wall so to speak.
gsub('.*(.{2}$)', '\\1',"my big fluffy cat")
This code returns the last two characters in the given string. In the above example it would return "at". This is the expected result but from my brief foray into regular expressions i don't understand why this code does what it does.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string. It would make more sense to me if this part in brackets was in place of the '\1'. To me it would then read look at the entire string and replace it with the last two characters of that string.
All that does though is output the actual code as the replacement e.g ".{2}$".
Finally i don't understand why '\1' is in the replace part of the function. To me this is just saying replace the entire string with a single backslash and the number one. I say a single backslash because it's my understanding the first backslash is just there to make the second backslash a none special character.
For gsub there are two ways of using the function. The most common way is probably.
gsub("-","TEST","This is a - ")
which would return
This is a TEST
What this does is simply finds the matches in the regular expression and replaces it with the replacement string.
The second way to use gsub is the method in which you described. using \\1, \\2 or \\3...
What this does is looks at the first, second or third capture group in your regular expression.
A capture group is defined by anything inside the circular brackets ex: (capture_group_1)(capture_group_2)...
Explanation
Your analysis is correct.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string
The last two characters are placed in a capture group and we are simply replace the whole string with this capture group. Not replacing them with anything.
if it helps, check out the result of this expression.
gsub('(.*)(.{2}$)', 'Group 1: \\1, Group 2: \\2',"my big fluffy cat")
hope the examples can help you to understand it better:
Say we have a string foobarabcabcdef
.* matches whole string.
.*abc it matches: from the beginning matches any chars till the last abc (greedy matching), thus, it matches foobarabcabc
.*(...)$ matches the whole string as well, however, the last 3 chars were groupped. Without the () , the matched string will have a default group, group0, the () will be group1, 2, 3.... think about .*(...)(...)(...)$ so we have:
group 0 : whole string
group 1 : "abc" the first "abc"
group 2 : "abc" the 2nd "abc"
group 3 : "def" the last 3 chars
So back to your example, the \\1 is a reference to group. What it does is: "replace the whole string by the matched text in group1" That is, the .{2}$ part is the replacement.
If you don't understand the backslashs, you have to reference the syntax of r, I cannot tell more. It is all about escaping.
Important part of that regular expression are brackets, that's something called "capturing group".
Regular expression .*(.{2}$) says - match anything and capture last 2 characters at the line. Replacement \\1 is referencing to that group, so it will replace whole match with captured group, which are last two characters in this case.

How do I extract a section number and the text after it?

I have a question.
My text file contains lines such as:
1.1        Description.
This is the description.
1.1.1      Quality Assurance
Random sentence.
1.6.1    Quality Control. Quality Control is the responsibility of the contractor.
I'm trying to find out how to get:
1.1        Description
1.1.1      Quality Assurance
1.6.1    Quality Control
Right now, I have:
txt1 <- readLines("text1.txt")
txt2<-grep("^[0-9.]+", txt1, value = TRUE)
file<-write(txt2, "text3.txt")
which results in:
1.1        Description.
1.1.1      Quality Assurance
1.6.1    Quality Control. Quality Control is the responsibility of the contractor.
You are using grep with value=TRUE, which
returns a character vector containing the selected elements of x
(after coercion, preserving names but no other attributes).
This means, that if your regular expression matches anything in the line, the all line will be returned. You managed to build your regular expression to match numbers in the begining of the line. So all the lines which begin with numbers get selected.
It seems that your goal is not to select the all line, but to select only until there is a line break or a period.
So, you need to adjust the regular expression to be more specific, and you need to extract only the matching portion of the line.
A regular expression that matches what you want can be:
"^([0-9]\\.?)+ .+?(\\.|$)"
It selects numbers with dots, followed by a space, followed by anything, and stops matching things when a . comes or the line ends. I recommend the following website to better understand what the regex does: https://regexr.com/
The next step is extracting from the given lines only the matching portion, and not the all line where the regex has a match. For this we'll use the function regexpr, which tells us where the matches are, and the function regmatches, which helps us extract those matches:
txt1 <- readLines("text.txt")
regmatches(txt1, regexpr("^([0-9]\\.?)+ .+?(\\.|$)", txt1))

regex to match page[0-9] and nothing before or after

I have a regex but it's not quite working the way i want
page[0-9]*
/pages/search.aspx?pageno=3&pg=232323&hdhdhd/page73733/xyz
In the above example, the only thing I want to match is page73733. But my regex matches the page in /pages and it matches page in pageno=3
i also tried page[0-9].*, then it matches page73733 but it also matches everything that comes after it so that it actually matches page73733/xyz
page[0-9].*[^a-zA-Z&?/=]
That seems to do what i want, but that also seems like a ugly way to do it. Plus if i had something like /page123/xyz/page456 it'll match that whole string.
So is there a better way to do this? I want to match ONLY the string page when it is followed by any number of digits, and if anything comes after the digits it should stop.
* means 0 or more occurrences. + means 1 or more occurrences.
page[0-9]+ should work.
page[0-9]*
Will match page followed by zero or more numbers. What you want is:
page[0-9]+
Which will match page followed by one or more numbers.
You almost got it. Just use + instead of * as that will force a match that has numbers after it.
Another way to type that expression would be
/page[0-9]+
note the / , this would be helpful because without it you might get a match with something like "notApage123"
The regex page[0-9]* will match [0-9] 0 or more times. + would match it 1 or more times, and ? would match it 0 or 1 times. An equivalent method to ?+* is as follows:
?={0,1}
*={0,}
+={1,}
This may be helpful for if you wanted to match a date:\\d{4}(-\\d{1,2}){2} which would match 2013-5-31
-
That said, the resulting Regex for your particular problem would be:
page\\d+
page\\d{1,}
page[0-9]+
or page[0-9]{1,}
In your example "/page123/xyz/page456" you may want to match all occurrences, so don't forget the g or global modifier.
If I understand your problem correctly, you only need to add $ to your original regex to specify that after page you want the string to end. So the regex would be
page[0-9]*$
Also, this will match strings that end in page too, if you want only strings that end in page followed by any number, use this regex
page[0-9]+$

Resources