This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed 2 years ago.
I'm working with long strings in R such as:
string <- "end of section. 3. LESSONS. The previous LESSONS are very important as seen in Figure 1. This text is also important. Figure 1: Blah blah blah".
I would like to extract the substring between the first occurrence of 'LESSONS' and the last occurrence of 'Figure 1', as follows:
"The previous LESSONS are very important as seen in Figure 1. This text is also important."
I tried the following but it returns the substring after the last occurence of 'LESSONS', not the first:
gsub(".*LESSONS (.*) Figure 1.*", "\\1", string)
#[1] "are very important as seen in Figure 1. This text is also important."
Also tried the following but it cuts the string after the first occurrence of 'Figure 1', not the last:
library(qdapRegex)
ex_between(string, "LESSONS", "Figure 1")
#[[1]]
#[1] ". The previous LESSONS are very important as seen in"
I'd appreciate any help!
You were very close. Make the regex non-greedy at the before "LESSONS" so that it matches the first one.
Also, here you can use only sub instead of gsub.
sub(".*?LESSONS\\.\\s*(.*) Figure 1.*", "\\1", string)
#[1] "The previous LESSONS are very important as seen in Figure 1. This text is also important."
You can use str_extract from the package stringr as well as positive lookbehind in (?<=...)and positive lookahead in (?=...) to define those parts of the string that delimit the part you want to extract:
str_extract(string, "(?<=LESSONS\\.\\s).*(?=\\sFigure 1)")
[1] "The previous LESSONS are very important as seen in Figure 1. This text is also important."
Related
I'm trying to use stringr/dplyr to extract a pathway name from a table cell containing excess information. All cells in this table follow the same general format. Some examples are:
(R)-lactate from methylglyoxal: step 1/2. {ECO:0000256|ARBA:ARBA00005008, ECO:0000256|RuleBase:RU361179}.
(S)-dihydroorotate from bicarbonate: step 3/3. {ECO:0000256|ARBA:ARBA00004880}.
3,4',5-trihydroxystilbene biosynthesis
From these examples, I want to extract "(R)-lactate from methylglyoxal", "(S)-dihydroorotate from bicarbonate", and "3,4',5-trihydroxystilbene biosynthesis" respectively. I'm struggling to figure out which combination of regular expressions to use in order to accomplish this. I've been trying to use the positive look behind assertion ?<=... along with str_extract to extract all information preceding the first ":", but I can't get it to work. Any help would be appreciated!
please try the following pattern:
(?<=^)(.+?)(:|$)
(?<=^) the first part is looking exclusively at the beginning of the sentence
(.+?)(:|$) the second part is looking for at least one character before first ":" or end of sentence
enter image description here
You don't need any lookarounds, you can match the values using:
^[^\r\n:]+
The pattern matches:
^ Start of string
[^\r\n:]+ Match 1+ chars other than newlines or :
Regex demo
library(stringr)
s <- c("(R)-lactate from methylglyoxal: step 1/2. {ECO:0000256|ARBA:ARBA00005008, ECO:0000256|RuleBase:RU361179}.",
"(S)-dihydroorotate from bicarbonate: step 3/3. {ECO:0000256|ARBA:ARBA00004880}.",
"3,4',5-trihydroxystilbene biosynthesis")
str_extract(s, "^[^\\r\\n:]+")
Output
[1] "(R)-lactate from methylglyoxal"
[2] "(S)-dihydroorotate from bicarbonate"
[3] "3,4',5-trihydroxystilbene biosynthesis"
This question already has answers here:
Why does is this end of line (\\b) not recognised as word boundary in stringr/ICU and Perl
(2 answers)
Closed 3 years ago.
I'm pretty new to regex and am trying to detect a word with the "+" symbol when surrounded by "\\b" in long strings of words but both stringr and grepl are giving me the wrong result.
This is the code that I have wrote:
library(stringr)
str_detect("coversyl +", "\\bcoversyl(plus| plus|\\+| \\+)\\b")
The output is FALSE which is wrong.
What would be the right way to do it?
My guess is that your expression is just fine, maybe missing an space,
\\bcoversyl\\b\\s(\\bplus\\b|\\+)
Please see the demo for additional explanation.
If we might want more than one space, we would simply change \\s to \\s+ and it might work:
\\bcoversyl\\b\\s+(\\bplus\\b|\\+)
This is a silly question, but I can't seem to find a solution in R online. I am trying to remove an isolated number from a long string. For example, I would like to remove the number 27198 from the sentence below.
x <- "hello3 my name 27198 is 5joey"
I tried the following:
gsub("[0-9]","",x)
Which results in:
"hello my name is joey"
But I want:
"hello3 my name is 5joey"
This seems really simple, but I am not well versed with regular expressions. Thanks for your help!
We can specify word boundary (\\b) at the end of one or more digits ([0-9]+)
gsub("\\b[0-9]+\\b", "", x)
#[1] "hello3 my name is 5joey"
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I'm trying to understand a regular expression someone has written in the gsub() function.
I've never used regular expressions before seeing this code, and i have tried to work out how it's getting the final result with some googling, but i have hit a wall so to speak.
gsub('.*(.{2}$)', '\\1',"my big fluffy cat")
This code returns the last two characters in the given string. In the above example it would return "at". This is the expected result but from my brief foray into regular expressions i don't understand why this code does what it does.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string. It would make more sense to me if this part in brackets was in place of the '\1'. To me it would then read look at the entire string and replace it with the last two characters of that string.
All that does though is output the actual code as the replacement e.g ".{2}$".
Finally i don't understand why '\1' is in the replace part of the function. To me this is just saying replace the entire string with a single backslash and the number one. I say a single backslash because it's my understanding the first backslash is just there to make the second backslash a none special character.
For gsub there are two ways of using the function. The most common way is probably.
gsub("-","TEST","This is a - ")
which would return
This is a TEST
What this does is simply finds the matches in the regular expression and replaces it with the replacement string.
The second way to use gsub is the method in which you described. using \\1, \\2 or \\3...
What this does is looks at the first, second or third capture group in your regular expression.
A capture group is defined by anything inside the circular brackets ex: (capture_group_1)(capture_group_2)...
Explanation
Your analysis is correct.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string
The last two characters are placed in a capture group and we are simply replace the whole string with this capture group. Not replacing them with anything.
if it helps, check out the result of this expression.
gsub('(.*)(.{2}$)', 'Group 1: \\1, Group 2: \\2',"my big fluffy cat")
hope the examples can help you to understand it better:
Say we have a string foobarabcabcdef
.* matches whole string.
.*abc it matches: from the beginning matches any chars till the last abc (greedy matching), thus, it matches foobarabcabc
.*(...)$ matches the whole string as well, however, the last 3 chars were groupped. Without the () , the matched string will have a default group, group0, the () will be group1, 2, 3.... think about .*(...)(...)(...)$ so we have:
group 0 : whole string
group 1 : "abc" the first "abc"
group 2 : "abc" the 2nd "abc"
group 3 : "def" the last 3 chars
So back to your example, the \\1 is a reference to group. What it does is: "replace the whole string by the matched text in group1" That is, the .{2}$ part is the replacement.
If you don't understand the backslashs, you have to reference the syntax of r, I cannot tell more. It is all about escaping.
Important part of that regular expression are brackets, that's something called "capturing group".
Regular expression .*(.{2}$) says - match anything and capture last 2 characters at the line. Replacement \\1 is referencing to that group, so it will replace whole match with captured group, which are last two characters in this case.
This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 5 years ago.
I'm trying to fix a dataset that has some errors of decimal numbers wrongly typed. For example, some entries were typed as ".15" instead of "0.15". Currently this column is chr but later I need to convert it to numeric.
I'm trying to select all of those "words" that start with a period "." and replace the period with "0." but it seems that the "^" used to anchor the start of the string doesn't work nicely with the period.
I tried with:
dataIMN$precip <- str_replace (dataIMN$precip, "^.", "0.")
But it puts a 0 at the beginning of all the entries, including the ones that are correctly typed (those that don't start with a period).
If you need to do as you've stated, brackets [] are regex for 'find exact', or you can use '\\' which escapes a character, such as a period:
Option 1:
gsub("^[.]","0.",".54")
[1] "0.54"
Option 2:
gsub("^\\.","0.",".54")
[1] "0.54"
Otherwise, as.numeric should also take care of it automatically.