Is there a simple way to get substring in R? - r

i get the substring of word in the following way:
word="xyz9874"
pattern="[0-9]+"
x=gregexpr(pattern,word)
substr(word,start=x[[1]],stop=x[[1]]+attr(x[[1]],"match.length")-1)
[1] "9874"
Is there a more simple way to get the result in R?

Sure, use gsub and backreferencing:
gsub( ".*?([0-9]+).*", "\\1", word )
Explanation: in most regex implementations, \1 is the back reference to the first subpattern matched. The subpattern is enclosed in parentheses. In R, you need to escape the backslash irrespective of the type of quotation marks you are using.
The question mark, an idiom of the "extended" regular expressions means that the given regex pattern should not be greedy, in other words -- it should take as little of the string as possible. Othrewise, the .* in the pattern .*([0-9]+) would match xyz987 and ([0-9]+) would match 4. Alternatively, we can write
gsub( ".*[^0-9]+([0-9]+).*", "\\1", word )
but then we have a problem with strings that start with a number.
By the way, note that instead of [0-9] you can write \d, or, actually, \\d:
gsub( ".*?(\\d+).*", "\\1", word )

Related

What is the regex pattern for extracting the substring to the left of four numbers attached to an uppercase word?

I have a string ARC GUNNA SPARKYA 2011QUARTER HORSE.
I'd like to extract only the ARC GUNNA SPARKYA part. I.e., everything to the left of the "2011QUARTER."
I will also have valid strings which I want the pattern NOT to match. Valid strings would be "10RUNS FAST" or "QUICKER 1".
Note that the above means I need a pattern which can explicitly pick up just any four numbers followed by the uppercase word "QUARTER."
I tried ([0-9A-Za-z]+( [0-9A-Za-z]+)+) but that pattern matches the part I want to keep too, so I can't use it to do something like gsub.
Can you please help me understand what regex pattern will accomplish this--particularly in R?
Thank you!
You could use sub with a capture group, and use that group in the replacement.
(.*?)\s+\d{4}QUARTER\b.*
Explanation
(.*?) Capture group 1, match any character, as few as possible
\s+ Match 1+ whitespace characters
\d{4}QUARTER\b Match 4 digits followed by the word QUARTER
.* Match the rest of the line
See a regex101 demo.
text <- "ARC GUNNA SPARKYA 2011QUARTER HORSE"
result = sub("(.*?)\\s+\\d{4}QUARTER\\b.*", "\\1", text)
result
Output
[1] "ARC GUNNA SPARKYA"

How do I add a space between two characters using regex in R?

I want to add a space between two punctuation characters (+ and -).
I have this code:
s <- "-+"
str_replace(s, "([:punct:])([:punct:])", "\\1\\s\\2")
It does not work.
May I have some help?
There are several issues here:
[:punct:] pattern in an ICU regex flavor does not match math symbols (\p{S}), it only matches punctuation proper (\p{P}), if you still want to match all of them, combine the two classes, [\p{P}\p{S}]
"\\1\\s\\2" replacement contains a \s regex escape sequence, and these are not supported in the replacement patterns, you need to use a literal space
str_replace only replaces one, first occurrence, use str_replace_all to handle all matches
Even if you use all the above suggestions, it still won't work for strings like -+?/. You need to make the second part of the regex a zero-width assertion, a positive lookahead, in order not to consume the second punctuation.
So, you can use
library(stringr)
s <- "-+?="
str_replace_all(s, "([\\p{P}\\p{S}])(?=[\\p{P}\\p{S}])", "\\1 ")
str_replace_all(s, "(?<=[\\p{P}\\p{S}])(?=[\\p{P}\\p{S}])", " ")
gsub("(?<=[[:punct:]])(?=[[:punct:]])", " ", s, perl=TRUE)
See the R demo online, all three lines yield [1] "- + ? =" output.
Note that in PCRE regex flavor (used with gsub and per=TRUE) the POSIX character class must be put inside a bracket expression, hence the use of double brackets in [[:punct:]].
Also, (?<=[[:punct:]]) is a positive lookbehind that checks for the presence of its pattern immediately on the left, and since it is non-consuming there is no need of any backreference in the replacement.

Pattern to match only characters within parentheses

I have looked at lots of posts here on SO with suggestions on REGEX patterns to grab texts from parentheses. However, from what I have looked into I cannot find a solution that works.
For example, I have had a look at the following:
R - Regular Expression to Extract Text Between Parentheses That Contain Keyword, Extract text in parentheses in R, regex to pickout some text between parenthesis [duplicate]
In the following order, here were the top answers solutions (with some amendments):
pattern1= '\\([^()]*[^()]*\\)'
pattern2= '(?<=\\()[^()]*(?=\\))'
pattern3= '.*\\((.*)\\).*'
all_patterns = c(pattern1, pattern2, pattern3)
I have used the following:
sapply(all_patterns , function(x)stringr::str_extract('I(data^2)', x))
\\([^()]*[^()]*\\) (?<=\\()[^()]*(?=\\)) .*\\((.*)\\).*
"(data^2)" "data^2" "I(data^2)"
None of these seem to only grab the characters within the brackets, so how can I just grab the characters inside brackets?
Expected output:
data
With str_extract, it would extract all those characters matched in the patterns. Instead, use a regex lookaround to match one or more characters that are not a ^ or the closing bracket ()) ([^\\^\\)]+) that succeeds an opening bracket ((?<=\\() - these are escaped (\\) as they are metacharacters
library(stringr)
str_extract('I(data^2)', '(?<=\\()[^\\^\\)]+')
# [1] "data"
Here is combinations of str_extract and str_remove
library(stringr)
str_extract(str_remove('I(data^2)', '.\\('), '\\w*')
[1] "data"

R - replace last instance of a regex match and everything afterwards

I'm trying to use a regex to replace the last instance of a phrase (and everything after that phrase, which could be any character):
stringi::stri_replace_last_regex("_AB:C-_ABCDEF_ABC:45_ABC:454:", "_ABC.*$", "CBA")
However, I can't seem to get the refex to function properly:
Input: "_AB:C-_ABCDEF_ABC:45_ABC:454:"
Actual output: "_AB:C-CBA"
Desired output: "_AB:C-_ABCDEF_ABC:45_CBA"
I have tried gsub() as well but that hasn't worked.
Any ideas where I'm going wrong?
One solution is:
sub("(.*)_ABC.*", "\\1_CBA", Input)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Have a look at what stringi::stri_replace_last_regex does:
Replaces with the given replacement string last substring of the input that matches a regular expression
What does your _ABC.*$ pattern match inside _AB:C-_ABCDEF_ABC:45_ABC:454:? It matches the first _ABC (that is right after C-) and all the text after to the end of the line (.*$ grabs 0+ chars other than line break chars to the end of the line). Hence, you only have 1 match, and it is the last.
Solutions can be many:
1) Capturing all text before the last occurrence of the pattern and insert the captured value with a replacement backreference (this pattern does not have to be anchored at the end of the string with $):
sub("(.*)_ABC.*", "\\1_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
2) Using a tempered greedy token to make sure you only match any char that does not start your pattern up to the end of the string after matching it (this pattern must be anchored at the end of the string with $):
sub("(?s)_ABC(?:(?!_ABC).)*$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
Note that this pattern will require perl=TRUE argument to be parsed with a PCRE engine with sub (or you may use stringr::str_replace that is ICU regex library powered and supports lookaheads)
3) A negative lookahead may be used to make sure your pattern does not appear anywhere to the right of your pattern (this pattern does not have to be anchored at the end of the string with $):
sub("(?s)_ABC(?!.*_ABC).*", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
See the R demo online, all these three lines of code returning _AB:C-_ABCDEF_ABC:45_CBA.
Note that (?s) in the PCRE patterns is necessary in case your strings may contain a newline (and . in a PCRE pattern does not match newline chars by default).
Arguably the safest thing to do is using a negative lookahead to find the last occurrence:
_ABC(?:(?!_ABC).)+$
Demo
gsub("_ABC(?:(?!_ABC).)+$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Using gsub and back referencing
gsub("(.*)ABC.*$", "\\1CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
[1] "_AB:C-_ABCDEF_ABC:45_CBA"

Regex in R to extract words before a special character

I having a dataframe of part of speech tagged strings
Example:
best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ
I want to remove the tags after/and the '_' so that I have the output
best phone only issue camera sensor have mind own
I am using R and I couldn't find an appropriate regex for the gsub function.
I tried this.
sentence= c("best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ")
o1=gsub("\\_.*","",sentence, perl = T)
But This removes entire string after the first underscore. Thanks in Advance
You may use _[A-Z]+ TRE pattern with gsub:
sentence <- c("best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ")
gsub("_[A-Z]+","",sentence)
[1] "best phone only issue camera sensor have mind own"
See the R demo
The _[A-Z]+ pattern matches an underscore (_, note it does not have to be escaped in a regex pattern) and one or more (+) uppercase ASCII letters ([A-Z]).
You may further precise the pattern, say, to only match the _ if it is preceded with a word char and match uppercase letters only when followed with a word boundary:
"\\B_[A-Z]+\\b
In case you want to create a very specific regex for the POS values, you may use alternation:
"\\B_(JJ|NN|CC|[VR]B)\\b"
And continue adding |<code> to the regex pattern.

Resources