What is the regex pattern for extracting the substring to the left of four numbers attached to an uppercase word? - r

I have a string ARC GUNNA SPARKYA 2011QUARTER HORSE.
I'd like to extract only the ARC GUNNA SPARKYA part. I.e., everything to the left of the "2011QUARTER."
I will also have valid strings which I want the pattern NOT to match. Valid strings would be "10RUNS FAST" or "QUICKER 1".
Note that the above means I need a pattern which can explicitly pick up just any four numbers followed by the uppercase word "QUARTER."
I tried ([0-9A-Za-z]+( [0-9A-Za-z]+)+) but that pattern matches the part I want to keep too, so I can't use it to do something like gsub.
Can you please help me understand what regex pattern will accomplish this--particularly in R?
Thank you!

You could use sub with a capture group, and use that group in the replacement.
(.*?)\s+\d{4}QUARTER\b.*
Explanation
(.*?) Capture group 1, match any character, as few as possible
\s+ Match 1+ whitespace characters
\d{4}QUARTER\b Match 4 digits followed by the word QUARTER
.* Match the rest of the line
See a regex101 demo.
text <- "ARC GUNNA SPARKYA 2011QUARTER HORSE"
result = sub("(.*?)\\s+\\d{4}QUARTER\\b.*", "\\1", text)
result
Output
[1] "ARC GUNNA SPARKYA"

Related

How do I add a space between two characters using regex in R?

I want to add a space between two punctuation characters (+ and -).
I have this code:
s <- "-+"
str_replace(s, "([:punct:])([:punct:])", "\\1\\s\\2")
It does not work.
May I have some help?
There are several issues here:
[:punct:] pattern in an ICU regex flavor does not match math symbols (\p{S}), it only matches punctuation proper (\p{P}), if you still want to match all of them, combine the two classes, [\p{P}\p{S}]
"\\1\\s\\2" replacement contains a \s regex escape sequence, and these are not supported in the replacement patterns, you need to use a literal space
str_replace only replaces one, first occurrence, use str_replace_all to handle all matches
Even if you use all the above suggestions, it still won't work for strings like -+?/. You need to make the second part of the regex a zero-width assertion, a positive lookahead, in order not to consume the second punctuation.
So, you can use
library(stringr)
s <- "-+?="
str_replace_all(s, "([\\p{P}\\p{S}])(?=[\\p{P}\\p{S}])", "\\1 ")
str_replace_all(s, "(?<=[\\p{P}\\p{S}])(?=[\\p{P}\\p{S}])", " ")
gsub("(?<=[[:punct:]])(?=[[:punct:]])", " ", s, perl=TRUE)
See the R demo online, all three lines yield [1] "- + ? =" output.
Note that in PCRE regex flavor (used with gsub and per=TRUE) the POSIX character class must be put inside a bracket expression, hence the use of double brackets in [[:punct:]].
Also, (?<=[[:punct:]]) is a positive lookbehind that checks for the presence of its pattern immediately on the left, and since it is non-consuming there is no need of any backreference in the replacement.

R: How to use stringr to extract the substring as the output to mutate a column of strings that begins with a string pattern and end with a number?

I'm creating a small example to be put into mutate(). Not sure why this doesn't work.
> str_extract("rs1234-<b>C</b>","^rs*\\d$")
[1] NA
I'd be great if you can point to my misunderstanding of the language instead of merely providing a solution. I expect to get "rs1234".
The ^rs*\d$ regex matches
^ - start of string
rs* - r and zero or more occurrences of s char
\d - a digit
$ - end of string.
So, your pattern matches strings like rsssss1, r3, etc.
You need
str_extract("rs1234-<b>C</b>", "^rs\\d+")
where ^rs\d+ matches rs at the start of string and then one or more digits. See this regex demo.
But if I just want the substring in between "rs" and the last number. What should I do?
You would use rs.*\d:
str_extract("rs1234-<b>C</b>", "rs.*\\d")
where rs.*\d matches rs, then any zero or more chars other than line break chars as many as possible and then a digit.
NOTE: If you need to match line endings, too, you need to prepend the last pattern with (?s) inline DOTALL modifier.
See this regex demo.

Find occurrences with regex and then only remove first character in matched expression

Surprisingly I haven't found a satisfactory answer to this regex problem. I have the following vector:
row1
[1] "AA.8.BB.CCCC" "2017" "3.166.5" "3.080.2" "68" "162.6"
[7] "185.223.632.4" "500.332.1"
My end result should look like this:
row1
[1] "AA.8.BB.CCCC" "2017" "3,166.5" "3,080.2" "68" "162.6"
[7] "185,223,632.4" "500,332.1"
The last period in each of the numeric values is the decimal point and the other periods should be converted to commas. I want this done without affecting the value with letters ([1]). I tried the following:
gsub("[.]\\d{3}[.]", ",", row1)
This regex sort of works but doesn't quite do what I want. Additionally it removes the numbers, which is problematic. Is there a way to find the regex and then only remove the first character and not the entire matched values? If there is a better way of approaching this I welcome those responses as well.
You can use the following:
See code in use here
gsub("\\G\\d+\\K\\.(?=\\d+(?!$))",",",x,perl=T)
See regex in use here
Note: The regex at the URL above is changed to (?:\G|^) for display purposes (\G matches the start of the string \A, but not the start of the line).
\G\d+\K\.(?=\d+(?!$))
How it works:
\G asserts position either at the end of the previous match or at the start of the string
\d+\K\. matches a digit one or more times, then resets the match (previously consumed characters are no longer included in the final match), then match a dot . literally
(?=\d+(?!$)) positive lookahead ensuring what follows is one or more digits, but not followed by the end of the line
One option is to use a combination of a lookbehind and a lookahead to match only a dot when what is on the left is a digit and on the right are 3 digits followed by a dot.
You could add perl = TRUE using gsub.
In the replacement use a comma.
(?<=\d)[.](?=\d{3}[.])
Regex demo | R demo
Double escaped as noted by #r2evans
(?<=\\d)[.](?=\\d{3}[.])

How to add the removed space in a sentence?

I have the following string:
x = "marchTextIWantToDisplayWithSpacesmarch"
I would like to delete the 'march' portion at the beginning of the string and then add a space before each uppercase letter in the remainder to yield the following result:
"Text I Want To Display With Spacesmarch"
To insert whitepace, I used gsub("([a-z]?)([A-Z])", "\\1 \\2", x, perl= T) but I have no clue how to modify the pattern so that the first 'march' is excluded from the returned string. I'm trying to get better at this so any help would be greatly appreciated.
An option would be to capture the upper case letter as a group ((...)) and in the replacement create a space followed by the backreference (\\1) of the captured group
gsub("([A-Z])", " \\1", x)
#[1] "march Text I Want To Display With Spacesmarch"
If we need to remove the 'march'
sub("\\b[a-z]\\w+\\s+", "", gsub("([A-Z])", " \\1", x))
[#1] "Text I Want To Display With Spacesmarch"
data
x <- "marchTextIWantToDisplayWithSpacesmarch"
No, you can't achieve your replacement using single gsub because in one of your requirement, you want to remove all lowercase letters starting from the beginning, and your second requirement is to introduce a space before every capital letter except the first capital letter of the resultant string after removing all lowercase letters from the beginning of text.
Doing it in single gsub call would have been possible in cases where somehow we can re-use some of the existing characters to make the conditional replace which can't be the case here. So in first step, you can use ^[a-z]+ regex to get rid of all lowercase letters only from the beginning of string,
sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch")
leaving you with this,
[1] "TextIWantToDisplayWithSpacesmarch"
And next step you can use this (?<!^)(?=[A-Z]) regex to insert a space before every capital letter except the first one as you might not want an extra space before your sentence. But you can combine both and write them as this,
gsub('(?<!^)(?=[A-Z])', ' ', sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch"), perl=TRUE)
which will give you your desired string,
[1] "Text I Want To Display With Spacesmarch"
Edit:
Explanation of (?<!^)(?=[A-Z]) pattern
First, let's just take (?=[A-Z]) pattern,
See the pink markers in this demo
As you can see, in the demo, every capital letter is preceded by a pink mark which is the place where a space will get inserted. But we don't want space to be inserted before the very first letter as that is not needed. Hence we need a condition in regex, which will not select the first capital letter which appears at the start of string. And for that, we need to use a negative look behind (?<!^) which means that Do not select the position which is preceded by start of string and hence this (?<!^) helps in discarding the upper case letter that is preceded by just start of string.
See this demo where the pink marker is gone from the very first uppercase letter
Hope this clarifies how every other capital letter is selected but not the very first. Let me know if you have any queries further.
You may use a single regex call to gsub coupled with trimws to trim the resulting string:
trimws(gsub("^\\p{Ll}+|(?<=.)(?=\\p{Lu})", " ", x, perl=TRUE))
## => [1] "Text I Want To Display With Spacesmarch"
It also supports all Unicode lowercase (\p{Ll}) and uppercase (\p{Lu}) letters.
See the R demo online and the regex demo.
Details
^\\p{Ll}+ - 1 or more lowercase letters at the string start
| - or
(?<=.)(?=\\p{Lu}) - any location between any char but linebreak chars and an uppercase letter.
Here is an altenative with a single call to gsubfn regex with some ifelse logic:
> gsubfn("^\\p{Ll}*(\\p{L})|(?<=.)(?=\\p{Lu})", function(n) ifelse(nchar(n)>0,n," "), x, perl=TRUE,backref=-1)
[1] "Text I Want To Display With Spacesmarch"
Here, the ^\\p{Ll}*(\\p{L}) part matches 0+ lowercase letters and captures the next uppercase into Group 1 that will be accessed by passing n argument to the anonymous function. If n length is non-zero, this alternative matched and the we need to replace with this value. Else, we replace with a space.
Since this is tagged perl, my 2 cents:
Can you chain together the substitutions inside sub() and gsub()? In newer perl versions an /r option can be added to the s/// substitution so the matched string can be returned "non-destructively" and then matched again. This allows hackish match/substitution/rematches without mastering advanced syntax, e.g.:
perl -E '
say "marchTextIWantToDisplayWithSpacesmarch" =~
s/\Amarch//r =~ s/([[:upper:]])/ $1/gr =~ s/\A\s//r;'
Output
Text I Want To Display With Spacesmarch
This seems to be what #pushpesh-kumar-rajwanshi and #akrun are doing by wrapping gsub inside sub() (and vice versa). In general I don't thinkperl = T captures the full magnificently advanced madness of perl regexps ;-) but gsub/sub must be fast operating on vectors, no?

Regex in R to extract words before a special character

I having a dataframe of part of speech tagged strings
Example:
best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ
I want to remove the tags after/and the '_' so that I have the output
best phone only issue camera sensor have mind own
I am using R and I couldn't find an appropriate regex for the gsub function.
I tried this.
sentence= c("best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ")
o1=gsub("\\_.*","",sentence, perl = T)
But This removes entire string after the first underscore. Thanks in Advance
You may use _[A-Z]+ TRE pattern with gsub:
sentence <- c("best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ")
gsub("_[A-Z]+","",sentence)
[1] "best phone only issue camera sensor have mind own"
See the R demo
The _[A-Z]+ pattern matches an underscore (_, note it does not have to be escaped in a regex pattern) and one or more (+) uppercase ASCII letters ([A-Z]).
You may further precise the pattern, say, to only match the _ if it is preceded with a word char and match uppercase letters only when followed with a word boundary:
"\\B_[A-Z]+\\b
In case you want to create a very specific regex for the POS values, you may use alternation:
"\\B_(JJ|NN|CC|[VR]B)\\b"
And continue adding |<code> to the regex pattern.

Resources