Find occurrences with regex and then only remove first character in matched expression - r

Surprisingly I haven't found a satisfactory answer to this regex problem. I have the following vector:
row1
[1] "AA.8.BB.CCCC" "2017" "3.166.5" "3.080.2" "68" "162.6"
[7] "185.223.632.4" "500.332.1"
My end result should look like this:
row1
[1] "AA.8.BB.CCCC" "2017" "3,166.5" "3,080.2" "68" "162.6"
[7] "185,223,632.4" "500,332.1"
The last period in each of the numeric values is the decimal point and the other periods should be converted to commas. I want this done without affecting the value with letters ([1]). I tried the following:
gsub("[.]\\d{3}[.]", ",", row1)
This regex sort of works but doesn't quite do what I want. Additionally it removes the numbers, which is problematic. Is there a way to find the regex and then only remove the first character and not the entire matched values? If there is a better way of approaching this I welcome those responses as well.

You can use the following:
See code in use here
gsub("\\G\\d+\\K\\.(?=\\d+(?!$))",",",x,perl=T)
See regex in use here
Note: The regex at the URL above is changed to (?:\G|^) for display purposes (\G matches the start of the string \A, but not the start of the line).
\G\d+\K\.(?=\d+(?!$))
How it works:
\G asserts position either at the end of the previous match or at the start of the string
\d+\K\. matches a digit one or more times, then resets the match (previously consumed characters are no longer included in the final match), then match a dot . literally
(?=\d+(?!$)) positive lookahead ensuring what follows is one or more digits, but not followed by the end of the line

One option is to use a combination of a lookbehind and a lookahead to match only a dot when what is on the left is a digit and on the right are 3 digits followed by a dot.
You could add perl = TRUE using gsub.
In the replacement use a comma.
(?<=\d)[.](?=\d{3}[.])
Regex demo | R demo
Double escaped as noted by #r2evans
(?<=\\d)[.](?=\\d{3}[.])

Related

What is the regex pattern for extracting the substring to the left of four numbers attached to an uppercase word?

I have a string ARC GUNNA SPARKYA 2011QUARTER HORSE.
I'd like to extract only the ARC GUNNA SPARKYA part. I.e., everything to the left of the "2011QUARTER."
I will also have valid strings which I want the pattern NOT to match. Valid strings would be "10RUNS FAST" or "QUICKER 1".
Note that the above means I need a pattern which can explicitly pick up just any four numbers followed by the uppercase word "QUARTER."
I tried ([0-9A-Za-z]+( [0-9A-Za-z]+)+) but that pattern matches the part I want to keep too, so I can't use it to do something like gsub.
Can you please help me understand what regex pattern will accomplish this--particularly in R?
Thank you!
You could use sub with a capture group, and use that group in the replacement.
(.*?)\s+\d{4}QUARTER\b.*
Explanation
(.*?) Capture group 1, match any character, as few as possible
\s+ Match 1+ whitespace characters
\d{4}QUARTER\b Match 4 digits followed by the word QUARTER
.* Match the rest of the line
See a regex101 demo.
text <- "ARC GUNNA SPARKYA 2011QUARTER HORSE"
result = sub("(.*?)\\s+\\d{4}QUARTER\\b.*", "\\1", text)
result
Output
[1] "ARC GUNNA SPARKYA"

How to remove a certain portion of the column name in a dataframe?

I have column names in the following format:
col= c('UserLanguage','Q48','Q21...20','Q22...21',"Q22_4_TEXT...202")
I would like to get the column names without everything that is after ...
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
I am not sure how to code it. I found this post here but I am not sure how to specify the pattern in my case.
You can use gsub.
gsub("\\...*","",col)
#[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
Or you can use stringr
library(stringr)
str_remove(col, "\\...*")
Since . matches any character, we need to "escape" (\) to specify exactly what we want to match in the regular expression (and not use the special behavior of the .). So, to match a period, we would need \.. However, the backslash (\) is used to escape special behavior (e.g., escape symbol in strings) in regexps. So, to create the regular expression, we need an additional backslash, \\. In this case, we want to match additional periods, so we can add those here, hence \\.... Then, * specifies that the previous expression (everything the three periods) may occur 0 or more times.
You could sub and capture the first word in each column:
col <- c("UserLanguage", "Q48", "Q21...20", "Q22...21", "Q22_4_TEXT...202")
sub("^(\\w+).*$", "\\1", col)
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
The regex pattern used here says to match:
^ from the start of the input
(\w+) match AND capture the first word
.* then consume the rest
$ end of the input
Then, using sub we replace with \1 to retain just the first word.

R: How to use stringr to extract the substring as the output to mutate a column of strings that begins with a string pattern and end with a number?

I'm creating a small example to be put into mutate(). Not sure why this doesn't work.
> str_extract("rs1234-<b>C</b>","^rs*\\d$")
[1] NA
I'd be great if you can point to my misunderstanding of the language instead of merely providing a solution. I expect to get "rs1234".
The ^rs*\d$ regex matches
^ - start of string
rs* - r and zero or more occurrences of s char
\d - a digit
$ - end of string.
So, your pattern matches strings like rsssss1, r3, etc.
You need
str_extract("rs1234-<b>C</b>", "^rs\\d+")
where ^rs\d+ matches rs at the start of string and then one or more digits. See this regex demo.
But if I just want the substring in between "rs" and the last number. What should I do?
You would use rs.*\d:
str_extract("rs1234-<b>C</b>", "rs.*\\d")
where rs.*\d matches rs, then any zero or more chars other than line break chars as many as possible and then a digit.
NOTE: If you need to match line endings, too, you need to prepend the last pattern with (?s) inline DOTALL modifier.
See this regex demo.

Add a white-space between number and special character condition R

I'm trying to use stringr or R base calls to conditionally add a white-space for instances in a large vector where there is a numeric value then a special character - in this case a $ sign without a space. str_pad doesn't appear to allow for a reference vectors.
For example, for:
$6.88$7.34
I'd like to add a whitespace after the last number and before the next dollar sign:
$6.88 $7.34
Thanks!
If there is only one instance, then use sub to capture digit and the $ separately and in the replacement add the space between the backreferences of the captured group
sub("([0-9])([$])", "\\1 \\2", v1)
#[1] "$6.88 $7.34"
Or with a regex lookaround
gsub("(?<=[0-9])(?=[$])", " ", v1, perl = TRUE)
data
v1 <- "$6.88$7.34"
This will work if you are working with a vectored string:
mystring<-as.vector('$6.88$7.34 $8.34$4.31')
gsub("(?<=\\d)\\$", " $", mystring, perl=T)
[1] "$6.88 $7.34 $8.34 $4.31"
This includes cases where there is already space as well.
Regarding the question asked in the comments:
mystring2<-as.vector('Regular_Distribution_Type† Income Only" "Distribution_Rate 5.34%" "Distribution_Amount $0.0295" "Distribution_Frequency Monthly')
gsub("(?<=[[:alpha:]])\\s(?=[[:alpha:]]+)", "_", mystring2, perl=T)
[1] "Regular_Distribution_Type<U+2020> Income_Only\" \"Distribution_Rate 5.34%\" \"Distribution_Amount $0.0295\" \"Distribution_Frequency_Monthly"
Note that the \ appears due to nested quotes in the vector, should not make a difference. Also <U+2020> appears due to encoding the special character.
Explanation of regex:
(?<=[[:alpha:]]) This first part is a positive look-behind created by ?<=, this basically looks behind anything we are trying to match to make sure what we define in the look behind is there. In this case we are looking for [[:alpha:]] which matches a alphabetic character.
We then check for a blank space with \s, in R we have to use a double escape so \\s, this is what we are trying to match.
Finally we use (?=[[:alpha:]]+), which is a positive look-ahead defined by ?= that checks to make sure our match is followed by another letter as explained above.
The logic is to find a blank space between letters, and match the space, which then is replaced by gsub, with a _
See all the regex here

R - replace last instance of a regex match and everything afterwards

I'm trying to use a regex to replace the last instance of a phrase (and everything after that phrase, which could be any character):
stringi::stri_replace_last_regex("_AB:C-_ABCDEF_ABC:45_ABC:454:", "_ABC.*$", "CBA")
However, I can't seem to get the refex to function properly:
Input: "_AB:C-_ABCDEF_ABC:45_ABC:454:"
Actual output: "_AB:C-CBA"
Desired output: "_AB:C-_ABCDEF_ABC:45_CBA"
I have tried gsub() as well but that hasn't worked.
Any ideas where I'm going wrong?
One solution is:
sub("(.*)_ABC.*", "\\1_CBA", Input)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Have a look at what stringi::stri_replace_last_regex does:
Replaces with the given replacement string last substring of the input that matches a regular expression
What does your _ABC.*$ pattern match inside _AB:C-_ABCDEF_ABC:45_ABC:454:? It matches the first _ABC (that is right after C-) and all the text after to the end of the line (.*$ grabs 0+ chars other than line break chars to the end of the line). Hence, you only have 1 match, and it is the last.
Solutions can be many:
1) Capturing all text before the last occurrence of the pattern and insert the captured value with a replacement backreference (this pattern does not have to be anchored at the end of the string with $):
sub("(.*)_ABC.*", "\\1_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
2) Using a tempered greedy token to make sure you only match any char that does not start your pattern up to the end of the string after matching it (this pattern must be anchored at the end of the string with $):
sub("(?s)_ABC(?:(?!_ABC).)*$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
Note that this pattern will require perl=TRUE argument to be parsed with a PCRE engine with sub (or you may use stringr::str_replace that is ICU regex library powered and supports lookaheads)
3) A negative lookahead may be used to make sure your pattern does not appear anywhere to the right of your pattern (this pattern does not have to be anchored at the end of the string with $):
sub("(?s)_ABC(?!.*_ABC).*", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
See the R demo online, all these three lines of code returning _AB:C-_ABCDEF_ABC:45_CBA.
Note that (?s) in the PCRE patterns is necessary in case your strings may contain a newline (and . in a PCRE pattern does not match newline chars by default).
Arguably the safest thing to do is using a negative lookahead to find the last occurrence:
_ABC(?:(?!_ABC).)+$
Demo
gsub("_ABC(?:(?!_ABC).)+$", "_CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:", perl=TRUE)
[1] "_AB:C-_ABCDEF_ABC:45_CBA"
Using gsub and back referencing
gsub("(.*)ABC.*$", "\\1CBA","_AB:C-_ABCDEF_ABC:45_ABC:454:")
[1] "_AB:C-_ABCDEF_ABC:45_CBA"

Resources