Replace the last occurence of a string (and only it) using regular expression - r

I have a string, let say MyString = "aabbccawww". I would like to use a gsub expression to replace the last "a" in MyString by "A", and only it. That is "aabbccAwww". I have found similar questions on the website, but they all requested to replace the last occurrence and everything coming after.
I have tried gsub("a[^a]*$", "A", MyString), but it gives "aabbccA". I know that I can use stringi functions for that purpose but I need the solution to be implemented in a part of a code where using such functions would be complicated, so I would like to use a regular expression.
Any suggestion?

You can use stringi library which makes dealing with strings very easy, i.e.
library(stringi)
x <- "aabbccawww"
stri_replace_last_fixed(x, 'a', 'A')
#[1] "aabbccAwww"

We can use sub to match 'a' followed by zero or more characters that are not an 'a' ([^a]*), capture it as group ((...)) until the end of the string ($) and replace it with "A" followed by the backreference of the captured group (\\1)
sub("a([^a]*)$", "A\\1", MyString)
#[1] "aabbccAwww"

While akrun's answer should solve the problem (not sure, haven't worked with \1 etc. yet), you can also use lookouts:
a(?!(.|\n)*a)
This is basically saying: Find an a that is NOT followed by any number of characters and an a. The (?!x) is a so-called lookout, which means that the searched expression won't be included in the match.
You need (.|\n) since . refers to all characters, except for line breaks.
For reference about lookouts or other regex, I can recommend http://regexr.com/.

Related

How to remove a certain portion of the column name in a dataframe?

I have column names in the following format:
col= c('UserLanguage','Q48','Q21...20','Q22...21',"Q22_4_TEXT...202")
I would like to get the column names without everything that is after ...
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
I am not sure how to code it. I found this post here but I am not sure how to specify the pattern in my case.
You can use gsub.
gsub("\\...*","",col)
#[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
Or you can use stringr
library(stringr)
str_remove(col, "\\...*")
Since . matches any character, we need to "escape" (\) to specify exactly what we want to match in the regular expression (and not use the special behavior of the .). So, to match a period, we would need \.. However, the backslash (\) is used to escape special behavior (e.g., escape symbol in strings) in regexps. So, to create the regular expression, we need an additional backslash, \\. In this case, we want to match additional periods, so we can add those here, hence \\.... Then, * specifies that the previous expression (everything the three periods) may occur 0 or more times.
You could sub and capture the first word in each column:
col <- c("UserLanguage", "Q48", "Q21...20", "Q22...21", "Q22_4_TEXT...202")
sub("^(\\w+).*$", "\\1", col)
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
The regex pattern used here says to match:
^ from the start of the input
(\w+) match AND capture the first word
.* then consume the rest
$ end of the input
Then, using sub we replace with \1 to retain just the first word.

Find occurrences with regex and then only remove first character in matched expression

Surprisingly I haven't found a satisfactory answer to this regex problem. I have the following vector:
row1
[1] "AA.8.BB.CCCC" "2017" "3.166.5" "3.080.2" "68" "162.6"
[7] "185.223.632.4" "500.332.1"
My end result should look like this:
row1
[1] "AA.8.BB.CCCC" "2017" "3,166.5" "3,080.2" "68" "162.6"
[7] "185,223,632.4" "500,332.1"
The last period in each of the numeric values is the decimal point and the other periods should be converted to commas. I want this done without affecting the value with letters ([1]). I tried the following:
gsub("[.]\\d{3}[.]", ",", row1)
This regex sort of works but doesn't quite do what I want. Additionally it removes the numbers, which is problematic. Is there a way to find the regex and then only remove the first character and not the entire matched values? If there is a better way of approaching this I welcome those responses as well.
You can use the following:
See code in use here
gsub("\\G\\d+\\K\\.(?=\\d+(?!$))",",",x,perl=T)
See regex in use here
Note: The regex at the URL above is changed to (?:\G|^) for display purposes (\G matches the start of the string \A, but not the start of the line).
\G\d+\K\.(?=\d+(?!$))
How it works:
\G asserts position either at the end of the previous match or at the start of the string
\d+\K\. matches a digit one or more times, then resets the match (previously consumed characters are no longer included in the final match), then match a dot . literally
(?=\d+(?!$)) positive lookahead ensuring what follows is one or more digits, but not followed by the end of the line
One option is to use a combination of a lookbehind and a lookahead to match only a dot when what is on the left is a digit and on the right are 3 digits followed by a dot.
You could add perl = TRUE using gsub.
In the replacement use a comma.
(?<=\d)[.](?=\d{3}[.])
Regex demo | R demo
Double escaped as noted by #r2evans
(?<=\\d)[.](?=\\d{3}[.])

How to remove characters before matching pattern and after matching pattern in R in one line?

I have this vector Target <- c( "tes_1123_SS1G_340T01", "tes_23_SS2G_340T021". I want to remove anything before SS and anything after T0 (including T0).
Result I want in one line of code:
SS1G_340 SS2G_340
Code I have tried:
gsub("^.*?SS|\\T0", "", Target)
We can use str_extract
library(stringr)
str_extract(Target, "SS[^T]*")
#[1] "SS1G_340" "SS2G_340"
Try this:
gsub(".*(SS.*)T0.*","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Why it works:
With regex, we can choose to keep a pattern and remove everything outside of that pattern with a two-step process. Step 1 is to put the pattern we'd like to keep in parentheses. Step 2 is to reference the number of the parentheses-bound pattern we'd like to keep, as sometimes we might have multiple parentheses-bound elements. See the example below for example:
gsub(".*(SS.*)+(T0.*)","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Note that I've put the T0.* in parentheses this time, but we still get the correct answer because I've told gsub to return the first of the two parentheses-bound patterns. But now see what happens if I use \\2 instead:
gsub(".*(SS.*)+(T0.*)","\\2",Target)
[1] "T01" "T021"
The .* are wild cards by the way. If you'd like to learn more about using regex in R, here's a reference that can get you started.

get last part of a string

I would like to get the last substring of a variable (the last part after the underscore), in this case: "myvar".
x = "string__subvar1__subvar2__subvar3__myvar"
my attempts result in a match starting from the first substring, e.g.
library(stringr)
str_extract(x, "__.*?$)
How do I do this in R?
You can do
sub('.*\\__', '', x)
You can do:
library(stringr)
str_extract(x,"[a-zA-Z]+$")
EDIT:
one could use lookaround feature as well: str_extract(x,"(?=_*)[a-zA-Z]+$")
also from baseR
regmatches(x,gregexpr("[a-zA-Z]+$",x))[[1]]
From documentation ?regex:
The caret ^ and the dollar sign $ are metacharacters that respectively
match the empty string at the beginning and end of a line.
Could this work? Sorry, I hope i'm understanding what you're asking correctly.
substr(x,gregexpr("_",x)[[1]][length(gregexpr("_",x)[[1]])]+1,nchar(x))
[1] "myvar"

grepl not searching correctly in R

I want to search ".com" in a vector, but grepl isn't working out for me. Anyone know why? I am doing the following
vector <- c("fdsfds.com","fdsfcom")
grepl(".com",vector)
This returns
[1] TRUE TRUE
I want it to strictly refer to "fdsfds.com"
As #user20650 said in the comments above, use grepl("\\.com",vector). the dot (.) is a special character in regular expressions that matches any character, so it's matching the second "f" in "fdsfcom". The "\\" before the . "escapes" the dot so it's treated literally. Alternatively, you could use grepl(".com",vector, fixed = TRUE), which searches literally, not using regular expressions.

Resources