R regular expression match/omit several repeats - r

I'm using back references to get rid of accidental repeats in vectors of variable names. The names in the first case I encountered have repeat patterns like this
x <- c("gender_gender-1", "county_county-2", "country_country-1997",
"country_country-1993")
The repeats were always separated by underscore and there was only one repeat to eliminate. And they always start at the beginning of the text. After checking the Regular Expression Cookbook, 2ed, I arrived at an answer that works:
> gsub("^(.*?)_\\1", "\\1", x)
[1] "gender-1" "county-2" "country-1997" "country-1993"
I was worried that the future cases might have dash or space as separator, so I wanted to generalize the matching a bit. I got that worked out as well.
> x <- c("gender_gender-1", "county-county-2", "country country-1997",
+ "country,country-1993")
> gsub("^(.*?)[,_\ -]\\1", "\\1", x)
[1] "gender-1" "county-2" "country-1997" "country-1993"
So far, total victory.
Now, what is the correct fix if there are three repeats in some cases? In this one, I want "country-country-country" to become just one "country".
> x <- c("gender_gender-1", "county-county-county-2")
> gsub("^(.*?)[,_\ -]\\1", "\\1", x)
[1] "gender-1" "county-county-2"
I am willing to replace all of the separators by "_" if that makes it easier to get rid of the repeat words.

You may quantify the [,_ -]\1 part:
gsub("^(.*?)(?:[,_\\s-]\\1)+", "\\1", x, perl=TRUE)
See the R demo
Note I also replace the space with \s to match any whitespace (and this requires perl=TRUE). You may also match any whitespace with [:space:], then you do not need perl=TRUE, i.e. gsub("^(.*?)(?:[,_[:space:]-]\\1)+", "\\1", x).
Details:
^ - matches the start of a string
(.*?) - any 0+ chars as few as possible up to the first...
(?:
[,_\\s-] - ,, _, whitespace or -
\\1 - same value as captured in Group 1
)+ - 1 or more times.
If you only want to match the repeat part 1 or 2 times, replace + with {1,2} limiting quantifier:
gsub("^(.*?)(?:[,_\\s-]\\1){1,2}", "\\1", x, perl=TRUE)

Related

Can quantifiers be used in regex replacement in R?

My objective would be replacing a string by a symbol repeated as many characters as have the string, in a way as one can replace letters to capital letters with \\U\\1, if my pattern was "...(*)..." my replacement for what is captured by (*) would be something like x\\q1 or {\\q1}x so I would get so many x as characters captured by *.
Is this possible?
I am thinking mainly in sub,gsub but you can answer with other libraris like stringi,stringr, etc.
You can use perl = TRUE or perl = FALSE and any other options with convenience.
I assume the answer can be negative, since seems to be quite limited options (?gsub):
a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed = FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern. For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. If a character vector of length 2 or more is supplied, the first element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA.
Main quantifiers are (?base::regex):
?
The preceding item is optional and will be matched at most once.
*
The preceding item will be matched zero or more times.
+
The preceding item will be matched one or more times.
{n}
The preceding item is matched exactly n times.
{n,}
The preceding item is matched n or more times.
{n,m}
The preceding item is matched at least n times, but not more than m times.
Ok, but it seems to be an option (which is not in PCRE, not sure if in PERL or where...) (*) which captures the number of characters the star quantifier is able to match (I found it at https://www.rexegg.com/regex-quantifier-capture.html) so then it could be used \q1 (same reference) to refer to the first captured quantifier (and \q2, etc.). I also read that (*) is equivalent to {0,} but I'm not sure if this is really the fact for what I'm interested in.
EDIT UPDATE:
Since asked by commenters I update my question with an specific example provide by this interesting question. I modify a bit the example. Let's say we have a <- "I hate extra spaces elephant" so we are interested in keeping the a unique space between words, the 5 first characters of each word (till here as the original question) but then a dot for each other character (not sure if this is what is expected in the original question but doesn't matter) so the resulting string would be "I hate extra space. eleph..." (one . for the last s in spaces and 3 dots for the 3 letters ant in the end of elephant). So I started by keeping the 5 first characters with
gsub("(?<!\\S)(\\S{5})\\S*", "\\1", a, perl = TRUE)
[1] "I hate extra space eleph"
How should I replace the exact number of characters in \\S* by dots or any other symbol?
Quantifiers cannot be used in the replacement pattern, nor the information how many chars they match.
What you need is a \G base PCRE pattern to find consecutive matches after a specific place in the string:
a <- "I hate extra spaces elephant"
gsub("(?:\\G(?!^)|(?<!\\S)\\S{5})\\K\\S", ".", a, perl = TRUE)
See the R demo and the regex demo.
Details
(?:\G(?!^)|(?<!\S)\S{5}) - the end of the previous successful match or five non-whitespace chars not preceded with a non-whitespace char
\K - a match reset operator discarding text matched so far
\S - any non-whitespace char.
gsubfn is like gsub except the replacement string can be a function which inputs the match and outputs the replacement. The function can optionally be expressed a formula as we do here replacing each string of word characters with the output of the function replacing that string. No complex regular expressions are needed.
library(gsubfn)
gsubfn("\\w+", ~ paste0(substr(x, 1, 5), strrep(".", max(0, nchar(x) - 5))), a)
## [1] "I hate extra space. eleph..."
or almost the same except function is slightly different:
gsubfn("\\w+", ~ paste0(substr(x, 1, 5), substring(gsub(".", ".", x), 6)), a)
## [1] "I hate extra space. eleph..."

using regular expressions (regex) to make replace multiple patterns at the same time in R

I have a vector of strings and I want to remove -es from all strings (words) ending in either -ses or -ces at the same time. The reason I want to do it at the same time and not consequitively is that sometimes it happens that after removing one ending, the other ending appears while I don't want to apply this pattern to a single word twice.
I have no idea how to use two patterns at the same time, but this is the best I could:
text <- gsub("[sc]+s$", "[sc]", text)
I know the replacement is not correct, but I wonder how can I show that I want to replace it with the letter I just detected (c or s in this case). Thank you in advance.
To remove es at the end of words, that is preceded with s or c, you may use
gsub("([sc])es\\b", "\\1", text)
gsub("(?<=[sc])es\\b", "", text, perl=TRUE)
To remove them at the end of strings, you can go on using your $ anchor:
gsub("([sc])es$", "\\1", text)
gsub("(?<=[sc])es$", "", text, perl=TRUE)
The first gsub TRE pattern is ([sc])es\b: a capturing group #1 that matches either s or c, and then es is matched, and then \b makes sure the next char is not a letter, digit or _. The \1 in the replacement is the backreference to the value stored in the capturing group #1 memory buffer.
In the second example with the PCRE regex (due to perl=TRUE), (?<=[sc]) positive lookbehind is used instead of the ([sc]) capturing group. Lookbehinds are not consuming text, the text they match does not land in the match value, and thus, there is no need to restore it anyhow. The replacement is an empty string.
Strings ending with "ces" and "ses" follow the same pattern, i.e. "*es$"
If I understand it correctly than you don't need two patterns.
Example:
x = c("ces", "ses", "mes)
gsub( pattern = "*([cs])es$", replacement = "\\1", x)
[1] "c" "s" "mes"
Hope it helps.
M

Regular Expression in R - Spaces before and after the text

I have a stats file that has lines that are like this:
"system.l2.compressor.compression_size::1 0 # Number of blocks that compressed to fit in 1 bits"
0 is the value that I care about in this case. The spaces between the actual statistic and whatever is before and after it are not the same each time.
My code is something like that to try and get the stats.
if (grepl("system.l2.compressor.compression_size::1", line))
{
matches <- regmatches(line, gregexpr("[[:digit:]]+\\.*[[:digit:]]", line))
compression_size_1 = as.numeric(unlist(matches))[1]
}
The reason I have this regular expression
[[:digit:]]+\\.*[[:digit:]]
is because in other cases the statistic is a decimal number. I don't anticipate in the cases that are like the example I posted for the numbers to be decimals, but it would be nice to have a "fail safe" regex that can capture even such a case.
In this case I get "2." "1" "0" "1" as answers. How can I restrict it so that I can get only the true stat as the answer?
I tried using something like this
"[:space:][[:digit:]]+\\.*[[:digit:]][:space:]"
or other variations, but either I get back NA, or the same numbers but with spaces surrounding them.
Here are a couple base R possibilities depending on how your data is set up. In the future, it is helpful to provide a reproducible example. Definitely provide one if these don't work. If the pattern works, it will probably be faster to adapt it to a stringr or stringi function. Good luck!!
# The digits after the space after the anything not a space following "::"
gsub(".*::\\S+\\s+(\\d+).*", "\\1", strings)
[1] "58740" "58731" "70576"
# Getting the digit(s) following a space and preceding a space and pound sign
gsub(".*\\s+(\\d+)\\s+#.*", "\\1", strings)
[1] "58740" "58731" "70576"
# Combining the two (this is the most restrictive)
gsub(".*::\\S+\\s+(\\d+)\\s+#.*", "\\1", strings)
[1] "58740" "58731" "70576"
# Extracting the first digits surounded by spaces (least restrictive)
gsub(".*?\\s+(\\d+)\\s+.*", "\\1", strings)
[1] "58740" "58731" "70576"
# Or, using stringr for the last pattern:
as.numeric(stringr::str_extract(strings, "\\s+\\d+\\s+"))
[1] 58740 58731 70576
EDIT: Explanation for the second one:
gsub(".*\\s+(\\d+)\\s+#.*", "\\1", strings)
.* - .=any character except \n; *=any number of times
\\s+ - \\s =whitespace; +=at least one instance (of the whitespace)
(\\d+) - ()=capture group, you can reference it later by the number of occurrences (i.e., the ”\\1” returns the first instance of this pattern); \\d=digit; +=at least one instance (of a digit)
\\s+# - \\s =whitespace; +=at least one instance (of the whitespace); # a literal pound sign
.* - .=any character except \n; *=any number of times
Data:
strings <- c("system.l2.compressor.compression_size::256 58740 # Number of blocks that compressed to fit in 256 bits",
"system.l2.compressor.encoding::Base*.8_1 58731 # Number of data entries that match encoding Base8_1",
"system.l2.overall_hits::.cpu.data 70576 # number of overall hits")

R Regex to identify and replace characters between multiple dots

I have the following codes
"ABC.A.SVN.10.10.390.10.UDGGL"
"XYZ.Z.SVN.11.12.111.99.ASDDL"
and I need to replace the characters that exist between the 2nd and the 3rd dot. In this case it is SVN but it may well be any combination of between A and ZZZ, so really the only way to make this work is by using the dots.
The required outcome would be:
"ABC.A..10.10.390.10.UDGGL"
"XYZ.Z..11.12.111.99.ASDDL"
I tried variants of grep("^.+(\\.\\).$", "ABC.A.SVN.10.10.390.10.UDGGL") but I get an error.
Some examples of what I have tried with no success :
Link 1
Link 2
EDIT
I tried #Onyambu 's first method and I ran into a variant which I had not accounted for: "ABC.A.AB11.1.12.112.1123.UDGGL". In the replacement part, I also have numeric values. The desired outcome is "ABC.A..1.12.112.1123.UDGGL" and I get it using sub("\\.\\w+.\\B.",".",x) per the second part of his answer!
See code in use here
x <- c("ABC.A.SVN.10.10.390.10.UDGGL", "XYZ.Z.SVN.11.12.111.99.ASDDL")
sub("^(?:[^.]*\\.){2}\\K[^.]*", "", x, perl=T)
^ Assert position at the start of the line
(?:[^.]*\.){2} Match the following exactly twice
[^.]*\. Match any character except . any number of times, followed by .
\K Resets the starting point of the pattern. Any previously consumed characters are no longer included in the final match
[^.]* Match any character except . any number of times
Results in [1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
x= "ABC.A.SVN.10.10.390.10.UDGGL" "XYZ.Z.SVN.11.12.111.99.ASDDL"
sub("([A-Z]+)(\\.\\d+)","\\2",x)
[1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
([A-Z]+) Capture any word that has the characters A-Z
(\\.\\d+) The captured word above, must be followed with a dot ie\\..This dot is then followed by numbers ie \\d+. This completes the capture.
so far the captured part of the string "ABC.A.SVN.10.10.390.10.UDGGL" is SVN.10 since this is the part that matches the regular expression. But this part was captured as SVN and .10. we do a backreference ie replace the whole SVN.10 with the 2nd part .10
Another logic that will work:
sub("\\.\\w+.\\B.",".",x)
[1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
Not exactly regex but here is one more approach
#DATA
S = c("ABC.A.SVN.10.10.390.10.UDGGL", "XYZ.Z.SVN.11.12.111.99.ASDDL")
sapply(X = S,
FUN = function(str){
ind = unlist(gregexpr("\\.", str))[2:3]
paste(c(substring(str, 1, ind[1]),
"SUBSTITUTION",
substring(str, ind[2], )), collapse = "")
},
USE.NAMES = FALSE)
#[1] "ABC.A.SUBSTITUTION.10.10.390.10.UDGGL" "XYZ.Z.SUBSTITUTION.11.12.111.99.ASDDL"

Match all from beginning to /aaa-bbb-ccc/, excluding /aaa-bbb-ccc/

Consider the following string:
tempo/blah/blah/aaa-bbb-ccc/def/ghi/jkl
I have a bunch of strings that have /aaa-bbb-ccc/ in them. I would like to remove any characters that occur before /aaa-bbb-ccc/. The final product of the above, for example, should be /aaa-bbb-ccc/def/ghi/jkl.
My attempt, after some searching:
x <- "tempo/blah/blah/aaa-bbb-ccc/def/ghi/jkl"
sub("^.*[^/aaa-bbb-ccc/]", "", x)
[1] ""
You need to use lazy dot matching and wrap the known value with a capturing group to restore with a backreference later:
x <- "tempo/blah/blah/aaa-bbb-ccc/def/ghi/jkl"
sub(".*?(/aaa-bbb-ccc/)", "\\1", x)
## [1] "/aaa-bbb-ccc/def/ghi/jkl"
See this R demo.
See regex demo, .*? matches any 0+ chars, as few as possible, and (/aaa-bbb-ccc/) is a capturing group with ID=1 that is reference to with \1 from the replacement pattern.
Note you may also extract that part using regmatches/regexpr:
x <- "tempo/blah/blah/aaa-bbb-ccc/def/ghi/jkl"
regmatches(x, regexpr("/aaa-bbb-ccc/.*", x))
See this R demo. .* just grabs any 0+ chars up to the end of the whole character vector.

Resources