Consider the following string:
tempo/blah/blah/aaa-bbb-ccc/def/ghi/jkl
I have a bunch of strings that have /aaa-bbb-ccc/ in them. I would like to remove any characters that occur before /aaa-bbb-ccc/. The final product of the above, for example, should be /aaa-bbb-ccc/def/ghi/jkl.
My attempt, after some searching:
x <- "tempo/blah/blah/aaa-bbb-ccc/def/ghi/jkl"
sub("^.*[^/aaa-bbb-ccc/]", "", x)
[1] ""
You need to use lazy dot matching and wrap the known value with a capturing group to restore with a backreference later:
x <- "tempo/blah/blah/aaa-bbb-ccc/def/ghi/jkl"
sub(".*?(/aaa-bbb-ccc/)", "\\1", x)
## [1] "/aaa-bbb-ccc/def/ghi/jkl"
See this R demo.
See regex demo, .*? matches any 0+ chars, as few as possible, and (/aaa-bbb-ccc/) is a capturing group with ID=1 that is reference to with \1 from the replacement pattern.
Note you may also extract that part using regmatches/regexpr:
x <- "tempo/blah/blah/aaa-bbb-ccc/def/ghi/jkl"
regmatches(x, regexpr("/aaa-bbb-ccc/.*", x))
See this R demo. .* just grabs any 0+ chars up to the end of the whole character vector.
Related
How can I back reference file_version_1a.csv in the following?
vec = c("dir/file_version_1a.csv")
In particular, I wonder why
gsub("(file.*csv$)", "", vec)
[1] "dir/"
as if I have a correct pattern, yet
gsub("(file.*csv$)", "\\1", vec)
[1] "dir/file_version_1a.csv"
You want to extract the substring starting with file and ending with csv at the end of string.
Since gsub replaces the match, and you want to use it as an extraction function, you need to match all the text in the string.
As the text not matched with your regex is at the start of the string, you need to prepend your pattern with .* (this matches any zero or more chars, as many as possible, if you use TRE regex in base R functions, and any zero or more chars other than line break chars in PCRE/ICU regexps used in perl=TRUE powered base R functions and stringr/stringi functions):
vec = c("dir/file_version_1a.csv")
gsub(".*(file.*csv)$", "\\1", vec)
However, stringr::str_extract seems a more natural choice here:
stringr::str_extract(vec, "file.*csv$")
regmatches(vec, regexpr("file.*csv$",vec))
See the R demo online.
Given a string like 'run- ning' I would like to replace 'n- n' by 'nn' in order to obtain 'running'.
Using the stringr package I tried this:
str_replace_all(s, "[:alpha:]\\-([ ])+[:alpha:]", "[:alpha:][:alpha:]")
but it seems not to work that way. I guess variables need to be used, but I could figure out how exactly.
I tried this:
str_replace_all(s, "[:alpha:]\\-([ ])+[:alpha:]", "\\0\\1")
but that does not give the desired result either.
Any ideas?
You may use
stringr::str_replace_all(s, "(?<=\\p{L})- +(?=\\p{L})", "")
stringr::str_replace_all(s, "(\\p{L})- +(\\p{L})", "\\1\\2")
Or, to match any horizontal whitespace chars
stringr::str_replace_all(s, "(?<=\\p{L})-\\h+(?=\\p{L})", "")
stringr::str_replace_all(s, "(\\p{L})-\\h+(\\p{L})", "\\1\\2")
Base R equivalent:
gsub("(?<=\\p{L})-\\h+(?=\\p{L})", "", s, perl=TRUE)
gsub("(\\p{L})-\\h+(\\p{L})", "\\1\\2", s, perl=TRUE)
gsub("([[:alpha:]])-\\s+([[:alpha:]])", "\\1\\2", s)
See the regex demo
Details
(?<=\p{L}) - a positive lookbehind that matches a location immediately preceded with any Unicode letter
- + - a hyphen followed with 1 or more spaces (\h matches any horizontal whitespace)
(?=\p{L}) - a positive lookahead that matches a location immediately followed with any Unicode letter.
(\p{L}) - a capturing group that matches any letter.
The \1\2 in the replacement patterns in the examples using capturing groups are backreferences to the corresponding capturing group values.
I have a vector of strings and I want to remove -es from all strings (words) ending in either -ses or -ces at the same time. The reason I want to do it at the same time and not consequitively is that sometimes it happens that after removing one ending, the other ending appears while I don't want to apply this pattern to a single word twice.
I have no idea how to use two patterns at the same time, but this is the best I could:
text <- gsub("[sc]+s$", "[sc]", text)
I know the replacement is not correct, but I wonder how can I show that I want to replace it with the letter I just detected (c or s in this case). Thank you in advance.
To remove es at the end of words, that is preceded with s or c, you may use
gsub("([sc])es\\b", "\\1", text)
gsub("(?<=[sc])es\\b", "", text, perl=TRUE)
To remove them at the end of strings, you can go on using your $ anchor:
gsub("([sc])es$", "\\1", text)
gsub("(?<=[sc])es$", "", text, perl=TRUE)
The first gsub TRE pattern is ([sc])es\b: a capturing group #1 that matches either s or c, and then es is matched, and then \b makes sure the next char is not a letter, digit or _. The \1 in the replacement is the backreference to the value stored in the capturing group #1 memory buffer.
In the second example with the PCRE regex (due to perl=TRUE), (?<=[sc]) positive lookbehind is used instead of the ([sc]) capturing group. Lookbehinds are not consuming text, the text they match does not land in the match value, and thus, there is no need to restore it anyhow. The replacement is an empty string.
Strings ending with "ces" and "ses" follow the same pattern, i.e. "*es$"
If I understand it correctly than you don't need two patterns.
Example:
x = c("ces", "ses", "mes)
gsub( pattern = "*([cs])es$", replacement = "\\1", x)
[1] "c" "s" "mes"
Hope it helps.
M
I have a series of strings that I would like to use regular expressions to compress.
1 617912568590104527563-Congress-Dem-Packages_Nomination-DC2019-08-08.xlsx
2 517912568590504527553-Dem-Plans-Packages_Debate2019-08-08.xlsx
3 47912568590104527523-Congress-Dem-Packages_House2019-08-08 (1).xlsx
I would like the result of the regular expression to be the following compressed strings:
1 Nomination-DC2019-08-08
2 Debate2019-08-08
3 House2019-08-08
Basically, the logic I'm looking for is to find the last hyphen, then move two spaces to the right and delete everything from there rightward. I'm undertaking this in R.
Update: I tried the following workflow, which addressed my issue. h/t to #brittenb for identifying the very useful tools::file_path_sans_ext()
x<-tools::file_path_sans_ext(x)
x<-str_replace(x, " .*", "")
x<-str_replace(x,".*\\_", "")
However, if anyone has a one line regex solution to this that would be great.
Update 2: h/t #WiktorStribiżew for identifying two one-liner solutions:
stringr::str_replace(x, ".*_([^.\\s]+).*", "\\1")
sub(".*_([^.[:space:]]+).*", "\\1", x)
You may simplify the task if you use tools::file_path_sans_ext() to extract the file name without extensions first and then grab all non-whitespace chars from the last _:
x <- c("617912568590104527563-Congress-Dem-Packages_Nomination-DC2019-08-08.xlsx", "517912568590504527553-Dem-Plans-Packages_Debate2019-08-08.xlsx", "47912568590104527523-Congress-Dem-Packages_House2019-08-08 (1).xlsx")
library(stringr)
str_extract(tools::file_path_sans_ext(x), "(?<=_)[^_\\s]+(?=[^_]*$)")
See the R demo. The (?<=_)[^_\\s]+(?=[^_]*$) regex matches a location after _, then matches 1+ chars other than _ and whitespaces and then asserts there are 0+ chars other than _ up to the end of string.
You may achieve what you need without extra libraries:
sub(".*_([^.[:space:]]+).*", "\\1", x)
See the regex demo and the R demo.
With stringr:
str_replace(x, ".*_([^.\\s]+).*", "\\1")
See the regex graph:
Details
.*_ - any 0+ chars as many as possible to the last occurrence of the subsequent patterns starting with _
([^.[:space:]]+) - Capturing group 1 (its value is referenced to with \1 placeholder, or replacement backrefence, from the replacement pattern): 1+ chars other than a dot and whitespace (note \s does not denote whitespace inside [...] in a TRE regex, it does in an ICU regex in stringr regex functions)
.* - any 0+ chars as many as possible.
Full code snippet:
x <- c("617912568590104527563-Congress-Dem-Packages_Nomination-DC2019-08-08.xlsx", "517912568590504527553-Dem-Plans-Packages_Debate2019-08-08.xlsx", "47912568590104527523-Congress-Dem-Packages_House2019-08-08 (1).xlsx")
sub(".*_([^.[:space:]]+).*", "\\1", x)
library(stringr)
stringr::str_replace(x, ".*_([^.\\s]+).*", "\\1")
Both yield
[1] "Nomination-DC2019-08-08" "Debate2019-08-08"
[3] "House2019-08-08"
I'm using back references to get rid of accidental repeats in vectors of variable names. The names in the first case I encountered have repeat patterns like this
x <- c("gender_gender-1", "county_county-2", "country_country-1997",
"country_country-1993")
The repeats were always separated by underscore and there was only one repeat to eliminate. And they always start at the beginning of the text. After checking the Regular Expression Cookbook, 2ed, I arrived at an answer that works:
> gsub("^(.*?)_\\1", "\\1", x)
[1] "gender-1" "county-2" "country-1997" "country-1993"
I was worried that the future cases might have dash or space as separator, so I wanted to generalize the matching a bit. I got that worked out as well.
> x <- c("gender_gender-1", "county-county-2", "country country-1997",
+ "country,country-1993")
> gsub("^(.*?)[,_\ -]\\1", "\\1", x)
[1] "gender-1" "county-2" "country-1997" "country-1993"
So far, total victory.
Now, what is the correct fix if there are three repeats in some cases? In this one, I want "country-country-country" to become just one "country".
> x <- c("gender_gender-1", "county-county-county-2")
> gsub("^(.*?)[,_\ -]\\1", "\\1", x)
[1] "gender-1" "county-county-2"
I am willing to replace all of the separators by "_" if that makes it easier to get rid of the repeat words.
You may quantify the [,_ -]\1 part:
gsub("^(.*?)(?:[,_\\s-]\\1)+", "\\1", x, perl=TRUE)
See the R demo
Note I also replace the space with \s to match any whitespace (and this requires perl=TRUE). You may also match any whitespace with [:space:], then you do not need perl=TRUE, i.e. gsub("^(.*?)(?:[,_[:space:]-]\\1)+", "\\1", x).
Details:
^ - matches the start of a string
(.*?) - any 0+ chars as few as possible up to the first...
(?:
[,_\\s-] - ,, _, whitespace or -
\\1 - same value as captured in Group 1
)+ - 1 or more times.
If you only want to match the repeat part 1 or 2 times, replace + with {1,2} limiting quantifier:
gsub("^(.*?)(?:[,_\\s-]\\1){1,2}", "\\1", x, perl=TRUE)