Given a string like 'run- ning' I would like to replace 'n- n' by 'nn' in order to obtain 'running'.
Using the stringr package I tried this:
str_replace_all(s, "[:alpha:]\\-([ ])+[:alpha:]", "[:alpha:][:alpha:]")
but it seems not to work that way. I guess variables need to be used, but I could figure out how exactly.
I tried this:
str_replace_all(s, "[:alpha:]\\-([ ])+[:alpha:]", "\\0\\1")
but that does not give the desired result either.
Any ideas?
You may use
stringr::str_replace_all(s, "(?<=\\p{L})- +(?=\\p{L})", "")
stringr::str_replace_all(s, "(\\p{L})- +(\\p{L})", "\\1\\2")
Or, to match any horizontal whitespace chars
stringr::str_replace_all(s, "(?<=\\p{L})-\\h+(?=\\p{L})", "")
stringr::str_replace_all(s, "(\\p{L})-\\h+(\\p{L})", "\\1\\2")
Base R equivalent:
gsub("(?<=\\p{L})-\\h+(?=\\p{L})", "", s, perl=TRUE)
gsub("(\\p{L})-\\h+(\\p{L})", "\\1\\2", s, perl=TRUE)
gsub("([[:alpha:]])-\\s+([[:alpha:]])", "\\1\\2", s)
See the regex demo
Details
(?<=\p{L}) - a positive lookbehind that matches a location immediately preceded with any Unicode letter
- + - a hyphen followed with 1 or more spaces (\h matches any horizontal whitespace)
(?=\p{L}) - a positive lookahead that matches a location immediately followed with any Unicode letter.
(\p{L}) - a capturing group that matches any letter.
The \1\2 in the replacement patterns in the examples using capturing groups are backreferences to the corresponding capturing group values.
Related
I'm trying to use the stringr package in R to identify space(s) between words but not space(s) between words and symbols (or vice versa) or words and numbers (or vice versa), or symbols and numbers.
Based on what I could find it seems like [A-Za-z][:space:][a-zA-Z] should work. I'm obviously missing something but not sure what.
I've tried the stringr syntax with [A-Za-z][:space:][a-zA-Z], as well as regex(?) syntax for the spaces such as [A-Za-z]\s+[a-zA-Z]
str_replace_all(x, [A-Za-z][:space:][a-zA-Z], "_")
Sometimes an error I would get is "Error in rep(value, length.out = nrows) : attempt to replicate an object of type 'closure'"
You may use
str_replace_all(x, "(?<=\\p{L})\\s(?=\\p{L})", "_")
gsub("(?<=\\p{L})\\s(?=\\p{L})", "_", x, perl=TRUE)
Or, if there are 1 or more spaces to be replaced with 1 _,
str_replace_all(x, "(?<=\\p{L})\\s+(?=\\p{L})", "_")
gsub("(?<=\\p{L})\\s+(?=\\p{L})", "_", x, perl=TRUE)
See the regex demo
Details
(?<=\p{L}) - a positive lookbehind that matches a location that is immediately preceded with any letter
\s - a whitespace (\s+ matches 1+ whitespaces)
(?=\p{L}) - a positive lookahead that matches a location that is immediately followed with any letter.
NOTE:
You should wrap the regex pattern with quotes to form a string literal
If you want to only support ASCII letters, you may replace \\p{L} with [A-Za-z].
I have a series of strings that I would like to use regular expressions to compress.
1 617912568590104527563-Congress-Dem-Packages_Nomination-DC2019-08-08.xlsx
2 517912568590504527553-Dem-Plans-Packages_Debate2019-08-08.xlsx
3 47912568590104527523-Congress-Dem-Packages_House2019-08-08 (1).xlsx
I would like the result of the regular expression to be the following compressed strings:
1 Nomination-DC2019-08-08
2 Debate2019-08-08
3 House2019-08-08
Basically, the logic I'm looking for is to find the last hyphen, then move two spaces to the right and delete everything from there rightward. I'm undertaking this in R.
Update: I tried the following workflow, which addressed my issue. h/t to #brittenb for identifying the very useful tools::file_path_sans_ext()
x<-tools::file_path_sans_ext(x)
x<-str_replace(x, " .*", "")
x<-str_replace(x,".*\\_", "")
However, if anyone has a one line regex solution to this that would be great.
Update 2: h/t #WiktorStribiżew for identifying two one-liner solutions:
stringr::str_replace(x, ".*_([^.\\s]+).*", "\\1")
sub(".*_([^.[:space:]]+).*", "\\1", x)
You may simplify the task if you use tools::file_path_sans_ext() to extract the file name without extensions first and then grab all non-whitespace chars from the last _:
x <- c("617912568590104527563-Congress-Dem-Packages_Nomination-DC2019-08-08.xlsx", "517912568590504527553-Dem-Plans-Packages_Debate2019-08-08.xlsx", "47912568590104527523-Congress-Dem-Packages_House2019-08-08 (1).xlsx")
library(stringr)
str_extract(tools::file_path_sans_ext(x), "(?<=_)[^_\\s]+(?=[^_]*$)")
See the R demo. The (?<=_)[^_\\s]+(?=[^_]*$) regex matches a location after _, then matches 1+ chars other than _ and whitespaces and then asserts there are 0+ chars other than _ up to the end of string.
You may achieve what you need without extra libraries:
sub(".*_([^.[:space:]]+).*", "\\1", x)
See the regex demo and the R demo.
With stringr:
str_replace(x, ".*_([^.\\s]+).*", "\\1")
See the regex graph:
Details
.*_ - any 0+ chars as many as possible to the last occurrence of the subsequent patterns starting with _
([^.[:space:]]+) - Capturing group 1 (its value is referenced to with \1 placeholder, or replacement backrefence, from the replacement pattern): 1+ chars other than a dot and whitespace (note \s does not denote whitespace inside [...] in a TRE regex, it does in an ICU regex in stringr regex functions)
.* - any 0+ chars as many as possible.
Full code snippet:
x <- c("617912568590104527563-Congress-Dem-Packages_Nomination-DC2019-08-08.xlsx", "517912568590504527553-Dem-Plans-Packages_Debate2019-08-08.xlsx", "47912568590104527523-Congress-Dem-Packages_House2019-08-08 (1).xlsx")
sub(".*_([^.[:space:]]+).*", "\\1", x)
library(stringr)
stringr::str_replace(x, ".*_([^.\\s]+).*", "\\1")
Both yield
[1] "Nomination-DC2019-08-08" "Debate2019-08-08"
[3] "House2019-08-08"
I have text I am cleaning up in R. I want to use stringi, but am happy to use other packages.
Some of the words are broken over two lines. So I get a sub-string "halfword-\nsecondhalfword".
I also have strings that are just "----\nword" and " -\n" (and some others that I do not want to replace.
What I want to do is identify all sub-strings "[a-z]-\n" and then keep the generic letter [a,z], but remove the -\n characters.
I do not want to remove all -\n , and I do not want to remove the letter [a-z].
Thanks!
You may make use of word boundaries to match -<LF> only in between word characters:
gsub("\\b-\n\\b", "", x)
gsub("(*UCP)\\b-\n\\b", "", x, perl=TRUE)
stringr::str_replace_all(x, "\\b-\n\\b", "", x)
The latter two support word boundaries between any Unicode word characters.
See the regex demo.
If you want to only remove -<LF> between letters you may use
gsub("([a-zA-Z])-\n([a-zA-Z])", "\\1\\2", x)
gsub("(\\p{L})-\n(\\p{L})", "\\1\\2", x, perl=TRUE)
stringr::str_replace_all(x, "(\\p{L})-\n(\\p{L})", "\\1\\2")
If you need to only support lowercase letters, remove A-Z in the first gsub and replace \p{L} with \p{Ll} in the latter two.
See this regex demo.
I am trying to extract usernames tagged in a text-chat, such as "#Jack #Marie Hi there!"
I am trying to do it on the combination of # and whitespace but I cannot get the regex to match non-greedy (or at least this is what I think is wrong):
library(stringr)
str_extract(string = '#This is what I want to extract', pattern = "(?<=#)(.*)(?=\\s+)")
[1] "This is what I want to"
What I would like to extract instead is only This.
You could make your regex non greedy:
(?<=#)(.*?)(?=\s+)
Or if you want to capture only "This" after the # sign, you could try it like this using only a positive lookbehind:
(?<=#)\w+
Explanation
A positive lookbehind (?<=
That asserts that what is behind is an #
Close positive lookbehind )
Match one or more word characters \w+
The central part of your regex ((.*)) is a sequence of any chars.
Instead you shoud look for a sequence of chars other than white space
(\S+) or word chars (\w+).
Note also that I changed * to +, as you are probably not interested
in any empty sequence of chars.
To capture also a name which has "last" position in the source
string, the last part of your regex should match not only a sequence
of whitespace chars, but also the end of the string, so change
(?=\\s+) to (?=\\s+|$).
And the last remark: Actually you don't need the parentheses around
the "central" part.
So to sum up, the whole regex can be like this:
(?<=#)\w+(?=\s+|$)
(with global oprion).
Here is a non-regex approach or rather a minimal-regex approach since grep takes the detection of # through the regex engine
grep('#', strsplit(x, ' ')[[1]], value = TRUE)
#[1] "#This"
Or to avoid strsplit, we can use scan (taken from this answer), i.e.
grep('#', scan(textConnection(x), " "), value=TRUE)
#Read 7 items
#[1] "#This"
Consider the following string:
tempo/blah/blah/aaa-bbb-ccc/def/ghi/jkl
I have a bunch of strings that have /aaa-bbb-ccc/ in them. I would like to remove any characters that occur before /aaa-bbb-ccc/. The final product of the above, for example, should be /aaa-bbb-ccc/def/ghi/jkl.
My attempt, after some searching:
x <- "tempo/blah/blah/aaa-bbb-ccc/def/ghi/jkl"
sub("^.*[^/aaa-bbb-ccc/]", "", x)
[1] ""
You need to use lazy dot matching and wrap the known value with a capturing group to restore with a backreference later:
x <- "tempo/blah/blah/aaa-bbb-ccc/def/ghi/jkl"
sub(".*?(/aaa-bbb-ccc/)", "\\1", x)
## [1] "/aaa-bbb-ccc/def/ghi/jkl"
See this R demo.
See regex demo, .*? matches any 0+ chars, as few as possible, and (/aaa-bbb-ccc/) is a capturing group with ID=1 that is reference to with \1 from the replacement pattern.
Note you may also extract that part using regmatches/regexpr:
x <- "tempo/blah/blah/aaa-bbb-ccc/def/ghi/jkl"
regmatches(x, regexpr("/aaa-bbb-ccc/.*", x))
See this R demo. .* just grabs any 0+ chars up to the end of the whole character vector.