Passing function to R regex-based tools - r

This string manipulation problem has evaded my best efforts. I have a string, e.g.
eg_str="[probability space](posts/probability space.md) is ... [Sigma Field](posts/Sigma Field.md)"
for which I would like to replace all spaces in the wildcard for ([wildcard].md) with underscores. My first thought was to use either gsub or stringr's str_replace_all to pass the appropriate substrings to a simple function. Something like
convert_space_to_underscore<-function(string){
return(str_replace(string," ","_"))
}
normal_eg_str<-gsub("\\((.+?)md\\)",paste0("(",convert_space_to_underscore("\\1"),"md)"),normal_eg_str)
or
normal_eg_str<-str_replace_all(document,"\\((.+?)md\\)",paste0("(",convert_space_to_underscore("\\1"),".md)"))
When I run these however, it appears that the argument to convert_space_to_underscore is being passed, rather than the output, because the string returns unchanged (if you make an error in the paste0 component, say have paste0("(",convert_space_to_underscore("\\1"),".m)"), then the string returns as
eg_str="[probability space](posts/probability space.m) is ... [Sigma Field](posts/Sigma Field.m)"
so I'm quite sure that what is happening is that str_replace_all and gsub are simply not evaluating the function).
Is there a way to force evaluation? This would be most ideal, as it would allow for the regex component to remain somewhat readable. However, I would welcome any pure-regex solutions as well — my attempts have all lead to greedy errors, no matter where I seem to sprinkle ? and {0} special characters. (Word of caution: there will be some matching substrings with more than one space e.g. [Dynklin's Pi Lambda](posts/dynklins pi lambda.md))

You can use
library(stringr)
eg_str <- "[probability space](posts/probability space.md) is ... [Sigma Field](posts/Sigma Field.md)"
str_replace_all(eg_str, "\\([^()]+\\.md\\)", function(x) gsub(" ", "_", x, fixed=TRUE) )
## => [1] "[probability space](posts/probability_space.md) is ... [Sigma Field](posts/Sigma_Field.md)"
See online R demo.
NOTE: To replace one or more whitespace chunks with a single underscore, you will need a regex in gsub: gsub("\\s+", "_", x).
The first regex finds all strings that
\( - start with (
[^()]+ - have one or more chars other than ( and )
\.md - a .md string
\) - and end with )
Then, the match is passed to an anonymous function that replaced each regular space with a _ (with gsub(" ", "_", x, fixed=TRUE)).
A base R solution (less readable, but using a plain regex):
eg_str <- "[probability space](posts/probability space.md) is ... [Sigma Field](posts/Sigma Field.md)"
gsub("(?:\\G(?!^)|\\()[^()\\s]*\\K\\s+(?=[^()]*\\.md\\))", "_", eg_str, perl=TRUE)
See this R demo online. See this regex demo. Details:
(?:\G(?!^)|\() - end of the preceding match or a ( char
[^()\s]* - any 0 or more chars other than (, ) and whitespace
\K - match reset operator that discards all text matched so far from the overall match memory buffer
\s+ - one or more whitespaces
(?=[^()]*\.md\)) - there should be zero or more chars other than ( and ) followed with .md) immediately to the right of the current location.

Related

How do I add a space between two characters using regex in R?

I want to add a space between two punctuation characters (+ and -).
I have this code:
s <- "-+"
str_replace(s, "([:punct:])([:punct:])", "\\1\\s\\2")
It does not work.
May I have some help?
There are several issues here:
[:punct:] pattern in an ICU regex flavor does not match math symbols (\p{S}), it only matches punctuation proper (\p{P}), if you still want to match all of them, combine the two classes, [\p{P}\p{S}]
"\\1\\s\\2" replacement contains a \s regex escape sequence, and these are not supported in the replacement patterns, you need to use a literal space
str_replace only replaces one, first occurrence, use str_replace_all to handle all matches
Even if you use all the above suggestions, it still won't work for strings like -+?/. You need to make the second part of the regex a zero-width assertion, a positive lookahead, in order not to consume the second punctuation.
So, you can use
library(stringr)
s <- "-+?="
str_replace_all(s, "([\\p{P}\\p{S}])(?=[\\p{P}\\p{S}])", "\\1 ")
str_replace_all(s, "(?<=[\\p{P}\\p{S}])(?=[\\p{P}\\p{S}])", " ")
gsub("(?<=[[:punct:]])(?=[[:punct:]])", " ", s, perl=TRUE)
See the R demo online, all three lines yield [1] "- + ? =" output.
Note that in PCRE regex flavor (used with gsub and per=TRUE) the POSIX character class must be put inside a bracket expression, hence the use of double brackets in [[:punct:]].
Also, (?<=[[:punct:]]) is a positive lookbehind that checks for the presence of its pattern immediately on the left, and since it is non-consuming there is no need of any backreference in the replacement.

Remove all dots but first in a string using R

I have some errors in some numbers showing numbers like "59.34343.23". I know the first dot is correct but the second one (or any after the first) should be remove. How can I remove those?
I tried using gsub in R:
gsub("(?<=\\..*)\\.", "", "59.34343.23", perl=T)
or
gsub("(?<!^[^.]*)\\.", "", "59.34343.23", perl=T)
However it gets the following error "invalid regular expression". But I have been trying the same code in a regex tester and it works.
What is my mistake here?
You can use
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23")
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23", perl=TRUE)
See the R demo online and the regex demo.
Details:
^([^.]*\.) - Capturing group 1 (referred to as \1 from the replacement pattern): any zero or more chars from the start of string and then a . char (the first in the string)
| - or
\. - any other dot in the string.
Since the replacement, \1, refers to Group 1, and Group 1 only contains a value after the text before and including the first dot is matched, the replacement is either this part of text, or empty string (i.e. the second and all subsequent occurrences of dots are removed).
We may use
gsub("^[^.]+\\.(*SKIP)(*FAIL)|\\.", "", str1, perl = TRUE)
[1] "59.3434323"
data
str1 <- "59.34343.23"
By specifying perl = TRUE you can convert matches of the following regular expression to empty strings:
^[^.]*\.[^.]*\K.|\.
Start your engine!
If you are unfamiliar with \K hover over it in the regular expression at the link to see an explanation of its effect.
There is always the option to only write back the dot if its the first in the line.
Key feature is to consume the other dots but don't write it back.
Effect is to delete trailing dots.
Below uses a branch reset to accomplish the goal (Perl mode).
(?m)(?|(^[^.\n]*\.)|()\.+)
Replace $1
https://regex101.com/r/cHcu4j/1
(?m)
(?|
( ^ [^.\n]* \. ) # (1)
| ( ) # (1)
\.+
)
The pattern that you tried does not match, because there is an infinite quantifier in the lookbehind (?<=\\..*) that is not supported.
Another variation using \G to get continuous matches after the first dot:
(?:^[^.]*\.|\G(?!^))[^.]*\K\.
In parts, the pattern matches:
(?: Non capture group for the alternation |
^[^.]*\. Start of string, match any char except ., then match .
| Or
\G(?!^) Assert the position at the end of the previous match (not at the start)
)[^.]* Optionally match any char except .
\K\. Clear the match buffer an match the dot (to be removed)
Regex demo | R demo
gsub("(?:^[^.]*\\.|\\G(?!^))[^.]*\\K\\.", "", "59.34343.23", perl=T)
Output
[1] "59.3434323"

using regex predefined class with an exception in R

So I am trying to split my string based on all the punctuations and space wherever they occur in the string (hence the + sign) except for on "#" & "/" because I don't want it to split #n/a which it does. I did search a lot on this problem but can't get to the solution. Any suggestions?
t<-"[[:punct:][:space:]]+"
bh <- tolower(strsplit(as.character(a), t)[[1]])
I have also tried storing the following to t but it also gives error
t<-"[!"\$%&'()*+,\-.:;<=>?#\[\\\]^_`{|}~\\ ]+"
Error: unexpected input in "t<-"[!"\"
One alternate is to substitute #n/a but I want to know how to do it without having to do that.
You may use a PCRE regex with a lookahead that will restrict the bracket expression pattern:
t <- "(?:(?![#/])[[:punct:][:space:]])+"
bh <- tolower(strsplit(as.character(a), t, perl=TRUE)[[1]])
The (?:(?![#/])[[:punct:][:space:]])+ pattern matches 1 or more repetitions of any punctuation or whitespace that is not # and / chars.
See the regex demo.
If you want to spell out the symbols you want to match inside a bracket expression you may fix your other pattern like
t <- "[][!\"$%&'()*+,.:;<=>?#\\\\^_`{|}~ -]+"
Note that ] must be right after the opening [, [ inside the expression does not need to be escaped, - can be put unescaped at the end, a \ should be defined with 4 backslashes. $ does not have to be escaped.

how to remove sentences with conjuctions in R

I have text, an example of which is as follows
Input
c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
The expected output is
,At the end of the study everything was great\n,Some other sentence\nThe test ended.
,Not sure how to get this regex sorted\n\nHow do I do this
I tried:
x[, y] <- gsub(".*[Bb]ut .*?(\\.|\n|:)", "", x[, y])
but it eradicated the whole sentence. How do I remove the phrase with 'but' in it and keep the rest of the phrases in each sentence?
You may use
x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.", ",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
gsub(".*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE, perl=TRUE)
gsub("(?n).*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE)
See the R demo online
The PCRE pattern matches:
.* - any 0+ chars other than line break chars, 0 or more, as many as possible
\\bbut\\b - a whole word but (\b are word boundaries)
.* - any 0+ chars other than line break chars, 0 or more, as many as possible
[\r\n]* - 0 or more line break chars.
Note that the first gsub has a perl=TRUE argument that makes R use the PCRE regex engine to parse the pattern, and . does not match a line break char there. The second gsub uses a TRE (default) regex engine, and one needs to use (?n) inline modifier to make . fail to match line break chars there.
Note that you mixed up "\n" and "/n", which I did correct.
My idea for a solution:
1) Simply catch all chars which are no linebreak ([^\n]) before and after the "but".
2) (Edit) To address the issue Wiktors found, we also have to check that no char ([^a-zA-Z]) is directly before or after the "but".
x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",
",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
> gsub("[^\n]*[^a-zA-Z]but[^a-zA-Z][^\n]*", "", x)
[1] ",At the end of the study everything was great\n\nSome other sentence\n The test ended."
[2] ",Not sure how to get this regex sorted\n\nHow do I do this"

Is there a simple way to get substring in R?

i get the substring of word in the following way:
word="xyz9874"
pattern="[0-9]+"
x=gregexpr(pattern,word)
substr(word,start=x[[1]],stop=x[[1]]+attr(x[[1]],"match.length")-1)
[1] "9874"
Is there a more simple way to get the result in R?
Sure, use gsub and backreferencing:
gsub( ".*?([0-9]+).*", "\\1", word )
Explanation: in most regex implementations, \1 is the back reference to the first subpattern matched. The subpattern is enclosed in parentheses. In R, you need to escape the backslash irrespective of the type of quotation marks you are using.
The question mark, an idiom of the "extended" regular expressions means that the given regex pattern should not be greedy, in other words -- it should take as little of the string as possible. Othrewise, the .* in the pattern .*([0-9]+) would match xyz987 and ([0-9]+) would match 4. Alternatively, we can write
gsub( ".*[^0-9]+([0-9]+).*", "\\1", word )
but then we have a problem with strings that start with a number.
By the way, note that instead of [0-9] you can write \d, or, actually, \\d:
gsub( ".*?(\\d+).*", "\\1", word )

Resources