I am trying to create a pattern for a regular expression in R. I want the pattern to be as shown here,
file1 <- "example.txt"
file2 <- "example.ffe.2f2.csv"
files <- c(file1,file2)
#pattern that matches everything up to, but not including last .
pattern <- ".*(?=\.)"
m <- regexpr(pattern, files)
However I am getting an error on the pattern line saying
Error: '\.' is an unrecognized escape in character string starting "".*(?=\."
I want the regex to match example in file1 and example.ffe.2f2 in file2. Any suggestions/things I'm doing incorrectly? It works correctly on regex101.com, so I know the pattern is correct.
A (?=\.) is a positive lookahead. TRE regex flavor (used by default if perl=TRUE is not specified) does not support lookaheads. You have to use a PCRE regex engine to handle such patterns.
To escape the . properly, with a literal \, thr \ symbol must be doubled in an R string literal. However, you may avoid that by putting the . into a bracket expression / character class - [.].
You may use the following code:
file1 <- "example.txt"
file2 <- "example.ffe.2f2.csv"
files <- c(file1,file2)
regmatches(files, regexpr(".*(?=[.])", files, perl=TRUE))
## => [1] "example" "example.ffe.2f2"
See the online R demo.
Note that the same result can be obtained with
tools::file_path_sans_ext(files)
that gets the file names without extensions (demo).
Related
I'm just getting to know the language R, previously worked with python. The challenge is to replace the last character of each word in the string with *.
How it should look: example text in string, and result work: exampl* tex* i* strin*
My code:
library(tidyverse)
library(stringr)
string_example = readline("Enter our text:")
string_example = unlist(strsplit(string_example, ' '))
string_example
result = str_replace(string_example, pattern = "*\b", replacement = "*")
result
I get an error:
> result = str_replace(string_example, pattern = "*\b", replacement = "*")
Error in stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=``)
Help solve the task
Oh, I noticed an error, the pattern should be .\b. this is how the code is executed, but there is no replacement in the string
If you mean words consisting of letters only, you can use
string_example <- "example text in string"
library(stringr)
str_replace_all(string_example, "\\p{L}\\b", "*")
## => [1] "exampl* tex* i* strin*"
See the R demo and the regex demo.
Details:
\p{L} - a Unicode category (propery) class matching any Unicode letter
\b - a word boundary, in this case, it makes sure there is no other word character immediately on the right. It will fails the match if the letter matched with \p{L} is immediately followed with a letter, digit or _ (these are all word chars). If you want to limit this to a letter check, replace \b with (?!\p{L}).
Note the backslashes are doubled because in regular string literals backslashes are used to form string escape sequences, and thus need escaping themselves to introduce literal backslashes in string literals.
Some more things to consider
If you do not want to change one-letter words, add a non-word boundary at the start, "\\B\\p{L}\\b"
If you want to avoid matching letters that are followed with - + another letter (i.e. some compound words), you can add a lookahead check: "\\p{L}\\b(?!-)".
You may combine the lookarounds and (non-)word boundaries as you need.
I have some errors in some numbers showing numbers like "59.34343.23". I know the first dot is correct but the second one (or any after the first) should be remove. How can I remove those?
I tried using gsub in R:
gsub("(?<=\\..*)\\.", "", "59.34343.23", perl=T)
or
gsub("(?<!^[^.]*)\\.", "", "59.34343.23", perl=T)
However it gets the following error "invalid regular expression". But I have been trying the same code in a regex tester and it works.
What is my mistake here?
You can use
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23")
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23", perl=TRUE)
See the R demo online and the regex demo.
Details:
^([^.]*\.) - Capturing group 1 (referred to as \1 from the replacement pattern): any zero or more chars from the start of string and then a . char (the first in the string)
| - or
\. - any other dot in the string.
Since the replacement, \1, refers to Group 1, and Group 1 only contains a value after the text before and including the first dot is matched, the replacement is either this part of text, or empty string (i.e. the second and all subsequent occurrences of dots are removed).
We may use
gsub("^[^.]+\\.(*SKIP)(*FAIL)|\\.", "", str1, perl = TRUE)
[1] "59.3434323"
data
str1 <- "59.34343.23"
By specifying perl = TRUE you can convert matches of the following regular expression to empty strings:
^[^.]*\.[^.]*\K.|\.
Start your engine!
If you are unfamiliar with \K hover over it in the regular expression at the link to see an explanation of its effect.
There is always the option to only write back the dot if its the first in the line.
Key feature is to consume the other dots but don't write it back.
Effect is to delete trailing dots.
Below uses a branch reset to accomplish the goal (Perl mode).
(?m)(?|(^[^.\n]*\.)|()\.+)
Replace $1
https://regex101.com/r/cHcu4j/1
(?m)
(?|
( ^ [^.\n]* \. ) # (1)
| ( ) # (1)
\.+
)
The pattern that you tried does not match, because there is an infinite quantifier in the lookbehind (?<=\\..*) that is not supported.
Another variation using \G to get continuous matches after the first dot:
(?:^[^.]*\.|\G(?!^))[^.]*\K\.
In parts, the pattern matches:
(?: Non capture group for the alternation |
^[^.]*\. Start of string, match any char except ., then match .
| Or
\G(?!^) Assert the position at the end of the previous match (not at the start)
)[^.]* Optionally match any char except .
\K\. Clear the match buffer an match the dot (to be removed)
Regex demo | R demo
gsub("(?:^[^.]*\\.|\\G(?!^))[^.]*\\K\\.", "", "59.34343.23", perl=T)
Output
[1] "59.3434323"
This string manipulation problem has evaded my best efforts. I have a string, e.g.
eg_str="[probability space](posts/probability space.md) is ... [Sigma Field](posts/Sigma Field.md)"
for which I would like to replace all spaces in the wildcard for ([wildcard].md) with underscores. My first thought was to use either gsub or stringr's str_replace_all to pass the appropriate substrings to a simple function. Something like
convert_space_to_underscore<-function(string){
return(str_replace(string," ","_"))
}
normal_eg_str<-gsub("\\((.+?)md\\)",paste0("(",convert_space_to_underscore("\\1"),"md)"),normal_eg_str)
or
normal_eg_str<-str_replace_all(document,"\\((.+?)md\\)",paste0("(",convert_space_to_underscore("\\1"),".md)"))
When I run these however, it appears that the argument to convert_space_to_underscore is being passed, rather than the output, because the string returns unchanged (if you make an error in the paste0 component, say have paste0("(",convert_space_to_underscore("\\1"),".m)"), then the string returns as
eg_str="[probability space](posts/probability space.m) is ... [Sigma Field](posts/Sigma Field.m)"
so I'm quite sure that what is happening is that str_replace_all and gsub are simply not evaluating the function).
Is there a way to force evaluation? This would be most ideal, as it would allow for the regex component to remain somewhat readable. However, I would welcome any pure-regex solutions as well — my attempts have all lead to greedy errors, no matter where I seem to sprinkle ? and {0} special characters. (Word of caution: there will be some matching substrings with more than one space e.g. [Dynklin's Pi Lambda](posts/dynklins pi lambda.md))
You can use
library(stringr)
eg_str <- "[probability space](posts/probability space.md) is ... [Sigma Field](posts/Sigma Field.md)"
str_replace_all(eg_str, "\\([^()]+\\.md\\)", function(x) gsub(" ", "_", x, fixed=TRUE) )
## => [1] "[probability space](posts/probability_space.md) is ... [Sigma Field](posts/Sigma_Field.md)"
See online R demo.
NOTE: To replace one or more whitespace chunks with a single underscore, you will need a regex in gsub: gsub("\\s+", "_", x).
The first regex finds all strings that
\( - start with (
[^()]+ - have one or more chars other than ( and )
\.md - a .md string
\) - and end with )
Then, the match is passed to an anonymous function that replaced each regular space with a _ (with gsub(" ", "_", x, fixed=TRUE)).
A base R solution (less readable, but using a plain regex):
eg_str <- "[probability space](posts/probability space.md) is ... [Sigma Field](posts/Sigma Field.md)"
gsub("(?:\\G(?!^)|\\()[^()\\s]*\\K\\s+(?=[^()]*\\.md\\))", "_", eg_str, perl=TRUE)
See this R demo online. See this regex demo. Details:
(?:\G(?!^)|\() - end of the preceding match or a ( char
[^()\s]* - any 0 or more chars other than (, ) and whitespace
\K - match reset operator that discards all text matched so far from the overall match memory buffer
\s+ - one or more whitespaces
(?=[^()]*\.md\)) - there should be zero or more chars other than ( and ) followed with .md) immediately to the right of the current location.
I would like to write a function rm_ext similar to tools::file_path_sans_ext but does not strip off file endings if they start with a digit. By replacing [:alnum:] by [:alpha:] in tools::file_path_sans_ext I almost got there, but if the base name of the file ends in a dot itself, it fails:
rm_ext <- function(x) sub("([^.]+)\\.[[:alpha:]]+$", "\\1", x) # adapted from tools::file_path_sans_ext()
rm_ext("test.string.with.dots.but.ending.alpha=0.25.rda") # works
rm_ext("test.string.with.dots.but.without.ending.alpha=0.25") # works
rm_ext("test.string.with.dots.but.ending.alpha=0.25.") # fails (should remove the final . too)
I tried to match [:alpha:] or EOL, but that didn't make the last case work.
Note: As a comparison, tools::file_path_sans_ext (of course) fails, see tools::file_path_sans_ext("test.string.with.dots.but.without.ending=0.25"). Also note that this is somewhat related but different.
You may use
\.(?:[^0-9.][^.]*)?$
See the regex demo and the regex graph:
Details
\. - a dot
(?:[^0-9.][^.]*)? - an optional sequence of a char other than a dot and a digit and then any 0+ chars other than a dot
$ - end of string.
In the code:
sub("\\.(?:[^0-9.][^.]*)?$", "", x)
So I am trying to split my string based on all the punctuations and space wherever they occur in the string (hence the + sign) except for on "#" & "/" because I don't want it to split #n/a which it does. I did search a lot on this problem but can't get to the solution. Any suggestions?
t<-"[[:punct:][:space:]]+"
bh <- tolower(strsplit(as.character(a), t)[[1]])
I have also tried storing the following to t but it also gives error
t<-"[!"\$%&'()*+,\-.:;<=>?#\[\\\]^_`{|}~\\ ]+"
Error: unexpected input in "t<-"[!"\"
One alternate is to substitute #n/a but I want to know how to do it without having to do that.
You may use a PCRE regex with a lookahead that will restrict the bracket expression pattern:
t <- "(?:(?![#/])[[:punct:][:space:]])+"
bh <- tolower(strsplit(as.character(a), t, perl=TRUE)[[1]])
The (?:(?![#/])[[:punct:][:space:]])+ pattern matches 1 or more repetitions of any punctuation or whitespace that is not # and / chars.
See the regex demo.
If you want to spell out the symbols you want to match inside a bracket expression you may fix your other pattern like
t <- "[][!\"$%&'()*+,.:;<=>?#\\\\^_`{|}~ -]+"
Note that ] must be right after the opening [, [ inside the expression does not need to be escaped, - can be put unescaped at the end, a \ should be defined with 4 backslashes. $ does not have to be escaped.