Replace text between first bracket and . without removing the bracket and - r

I would like to replace the 70 in between the brackets with a specific string lets say '80'.
from filter[70.00-100.00] --> filter[80.00-100.00]
However when using the following code:
str_replace('filter [70.00-140.00]'," *\\[.*?\\. *",'80')
The output is:
filter8000-140.00]
Is there any way to replace the string between the \ and . (in this case 70) without removing the \ and . ?

To replace any one or more digits after [ use
library(stringr)
str_replace('filter [70.00-140.00]','(?<=\\[)\\d+', '80')
sub('\\[\\d+', '[80', 'filter [70.00-140.00]')
sub('(?<=\\[)\\d+', '80', 'filter [70.00-140.00]', perl=TRUE)
See the online R demo. The (?<=\[)\d+ matches a location immediately preceded with [ and then one or more digits. \[\d+ matches [ and one or more digits, so [ must be restored and thus is added into the replacement pattern.
To replace exactly 70, you can use
library(stringr)
str_replace('filter [70.00-140.00]',"(\\[[^\\]\\[]*)70",'\\180')
# => [1] "filter [80.00-140.00]"
sub('(\\[[^][]*)70','\\180', 'filter [70.00-140.00]')
# => [1] "filter [80.00-140.00]"
See the regex demo. Details:
(\[[^\]\[]*) - Group 1: [, then zero or more chars other than [ and ]
70 - a 70 string.
In the replacement, \1 inserts Group 1 value.
Another solution could be replacing all occurrences of 70 strings inside square brackets that end with a word boundary and are not preceeded with a digit or digit+. (that is, to only match 70 as a whole integer part of a number) with
str_replace_all(
'filter [70.00-140.00]',
'\\[[^\\]\\[]*]',
function(x) gsub('(?<!\\d|\\d\\.)70\\b', '80', x, perl=TRUE))
# => [1] "filter [80.00-140.00]"
Here, \[[^\]\[]*] matches strings between two square brackets having no other square brackets in between and the gsub('(?<!\\d|\\d\\.)70\\b', '80', x, perl=TRUE) is run on the these matched substrings only. The (?<!\d|\d\.)70\b matches any 70 that is not preceded with digit or digit + . and is not followed by another word char (letter, digit or _, or connector punctuation, since ICU regexps are Unicode aware by default).

You can use this. It will replace all digits between the [ and the . :
preg_replace('/(?<=\[)(\d*)(?=\.)/', '80', 'filter [70.00-140.00]');

Related

grep in R, literal and pattern match

I have seen in manuals how to use grep to match either a pattern or an exact string. However, I cannot figure out how to do both at the same time. I have a latex file where I want to find the following pattern:
\caption[SOME WORDS]
and replace it with:
\caption[\textit{SOME WORDS}]
I have tried with:
texfile <- sub('\\caption[','\\caption[\\textit ', texfile, fixed=TRUE)
but I do not know how to tell grep that there should be some text after the square bracket, and then a closed square bracket.
You can use
texfile <- "\\caption[SOME WORDS]" ## -> \caption[\textit{SOME WORDS}]
texfile <-gsub('(\\\\caption\\[)([^][]*)]','\\1\\\\textit{\\2}]', texfile)
cat(texfile)
## -> \caption[\textit{SOME WORDS}]
See the R demo online.
Details:
(\\caption\[) - Group 1 (\1 in the replacement pattern): a \caption[ string
([^][]*) - Group 2 (\2 in the replacement pattern): any zero or more chars other than [ and ]
] - a ] char.
Another solution based on a PCRE regex:
gsub('\\Q\\caption[\\E\\K([^][]*)]','\\\\textit{\\1}]', texfile, perl=TRUE)
See this R demo online. Details:
\Q - start "quoting", i.e. treating the patterns to the right as literal text
\caption[ - a literal fixed string
\E - stop quoting the pattern
\K - omit text matched so far
([^][]*) - Group 1 (\1): any zero or more non-bracket chars
] - a ] char.

Extract exact matches from array

Assume I have text and I want to extract exact matches. How can I do this efficiently:
test_text <- c("[]", "[1234]", "[1234a]", "[v1256a] ghjk kjh",
"[othername1256b] kjhgfd hgj",
"[v1256] ghjk kjh", "[v1256] kjhgfd hgj",
" text here [name1991] and here",
"[name1990] this is an explanation",
"[name1991] this is another explanation",
"[mäölk1234]")
expected <- c("[v1256a]", "[othername1256b]", "[v1256]", "[v1256]", "[name1991]",
"[name1990]", "[name1991]", "[mäölk1234]")
# This works:
regmatches(text, regexpr("\\[.*[0-9]{4}.*\\]", text))
But I guess something like "\\[.*[0-9]{4}(?[a-z])]\\]" would be better but it throws an error
Error in regexpr("\[.[0-9]{4}(?[a-z])]\]", text) : invalid
regular expression '[.[0-9]{4}(?[a-z])]]', reason 'Invalid regexp'
Only ONE letter should follow the year, but there can be none, see example. Sorry, I rarly use regexpr...
Updated question solution
It seems you want to extract all occurrences of 1+ letters followed with 4 digits and then an optional letter inside square brackets.
Use
test_text <- c("[]", "[1234]", "[1234a]", "[v1256a] ghjk kjh",
"[othername1256b] kjhgfd hgj",
"[v1256] ghjk kjh", "[v1256] kjhgfd hgj",
" text here [name1991] and here",
"[name1990] this is an explanation",
"[name1991] this is another explanation",
"[mäölk1234]")
regmatches(test_text, regexpr("\\[\\p{L}+[0-9]{4}\\p{L}?]", test_text, perl=TRUE))
# => c("[v1256a]", "[othername1256b]", "[v1256]", "[v1256]", "[name1991]",
# "[name1990]", "[name1991]", "[mäölk1234]")
See the R demo online. NOTE that you need to use a PCRE regex for this to work, perl=TRUE is crucial here.
Details
\[ - a [ char
\p{L}+ - 1+ any Unicode letters
[0-9]{4} - four ASCII digits
\\p{L}? - an optional any Unicode letter
] - a ] char.
Original answer
Use
regmatches(test_text, regexpr("\\[[^][]*[0-9]{4}[[:alpha:]]?]", test_text))
Or
regmatches(test_text, regexpr("\\[[^][]*[0-9]{4}[a-zA-Z]?]", test_text))
See the regex demo and a Regulex graph:
Details
\[ - a [ char
[^][]* - 0 or more chars other than [ and ] (HINT: if you only expect letters here replace with [[:alpha:]]* or [a-zA-Z]*)
[0-9]{4} - four digits
[[:alpha:]]? - an optional letter (or [a-zA-Z]? will match any ASCII optional letter)
] - a ] char
R test:
regmatches(test_text, regexpr("\\[[^][]*[0-9]{4}[[:alpha:]]?]", test_text))
## => [1] "[v1256a]" "[othername1256b]" "[v1256]" "[v1256]" "[name1991]" "[name1990]" "[name1991]"

R regex match things other than known characters

For a text field, I would like to expose those that contain invalid characters. The list of invalid characters is unknown; I only know the list of accepted ones.
For example for French language, the accepted list is
A-z, 1-9, [punc::], space, àéèçè, hyphen, etc.
The list of invalid charactersis unknown, yet I want anything unusual to resurface, for example, I would want
This is an 2-piece à-la-carte dessert to pass when
'Ã this Øs an apple' pumps up as an anomalie
The 'not contain' notion in R does not behave as I would like, for example
grep("[^(abc)]",c("abcdef", "defabc", "apple") )
(those that does not contain 'abc') match all three while
grep("(abc)",c("abcdef", "defabc", "apple") )
behaves correctly and match only the first two. Am I missing something
How can we do that in R ? Also, how can we put hypen together in the list of accepted characters ?
[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+
The above regex matches any of the following (one or more times). Note that the parameter ignore.case=T used in the code below allows the following to also match uppercase variants of the letters.
a-z Any lowercase ASCII letter
1-9 Any digit in the range from 1 to 9 (excludes 0)
[:punct:] Any punctuation character
The space character
àâæçéèêëîïôœùûüÿ Any valid French character with a diacritic mark
- The hyphen character
See code in use here
x <- c("This is an 2-piece à-la-carte dessert", "Ã this Øs an apple")
gsub("[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+", "", x, ignore.case=T)
The code above replaces all valid characters with nothing. The result is all invalid characters that exist in the string. The following is the output:
[1] "" "ÃØ"
If by "expose the invalid characters" you mean delete the "accepted" ones, then a regex character class should be helpful. From the ?regex help page we can see that a hyphen is already part of the punctuation character vector;
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
So the code could be:
x <- 'Ã this Øs an apple'
gsub("[A-z1-9[:punct:] àéèçè]+", "", x)
#[1] "ÃØ"
Note that regex has a predefined, locale-specific "[:alpha:]" named character class that would probably be both safer and more compact than the expression "[A-zàéèçè]" especially since the post from ctwheels suggests that you missed a few. The ?regex page indicates that "[0-9A-Za-z]" might be both locale- and encoding-specific.
If by "expose" you instead meant "identify the postion within the string" then you could use the negation operator "^" within the character class formalism and apply gregexpr:
gregexpr("[^A-z1-9[:punct:] àéèçè]+", x)
[[1]]
[1] 1 8
attr(,"match.length")
[1] 1 1

remove all characters between string and bracket in R

Say I have a dataframe df in which a column df$strings contains strings like
[cat 00.04;09]
[cat 00.04;10]
and so on. I want to remove all characters between "[cat" and "]" to yield
[cat]
[cat]
I've tried this using gsub but it's not working and I'm not sure what I'm doing wrong:
gsub('cat*?\\]', '', df)
Note that cat*?\\] patten matches ca, then any 0+ t chars but as few as possible and then ].
You want to match any chars other than ] between [cat and ]:
gsub('\\[cat[^]]*\\]', '[cat]', df$strings)
Here,
\\[ - matches [
cat - matches cat
[^]]* - 0+ chars other than ] (note that ] inside the bracket expression should not be escaped when placed at the start - else, if you escape it, you will need to add perl=TRUE argument since PCRE regex engine can handle regex escapes inside bracket expressions (not the default TRE))
\\] - a ] (you do not even need to escape it, you may just use ]).
See the R demo:
x <- c("[cat 00.04;09]", "[cat 00.04;10]")
gsub('\\[cat[^]]*\\]', '[cat]', x)
## => [1] "[cat]" "[cat]"
If cat can be any word, use
gsub('\\[(\\w+)[^]]*\\]', '[\\1]', x)
where (\\w+) is a capturing group with ID=1 that matches 1 or more word chars, and \\1 in the replacement pattern is a replacement backreference that stands for the group value.

Regex: Extracting numbers from parentheses with multiple matches

How do I match the year such that it is general for the following examples.
a <- '"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}'
b <- 'Þegar það gerist (1998/I) (TV)'
I have tried the following, but did not have the biggest success.
gsub('.+\\(([0-9]+.+\\)).?$', '\\1', a)
What I thought it did was to go until it finds a (, then it would make a group of numbers, then any character until it meets a ). And if there are several matches, I want to extract the first group.
Any suggestions to where I go wrong? I have been doing this in R.
You could use
library(stringr)
strings <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)')
years <- str_match(strings, "\\((\\d+(?: B\\.C\\.)?)")[,2]
years
# [1] "1953" "1998"
The expression here is
\( # (
(\d+ # capture 1+ digits
(?: B\.C\.)? # B.C. eventually
)
Note that backslashes need to be escaped in R.
Your pattern contains .+ parts that match 1 or more chars as many as possible, and at best your pattern could grab last 4 digit chunks from the incoming strings.
You may use
^.*?\((\d{4})(?:/[^)]*)?\).*
Replace with \1 to only keep the 4 digit number. See the regex demo.
Details
^ - start of string
.*? - any 0+ chars as few as possible
\( - a (
(\d{4}) - Group 1: four digits
(?: - start of an optional non-capturing group
/ - a /
[^)]* - any 0+ chars other than )
)? - end of the group
\) - a ) (OPTIONAL, MAY BE OMITTED)
.* - the rest of the string.
See the R demo:
a <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)', 'Johannes Passion, BWV. 245 (1725 Version) (1996) (V)')
sub("^.*?\\((\\d{4})(?:/[^)]*)?\\).*", "\\1", a)
# => [1] "1953" "1998" "1996"
Another base R solution is to match the 4 digits after (:
regmatches(a, regexpr("\\(\\K\\d{4}(?=(?:/[^)]*)?\\))", a, perl=TRUE))
# => [1] "1953" "1998" "1996"
The \(\K\d{4} pattern matches ( and then drops it due to \K match reset operator and then a (?=(?:/[^)]*)?\\)) lookahead ensures there is an optional / + 0+ chars other than ) and then a ). Note that regexpr extracts the first match only.

Resources