grep in R, literal and pattern match - r

I have seen in manuals how to use grep to match either a pattern or an exact string. However, I cannot figure out how to do both at the same time. I have a latex file where I want to find the following pattern:
\caption[SOME WORDS]
and replace it with:
\caption[\textit{SOME WORDS}]
I have tried with:
texfile <- sub('\\caption[','\\caption[\\textit ', texfile, fixed=TRUE)
but I do not know how to tell grep that there should be some text after the square bracket, and then a closed square bracket.

You can use
texfile <- "\\caption[SOME WORDS]" ## -> \caption[\textit{SOME WORDS}]
texfile <-gsub('(\\\\caption\\[)([^][]*)]','\\1\\\\textit{\\2}]', texfile)
cat(texfile)
## -> \caption[\textit{SOME WORDS}]
See the R demo online.
Details:
(\\caption\[) - Group 1 (\1 in the replacement pattern): a \caption[ string
([^][]*) - Group 2 (\2 in the replacement pattern): any zero or more chars other than [ and ]
] - a ] char.
Another solution based on a PCRE regex:
gsub('\\Q\\caption[\\E\\K([^][]*)]','\\\\textit{\\1}]', texfile, perl=TRUE)
See this R demo online. Details:
\Q - start "quoting", i.e. treating the patterns to the right as literal text
\caption[ - a literal fixed string
\E - stop quoting the pattern
\K - omit text matched so far
([^][]*) - Group 1 (\1): any zero or more non-bracket chars
] - a ] char.

Related

Extract exact matches from array

Assume I have text and I want to extract exact matches. How can I do this efficiently:
test_text <- c("[]", "[1234]", "[1234a]", "[v1256a] ghjk kjh",
"[othername1256b] kjhgfd hgj",
"[v1256] ghjk kjh", "[v1256] kjhgfd hgj",
" text here [name1991] and here",
"[name1990] this is an explanation",
"[name1991] this is another explanation",
"[mäölk1234]")
expected <- c("[v1256a]", "[othername1256b]", "[v1256]", "[v1256]", "[name1991]",
"[name1990]", "[name1991]", "[mäölk1234]")
# This works:
regmatches(text, regexpr("\\[.*[0-9]{4}.*\\]", text))
But I guess something like "\\[.*[0-9]{4}(?[a-z])]\\]" would be better but it throws an error
Error in regexpr("\[.[0-9]{4}(?[a-z])]\]", text) : invalid
regular expression '[.[0-9]{4}(?[a-z])]]', reason 'Invalid regexp'
Only ONE letter should follow the year, but there can be none, see example. Sorry, I rarly use regexpr...
Updated question solution
It seems you want to extract all occurrences of 1+ letters followed with 4 digits and then an optional letter inside square brackets.
Use
test_text <- c("[]", "[1234]", "[1234a]", "[v1256a] ghjk kjh",
"[othername1256b] kjhgfd hgj",
"[v1256] ghjk kjh", "[v1256] kjhgfd hgj",
" text here [name1991] and here",
"[name1990] this is an explanation",
"[name1991] this is another explanation",
"[mäölk1234]")
regmatches(test_text, regexpr("\\[\\p{L}+[0-9]{4}\\p{L}?]", test_text, perl=TRUE))
# => c("[v1256a]", "[othername1256b]", "[v1256]", "[v1256]", "[name1991]",
# "[name1990]", "[name1991]", "[mäölk1234]")
See the R demo online. NOTE that you need to use a PCRE regex for this to work, perl=TRUE is crucial here.
Details
\[ - a [ char
\p{L}+ - 1+ any Unicode letters
[0-9]{4} - four ASCII digits
\\p{L}? - an optional any Unicode letter
] - a ] char.
Original answer
Use
regmatches(test_text, regexpr("\\[[^][]*[0-9]{4}[[:alpha:]]?]", test_text))
Or
regmatches(test_text, regexpr("\\[[^][]*[0-9]{4}[a-zA-Z]?]", test_text))
See the regex demo and a Regulex graph:
Details
\[ - a [ char
[^][]* - 0 or more chars other than [ and ] (HINT: if you only expect letters here replace with [[:alpha:]]* or [a-zA-Z]*)
[0-9]{4} - four digits
[[:alpha:]]? - an optional letter (or [a-zA-Z]? will match any ASCII optional letter)
] - a ] char
R test:
regmatches(test_text, regexpr("\\[[^][]*[0-9]{4}[[:alpha:]]?]", test_text))
## => [1] "[v1256a]" "[othername1256b]" "[v1256]" "[v1256]" "[name1991]" "[name1990]" "[name1991]"

Regex to match a pattern but not two specific cases

I want to match every cases of "-", but not these ones:
[\d]-[A-Z]
[A-Z]-[\d]
I tried this pattern: ((?<![A-Z])-(?![0-9]))|((?<![0-9])-(?![A-Z])) but some results are incorrect like: "RUA VF-32 N"
Can anyone help me?
A simple approach is to use grep with your current logic and inverting the result, and then run another grep to only keep those items that have a hyphen in them:
x <- c("QUADRA 120 - ASA BRANCA","FAZENDA LAGE -RODOVIA RIO VERDE","C-15","99-B","A-A")
grep("-", grep("[A-Z]-\\d|\\d-[A-Z]", x, invert=TRUE, value=TRUE), value=TRUE, fixed=TRUE)
# => [1] "QUADRA 120 - ASA BRANCA" "FAZENDA LAGE -RODOVIA RIO VERDE"
# [3] "A-A"
Here, [A-Z]-\\d|\\d-[A-Z] matches a hyphen either in between an uppercase ASCII etter or a digit or betweena digit and an ASCII uppercase letter. If there is a match, the result is inverted due to invert=TRUE.
See the R demo.
To only match - in all contexts other than in between a letter and a digit, you may use the PCRE regex based on SKIP-FAIL technique like
> grep("(?:\\d-[A-Z]|[A-Z]-\\d)(*SKIP)(*F)|-", x, perl=TRUE)
[1] 1 2
See this regex demo
Details
(?:\d-[A-Z]|[A-Z]-\d) - a non-capturing group that matches either a digit, - and then uppercase ASCII letter, or an uppercase ASCII letter, - and a digit
(*SKIP)(*F) - omit the current match and proceed looking for the next match at the end of the "failed" match
| - or
- - a hyphen.

remove all characters between string and bracket in R

Say I have a dataframe df in which a column df$strings contains strings like
[cat 00.04;09]
[cat 00.04;10]
and so on. I want to remove all characters between "[cat" and "]" to yield
[cat]
[cat]
I've tried this using gsub but it's not working and I'm not sure what I'm doing wrong:
gsub('cat*?\\]', '', df)
Note that cat*?\\] patten matches ca, then any 0+ t chars but as few as possible and then ].
You want to match any chars other than ] between [cat and ]:
gsub('\\[cat[^]]*\\]', '[cat]', df$strings)
Here,
\\[ - matches [
cat - matches cat
[^]]* - 0+ chars other than ] (note that ] inside the bracket expression should not be escaped when placed at the start - else, if you escape it, you will need to add perl=TRUE argument since PCRE regex engine can handle regex escapes inside bracket expressions (not the default TRE))
\\] - a ] (you do not even need to escape it, you may just use ]).
See the R demo:
x <- c("[cat 00.04;09]", "[cat 00.04;10]")
gsub('\\[cat[^]]*\\]', '[cat]', x)
## => [1] "[cat]" "[cat]"
If cat can be any word, use
gsub('\\[(\\w+)[^]]*\\]', '[\\1]', x)
where (\\w+) is a capturing group with ID=1 that matches 1 or more word chars, and \\1 in the replacement pattern is a replacement backreference that stands for the group value.

replace last number in string using regex

I want to replace the last number in a string using regex and gsub
S <- "abcd2efghi2.txt"
The last number and the position of the last number can vary.
So I've tried the regex
?<=[\d+])\b
gsub("?<=[\d+])\b", "", S)
but that doesn't seem to work
Appreciate any help.
You can achieve that with a default TRE engine using the following regex:
\d+(\D*)$
Replace with the \1 backreference.
Details
\d+ - 1 or more digits
(\D*) - Capturing group 1: any 0+ non-digit symbols
$ - end of string
\1 - a backreference to the Group 1 value (so as to restore the text matched and consumed with the (\D*) subpattern).
See the regex demo.
R code demo:
sub("\\d+(\\D*)$", "\\1", S)
## => [1] "abcd2efghi.txt"
You could use this regex:
\d+(?=\D*$)
It matches a sequence of digits when everything that follows consists of non-digits (\D) until the end of the string ($).

extract string from in R using regex

I have this vector:
jvm<-c("test - PROD_DB_APP_185b#SERVER01" ,"uat - PROD_DB_APP_SYS[1]#SERVER2")
I need to extract text until "[" or if there is no "[", then until the "#" character.
result should be
PROD_DB_APP_185b
PROD_DB_APP_SYS
I've tried something like this:
str_match(jvm, ".*\\-([^\\.]*)([.*)|(#.*)")
not working, any ides?
A sub solution with base R:
jvm<-c("test - PROD_DB_APP_185b#SERVER01" ,"uat - PROD_DB_APP_SYS[1]#SERVER2")
sub("^.*?\\s+-\\s+([^#[]+).*", "\\1", jvm)
See the online R demo
Details:
^ - start of string
.*? - any 0+ chars as few as possible
\\s+-\\s+ - a hyphen enclosed with 1 or more whitespaces
([^#[]+) - capturing group 1 matching any 1 or more chars other than #
and [
.* - any 0+ chars, up to the end of string.
Or a stringr solution with str_extract:
str_extract(jvm, "(?<=-\\s)[^#\\[]+")
See the regex demo
Details:
(?<=-\\s) - a positive lookbehind that matches an empty string that is preceded with a - and a whitespace immediately to the left of the current location
[^#\\[]+ - 1 or more chars other than # and [.

Resources