Searching and replacing characters with classes in R

Searching and replacing characters with classes in R - r

I am trying to replace text in R. I want to find spaces between letters and numbers only and delete them, but when I search using [:alpha:] and [:alnum:] it replaces with that class operator.
> string <- "WORD = 500 * WORD + ((WORD & 400) - (WORD & 300))"
> str_replace_all(string,
+ "[:alpha:] & [:alnum:]",
+ "[:alpha:]&[:alnum:]")
[1] "WORD = 500 * WORD + ((WOR[:alpha:]&[:alnum:]00) - (WOR[:alpha:]&[:alnum:]00))"
How can I use the function so that it returns-
[1] "WORD = 500 * WORD + ((WORD&400) - (WORD&300))"

str_replace_all(string, "([:alpha:]) & ([:alnum:])", "\\1&\\2")

Your requirement is easy enough to handle using sub with lookarounds:
string <- "WORD = 500 * WORD + ((WORD & 400) - (WORD & 300))"
output <- gsub("(?<=\\w) & (?=\\w)", "&", string, perl=TRUE)
output
[1] "WORD = 500 * WORD + ((WORD&400) - (WORD&300))"
Here is a brief explanation of the regex:
(?<=\\w) assert that what precedes is a word character
[ ]&[ ] then match a space, followed by `&`, followed by another space
(?=\\w) assert that what follows is also a word character
Then, we replace with just a single &, with no spaces on either side.

Here is one option where we match regex lookarounds to match one or more spaces (\\s+) either preceding or succeeding a & and replace with blank ("")
gsub("(?<=&)\\s+|\\s+(?=&)", "", string, perl = TRUE)
#[1] "WORD = 500 * WORD + ((WORD&400) - (WORD&300))"

Related

grep in R, literal and pattern match

I have seen in manuals how to use grep to match either a pattern or an exact string. However, I cannot figure out how to do both at the same time. I have a latex file where I want to find the following pattern:
\caption[SOME WORDS]
and replace it with:
\caption[\textit{SOME WORDS}]
I have tried with:
texfile <- sub('\\caption[','\\caption[\\textit ', texfile, fixed=TRUE)
but I do not know how to tell grep that there should be some text after the square bracket, and then a closed square bracket.

You can use
texfile <- "\\caption[SOME WORDS]" ## -> \caption[\textit{SOME WORDS}]
texfile <-gsub('(\\\\caption\\[)([^][]*)]','\\1\\\\textit{\\2}]', texfile)
cat(texfile)
## -> \caption[\textit{SOME WORDS}]
See the R demo online.
Details:
(\\caption\[) - Group 1 (\1 in the replacement pattern): a \caption[ string
([^][]*) - Group 2 (\2 in the replacement pattern): any zero or more chars other than [ and ]
] - a ] char.
Another solution based on a PCRE regex:
gsub('\\Q\\caption[\\E\\K([^][]*)]','\\\\textit{\\1}]', texfile, perl=TRUE)
See this R demo online. Details:
\Q - start "quoting", i.e. treating the patterns to the right as literal text
\caption[ - a literal fixed string
\E - stop quoting the pattern
\K - omit text matched so far
([^][]*) - Group 1 (\1): any zero or more non-bracket chars
] - a ] char.

Regex match after last / and first underscore

Assuming I have the following string:
string = "path/stack/over_flow/Pedro_account"
I am intrested in matching the first 2 characters after the last / and before the first _. So in this case the desired out put is:
Pe
What I have so far is a mix of substr and str_extract:
substr(str_extract(string, "[^/]*$"),1,2)
which of course will give an answer but I belive there is a nice regex for it as well, and that is what I'm looking for.

You can use
library(stringr)
str_extract(string, "(?<=/)[^/]{2}(?=[^/]*$)")
## => [1] "Pe"
See the R demo and the regex demo. Details:
(?<=/) - a location immediately preceded with a / char
[^/]{2} - two chars other than /
(?=[^/]*$) - a location immediately preceded with zero or more chars other than / till the end of string.

Using basename to get the last folder name, then substring:
substr(basename("path/stack/over_flow/Pedro_account"), 1, 2)
# [1] "Pe"

Remove everything till last / and extract first 2 characters.
Base R -
string = "path/stack/over_flow/Pedro_account"
substr(sub('.*/', '', string), 1, 2)
#[1] "Pe"
stringr
substr(stringr::str_remove(string, '.*/'), 1, 2)

You can use str_match with a capture group:
/ Match literally
([^/_]{2}) Capture 2 chars other than / or _ in group 1
[^/]* Match optional chars other than /
$ End of string
See a regex demo and a R demo.
Example
library(stringr)
string = "path/stack/over_flow/Pedro_account"
str_match(string, "/([^/_]{2})[^/]*$")[,2]
Output
[1] "Pe"

Remove all punctuation except underline between characters in R with POSIX character class

I would like to use R to remove all underlines expect those between words. At the end the code removes underlines at the end or at the beginning of a word.
The result should be
'hello_world and hello_world'.
I want to use those pre-built classes. Right know I have learn to expect particular characters with following code but I don't know how to use the word boundary sequences.
test<-"hello_world and _hello_world_"
gsub("[^_[:^punct:]]", "", test, perl=T)

You can use
gsub("[^_[:^punct:]]|_+\\b|\\b_+", "", test, perl=TRUE)
See the regex demo
Details:
[^_[:^punct:]] - any punctuation except _
| - or
_+\b - one or more _ at the end of a word
| - or
\b_+ - one or more _ at the start of a word

One non-regex way is to split and use trimws by setting the whitespace argument to _, i.e.
paste(sapply(strsplit(test, ' '), function(i)trimws(i, whitespace = '_')), collapse = ' ')
#[1] "hello_world and hello_world"

We can remove all the underlying which has a word boundary on either of the end. We use positive lookahead and lookbehind regex to find such underlyings. To remove underlying at the start and end we use trimws.
test<-"hello_world and _hello_world_"
gsub("(?<=\\b)_|_(?=\\b)", "", trimws(test, whitespace = '_'), perl = TRUE)
#[1] "hello_world and hello_world"

You could use:
test <- "hello_world and _hello_world_"
output <- gsub("(?<![^\\W])_|_(?![^\\W])", "", test, perl=TRUE)
output
[1] "hello_world and hello_world"
Explanation of regex:
(?<![^\\W]) assert that what precedes is a non word character OR the start of the input
_ match an underscore to remove
| OR
_ match an underscore to remove, followed by
(?![^\\W]) assert that what follows is a non word character OR the end of the input

Extract exact matches from array

Assume I have text and I want to extract exact matches. How can I do this efficiently:
test_text <- c("[]", "[1234]", "[1234a]", "[v1256a] ghjk kjh",
"[othername1256b] kjhgfd hgj",
"[v1256] ghjk kjh", "[v1256] kjhgfd hgj",
" text here [name1991] and here",
"[name1990] this is an explanation",
"[name1991] this is another explanation",
"[mäölk1234]")
expected <- c("[v1256a]", "[othername1256b]", "[v1256]", "[v1256]", "[name1991]",
"[name1990]", "[name1991]", "[mäölk1234]")
# This works:
regmatches(text, regexpr("\\[.*[0-9]{4}.*\\]", text))
But I guess something like "\\[.*[0-9]{4}(?[a-z])]\\]" would be better but it throws an error
Error in regexpr("\[.[0-9]{4}(?[a-z])]\]", text) : invalid
regular expression '[.[0-9]{4}(?[a-z])]]', reason 'Invalid regexp'
Only ONE letter should follow the year, but there can be none, see example. Sorry, I rarly use regexpr...

Updated question solution
It seems you want to extract all occurrences of 1+ letters followed with 4 digits and then an optional letter inside square brackets.
Use
test_text <- c("[]", "[1234]", "[1234a]", "[v1256a] ghjk kjh",
"[othername1256b] kjhgfd hgj",
"[v1256] ghjk kjh", "[v1256] kjhgfd hgj",
" text here [name1991] and here",
"[name1990] this is an explanation",
"[name1991] this is another explanation",
"[mäölk1234]")
regmatches(test_text, regexpr("\\[\\p{L}+[0-9]{4}\\p{L}?]", test_text, perl=TRUE))
# => c("[v1256a]", "[othername1256b]", "[v1256]", "[v1256]", "[name1991]",
# "[name1990]", "[name1991]", "[mäölk1234]")
See the R demo online. NOTE that you need to use a PCRE regex for this to work, perl=TRUE is crucial here.
Details
\[ - a [ char
\p{L}+ - 1+ any Unicode letters
[0-9]{4} - four ASCII digits
\\p{L}? - an optional any Unicode letter
] - a ] char.
Original answer
Use
regmatches(test_text, regexpr("\\[[^][]*[0-9]{4}[[:alpha:]]?]", test_text))
Or
regmatches(test_text, regexpr("\\[[^][]*[0-9]{4}[a-zA-Z]?]", test_text))
See the regex demo and a Regulex graph:
Details
\[ - a [ char
[^][]* - 0 or more chars other than [ and ] (HINT: if you only expect letters here replace with [[:alpha:]]* or [a-zA-Z]*)
[0-9]{4} - four digits
[[:alpha:]]? - an optional letter (or [a-zA-Z]? will match any ASCII optional letter)
] - a ] char
R test:
regmatches(test_text, regexpr("\\[[^][]*[0-9]{4}[[:alpha:]]?]", test_text))
## => [1] "[v1256a]" "[othername1256b]" "[v1256]" "[v1256]" "[name1991]" "[name1990]" "[name1991]"

Finding a word with condition in a vector with regex on R (perl)

I would like to find the rows in a vector with the word 'RT' in it or 'R' but not if the word 'RT' is preceded by 'no'.
The word RT may be preceded by nothing, a space, a dot, etc.
With the regex, I tried :
grep("(?<=[no] )RT", aaa,ignore.case = FALSE, perl = T)
Which was giving me all the rows with "no RT".
and
grep("(?=[^no].*)RT",aaa , perl = T)
which was giving me all the rows containing 'RT' with and without 'no' at the beginning.
What is my mistake? I thought the ^ was giving everything but the character that follows it.
Example :
aaa = c("RT alone", "no RT", "CT/RT", "adj.RTx", "RT/CT", "lang, RT+","npo RT" )

(?<=[no] )RT matches any RT that is immediately preceded with "n " or "o ".
You should use a negative lookbehind,
"(?<!no )RT"
See the regex demo.
Or, if you need to check for a whole word no,
"(?<!\\bno )RT"
See this regex demo.
Here, (?<!no ) makes sure there is no no immediately to the left of the current location, and only then RT is consumed.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Searching and replacing characters with classes in R - r

str_replace_all(string, "([:alpha:]) & ([:alnum:])", "\\1&\\2")

Here is one option where we match regex lookarounds to match one or more spaces (\\s+) either preceding or succeeding a & and replace with blank ("") gsub("(?<=&)\\s+|\\s+(?=&)", "", string, perl = TRUE) #[1] "WORD = 500 * WORD + ((WORD&400) - (WORD&300))"

Related

grep in R, literal and pattern match

Regex match after last / and first underscore

Remove all punctuation except underline between characters in R with POSIX character class

Extract exact matches from array

Finding a word with condition in a vector with regex on R (perl)

Categories

Resources