Remove all punctuation except underline between characters in R with POSIX character class - r

I would like to use R to remove all underlines expect those between words. At the end the code removes underlines at the end or at the beginning of a word.
The result should be
'hello_world and hello_world'.
I want to use those pre-built classes. Right know I have learn to expect particular characters with following code but I don't know how to use the word boundary sequences.
test<-"hello_world and _hello_world_"
gsub("[^_[:^punct:]]", "", test, perl=T)

You can use
gsub("[^_[:^punct:]]|_+\\b|\\b_+", "", test, perl=TRUE)
See the regex demo
Details:
[^_[:^punct:]] - any punctuation except _
| - or
_+\b - one or more _ at the end of a word
| - or
\b_+ - one or more _ at the start of a word

One non-regex way is to split and use trimws by setting the whitespace argument to _, i.e.
paste(sapply(strsplit(test, ' '), function(i)trimws(i, whitespace = '_')), collapse = ' ')
#[1] "hello_world and hello_world"

We can remove all the underlying which has a word boundary on either of the end. We use positive lookahead and lookbehind regex to find such underlyings. To remove underlying at the start and end we use trimws.
test<-"hello_world and _hello_world_"
gsub("(?<=\\b)_|_(?=\\b)", "", trimws(test, whitespace = '_'), perl = TRUE)
#[1] "hello_world and hello_world"

You could use:
test <- "hello_world and _hello_world_"
output <- gsub("(?<![^\\W])_|_(?![^\\W])", "", test, perl=TRUE)
output
[1] "hello_world and hello_world"
Explanation of regex:
(?<![^\\W]) assert that what precedes is a non word character OR the start of the input
_ match an underscore to remove
| OR
_ match an underscore to remove, followed by
(?![^\\W]) assert that what follows is a non word character OR the end of the input

Related

grep in R, literal and pattern match

I have seen in manuals how to use grep to match either a pattern or an exact string. However, I cannot figure out how to do both at the same time. I have a latex file where I want to find the following pattern:
\caption[SOME WORDS]
and replace it with:
\caption[\textit{SOME WORDS}]
I have tried with:
texfile <- sub('\\caption[','\\caption[\\textit ', texfile, fixed=TRUE)
but I do not know how to tell grep that there should be some text after the square bracket, and then a closed square bracket.
You can use
texfile <- "\\caption[SOME WORDS]" ## -> \caption[\textit{SOME WORDS}]
texfile <-gsub('(\\\\caption\\[)([^][]*)]','\\1\\\\textit{\\2}]', texfile)
cat(texfile)
## -> \caption[\textit{SOME WORDS}]
See the R demo online.
Details:
(\\caption\[) - Group 1 (\1 in the replacement pattern): a \caption[ string
([^][]*) - Group 2 (\2 in the replacement pattern): any zero or more chars other than [ and ]
] - a ] char.
Another solution based on a PCRE regex:
gsub('\\Q\\caption[\\E\\K([^][]*)]','\\\\textit{\\1}]', texfile, perl=TRUE)
See this R demo online. Details:
\Q - start "quoting", i.e. treating the patterns to the right as literal text
\caption[ - a literal fixed string
\E - stop quoting the pattern
\K - omit text matched so far
([^][]*) - Group 1 (\1): any zero or more non-bracket chars
] - a ] char.

Regex match after last / and first underscore

Assuming I have the following string:
string = "path/stack/over_flow/Pedro_account"
I am intrested in matching the first 2 characters after the last / and before the first _. So in this case the desired out put is:
Pe
What I have so far is a mix of substr and str_extract:
substr(str_extract(string, "[^/]*$"),1,2)
which of course will give an answer but I belive there is a nice regex for it as well, and that is what I'm looking for.
You can use
library(stringr)
str_extract(string, "(?<=/)[^/]{2}(?=[^/]*$)")
## => [1] "Pe"
See the R demo and the regex demo. Details:
(?<=/) - a location immediately preceded with a / char
[^/]{2} - two chars other than /
(?=[^/]*$) - a location immediately preceded with zero or more chars other than / till the end of string.
Using basename to get the last folder name, then substring:
substr(basename("path/stack/over_flow/Pedro_account"), 1, 2)
# [1] "Pe"
Remove everything till last / and extract first 2 characters.
Base R -
string = "path/stack/over_flow/Pedro_account"
substr(sub('.*/', '', string), 1, 2)
#[1] "Pe"
stringr
substr(stringr::str_remove(string, '.*/'), 1, 2)
You can use str_match with a capture group:
/ Match literally
([^/_]{2}) Capture 2 chars other than / or _ in group 1
[^/]* Match optional chars other than /
$ End of string
See a regex demo and a R demo.
Example
library(stringr)
string = "path/stack/over_flow/Pedro_account"
str_match(string, "/([^/_]{2})[^/]*$")[,2]
Output
[1] "Pe"

remove all characters between string and bracket in R

Say I have a dataframe df in which a column df$strings contains strings like
[cat 00.04;09]
[cat 00.04;10]
and so on. I want to remove all characters between "[cat" and "]" to yield
[cat]
[cat]
I've tried this using gsub but it's not working and I'm not sure what I'm doing wrong:
gsub('cat*?\\]', '', df)
Note that cat*?\\] patten matches ca, then any 0+ t chars but as few as possible and then ].
You want to match any chars other than ] between [cat and ]:
gsub('\\[cat[^]]*\\]', '[cat]', df$strings)
Here,
\\[ - matches [
cat - matches cat
[^]]* - 0+ chars other than ] (note that ] inside the bracket expression should not be escaped when placed at the start - else, if you escape it, you will need to add perl=TRUE argument since PCRE regex engine can handle regex escapes inside bracket expressions (not the default TRE))
\\] - a ] (you do not even need to escape it, you may just use ]).
See the R demo:
x <- c("[cat 00.04;09]", "[cat 00.04;10]")
gsub('\\[cat[^]]*\\]', '[cat]', x)
## => [1] "[cat]" "[cat]"
If cat can be any word, use
gsub('\\[(\\w+)[^]]*\\]', '[\\1]', x)
where (\\w+) is a capturing group with ID=1 that matches 1 or more word chars, and \\1 in the replacement pattern is a replacement backreference that stands for the group value.

How to completely remove head and tail white spaces or punctuation characters?

I have string_a, such that
string_a <- " ,A thing, something, . ."
Using regex, how can I just retain "A thing, something"?
I have tried the following and got such output:
sub("[[:punct:]]$|^[[:punct:]]","", trimws(string_a))
[1] "A thing, something, . ."
We can use gsub to match one or more punctuation characters including spaces ([[:punct:] ] +) from the start (^) or | those characters until the end ($) of the string and replace it with blank ("")
gsub("^[[:punct:] ]+|[[:punct:] ]+$", "", string_a)
#[1] "A thing, something"
Note: sub will replace only a single instance
Or as #Cath mentioned [[:punct:] ] can be replaced with \\W

grep formatted number using r

I have a string format that I would like to select from a character vector. The form is
123 123 1234
where the two spaces can also be a hyphen. i.e. 3 digits followed by space or hyphen, followed by 3 digits, followed by space or hyphen, followed by 4 digits
I am trying to do this by the following:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4}$)",mytext)
however this yields:
integer(0)
What am I doing wrong?
Your string has a whitespace at the end, so you can either consider that white space, like so:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4} $)",mytext)
Or remove the end of line assertion "$", like so:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4})",mytext)
Also, as pointed out by Wiktor Stribiżew, the character class [ -.] will match any character in the range between " " and ".". To match "-","." and " " you have to escape the "-" or put it at the end of the class. Like [ \-.] or [ .-]

Resources