How do I replace all the punctuation in a string with '\\W'? - r

string = 'Hello, how are you?'
What I want to achieve:
Hello\\W how are you\\W
What I've done: Substituting all characters that are not alphanumeric with '\\W'
gsub('(\\W)+[^\\S]+','\\\\W',string,perl=TRUE)
[1] "Hello\\Whow are you?"
I'm not too sure why wasn't the question mark at the end of the sentence substituted with '\\W'and why was the first space being substituted. Could anyone help me out with this? Thank you!

We can do
gsub("[,?]", "\\\\W", string)
#[1] "Hello\\W how are you\\W"
If there are other characters, use [[:punct:]]
gsub("[[:punct:]]", "\\\\W", string)
#[1] "Hello\\W how are you\\W"

Related

Remove all punctuation except underline between characters in R with POSIX character class

I would like to use R to remove all underlines expect those between words. At the end the code removes underlines at the end or at the beginning of a word.
The result should be
'hello_world and hello_world'.
I want to use those pre-built classes. Right know I have learn to expect particular characters with following code but I don't know how to use the word boundary sequences.
test<-"hello_world and _hello_world_"
gsub("[^_[:^punct:]]", "", test, perl=T)
You can use
gsub("[^_[:^punct:]]|_+\\b|\\b_+", "", test, perl=TRUE)
See the regex demo
Details:
[^_[:^punct:]] - any punctuation except _
| - or
_+\b - one or more _ at the end of a word
| - or
\b_+ - one or more _ at the start of a word
One non-regex way is to split and use trimws by setting the whitespace argument to _, i.e.
paste(sapply(strsplit(test, ' '), function(i)trimws(i, whitespace = '_')), collapse = ' ')
#[1] "hello_world and hello_world"
We can remove all the underlying which has a word boundary on either of the end. We use positive lookahead and lookbehind regex to find such underlyings. To remove underlying at the start and end we use trimws.
test<-"hello_world and _hello_world_"
gsub("(?<=\\b)_|_(?=\\b)", "", trimws(test, whitespace = '_'), perl = TRUE)
#[1] "hello_world and hello_world"
You could use:
test <- "hello_world and _hello_world_"
output <- gsub("(?<![^\\W])_|_(?![^\\W])", "", test, perl=TRUE)
output
[1] "hello_world and hello_world"
Explanation of regex:
(?<![^\\W]) assert that what precedes is a non word character OR the start of the input
_ match an underscore to remove
| OR
_ match an underscore to remove, followed by
(?![^\\W]) assert that what follows is a non word character OR the end of the input

Insert characters when a string changes its case R

I would like to insert characters in the places were a string change its case. I tried this to insert a '\n' after a fixed number of characters and then a ' ', as I don't figure out how to detect the case change
s <-c("FloridaIslandE7", "FloridaIslandE9", "Meta")
gsub('^(.{7})(.{6})(.*)$', '\\1\\\n\\2 \\3', s )
[1] "Florida\nIsland E7" "Florida\nIsland E9" "Meta"
This works because the positions are fixed but I would like to know how to do it for the general case.
Surely there's a less convoluted regex for this, but you could try:
gsub('([A-Z][0-9])', ' \\1', gsub('([a-z])([A-Z])', '\\1\n\\2', s))
Output:
[1] "Florida\nIsland E7" "Florida\nIsland E9" "Meta"
Here is an option
str_replace_all(s, "(?<=[a-z])(?=[A-Z])", "\n")
#[1] "Florida\nIsland\nE7" "Florida\nIsland\nE9" "Meta"
If you really want to insert \n, try this:
gsub("([a-z])([A-Z])", "\\1\\\n\\2", s)
[1] "Florida\nIsland\nE7" "Florida\nIsland\nE9" "Meta"

Regex find the string between last two quotes " "?

For example, this is my string -> abcd 1234abcda="author 1" content="author 2.">\n
I only want the string author 2. by using the function str_extract() in R. How can I use regex to do that? Thank you so much.
You can use :
string = 'abcd 1234abcda="author 1" content="author 2.">\n'
sub('.*"(.*)".*', '\\1', string)
#[1] "author 2."
With str_match
library(stringr)
str_match(string, '.*"(.*)"')[, 2]
Another option is to extract all the values with "author" followed by a number and select the last one using tail.
tail(str_extract_all(string, 'author \\d+')[[1]], 1)

replace last number in string using regex

I want to replace the last number in a string using regex and gsub
S <- "abcd2efghi2.txt"
The last number and the position of the last number can vary.
So I've tried the regex
?<=[\d+])\b
gsub("?<=[\d+])\b", "", S)
but that doesn't seem to work
Appreciate any help.
You can achieve that with a default TRE engine using the following regex:
\d+(\D*)$
Replace with the \1 backreference.
Details
\d+ - 1 or more digits
(\D*) - Capturing group 1: any 0+ non-digit symbols
$ - end of string
\1 - a backreference to the Group 1 value (so as to restore the text matched and consumed with the (\D*) subpattern).
See the regex demo.
R code demo:
sub("\\d+(\\D*)$", "\\1", S)
## => [1] "abcd2efghi.txt"
You could use this regex:
\d+(?=\D*$)
It matches a sequence of digits when everything that follows consists of non-digits (\D) until the end of the string ($).

Extracting until the last character in a string

Consider this data:
str <- c("OTB_MCD_100-119_0_0", "PS_SPF_16-31_0_0", "PP_DR/>16-77")
How to make it into a string like this?
str
[1] "OTB_MCD" "PS_SPF" "PP_DR"
I tried substr, but it doesn't work when the characters are of different length.
We can use sub to match zero or more _ followed by 0 or more characters that are not alphabets ([^A-Za-z]*) until the end ($) of the string, replace it with blank ("")
sub("_*[^A-Za-z]*$", "", str)
#[1] "OTB_MCD" "PS_SPF" "PP_DR"

Resources