How to extract a substring by inverse pattern with R? - r

I trying to extract a substring by pattern using gsub() R function.
# Example: extracting "7 years" substring.
string <- "Psychologist - 7 years on the website, online"
gsub(pattern="[0-9]+\\s+\\w+", replacement="", string)`
`[1] "Psychologist - on the website, online"
As you can see, it's easy to exlude needed substring using gsub(), but I need to inverse the result and getting "7 years" only.
I think about using "^", something like that:
gsub(pattern="[^[0-9]+\\s+\\w+]", replacement="", string)
Please, could anyone help me with correct regexp pattern?

You may use
sub(pattern=".*?([0-9]+\\s+\\w+).*", replacement="\\1", string)
See this R demo.
Details
.*? - any 0+ chars, as few as possible
([0-9]+\\s+\\w+) - Capturing group 1:
[0-9]+ - one or more digits
\\s+ - 1 or more whitespaces
\\w+ - 1 or more word chars
.* - the rest of the string (any 0+ chars, as many as possible)
The \1 in the replacement replaces with the contents of Group 1.

You could use the opposite of \d, which is \D in R:
string <- "Psychologist - 7 years on the website, online"
sub(pattern = "\\D*(\\d+\\s+\\w+).*", replacement = "\\1", string)
# [1] "7 years"
\D* means: no digits as long as possible, the rest is captured in a group and then replaces the complete string.
See a demo on regex101.com.

Related

A simple tidy up needed in regex

I have had a good morning of trying to tidy this up but cannot find a more elegant solution.
I have the following as a value:
TEST <- Pia1.2016-10-08.1103+N2353.tif
and from this I need the date and the time 'extracted', I have the following (which works but I am 100% sure there is a better way to do it)
DATEDIR <- sub("[P][i][a][1]\\.","",TEST)
DATEDIR <- sub("\\...............","", DATEDIR)
DATEDIR # to check
I have not got round ot extracting the time bit yet as I thought I would clear this up first, although I would like the time variable to be called
TIMEDIR <-
Many thanks!
You may use
TEST <- 'Pia1.2016-10-08.1103+N2353_hc.tif'
date <- sub('.*?\\.(\\d{4}-\\d{2}-\\d{2})\\..*', '\\1', TEST)
time <- sub('.*?\\.\\d{4}-\\d{2}-\\d{2}\\.(\\d{2})(\\d{2}).*', '\\1:\\2', TEST)
# => [1] "2016-10-08"
# [1] "11:03"
See the R demo online. See the Regex 1 and Regex 2.
The first pattern matches
.*? - any 0+ chars, as few as possilbe
\\. - a dot
(\\d{4}-\\d{2}-\\d{2}) - Capturing group 1 (referred to with \1 from the replacement pattern): 4 digits, -, 2 digits, - and 2 digits
\\. - a .
.* - any 0+ chars, as many as possilbe.
The second pattern matches and captures the subsequent two digits into Group 1 and the next two digits into Group 2, and the \1:\2 replacement formats the time into a HH:mm string.

Number followed by hyphen

I have dataframe which contains strings. Refer the code below -
mydf = data.frame(x=c("ads 1-x as", "sda 1-xxaa sad", "sda a-x sad"))
I want the word that follows pattern : numeric followed by hyphen followed by a single letter, to be replaced with numeric only.
Expected Output -
"ads 1 as", "sda 1-xxaa sad", "sda a-x sad"
We can use sub (or if there are more number of instances, with gsub) to match one or more digits captured as a group (\\d+ - note the word boundary (\\b) before that) followed by a hyphen (-) and a single letter ([A-Za-z]) (to avoid matching it with more letters - use the word boundary -\\b) and replace with the back reference (\\1) of the captured group
gsub("\\b(\\d+)-[A-Za-z]\\b", "\\1", mydf$x)
#[1] "ads 1 as" "sda 1-xxaa sad" "sda a-x sad"

Extract all text between third to last and last period

I have text that looks like:
txt <- Name, Name. Title. Pub. Year; Details.
I want to extract only Pub.
I can extract year and details using:
gsub(".*\\.(.*)\\..*", "\\1", txt)
How can extract everything between the third to last and second to last period (just Pub) in R?
You may use a sub (since you need to perform a single search and replace operation) the following way:
txt <-"Name, Name. Title. Pub. Year; Details."
sub(".*\\.([^.]*)(?:\\.[^.]*){2}$", "\\1", txt)
# => [1] " Pub"
See the R demo.
Details
.* - any 0+ chars, as many as possible
\\. - a .
([^.]*) - Group 1: any 0+ chars other than .
(?:\\.[^.]*){2} - 2 consecutive sequences of
\\. - a .
[^.]* - any 0+ chars other than .
$ - end of string.

Replacing repeated groups of characters using regex

In R, I have a string where it contains repeated groups of characters:
testString <- "Hi hi missing u lollol hahahahalol sillybilly haaaaa!"
I'm trying to use a gsub regex to replace repeated groups of characters within each word to produce the following output:
"Hi hi missing u lol halol sillybilly haaaaa!"
I've tried the following line but it isn't producing the right output:
gsub("[[:blank:]](.+?){2,}[[blank]]\\1",
replacement="\\1", testString, perl=TRUE)
What have I done wrong?
You may match repeated consecutive word chars and skip them, and then handle all other repeated consecutive chars with a solution like
x <- "Hi hi missing u lollol hahahahalol sillybilly haaaaa!"
gsub("(\\w)\\1+(*SKIP)(*F)|(\\w+?)\\2+", "\\2", x, perl=TRUE)
See the regex demo and an online R demo
Details:
(\\w)\\1+(*SKIP)(*F) - match and capture a word char (with (\\w), this can be adjusted) and then 1+ ocurrences of this same char (with \\1+) and then the whole text is discarded and the engine goes on to search for another match after the end of the match (with the PCRE (*SKIP)(*FAIL) verbs sequence)
| - or
(\\w+?)\\2+ - 1 or more word chars, as few as possible, are captured into Group 2 (with (\\w+?)) and then 1+ occurrences of the same value are matched (with \\2+).
The replacement is just the Group 2 value.

text manipulation in R

I am trying to add parentheses around certain book titles character strings and I want to be able to paste with the paste0 function. I want to take this string:
a <- c("I Like What I Know 1959 02e pdfDrama (amazon.com)", "My Liffe 1993 07e pdfDrama (amazon.com)")
wrap certain strings in parentheses:
a
[1] “I Like What I Know (1959) (02e) (pdfDrama) (amazon.com)”
[2] ”My Life (1993) (07e) (pdfDrama) (amazon.com)”
I have tried but can't figure out a way to replace them within the string:
paste0("(",str_extract(a, "\\d{4}"),")")
paste0("(",str_extract(a, ”[0-9]+.e”),”)”)
Help?
I can suggest a regex for a fixed number of words of specific type:
a <- c("I Like What I Know 1959 02e pdfDrama (amazon.com)","My Life 1993 07e pdfDrama (amazon.com)")
sub("\\b(\\d{4})(\\s+)(\\d+e)(\\s+)([a-zA-Z]+)(\\s+\\([^()]*\\))", "(\\1)\\2(\\3)\\4(\\5)\\6", a)
See the R demo
And here is the regex demo. In short,
\\b(\\d{4}) - captures 4 digits as a whole word into Group 1
(\\s+) - Group 2: one or more whitespaces
(\\d+e) - Group 3: one or more digits and e
(\\s+) - Group 4: ibid
([a-zA-Z]+) - Group 5: one or more letters
(\\s+\\([^()]*\\)) - Group 6: one or more whitespaces, (, zero or more chars other than ( and ), ).
The contents of the groups are inserted back into the result with the help of backreferences.
If there are more words, and you need to wrap words starting with a letter/digit/underscore after a 4-digit word in the string, use
gsub("(?:(?=\\b\\d{4}\\b)|\\G(?!\\A))\\s*\\K\\b(\\S+)", "(\\1)", a, perl=TRUE)
See the R demo and a regex demo
Details:
(?:(?=\\b\\d{4}\\b)|\\G(?!\\A)) - either the location before a 4-digit whole word (see the positive lookahead (?=\\b\\d{4}\\b)) or the end of the previous successful match
\\s* - 0+ whitespaces
\\K - omitting the text matched so far
\\b(\\S+) - Group 1 capturing 1 or more non-whitespace symbols that are preceded with a word boundary.

Resources