Regex for substring matching with space and substitution - r

I want to combine words in one string having spaces in between, which are similar to words in another string without spaces in between them (In R).
eg
s1 = 'this is an example of an undivided string case here'
s2 = 'Please note th is is an un di vid ed case right he r e for you!'
s2 needs to be converted into
s2 = 'Please note this is an undivided case right here for you!'
based on combined words in s1 which are same as non combined successive/continuous words in s2(with spaces in between)
I am new to R and tried with gsub, and different combinations of '\s', but not able to get the desired result.

You may achieve what you need by
removing all whitespaces from the string you want to search for (s1) (with gsub("\\s+", "", x)), then
insert whitespace patterns (\s*) in between each char (use something like sapply(strsplit(unspace(s1), ""), paste, collapse="\\s*")), and then
replace all the matches with the replacement with gsub(pattern, s1, s2).
See the R demo:
s2 = 'Please note th is is an un di vid ed case right he r e for you!'
s1 = 'this is an undivided case right here'
unspace <- function(x) { gsub("\\s+", "", x) }
pattern <- sapply(strsplit(unspace(s1), ""), paste, collapse="\\s*")
gsub(pattern, s1, s2)
## => [1] "Please note this is an undivided case right here for you!"

Related

Extract all substrings in string

I want to extract all substrings that begin with M and are terminated by a *
The string below as an example;
vec<-c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
Would ideally return;
MGMTPRLGLESLLE
MTPRLGLESLLE
I have tried the code below;
regmatches(vec, gregexpr('(?<=M).*?(?=\\*)', vec, perl=T))[[1]]
but this drops the first M and only returns the first string rather than all substrings within.
"GMTPRLGLESLLE"
You can use
(?=(M[^*]*)\*)
See the regex demo. Details:
(?= - start of a positive lookahead that matches a location that is immediately followed with:
(M[^*]*) - Group 1: M, zero or more chars other than a * char
\* - a * char
) - end of the lookahead.
See the R demo:
library(stringr)
vec <- c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
matches <- stringr::str_match_all(vec, "(?=(M[^*]*)\\*)")
unlist(lapply(matches, function(z) z[,2]))
## => [1] "MGMTPRLGLESLLE" "MTPRLGLESLLE"
If you prefer a base R solution:
vec <- c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
matches <- regmatches(vec, gregexec("(?=(M[^*]*)\\*)", vec, perl=TRUE))
unlist(lapply(matches, tail, -1))
## => [1] "MGMTPRLGLESLLE" "MTPRLGLESLLE"
This could be done instead with a for loop on a char array converted from you string.
If you encounter a M you start concatenating chars to a new string until you encounter a *, when you do encounter a * you push the new string to an array of strings and start over from the first step until you reach the end of your loop.
It's not quite as interesting as using REGEX to do it, but it's failsafe.
It is not possible to use regular expressions here, because regular languages don't have memory states required for nested matches.
stringr::str_extract_all("abaca", "a[^a]*a") only gives you aba but not the sorrounding abaca.
The first M was dropped, because (?<=M) is a positive look behind which is by definition not part of the match, but just behind it.

Capturing Group in R

I have then following pattern Set(?:Value)? in R as follows:
grepl('Set(?:Value)?', 'Set(Value)', perl=T)
this pattern is macthed by
1- Set
2- Set Value
3- Set(Value)
But I want to match only for two first cases and for for third case. Can anybody help me?
Thank you
You can use
grepl('^Set(?:\\s+Value)?$', x)
grepl('\\bSet(?!\\(Value\\))(?:\\s+Value)?\\b', x, perl=TRUE)
See regex demo #1 and regex demo #2.
Details:
^Set(?:\\s+Value)?$ - start of string, Set, an optional sequence of one or more whitespaces (\s+) and a Value and then end of string
\bSet(?!\(Value\))(?:\s+Value)?\b:
\b - word boundary
Set - Set string
(?!\(Value\)) - no (Value) string allowed at this very location
(?:\s+Value)? - an optional sequence of one or more whitespaces (\s+) and a Value
\b - word boundary
See an R demo:
x <- c("Set", "Set Value", "Set(Value)")
grep('^Set(?:\\s+Value)?$', x, value=TRUE)
## => [1] "Set" "Set Value"
grep('\\bSet(?!\\(Value\\))(?:\\s+Value)?\\b', x, perl=TRUE, value=TRUE)
## => [1] "Set" "Set Value"

Extracting matches from strings with lookaround in R

I have textual data (storytellings) and my aim is to extract certain words that are defined by a co-occurrence pattern, namely that they occur immediately prior to overlap, which is indicated by square brackets. The data are like this:
who <- c("Sue:", NA, "Carl:", "Sue:", NA, NA, NA, "Carl:", "Sue:","Carl:", "Sue:","Carl:")
story <- c("That’s like your grand:ma. did that with::=erm ",
"with Ju:ne (.) once or [ twice.] ",
" [ Yeah. ] ",
"And June wanted to go out and yo- your granny said (0.8)",
"“make sure you're ba(hh)ck before midni(hh)ght.” ",
"[Mm.] ",
"[There] she was (.) a ma(h)rried woman with a(h)- ",
"She’s a right wally. ",
"mm [kids as well ] ",
" [They assume] an awful lot man¿ ",
"°°ye:ah,°° ",
"°°the elderly do.°° ")
CAt <- data.frame(who, story)
Now, defining the pattern:
pattern <- "\\w.*\\s\\[[^]].*]"
and using grep():
grep(pattern, CAt$story, value = T)
[1] "with Ju:ne (.) once or [ twice.] "
[2] "mm [kids as well ] "
I get the two strings that contain the target matches but what I'm really after are the target words only, in this case the words "or" and "mm". This, to me, seems to call for positive lookahead. So I redefined the pattern thus:
pattern <- "\\w.*(?=\\s\\[[^]].*])"
which says something along the lines: "match the word iff you see a space followed by square brackets with some content on the right of that word". Now to extract only the exact matches, I normally use this code, which works fine as long as no lookaround is involved, but here it throws an error:
unlist(regmatches(CAt$story, gregexpr(pattern, CAt$story)))
Error in gregexpr(pattern, CAt$story) :
invalid regular expression, reason 'Invalid regexp'
Why is this? And how can the exact matches be extracted?
In your code, you could add perl=TRUE to gregexpr.
In your pattern \w.* will match a single word char followed by matching any char 0+ times.
This part \[[^]].*] will match [, then 1 char which is not ] and then .* which will match any char 0+ times followed by ].
You could update your pattern to repeating the word char and the character class itself instead.
\w+(?=\s\[[^]]*])
Explanation
\w+ Match 1+ word chars
(?= Positive lookahead, assert what is directly to the right is
\s Match single whitespace char
\[[^]]*] Match from opening[ to closing ] using a negated character class
) Close positive lookahead
Regex demo
Using doubled backslashes:
\\w+(?=\\s\\[[^]]*])
As an alternative you could use a capturing group instead of using a lookahead
(\w+)\s\[[^]]*]
Regex demo

Match only parenthesis with text and numbers in R

I would like to replace the parenthesis and the text between parenthesis in string variables. However I only want to replace those parenthesis with at least one number in it.
Example string:
text <- c("Sekretär (dipl.) (G3)", "Zolldeklarant (3 Jahre)", "Grenzwächter (< 2 Jahre)")
I tried the following:
str_extract_all(text, " *\\(.*?\\d+.*?\\) *")
It does extract the text in parenthesis, but in the first one, it matches also the first parenthesis without any number.
The extraction should look like:
" (G3)"
" (3 Jahre)"
" (< 2 Jahre)"
If you want to replace these terms in parentheses, containing at least one number, then sub is a good base R option:
text
sapply(text, function (x) {
gsub("\\([^()]*\\d[^()]*\\)", "REMOVED", x)
})
[1] "Sekretär (dipl.) (G3)" "Zolldeklarant (3 Jahre)" "Grenzwächter (< 2 Jahre)"
[1] "Sekretär (dipl.) REMOVED" "Zolldeklarant REMOVED" "Grenzwächter REMOVED"
I have replaced with the literal text REMOVED just as a placeholder to show the replacement.
Edit:
If you just want to extract these terms, we can also use sub for this:
sapply(text, function (x) {
gsub(".*(\\([^()]*\\d[^()]*\\)).*", "\\1", x)
})
[1] "(G3)" "(3 Jahre)" "(< 2 Jahre)"
Here, we capture the term in parentheses, then replace the entire string with just the first (and only) capture group \\1.
You can use
\([^()]*\d+[^()]*\)
See a demo on regex101.com.
Backslashes need to be double escaped in R, so your expression would become
\\([^()]*\\d+[^()]*\\)
Broken down this is
\( # (
[^()]* # not ( nor ), 0+ times
\d+ # digits, 1+
[^()]* # same as above
\) # )
text <- c("Sekretär (dipl.) (G3)", "Zolldeklarant (3 Jahre)", "Grenzwächter (< 2 Jahre)")
gsub(".*\\((.*[0-9].*)\\).*","(\\1)",text)
Basically you ask gsub to select the whole string but to assign as a group (\1) the strings in a parentheses and including a number.
Plus, if you want to extract the last parentheses always, that could follow a different approach.

Replace a specific character only between parenthesis

Lest's say I have a string:
test <- "(pop+corn)-bread+salt"
I want to replace the plus sign that is only between parenthesis by '|', so I get:
"(pop|corn)-bread+salt"
I tried:
gsub("([+])","\\|",test)
But it replaces all the plus signs of the string (obviously)
If you want to replace all + symbols that are inside parentheses (if there may be 1 or more), you can use any of the following solutions:
gsub("\\+(?=[^()]*\\))", "|", x, perl=TRUE)
See the regex demo. Here, the + is only matched when it is followed with any 0+ chars other than ( and ) (with [^()]*) and then a ). It is only good if the input is well-formed and there is no nested parentheses as it does not check if there was a starting (.
gsub("(?:\\G(?!^)|\\()[^()]*?\\K\\+", "|", x, perl=TRUE)
This is a safer solution since it starts matching + only if there was a starting (. See the regex demo. In this pattern, (?:\G(?!^)|\() matches the end of the previous match (\G(?!^)) or (|) a (, then [^()]*? matches any 0+ chars other than ( and ) chars, and then \K discards all the matched text and \+ matches a + that will be consumed and replaced. It still does not handle nested parentheses.
Also, see an online R demo for the above two solutions.
library(gsubfn)
s <- "(pop(+corn)+unicorn)-bread+salt+malt"
gsubfn("\\((?:[^()]++|(?R))*\\)", ~ gsub("+", "|", m, fixed=TRUE), s, perl=TRUE, backref=0)
## => [1] "(pop(|corn)|unicorn)-bread+salt+malt"
This solves the problem of matching nested parentheses, but requires the gsubfn package. See another regex demo. See this regex description here.
Note that in case you do not have to match nested parentheses, you may use "\\([^()]*\\)" regex with the gsubfn code above. \([^()]*\) regex matches (, then any zero or more chars other than ( and ) (replace with [^)]* to match )) and then a ).
We can try
sub("(\\([^+]+)\\+","\\1|", test)
#[1] "(pop|corn)-bread+salt"

Resources