Extracting matches from strings with lookaround in R - r

I have textual data (storytellings) and my aim is to extract certain words that are defined by a co-occurrence pattern, namely that they occur immediately prior to overlap, which is indicated by square brackets. The data are like this:
who <- c("Sue:", NA, "Carl:", "Sue:", NA, NA, NA, "Carl:", "Sue:","Carl:", "Sue:","Carl:")
story <- c("That’s like your grand:ma. did that with::=erm ",
"with Ju:ne (.) once or [ twice.] ",
" [ Yeah. ] ",
"And June wanted to go out and yo- your granny said (0.8)",
"“make sure you're ba(hh)ck before midni(hh)ght.” ",
"[Mm.] ",
"[There] she was (.) a ma(h)rried woman with a(h)- ",
"She’s a right wally. ",
"mm [kids as well ] ",
" [They assume] an awful lot man¿ ",
"°°ye:ah,°° ",
"°°the elderly do.°° ")
CAt <- data.frame(who, story)
Now, defining the pattern:
pattern <- "\\w.*\\s\\[[^]].*]"
and using grep():
grep(pattern, CAt$story, value = T)
[1] "with Ju:ne (.) once or [ twice.] "
[2] "mm [kids as well ] "
I get the two strings that contain the target matches but what I'm really after are the target words only, in this case the words "or" and "mm". This, to me, seems to call for positive lookahead. So I redefined the pattern thus:
pattern <- "\\w.*(?=\\s\\[[^]].*])"
which says something along the lines: "match the word iff you see a space followed by square brackets with some content on the right of that word". Now to extract only the exact matches, I normally use this code, which works fine as long as no lookaround is involved, but here it throws an error:
unlist(regmatches(CAt$story, gregexpr(pattern, CAt$story)))
Error in gregexpr(pattern, CAt$story) :
invalid regular expression, reason 'Invalid regexp'
Why is this? And how can the exact matches be extracted?

In your code, you could add perl=TRUE to gregexpr.
In your pattern \w.* will match a single word char followed by matching any char 0+ times.
This part \[[^]].*] will match [, then 1 char which is not ] and then .* which will match any char 0+ times followed by ].
You could update your pattern to repeating the word char and the character class itself instead.
\w+(?=\s\[[^]]*])
Explanation
\w+ Match 1+ word chars
(?= Positive lookahead, assert what is directly to the right is
\s Match single whitespace char
\[[^]]*] Match from opening[ to closing ] using a negated character class
) Close positive lookahead
Regex demo
Using doubled backslashes:
\\w+(?=\\s\\[[^]]*])
As an alternative you could use a capturing group instead of using a lookahead
(\w+)\s\[[^]]*]
Regex demo

Related

How to match phonemic transcriptions with a single vowel except if a condition applies

I have phonemic transcriptions of English words such as these:
test <- c("ˈsɜːtnli", "ˈtwɛnti", "ˈfɒksi", "kɑːnt", "ʧeɪnʤd", "vɪkˈtɔːrɪə", "wɒznt", "ðeər", "dɪdnt",
"ˈdɪzni", "ˈəʊnli", "ˈfæbrɪks", "sɪˈkjʊərɪti", "ˈnjuːzˌpeɪpər", "ɑhɑː")
I'd like to match mono-syllabic words, i.e., words that contain a single vowel. My set of phonemic vowels is this:
vowel <- "iː|aɪ|ɔː|ɔɪ|əʊ|ɛə|eɪ|aʊ|eə|uː|ɑː|ɪə|ɜː|ʊə|ə|ɪ|ɒ|ʊ|ʌ|æ|e|ɑ|ɛ|i"
Using str_count and the vector vowel as pattern, I'm able to match a fairly good set of words:
library(stringr)
test[str_count(test, vowel) == 1]
[1] "kɑːnt" "ʧeɪnʤd" "wɒznt" "ðeər" "dɪdnt"
However, wɒznt and dɪdntcan be seen as bi-syllabic (as the nsound can replace a vowel so that nt counts as a second vowel). So the question is, how can I match mono-syllabic words except those that end in nt?
What I've tried so far is this set operation, which works well but looks clumsy:
setdiff(test[str_count(test, vowel) == 1], test[str_count(test, paste0("[^", vowel, "]nt$")) == 1])
[1] "kɑːnt" "ʧeɪnʤd" "ðeər"
I'd much rather have a single more concise regex. Any ideas?
You can use
test <- c("ˈsɜːtnli", "ˈtwɛnti", "ˈfɒksi", "kɑːnt", "ʧeɪnʤd", "vɪkˈtɔːrɪə", "wɒznt", "ðeər", "dɪdnt",
"ˈdɪzni", "ˈəʊnli", "ˈfæbrɪks", "sɪˈkjʊərɪti", "ˈnjuːzˌpeɪpər", "ɑhɑː")
vowel <- "iː|aɪ|ɔː|ɔɪ|əʊ|ɛə|eɪ|aʊ|eə|uː|ɑː|ɪə|ɜː|ʊə|ə|ɪ|ɒ|ʊ|ʌ|æ|e|ɑ|ɛ|i"
library(stringr)
p <- paste0("^(?!.*(?<!",vowel,")nt$)(?:(?!",vowel,").)*(?:",vowel,")(?:(?!",vowel,").)*$")
test[str_detect(test, p)]
## => [1] "kɑːnt" "ʧeɪnʤd" "ðeər"
See the online R demo. See the regex demo. The pattern means
^ - start of string
(?!.*(?<!",vowel,")nt$) - immediately to the right, there must not be any 0+ chars other than line break chars as many as possible followed with nt (not preceded with any of the specified vowel sound sequences) and end of string
(?:(?!",vowel,").)* - any char but a line break char, zero or more times as many as possible, that does not start a vowel char sequence
(?:",vowel,") - any of the specified vowel sound sequences
(?:(?!",vowel,").)* - any char but a line break char, zero or more times as many as possible, that does not start a vowel char sequence
$ - end of string.
This is a somewhat concise solution (thanks to #G5W for the decisive hint):
vowel_cc <- paste0(unique(unlist(strsplit(gsub("\\|", "", vowel), ""))), collapse = "")
vowel_cc
[1] "iːaɪɔəʊɛeuɑɜɒʌæ"
test[str_count(test, paste0(vowel, "|[^", vowel_cc, "]+nt$")) == 1]
[1] "kɑːnt" "ʧeɪnʤd" "ðeər"
This solution uses a vector vowel_cc consisting of all unique characters in vowels. These serve as input for a negated character class. The pattern specifies nt as one of the vowel alternatives on the condition that it be preceded by one or more non-vowel_ccs and occur at string end.

Remove hashtags from beginning and end of tweets in R

I am trying to remove hashtags from beginning of strings in R.
For example:
x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
I want to remove the hashtags at the end of string which are #lateNightThoughts and #movie. Result:
- "I didn't know it could be #boring. guess I need some fun"
I tried :
stringi::stri_replace_last_regex(x,'#\\S+',"")
but it removes only the very last hashtag.
- "I didn't know it could be #boring. guess I need some fun #movie "
Any idea how to get the expected result?
Edit:
How about removing hashtag from beginning of text ?
eg:
x<- "#Thomas20 I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
You may use
> x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
> sub("\\s*\\B#\\w+(?:\\s*#\\w+)*\\s*$", "", x)
[1] "I didn't know it could be #boring. guess I need some fun"
Or, if you do not care about the context of the first # you want to start matching from, you may even use
sub("(?:\\s*#\\w+)+\\s*$", "", x)
See the regex demo.
Details
\s* - zero or more whitespaces
\B - right before the current location, there can be start of string or a non-word char (this is usually used to ensure you do not match # inside a "word", so if you do not need it, you may remove this non-word boundary)
# - a # char
\w+ - 1 or more word chars (letters, digits or _)
(?:\s*#\w+)* - zero or more occurrences of:
\s* - zero or more whitespaces
# - a # char
\w+ - 1+ word chars
\s* - zero or more whitespaces
$ - end of string.

matching start of a string but not end in R

How can I match all words starting with plan_ and not ending with template without using invert = TRUE? In the below example, I'd like to match only the second string. I tried with negative lookahead but it does not work, maybe because of greediness?
names <- c("plan_x_template", "plan_x")
grep("^plan.*(?!template)$",
names,
value = TRUE, perl = TRUE
)
#> [1] "plan_x_template" "plan_x"
I mean one can also solve the problem with two regex calls but I'd like to see how it works the other way :-)
is_plan <- grepl("^plan_", names)
is_template <- grepl("_template$", names)
names[is_plan & !is_template]
#> [1] "plan_x"
You may use
names <- c("plan_x_template", "plan_x")
grep("^plan(?!.*template)",
names,
value = TRUE, perl = TRUE
)
See the R online demo
The ^plan(?!.*template) pattern matches:
^ - a start of string
plan - a plan substring
(?!.*template) - a negative lookahead that fails the match if, immediately to the left of the current location, there are 0+ chars other than line break chars (since perl = TRUE is used and the pattern is processed with a PCRE engine, the . does not match all possible chars as opposed to the default grep TRE regex engine), as many as possible, followed with template substring.
NOTE: In case of multiline strings, you need to use a DOTALL modifier in the regex, "(?s)^plan(?!.*template)".

Replace a specific character only between parenthesis

Lest's say I have a string:
test <- "(pop+corn)-bread+salt"
I want to replace the plus sign that is only between parenthesis by '|', so I get:
"(pop|corn)-bread+salt"
I tried:
gsub("([+])","\\|",test)
But it replaces all the plus signs of the string (obviously)
If you want to replace all + symbols that are inside parentheses (if there may be 1 or more), you can use any of the following solutions:
gsub("\\+(?=[^()]*\\))", "|", x, perl=TRUE)
See the regex demo. Here, the + is only matched when it is followed with any 0+ chars other than ( and ) (with [^()]*) and then a ). It is only good if the input is well-formed and there is no nested parentheses as it does not check if there was a starting (.
gsub("(?:\\G(?!^)|\\()[^()]*?\\K\\+", "|", x, perl=TRUE)
This is a safer solution since it starts matching + only if there was a starting (. See the regex demo. In this pattern, (?:\G(?!^)|\() matches the end of the previous match (\G(?!^)) or (|) a (, then [^()]*? matches any 0+ chars other than ( and ) chars, and then \K discards all the matched text and \+ matches a + that will be consumed and replaced. It still does not handle nested parentheses.
Also, see an online R demo for the above two solutions.
library(gsubfn)
s <- "(pop(+corn)+unicorn)-bread+salt+malt"
gsubfn("\\((?:[^()]++|(?R))*\\)", ~ gsub("+", "|", m, fixed=TRUE), s, perl=TRUE, backref=0)
## => [1] "(pop(|corn)|unicorn)-bread+salt+malt"
This solves the problem of matching nested parentheses, but requires the gsubfn package. See another regex demo. See this regex description here.
Note that in case you do not have to match nested parentheses, you may use "\\([^()]*\\)" regex with the gsubfn code above. \([^()]*\) regex matches (, then any zero or more chars other than ( and ) (replace with [^)]* to match )) and then a ).
We can try
sub("(\\([^+]+)\\+","\\1|", test)
#[1] "(pop|corn)-bread+salt"

Regex - Substitute character in a matching substring

Let's say I have the following string:
input = "askl jmsp wiqp;THIS IS A MATCH; dlkasl das, fm"
I need to replace the white-spaces with underscores, but only in the substrings that match a pattern. (In this case the pattern would be a semi-colon before and after.)
The expected output should be:
output = "askl jmsp wiqp;THIS_IS_A_MATCH; dlkasl das, fm"
Any ideas how to achieve that, preferably using regular expressions, and without splitting the string?
I tried:
gsub("(.*);(.*);(.*)", "\\2", input) # Pattern matching and
gsub(" ", "_", input) # Naive gsub
Couldn't put them both together though.
Regarding the original question:
Substitute character in a matching substring
You may do it easily with gsubfn:
> library(gsubfn)
> input = "askl jmsp wiqp;THIS IS A MATCH; dlkasl das, fm"
> gsubfn(";([^;]+);", function(g1) paste0(";",gsub(" ", "-", g1, fixed=TRUE),";"), input)
[1] "askl jmsp wiqp;THIS-IS-A-MATCH; dlkasl das, fm"
The ;([^;]+); matches any string starting with ; and up to the next ; capturing the text in-between and then replacing the whitespaces with hyphens only inside the captured part.
Another approach is to use a PCRE regex with a \G based regex with gsub:
p = "(?:\\G(?!\\A)|;)(?=[^;]*;)[^;\\s]*\\K\\s"
> gsub(p, "-", input, perl=TRUE)
[1] "askl jmsp wiqp;THIS-IS-A-MATCH; dlkasl das, fm"
See the online regex demo
Pattern details:
(?:\\G(?!\\A)|;) - a custom boundary: either the end of the previous successful match (\\G(?!\\A)) or (|) a semicolon
(?=[^;]*;) - a lookahead check: there must be a ; after 0+ chars other than ;
[^;\\s]* - 0+ chars other than ; and whitespaces
\\K - omitting the text matched so far
\\s - 1 single whitespace character (if multiple whitespaces are to be replaced with 1 hyphen, add + after it).

Resources