How to extract words containing combinations of certain characters in R - r

In this sample text:
turns <- tolower(c("Does him good to stir him up now and again .",
"When , when I see him he w's on the settees .",
"Yes it 's been eery for a long time .",
"blissful timing , indeed it was "))
I'd like to extract all words that contain the letters y and e no matter what position or combination, namely yesand eery, using str_extract from stringr:
This regex, in which I determine that y occur immediately before e, matches not surprisingly only yes but not eery:
unlist(str_extract_all(turns, "\\b([a-z]+)?ye([a-z]+)?\\b"))
[1] "yes"
Putting yand e into a character class doesn't get me the desired result either in that all words either with y or with e are matched:
unlist(str_extract_all(turns, "\\b([a-z]+)?[ye]([a-z]+)?\\b"))
[1] "does" "when" "when" "see" "he" "the" "settees" "yes" "been" "eery" "time" "indeed"
So what is the right solution?

You may use both base R and stringr approaches:
stringr::str_extract_all(turns, "\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b")
regmatches(turns, gregexpr("\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b", turns, perl=TRUE))
Or, without turning the strings to lower case, you may use a case insensitive matching with (?i):
stringr::str_extract_all(turns, "(?i)\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b")
regmatches(turns, gregexpr("\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b", turns, perl=TRUE, ignore.case=TRUE))
See the regex demo and the R demo. Also, if you want to make it a tiny bit more efficient, you may use principle of contrast in the lookahead patterns: match any letters but y in the first and all letters but the e in the second using character class substraction:
stringr::str_extract_all(turns, "(?i)\\b(?=[\\p{L}--[y]]*y)(?=[\\p{L}--[e]]*e)\\p{L}+\\b")
Details
(?i) - case insensitive modifier
\b - word boundary
(?=\p{L}*y) - after 0 or more Unicode letters, there must be y ([\p{L}--[y]]* matches any 0 or more letters but y up to the first y)
(?=\p{L}*e) - after 0 or more Unicode letters, there must be e ([\p{L}--[e]]* matches any 0 or more letters but e up to the first e)
\p{L}+ - 1 or more Unicode letters
\b - word boundary

In case there is no urgent need to use stringr::str_extract you can get words containing the letters y and e in base with strsplit and grepl like:
tt <- unlist(strsplit(turns, " "))
tt[grepl("y", tt) & grepl("e", tt)]
#[1] "yes" "eery"
In case you have letter chunks between words:
turns <- c("yes no ay ae 012y345e year.")
tt <- regmatches(turns, gregexpr("\\b[[:alpha:]]+\\b", turns))[[1]]
tt[grepl("y", tt) & grepl("e", tt)]
#[1] "yes" "year"

Related

Why does R appear to be a lazy match [duplicate]

This question already has answers here:
R regex to get partly match
(2 answers)
Closed 6 days ago.
I want to use stri_replace_all_regex to replace string see as follows:
It's known that R default to greedy matching, but why it appears lazy matching here?
library(stringi)
a <- c('abc2','xycd2','mnb345')
b <- c('ab','abc','xyc','mnb','mn')
stri_replace_all_regex(a, "\\b" %s+% b %s+% "\\S+", b, vectorize_all=FALSE)
The result is [1] "ab" "xyc" "mn", which is not what I want. I
expected "abc" "xyc" "mnb".
You are calling stri_replace_all_regex with four arguments:
a is length 3. That's the str argument.
"\\b" %s+% b %s+% "\\S+" is length 5. (It would be a lot easier to read if you had used paste0("\\b", b, "\\S+"), but that's beside the point.) That's the pattern argument.
b is length 5. That's the replacement argument.
The last argument is vectorize_all=FALSE.
What it tries to do is documented as follows:
However, for stri_replace_all*, if vectorize_all is FALSE, then each
substring matching any of the supplied patterns is replaced by a
corresponding replacement string. In such a case, the vectorization is
over str, and - independently - over pattern and replacement. In other
words, this is equivalent to something like for (i in 1:npatterns) str <- stri_replace_all(str, pattern[i], replacement[i]). Note that you
must set length(pattern) >= length(replacement).
That's pretty sloppy documentation (I want to know what it does, not "something like" what it does!), but I think the process is as follows:
Your first pattern is "\\bab\\S+". That says "word boundary followed by ab followed by one or more non-whitespace chars". That matches all of a[1], so a[1] is replaced by b[1], which is "ab". It then tries the four other patterns, but none of them match, so you get "ab" as output.
The handling of a[3] is more complicated. The first match replaces it with "mnb", based on pattern[4]. Then a second replacement happens, because "mnb" matches pattern[5], and it gets changed again to "mn".
When you say R defaults to greedy matching, that's when doing a single regular expression match. You're doing five separate greedy matches, not one big greedy match.
EDITED to add:
I don't know the stringi functions well, but in the base regex functions you can do this with just one regex:
a <- c('abc2','xycd2','mnb345')
b <- c('ab','abc','xyc','mnb','mn')
# Build a big pattern:
# "|" means "or", "(" ... ") capture the match
pattern <- paste0("\\b(", b, ")\\S+", collapse = "|")
pattern
#> [1] "\\b(ab)\\S+|\\b(abc)\\S+|\\b(xyc)\\S+|\\b(mnb)\\S+|\\b(mn)\\S+"
# \\1 etc contain whatever matched the parenthesized
# patterns. Only one will match, the rest will be empty
gsub(pattern, "\\1\\2\\3\\4\\5", a)
#> [1] "ab" "xyc" "mnb"
# I would have guessed the greedy rule would have found "abc"
# Try again:
pattern <- paste0("\\b(", b[c(2, 1, 3:5)], ")\\S+", collapse = "|")
pattern
#> [1] "\\b(abc)\\S+|\\b(ab)\\S+|\\b(xyc)\\S+|\\b(mnb)\\S+|\\b(mn)\\S+"
gsub(pattern, "\\1\\2\\3\\4\\5", a)
#> [1] "abc" "xyc" "mnb"
Created on 2023-02-13 with reprex v2.0.2
It appears the "|" takes the first match, not the greedy match. I don't think the R docs specify it one way or the other.

Keep only the first letter of each word after a comma

I have strings like Sacher, Franz Xaver or Nishikawa, Kiyoko.
Using R, I want to change them to Sacher, F. X. or Nishikawa, K..
In other words, the first letter of each word after the comma should be retained with a dot (and a whitespace if another word follows).
Here is a related response, but it cannot be applied to my case 1:1 as it does not have a comma in its strings; it seems that the simple addition of (<?=, ) does not work.
E.g. in the following attempts, gsub() replaces everything, while my str_replace_all()-attempt leads to an error:
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
# first attempt
# (resembles the response from the other thread)
gsub('\\b(\\pL)\\pL{2,}|.','\\U\\1', TEST, perl = TRUE)
# second attempt
# error: "Incorrect unicode property"
stringr::str_replace_all(TEST, '(?<=, )\\b(\\pL)\\pL{2,}|.','\\U\\1')
I would be grateful for your help!
You can use
gsub("(*UCP)^[^,]+(*SKIP)(*F)|\\b(\\p{L})\\p{L}*", "\\U\\1.", TEST, perl=TRUE)
See the regex demo. Details:
(*UCP) - the PCRE verb that will make \b Unicode aware
^[^,]+(*SKIP)(*F) - start of string and then any zero or more chars other than a comma, and then the match is failed and skipped, the next match starts at the location where the failure occurred
| - or
\b - word boundary
(\p{L}) - Group 1: any Unicode letter
\p{L}* - zero or more Unicode letters
See the R demo:
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
gsub("(*UCP)^[^,]+(*SKIP)(*F)|\\b(\\p{L})\\p{L}*", "\\U\\1.", TEST, perl=TRUE)
## => [1] "Sacher, F. X." "Nishikawa, K." "Al-Assam, M."
A crude approach splitting the string :
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
sapply(strsplit(TEST, '\\s+'), function(x)
paste0(x[1], paste0(substr(x[-1], 1, 1), collapse = '.'), '.'))
#[1] "Sacher,F.X." "Nishikawa,K." "Al-Assam,M."
An approach using multiple backreference:
gsub("(\\b\\w+,\\s)(\\b\\w).*(\\b\\w)*", "\\1\\2.\\3", TEST)
[1] "Sacher, F." "Nishikawa, K." "Al-Assam, M."
Here, we use three capturing groups to refer back to in gsub's replacment argument via backreference:
(\\b\\w+,\\s): this, first, group captures the last name plus the comma followed by whitespace
(\\b\\w): this, second, group captures the initial of the first name
(\\b\\w): this, third, group captures the initial of the middle name

(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" while preserving abbreviations?

(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" without splitting abbreviations?
I know how to split the string at every uppercase letter, but doing that would split initialisms/abbreviations, like CA or USSR or even U.S.A. and I need to preserve those.
So I'm thinking some type of logical like if a word in a string isn't an initialism then split the word with a space where a lowercase character is followed by an uppercase character.
My snippet of code below splits words with spaces by capital letters, but it breaks initialisms like CA becomes C A undesirably.
s <- "WeLiveInCA"
trimws(gsub('([[:upper:]])', ' \\1', s))
# "We Live In C A"
or another example...
s <- c("IDon'tEatKittensFYI", "YouKnowYourABCs")
trimws(gsub('([[:upper:]])', ' \\1', s))
# "I Don't Eat Kittens F Y I" "You Know Your A B Cs"
The results I'd want would be:
"We Live In CA"
#
"I Don't Eat Kittens FYI" "You Know Your ABCs"
But this needs to be widely applicable (not just for my example)
Try with base R gregexpr/regmatches.
s <- c("WeLiveInCA", "IDon'tEatKittensFYI", "YouKnowYourABCs")
regmatches(s, gregexpr('[[:upper:]]+[^[:upper:]]*', s))
#[[1]]
#[1] "We" "Live" "In" "CA"
#
#[[2]]
#[1] "IDon't" "Eat" "Kittens" "FYI"
#
#[[3]]
#[1] "You" "Know" "Your" "ABCs"
Explanation.
[[:upper:]]+ matches one or more upper case letters;
[^[:upper:]]* matches zero or more occurrences of anything but upper case letters.
In sequence these two regular expressions match words starting with upper case letter(s) followed by something else.

Extract strings in round brackets using regex in R

In my transcripts, silent pauses are indicated in round brackets, e.g., (0.9) but also (.) for pauses < 0.3 seconds. I want to extract these pauses. However, transcribers' comments are indicated similarly, namely in double round brackets, e.g. ((coughs)). For this example
yy <- c("well [yes right] (.)", "let's go ((giggles))", "oh [ we::ll] i do n't (0.5) know", "erm [°well right° ]", "(3.2)")
this extracts all the pauses but also the transcriber comment:
pattern <- "(\\(.*?\\))"
grep(pattern, yy, value=T)
matches <- gregexpr(pattern, yy)
paus <- regmatches(yy, matches)
paus <- unlist(paus)
paus
[1] "(.)" "((giggles)" "(0.5)" "(3.2)"
To get rid of the comment, I tried this:
pattern <- "\\([^\\(].*?\\)[^\\)].*?"
That found "(0.5)" but failed to find the string-final pauses "(.)" and "(3.2)".
Any pointer?
Another option with gsub:
gsub("[^(]*(\\(([.0-9]+)\\)|\\b|\\B)[^)]*", "\\2", yy)
#[1] "." "" "0.5" "" "3.2"
Explanation of the pattern:
. [^(]*: anything except an open bracket, 0 or more times
. (\\(([.0-9]+)\\)|\\b|\\B) : what we want to capture : an open bracket followed by a dot or digits, one or more times, followed by a closing bracket (we only want to capture the dot or digits part, hence \\2 in the replacement part) or the empty string that can be at the edge of a word (\\b) or not (\\B). N.B: Here we are not keeping the brackets around the pauses times but we could.
. [^)]*: anything except a closing bracket, 0 or more times
We can use str_extract to extract the pattern which says an optional number followed by a decimal and then followed by another optional number value. We are using optional ("?") here to get the empty value "(.)".
library(stringr)
vec <- str_extract(yy, "(\\((\\d+)?(\\.(\\d)?\\)))")
vec
#[1] "(.)" NA "(0.5)" NA "(3.2)"
and then use is.na to remove NA elements
vec[!is.na(vec)]
#[1] "(.)" "(0.5)" "(3.2)"
Or using the same regular expression with base R regmatches saves a step to remove NA values.
regmatches(yy, regexpr("(\\((\\d+)?(\\.(\\d)?\\)))", yy))
#[1] "(.)" "(0.5)" "(3.2)"

Extract a year number from a string that is surrounded by special characters

What's a good way to extract only the number 2007 from the following string:
some_string <- "1_2_start_2007_3_end"
The pattern to detect the year number in my case would be:
4 digits
surrounded by "_"
I am quite new to using regular expressions. I tried the following:
regexp <- "_+[0-9]+_"
names <- str_extract(files, regexp)
But this does not take into account that there are always 4 digits and outputs the underlines as well.
You may use a sub option, too:
some_string <- "1_2_start_2007_3_end"
sub(".*_(\\d{4})_.*", "\\1", some_string)
See the regex demo
Details
.* - any 0+ chars, as many as possible
_ - a _ char
(\\d{4}) - Group 1 (referred to via \1 from the replacement pattern): 4 digits
_.* - a _ and then any 0+ chars up to the end of string.
NOTE: akrun's str_extract(some_string, "(?<=_)\\d{4}") will extract the leftmost occurrence and my sub(".*_(\\d{4})_.*", "\\1", some_string) will extract the rightmost occurrence of a 4-digit substring enclosed with _. For my my solution to return the leftmost one use a lazy quantifier with the first .: sub(".*?_(\\d{4})_.*", "\\1", some_string).
R test:
some_string <- "1_2018_start_2007_3_end"
sub(".*?_(\\d{4})_.*", "\\1", some_string) # leftmost
## -> 2018
sub(".*_(\\d{4})_.*", "\\1", some_string) # rightmost
## -> 2007
We can use regex lookbehind to specify the _ and extract the 4 digits that follow
library(stringr)
str_extract(some_string, "(?<=_)\\d{4}")
#[1] "2007"
If the pattern also shows - both before and after the 4 digits, then use regex lookahead as well
str_extract(some_string, "(?<=_)\\d{4}(?=_)")
#[1] "2007"
Just to get a non-regex approach out there, in which we split on _ and convert to numeric. All non-numbers will be coerced to NA, so we use !is.na to eliminate them. We then use nchar to count the characters, and pull the one with 4.
i1 <- as.numeric(strsplit(some_string, '_')[[1]])
i1 <- i1[!is.na(i1)]
i1[nchar(i1) == 4]
#[1] 2007
This is the quickest regex I could come up with:
\S.*_(\d{4})_\S.*
It means,
any number of non-space characters,
then _
followed by four digits (d{4})
above four digits is your year captured using ()
another _
any other gibberish non space string
Since, you mentioned you're new, please test this and all other answers at https://regex101.com/, pretty good to learn regex, it explains in depth what your regex is actually doing.
If you just care about (year) then below regex is enough:
_(\d{4})_

Resources