Extracting a pattern considering different patterns [duplicate] - r

This question already has answers here:
Can I use an OR statement to indicate the pattern in stringr's str_extract_all function?
(1 answer)
find multiple strings using str_extract_all
(3 answers)
Closed 2 years ago.
Let's say I have this toy vectors
vec <- c("FOO blabla", "fail bla", "blabla FEEbla", "textFOO", "textttt")
to_match <- c("FOO", "FEE")
I would like to obtain a vector of the same length of vec in which to store only the patterns from to_match, if present, otherwise leave NA. Therefore, my desired result would be
c("FOO", NA, "FEE", "FOO", NA)
My first thought was to replace everything that does not match any of the patterns in to_match with whitespaces (""). I tried the following code which does the exact opposite, i.e. it replaces everything that does match any of the patterns in to_match with whitespaces.
sub(paste(to_match, collapse = "|"), "", vec)
# [1] " blabla" "fail bla" "blabla bla" "text" "textttt"
However, I tried to invert this behaviour using a caret (^) before a grouping structure but with scarse success.
# fail
sub(paste0("^(", paste(to_match, collapse = "|"), ")"), "", vec)
# [1] " blabla" "fail bla" "blabla FEEbla" "textFOO" "textttt"
How can I reach the desired output?

Your approach was correct but you should look at extracting the pattern that you want instead of removing which you don't want.
library(stringr)
str_extract(vec, str_c(to_match, collapse = "|"))
#[1] "FOO" NA "FEE" "FOO" NA

Related

Why does R appear to be a lazy match [duplicate]

This question already has answers here:
R regex to get partly match
(2 answers)
Closed 6 days ago.
I want to use stri_replace_all_regex to replace string see as follows:
It's known that R default to greedy matching, but why it appears lazy matching here?
library(stringi)
a <- c('abc2','xycd2','mnb345')
b <- c('ab','abc','xyc','mnb','mn')
stri_replace_all_regex(a, "\\b" %s+% b %s+% "\\S+", b, vectorize_all=FALSE)
The result is [1] "ab" "xyc" "mn", which is not what I want. I
expected "abc" "xyc" "mnb".
You are calling stri_replace_all_regex with four arguments:
a is length 3. That's the str argument.
"\\b" %s+% b %s+% "\\S+" is length 5. (It would be a lot easier to read if you had used paste0("\\b", b, "\\S+"), but that's beside the point.) That's the pattern argument.
b is length 5. That's the replacement argument.
The last argument is vectorize_all=FALSE.
What it tries to do is documented as follows:
However, for stri_replace_all*, if vectorize_all is FALSE, then each
substring matching any of the supplied patterns is replaced by a
corresponding replacement string. In such a case, the vectorization is
over str, and - independently - over pattern and replacement. In other
words, this is equivalent to something like for (i in 1:npatterns) str <- stri_replace_all(str, pattern[i], replacement[i]). Note that you
must set length(pattern) >= length(replacement).
That's pretty sloppy documentation (I want to know what it does, not "something like" what it does!), but I think the process is as follows:
Your first pattern is "\\bab\\S+". That says "word boundary followed by ab followed by one or more non-whitespace chars". That matches all of a[1], so a[1] is replaced by b[1], which is "ab". It then tries the four other patterns, but none of them match, so you get "ab" as output.
The handling of a[3] is more complicated. The first match replaces it with "mnb", based on pattern[4]. Then a second replacement happens, because "mnb" matches pattern[5], and it gets changed again to "mn".
When you say R defaults to greedy matching, that's when doing a single regular expression match. You're doing five separate greedy matches, not one big greedy match.
EDITED to add:
I don't know the stringi functions well, but in the base regex functions you can do this with just one regex:
a <- c('abc2','xycd2','mnb345')
b <- c('ab','abc','xyc','mnb','mn')
# Build a big pattern:
# "|" means "or", "(" ... ") capture the match
pattern <- paste0("\\b(", b, ")\\S+", collapse = "|")
pattern
#> [1] "\\b(ab)\\S+|\\b(abc)\\S+|\\b(xyc)\\S+|\\b(mnb)\\S+|\\b(mn)\\S+"
# \\1 etc contain whatever matched the parenthesized
# patterns. Only one will match, the rest will be empty
gsub(pattern, "\\1\\2\\3\\4\\5", a)
#> [1] "ab" "xyc" "mnb"
# I would have guessed the greedy rule would have found "abc"
# Try again:
pattern <- paste0("\\b(", b[c(2, 1, 3:5)], ")\\S+", collapse = "|")
pattern
#> [1] "\\b(abc)\\S+|\\b(ab)\\S+|\\b(xyc)\\S+|\\b(mnb)\\S+|\\b(mn)\\S+"
gsub(pattern, "\\1\\2\\3\\4\\5", a)
#> [1] "abc" "xyc" "mnb"
Created on 2023-02-13 with reprex v2.0.2
It appears the "|" takes the first match, not the greedy match. I don't think the R docs specify it one way or the other.

How to keep only specific punctuation mark in a column [duplicate]

This question already has answers here:
in R, use gsub to remove all punctuation except period
(4 answers)
Closed 2 years ago.
In the column text how it is possible to remove all punctuation remarks but keep only the ?
data.frame(id = c(1), text = c("keep<>-??it--!##"))
expected output
data.frame(id = c(1), text = c("keep??it"))
A more general solution would be to used nested gsub commands that converts ? to a particular unusual string (like "foobar"), gets rid of all punctuation, then writes "foobar" back to ?:
gsub("foobar", "?", gsub("[[:punct:]]", "", gsub("\\?", "foobar", df$text)))
#> [1] "keep??it"
Using gsub you could do:
gsub("(\\?+)|[[:punct:]]","\\1",df$text)
[1] "keep??it"
gsub('[[:punct:] ]+',' ',data) removes all punctuation which is not what you want.
But this is:
library(stringr)
sapply(df, function(x) str_replace_all(x, "<|>|-|!|#|#",""))
id text
[1,] "1" "a"
[2,] "2" "keep??it"
Better IMO than other answers because no need for nesting, and lets you define whichever characters to sub.
Here's another solution using negative lookahead:
gsub("(?!\\?)[[:punct:]]", "", df$text, perl = T)
[1] "keep??it"
The negative lookahead asserts that the next character is not a ? and then matches any punctuation.
Data:
df <- data.frame(id = c(1), text = c("keep<>-??it--!##"))

How to find if a string contain certain characters without considering sequence?

I'm trying to match a name using elements from another vector with R. But I don't know how to escape sequence when using grep() in R.
name <- "Cry River"
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
grep(name, string, value = TRUE)
I expect the output to be "Cry Me A River", but I don't know how to do it.
Use .* in the pattern
grep("Cry.*River", string, value = TRUE)
#[1] "Cry Me A River"
Or if you are getting names as it is and can't change it, you can split on whitespace and insert the .* between the words like
grep(paste(strsplit(name, "\\s+")[[1]], collapse = ".*"), string, value = TRUE)
where the regex is constructed in the below fashion
strsplit(name, "\\s+")[[1]]
#[1] "Cry" "River"
paste(strsplit(name, "\\s+")[[1]], collapse = ".*")
#[1] "Cry.*River"
Here is a base R option, using grepl:
name <- "Cry River"
parts <- paste0("\\b", strsplit(name, "\\s+")[[1]], "\\b")
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
result <- sapply(parts, function(x) { grepl(x, string) })
string[rowSums(result) == length(parts)]
[1] "Cry Me A River"
The strategy here is to first split the string containing the various search terms, and generating individual regex patterns for each term. In this case, we generate:
\bCry\b and \bRiver\b
Then, we iterate over each term, and using grepl we check that the term appears in each of the strings. Finally, we retain only those matches which contained all terms.
We can do the grepl on splitted string and Reduce the list of logical vectors to a single logicalvector` and extract the matching element in 'string'
string[Reduce(`&`, lapply(strsplit(name, " ")[[1]], grepl, string))]
#[1] "Cry Me A River"
Also, instead of strsplit, we can insert the .* with sub
grep(sub(" ", ".*", name), string, value = TRUE)
#[1] "Cry Me A River"
Here's an approach using stringr. Is order important? Is case important? Is it important to match whole words. If you would just like to match 'Cry' and 'River' in any order and don't care about case.
name <- "Cry River"
string <- c("Yesterday Once More",
"Are You happy",
"Cry Me A River",
"Take me to the River or I'll Cry",
"The Cryogenic River Rag",
"Crying on the Riverside")
string[str_detect(string, pattern = regex('\\bcry\\b', ignore_case = TRUE)) &
str_detect(string, regex('\\bRiver\\b', ignore_case = TRUE))]

Looping through and replacing text in a data frame

I have a dataframe that consists of a variable with multiple words, such as:
variable
"hello my name is this"
"greetings friend"
And another dataframe that consists of two columns, one of which is words, the other of which is replacements for those words, such as:
word
"hello"
"greetings"
replacement:
replacement
"hi"
"hi"
I'm trying to find an easy way to replace the words in "variable" with the replacement words, looping over both all the observations, and all the words in each observation. The desired result is:
variable
"hi my name is this"
"hi friend"
I've looked into some methods that use cSplit, but it's not feasible for my application (there are too many words in any given observation of "variable", so this creates too many columns). I'm not sure how I would use strsplit for this, but am guessing that is the correct option?
EDIT: From my understanding of this question, my question my be a repeat of a previously unanswered question: Replace strings in text based on dictionary
stringr's str_replace_all would be handy in this case:
df = data.frame(variable = c('hello my name is this','greetings friend'))
replacement <- data.frame(word = c('hello','greetings'), replacment = c('hi','hi'), stringsAsFactors = F)
stringr::str_replace_all(df$variable,replacement$word,replacement$replacment)
Output:
> stringr::str_replace_all(df$variable,replacement$word,replacement$replacment)
[1] "hi my name is this" "hi friend"
This is similar to #amrrs's solution, but I am using a named vector instead of supplying two separate vectors. This also addresses the issue mentioned by OP in the comments:
library(dplyr)
library(stringr)
df2$word %>%
paste0("\\b", ., "\\b") %>%
setNames(df2$replacement, .) %>%
str_replace_all(df1$variable, .)
# [1] "hi my name is this" "hi friend" "hi, hellomy is not a word"
# [4] "hi! my friend"
This is the named vector with regex as the names and string to replace with as elements:
df2$word %>%
paste0("\\b", ., "\\b") %>%
setNames(df2$replacement, .)
# \\bhello\\b \\bgreetings\\b
# "hi" "hi"
Data:
df1 = data.frame(variable = c('hello my name is this',
'greetings friend',
'hello, hellomy is not a word',
'greetings! my friend'))
df2 = data.frame(word = c('hello','greetings'),
replacement = c('hi','hi'),
stringsAsFactors = F)
Note:
In order to address the issue of root words also being converted, I wrapped the regex with word boundaries (\\b). This makes sure that I am not converting words that live inside another, like "helloguys".

Remove others in a string except a needed word including certain patterns in R

I have a vector including certain strings, and I would like remove other parts in each string except the word including certain patter (here is mir).
s <- c("a mir-96 line (kk27)", "mir-133a cell",
"d mir-14-3p in", "m mir133 (sas)", "mir_23_5p r 27")
I want to obtain:
mir-96, mir-133a, mir-14-3p, mir133, mir_23_5p
I know the idea: use the gsub() and pattern is: a word beginning with (or including) mir.
But I have no idea how to construct such patter.
Or other idea?
Any help will be appreciated!
One way in base R would be splitting every string into words and then extracting only those with mir in it
unlist(lapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE)))
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
We can save the unlist step in lapply by using sapply as suggested by #Rich Scriven in comments
sapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE))
We can use sub to match zero or more characters (.*) followed by a word boundary (\\b) followed by the string (mir and one or more characters that are not a white space (\\S+), capture it as a group by placing inside the (...) followed by other characters, and in the replacement use the backreference of the captured group (\\1)
sub(".*\\b(mir\\S+).*", "\\1", s)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
Update
If there are multiple 'mir.*' substring, then we want to extract strings having some numeric part
sub(".*\\b(mir[^0-9]*[0-9]+\\S*).*", "\\1", s1)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p" "mir_23-5p"
data
s1 <- c("a mir-96 line (kk27)", "mir-133a cell", "d mir-14-3p in", "m mir133 (sas)",
"mir_23_5p r 27", "a mir_23-5p 1 mir-net")

Resources