How to pass multiple necessary patterns to str_subset? - r

I am trying to find elements in a character vector that match two words in no particular order, not just any single one of them, using the stringr::str_subset function. In other words, I'm looking for the intersection, not the union of the two words.
I tried using the "or" (|) operator but this only gives me either one of the two words and returns too many results. I also tried just passing a character vector with the two words as the pattern argument. This just returns the error that "longer object length is not a multiple of shorter object length" and only returns the values that match the second one of the two words.
character_vector <- c("abc ghi jkl mno def", "pqr abc def", "abc jkl pqr")
pattern <- c("def", "pqr")
str_subset(character_vector, pattern)
I'm looking for the pattern that will return only the second element of the character vector, i.e. "pqr abc def".

An option is str_detect. Loop over the 'pattern', check if both the 'pattern' elements match with the 'character_vector' (&), use the logical vector to extract the element from the 'character_vector'
library(tidyverse)
map(pattern, str_detect, string = character_vector) %>%
reduce(`&`) %>%
magrittr::extract(character_vector, .)
#[1] "pqr abc def"
Or using str_subset
map(pattern, str_subset, string = character_vector) %>%
reduce(intersect)
#[1] "pqr abc def"

You can use a pure R code with out a loop using regular expression. The code is like this:
character_vector[grepl(paste0("(?=.*",pattern,")",collapse = ""), character_vector, perl = TRUE)]
the grepl would find the position of the character that full fills the regex and condition inside the paste0.

As you are looking for the intersect, you can use the function intersect() and explicit the 2 patterns you are looking for
pattern_1 <- 'pqr'
pattern_2 <- 'def'
intersect(
str_subset(character_vector, pattern_1),
str_subset(character_vector, pattern_2)
)

Will this work?
character_vector %>% purrr::reduce(pattern, str_subset, .init = . )
[1] "pqr abc def"

Related

Substitute strings by their first match in a dictionary

I have a vector long_strings defined as
long_strings <- c("*/1/1/1/1", "*/1/2/1/1", "*/2/1",
"*/2/2/1", "*/3/1/1/1")
and I have a dictionary of short short_strings containing the initial patterns (with differing lengths) of those strings, for example
short_strings <- c("*/1/1", "*/3", "*/2", "*/1/2")
How can I "simplify" the contents of long_strings to match their corresponding value on short_strings?
The results should look like
"*/1/1", "*/1/2", "*/2", "*/2", "*/3"
I can find where are the occurrences of a single element of short_strings using grep("\\*/2", long_strings), but I want to avoid looping over the short_strings.
An option with sapply
as.character(with(stack(sapply(setNames(paste0("\\", short_strings), short_strings),
grep, x = long_strings)), ind[order(values)]))
#[1] "*/1/1" "*/1/2" "*/2" "*/2" "*/3"
Or using str_extract
library(stringr)
str_extract(long_strings, str_c(str_c("\\", short_strings), collapse="|"))
#[1] "*/1/1" "*/1/2" "*/2" "*/2" "*/3"
We can programmatically create a capture group and use it in sub to extract it
sub(paste0(".*(",paste0("\\", short_strings, collapse = "|"), ").*"), "\\1",long_strings)
#[1] "*/1/1" "*/1/2" "*/2" "*/2" "*/3"

Apply a regex only to the first word of a phrase (defined with spaces)

I have this regex to separate letters from numbers (and symbols) of a word: (?<=[a-zA-Z])(?=([[0-9]|[:punct:]])). My test string is: "CALLE15 CRA22".
I want to apply this regex only to the first word of that sentence (the word is defined with spaces). Namely, I want apply that only to "CALLE15".
One solution is split the string (sentence) into words and then apply the regex to the first word, but I want to do all in one regex. Other solution is to use r stringr::str_replace() (or sub()) that replace only the first match, but I need stringr::str_replace_all (or gsub()) for other reasons.
What I need is to insert a space between the two that I do with the replacement function. The outcome I want is "CALLE 15 CRA22" and with the posibility of "CALLE15 CRA 22". I try a lot of positions for the space and nothing, neither the ^ at the beginning.
https://rubular.com/r/7dxsHdOA3avTdX
Thanks for your help!!!!
I am unsure about your problem statement (see my comment above), but the following reproduces your expected output and uses str_replace_all
ss <- "CALLE15 CRA22"
library(stringr)
str_replace_all(ss, "^([A-Za-z]+)(\\d+)(\\s.+)$", "\\1 \\2\\3")
#[1] "CALLE 15 CRA22"
Update
To reproduce the output of the sample string from the comment above
ss <- "CLL.6 N 5-74NORTE"
pat <- c(
"(?<=[A-Za-z])(?![A-Za-z])",
"(?<![A-Za-z])(?=[A-Za-z])",
"(?<=[0-9])(?![0-9])",
"(?<![0-9])(?=[0-9])")
library(stringr)
str_split(ss, sprintf("(%s)", paste(pat, collapse = "|"))) %>%
unlist() %>%
.[nchar(trimws(.)) > 0] %>%
paste(collapse = " ")
#[1] "CLL . 6 N 5 - 74 NORTE"

Extract substring in R using grepl

I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work{*}.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA (or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\\.csv)")
# [1] "start" "complete"
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub. Because I first remove ^.*Work then \\.csv$.
For [\\s\\S] or \\d\\D ... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\\s\\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
. matches also \n when using the R engine.
Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

R grepl - Matching Pattern to String

I am using grepl() in R to match patterns to a string.
I need to match multiple strings to a common string and return TRUE if they all match.
For example:
a <- 'DEARBORN TRUCK INCDBA'
b <- 'DEARBORN TRUCK INC DBA'
I want to see if all words in variable b are also in variable a.
I can't just use grepl(b, a) because the patterns (spaces) aren't the same.
It seems like it should be something like this:
grepl('DEARBORN&TRUCK&INC&DBA', a)
or
grepl('DEARBORN+TRUCK+INC+DBA', a)
but neither work. I need to compare each individual word in b to a. In this case, since all the words exist in a, it should return TRUE.
Thanks!
Use strsplit to split b into words and then use sapply to perform a grepl on each such word. The result will be a logical vector and if its all TRUE then return TRUE:
all(sapply(strsplit(b, " ")[[1]], grepl, a))
giving:
[1] TRUE
Note: If you are only looking to determine if a and b are the same aside from spaces then remove the spaces from both and compare what is left:
gsub(" ", "", a) == gsub(" ", "", b)

Remove part of a string

How do I remove part of a string? For example in ATGAS_1121 I want to remove everything before _.
Use regular expressions. In this case, you can use gsub:
gsub("^.*?_","_","ATGAS_1121")
[1] "_1121"
This regular expression matches the beginning of the string (^), any character (.) repeated zero or more times (*), and underscore (_). The ? makes the match "lazy" so that it only matches are far as the first underscore. That match is replaced with just an underscore. See ?regex for more details and references
You can use a built-in for this, strsplit:
> s = "TGAS_1121"
> s1 = unlist(strsplit(s, split='_', fixed=TRUE))[2]
> s1
[1] "1121"
strsplit returns both pieces of the string parsed on the split parameter as a list. That's probably not what you want, so wrap the call in unlist, then index that array so that only the second of the two elements in the vector are returned.
Finally, the fixed parameter should be set to TRUE to indicate that the split parameter is not a regular expression, but a literal matching character.
If you're a Tidyverse kind of person, here's the stringr solution:
R> library(stringr)
R> strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
R> strings %>% str_replace(".*_", "_")
[1] "_1121" "_1432" "_1121"
# Or:
R> strings %>% str_replace("^[A-Z]*", "")
[1] "_1121" "_1432" "_1121"
Here's the strsplit solution if s is a vector:
> s <- c("TGAS_1121", "MGAS_1432")
> s1 <- sapply(strsplit(s, split='_', fixed=TRUE), function(x) (x[2]))
> s1
[1] "1121" "1432"
Maybe the most intuitive solution is probably to use the stringr function str_remove which is even easier than str_replace as it has only 1 argument instead of 2.
The only tricky part in your example is that you want to keep the underscore but its possible: You must match the regular expression until it finds the specified string pattern (?=pattern).
See example:
strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
strings %>% stringr::str_remove(".+?(?=_)")
[1] "_1121" "_1432" "_1121"
Here the strsplit solution for a dataframe using dplyr package
col1 = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
col2 = c("T", "M", "A")
df = data.frame(col1, col2)
df
col1 col2
1 TGAS_1121 T
2 MGAS_1432 M
3 ATGAS_1121 A
df<-mutate(df,col1=as.character(col1))
df2<-mutate(df,col1=sapply(strsplit(df$col1, split='_', fixed=TRUE),function(x) (x[2])))
df2
col1 col2
1 1121 T
2 1432 M
3 1121 A

Resources