Detect substring within a string while not considering part of the substring - r

I'm trying to check whether string B is contained by string A and this is what I tried:
library(stringr)
string_a <- "something else free/1a2b a bird yes"
string_b <- "free/xxxx a bird"
str_detect(string_a, string_b)
I would expect a match (TRUE) since I wouldn't like to consider part of string_b followed by the "/" and before a white space, which is why I put "/xxxx".
In a way the "/xxxx" should represent match any string or number possible in these places. Is there maybe another notation to ignore parts of string when matching like this?

Yes, in regex you can use .* to match zero or more characters.
library(stringr)
string_a <- "something else free/1a2b a bird yes"
string_b <- "free/xxxx a bird"
string_c <- "free/.*a bird"
str_detect(string_a, string_c)
#[1] TRUE
If you cannot change string_b at source, you may use str_replace_all or gsub to replace xxxx with '.*'.
str_detect(string_a, str_replace_all(string_b, 'x+', '.*'))
#[1] TRUE

Related

How to extract words containing combinations of certain characters in R

In this sample text:
turns <- tolower(c("Does him good to stir him up now and again .",
"When , when I see him he w's on the settees .",
"Yes it 's been eery for a long time .",
"blissful timing , indeed it was "))
I'd like to extract all words that contain the letters y and e no matter what position or combination, namely yesand eery, using str_extract from stringr:
This regex, in which I determine that y occur immediately before e, matches not surprisingly only yes but not eery:
unlist(str_extract_all(turns, "\\b([a-z]+)?ye([a-z]+)?\\b"))
[1] "yes"
Putting yand e into a character class doesn't get me the desired result either in that all words either with y or with e are matched:
unlist(str_extract_all(turns, "\\b([a-z]+)?[ye]([a-z]+)?\\b"))
[1] "does" "when" "when" "see" "he" "the" "settees" "yes" "been" "eery" "time" "indeed"
So what is the right solution?
You may use both base R and stringr approaches:
stringr::str_extract_all(turns, "\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b")
regmatches(turns, gregexpr("\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b", turns, perl=TRUE))
Or, without turning the strings to lower case, you may use a case insensitive matching with (?i):
stringr::str_extract_all(turns, "(?i)\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b")
regmatches(turns, gregexpr("\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b", turns, perl=TRUE, ignore.case=TRUE))
See the regex demo and the R demo. Also, if you want to make it a tiny bit more efficient, you may use principle of contrast in the lookahead patterns: match any letters but y in the first and all letters but the e in the second using character class substraction:
stringr::str_extract_all(turns, "(?i)\\b(?=[\\p{L}--[y]]*y)(?=[\\p{L}--[e]]*e)\\p{L}+\\b")
Details
(?i) - case insensitive modifier
\b - word boundary
(?=\p{L}*y) - after 0 or more Unicode letters, there must be y ([\p{L}--[y]]* matches any 0 or more letters but y up to the first y)
(?=\p{L}*e) - after 0 or more Unicode letters, there must be e ([\p{L}--[e]]* matches any 0 or more letters but e up to the first e)
\p{L}+ - 1 or more Unicode letters
\b - word boundary
In case there is no urgent need to use stringr::str_extract you can get words containing the letters y and e in base with strsplit and grepl like:
tt <- unlist(strsplit(turns, " "))
tt[grepl("y", tt) & grepl("e", tt)]
#[1] "yes" "eery"
In case you have letter chunks between words:
turns <- c("yes no ay ae 012y345e year.")
tt <- regmatches(turns, gregexpr("\\b[[:alpha:]]+\\b", turns))[[1]]
tt[grepl("y", tt) & grepl("e", tt)]
#[1] "yes" "year"

Removing repeated, nonpunctuation character from a string

I have a string in R with several, non-punctuation repeated characters (a pound sign). I am trying to remove the repeatedness of the pound sign "#" but keep only one to separate the words in the string. The number of pound signs between words is random and is not always the same.
For example:
String="##Hello####World#Happy#######New###Ye#r!"
transform into
String_New="#Hello#World#Happy#New#Ye#r!"
Does the gsub command handle non-punctuation signs?
We need to specify + ie. one or more characters to match and in the replacement add a single #
gsub("#+", "#", String)
#[1] "#Hello#World#Happy#New#Ye#r!"
Here is a quick way to do what you want:
a <- "##Hello####World#Happy#######New###Year"
b <- gsub('#######', '#', a)
b <- gsub('###', '#', b)
b <- gsub('##', '#', b)
And yes you can handle nonpunction signs as well if you desire.

Regular Expression String Detect

Say I have a string such as "J1P3V9". I also have strings such as "0H44J4". I want to only detect string which follow the first patter of Letter, Number, Letter, Number, Letter, Number.
What is a regex expression to match only these instances?
This regex does your job,
\b([A-Z]\d){3}\b
\b makes sure it doesn't match partially in a bigger string.
Demo
In case you want to include lowercase alphabets too, the regex becomes,
\b([a-zA-Z]\d){3}\b
Try the following regex.
s <- c("J1P3V9", "0H44J4")
pattern <- paste(rep("[[:alpha:]][[:digit:]]", 3), collapse = "")
grep(pattern, s, value = TRUE)
#[1] "J1P3V9"
You might use
\b(?:[a-zA-Z]\d){3}\b
See a demo on regex101.com.
Or, more verbose but not supported in R:
(?(DEFINE)
(?<letter>[a-zA-Z])
(?<number>\d)
)
\b(?:(?&letter)(?&number)){3}\b
Jokes aside, don't rely on \w which is a shortcut for [a-zA-z0-9_] and will most likely match more than you want.
You can use this regex:
(?:[A-Z]\\d){3}
Usage:
mystring <- c("J1P3V9", "0H44J4")
grepl("(?:[A-Z]\\d){3}", mystring)
# [1] TRUE FALSE

Match elements from a character range n times

Assume I have a string like this:
id = "ce91ffbe-8218-e211-86da-000c29e211a0"
What regex can I write in R that will verify that this string is 36 characters long and only contains letters, numbers, and dashes?
There is nothing in the documentation on how to use a character range (e.g. [0-9A-z-]) with a quantifier (e.g. {36}). The following code is always returning TRUE regardless of the quantifier. I'm sure I'm missing something simple here...
id <- "ce91ffbe-8218-e211-86da-000c29e211a0"
grepl("[0-9A-z-]{36}", id)
#> [1] TRUE
grepl("[0-9A-z-]{34}", id)
#> [1] TRUE
This behavior only starts when I add the check for the numbers 0-9 in the character range.
Could you please try following:
grepl("^[0-9a-zA-Z-]{36}$",id)
OR
grepl("^[[:alnum:]-]{36}$",id)
After running it we will get following output.
grepl("^[0-9a-zA-Z-]{36}$",id)
[1] TRUE
Explanation: Adding following for only explanation purposes here.
grepl(" ##using grepl to check if regex mentioned in it gives TRUE or FALSE result.
^ ##^ means shows starting of the line.
[[:alnum:]-] ##Mentioning character class [[:alnum:]] with a dash(-) in it means match alphabets with digits and dashes in regex.
{36} ##Look for only 36 occurences of alphabets with dashes.
$", ##$ means check from starting(^) to till end of the variable's value.
id) ##Mentioning id value here.
You want to use:
^[0-9a-z-]{36}$
^ Assert position start of line.
[0-9a-z-] Character set for numbers, letters a to z and dashes -.
{36} Match preceding pattern 36 times.
$ Assert position end of line.
Try it here.
If the string can have other characters before or after the target characters, try
id <- "ce91ffbe-8218-e211-86da-000c29e211a0"
grepl("^[^[:alnum:]-]*[[:alnum:]-]{36}[^[:alnum:]-]*$", id)
#[1] TRUE
grepl("^[^[:alnum:]-]*[[:alnum:]-]{34}[^[:alnum:]-]*$", id)
#[1] FALSE
And this will still work.
id2 <- paste0(":+)!#", id)
grepl("^[^[:alnum:]-]*[[:alnum:]-]{36}[^[:alnum:]-]*$", id2)
#[1] TRUE
grepl("^[^[:alnum:]-]*[[:alnum:]-]{34}[^[:alnum:]-]*$", id2)
#[1] FALSE

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

Resources