Alternaion with quantifies in gregexpr and str_extract_all function - r

Alternaion with quantifies in gregexpr and str_extract_all function
require(stringr)
gregexpr(pattern = "(h|a)*", "xxhx")
[[1]]
[1] 1 2 3 4
attr(,"match.length")
[1] 0 0 1 0
attr(,"useBytes")
[1] TRUE
str_extract_all(pattern = "(h|a)*", "xxhx")
[[1]]
[1] "" "" "h" "" ""
why gregexpr indicates 3 voids while str_extract_all indicates 4 voids

This is the difference between how TRE (gregexpr) and ICU (str_extract_all) regex engines deal with empty (also called "zero length") regex matches. TRE regex advances the regex index after a zero length match, while ICU allows testing the same position twice.
It becomes obvious what positions are tried by both engines if you use replacing functions:
> gsub("(h|a)*", "-\\1", "xxhx")
[1] "-x-x-hx-"
> str_replace_all("xxhx", "(h|a)*", "-\\1")
[1] "-x-x-h-x-"
The TRE engine matched h and moved the index after x while ICU engine matched h and stopped right after h before x to match the empty location before it.

Related

Regular expression

> y <- "<dd> Hello world hello i love you hello birthday thank you I am fine and you"
> regexpr("[H|h]ello", y)
[1] 6
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE
> regexpr("[H|h]ello(.*)", x)
[1] 1
attr(,"match.length")
[1] 65
attr(,"useBytes")
[1] TRUE
In the case above, what does (.*) mean? Is it regular expression or something else?
Yes, it's a regular expression. In a regular expression, . represents "any character". * means "occuring zero or more times". So The regex "[H|h]ello(.*)" Means a string of chracaters that starts with either "H" or "h" and is followed by the exact sequence of characters "ello" and which may or may not be followed by an arbitrary sequence of characters of any length. The brackets, ( and ) around the .* mean that the arbitrary sequence of characters is "captured" by the regex so that it can be used elsehwere.
In lay language, the purpose of the regex is to find out what you are saying after you've said hello

Find position of last match in R with regexpr

I hope this question is appropriate.
IF I have a string like:
> x <- "ABCDDDDDABC"
If I use regexpr it returns me the position (and length) of the first match, while if I use gregexpr it returns position and length of all matches.
> regexpr("ABC",x)
[1] 1
attr(,"match.length")
[1] 3
attr(,"useBytes")
[1] TRUE
> gregexpr("ABC",x)
[[1]]
[1] 1 9
attr(,"match.length")
[1] 3 3
attr(,"useBytes")
[1] TRUE
What I wanna do is only find the position of the last ABC, I don't care about any before it.
My desired output have to be:
[1] 9
attr(,"match.length")
[1] 3
attr(,"useBytes")
[1] TRUE
Which only return the position of the last ABC. Is there a simple way to do this, maybe the use of a wildcard I am unaware of? I have looked in the forum and online but I can not find a universal solution.
IMPORTANT NOTE: I do not want a regex that take into account the context around the ABC of this particular string. I want something that gives me the position of the last ABC of every string i may feed it with.
We can use the convenient function from stringi
library(stringi)
as.vector(stri_locate_last_regex(x, "ABC")[,1])
#[1] 9

R regex extract numbers from string depending of context

s <- c('abc_1_efg', 'efg_2', 'hi2jk_lmn', 'opq')
How can I use a regex to get the numbers that are beside at least one underscore ("_"). In effect I would like to get outputs like this :
> output # The result
[1] 1 2
> output_l # Alternatively
[1] TRUE TRUE FALSE FALSE
We can use regex lookarounds
grep("(?<=_)\\d+", s, perl = TRUE)
grepl("(?<=_)\\d+", s, perl = TRUE)
#[1] TRUE TRUE FALSE FALSE
If you need to get just indices, use grep with a simple TRE regex (no lookarounds are necessary):
> grep("_\\d+", s)
[1] 1 2
To get the numbers themselves, use a PCRE regex with a positive lookahead with regmatches / gregexpr:
> unlist(regmatches(s, gregexpr("(?<=_)[0-9]+", s, perl=TRUE)))
[1] "1" "2"
Details:
(?<=_) - a positive lookbehind that requires _ to appear immediately to the left of the current position
[0-9]+ - 1+ digits
EDIT: If the digits to the left of _ should also be considered, use 1) "(^|_)\\d|\\d(_|$)" with grep solution and 2) "(?<![^_])\\d+|\\d+(?![^_])" with the number extraction solution.
Using this regex :
[_]([0-9]){1}
And selecting group 1 you'll get your digit, if you want more, use
[_]([0-9]+)
And it will not match the last two strings
You can use this tool : https://regex101.com/
with stringr:
s <- c('abc_1_efg', 'efg_2', 'hi2jk_lmn', 'opq', 'a_1_b')
library(stringr)
which(!is.na(str_match(s, '_\\d|\\d_')))
# [1] 1 2 5

plot function type

For digits I have done so:
digits <- c("0","1","2","3","4","5","6","7","8","9")
You can use the [:punct:] to detect punctuation. This detects
[!"\#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]
Either in grepexpr
x = c("we are friends!, Good Friends!!")
gregexpr("[[:punct:]]", x)
R> gregexpr("[[:punct:]]", x)
[[1]]
[1] 15 16 30 31
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
or via stringi
# Gives 4
stringi::stri_count_regex(x, "[:punct:]")
Notice the , is counted as punctuation.
The question seems to be about getting individual counts of particular punctuation marks. #Joba provides a neat answer in the comments:
## Create a vector of punctuation marks you are interested in
punct = strsplit('[]?!"\'#$%&(){}+*/:;,._`|~[<=>#^-]\\', '')[[1]]
The count how often they appear
counts = stringi::stri_count_fixed(x, punct)
Decorate the vector
setNames(counts, punct)
You can use regular expressions.
stringi::stri_count_regex("amdfa, ad,a, ad,. ", "[:punct:]")
https://en.wikipedia.org/wiki/Regular_expression
might help too.

Get indices of all character elements matches in string in R

I want to get indices of all occurences of character elements in some word. Assume these character elements I look for are: l, e, a, z.
I tried the following regex in grep function and tens of its modifications, but I keep receiving not what I want.
grep("/([leazoscnz]{1})/", "ylaf", value = F)
gives me
numeric(0)
where I would like:
[1] 2 3
To use grep work with individual characters of a string, you first need to split the string into separate character vectors. You can use strsplit for this:
strsplit("ylaf", split="")[[1]]
[1] "y" "l" "a" "f"
Next you need to simplify your regular expression, and try the grep again:
strsplit("ylaf", split="")[[1]]
grep("[leazoscnz]", strsplit("ylaf", split="")[[1]])
[1] 2 3
But it is easier to use gregexpr:
gregexpr("[leazoscnz]", "ylaf")
[[1]]
[1] 2 3
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE

Resources