Regular expression - r

> y <- "<dd> Hello world hello i love you hello birthday thank you I am fine and you"
> regexpr("[H|h]ello", y)
[1] 6
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE
> regexpr("[H|h]ello(.*)", x)
[1] 1
attr(,"match.length")
[1] 65
attr(,"useBytes")
[1] TRUE
In the case above, what does (.*) mean? Is it regular expression or something else?

Yes, it's a regular expression. In a regular expression, . represents "any character". * means "occuring zero or more times". So The regex "[H|h]ello(.*)" Means a string of chracaters that starts with either "H" or "h" and is followed by the exact sequence of characters "ello" and which may or may not be followed by an arbitrary sequence of characters of any length. The brackets, ( and ) around the .* mean that the arbitrary sequence of characters is "captured" by the regex so that it can be used elsehwere.
In lay language, the purpose of the regex is to find out what you are saying after you've said hello

Related

Extract all digits values after first underscore

I want to extract the numbers after the 1st underscore (_), but I don't know why just only 1 number digit is selected.
My sample data is:
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(gsub("(.*_){1}(\\d)_.+", "\\2", myvec))
[1] 0 9 NA
Warning message:
NAs introduced by coercion
I'd like:
[1] 0 9 25
Please, any help with it?
Some explanation. We are interested in digits coming after _. [0-9] captures the digits, where the + says that we want to match any number of digits in a row. (?<=_) 'looks behind' the digit and makes sure we are only capturing digits preceded by a _.
library(stringr)
str_extract(myvec, "(?<=_)[0-9]+")
[1] "0" "9" "25"
Another possible solution, based on stringr::str_extract:
library(stringr)
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(str_extract(myvec, "(?<=_)\\d+"))
#> [1] 0 9 25
You can use sub (because you will need a single search and replace operation) with a pattern like ^[^_]*_(\d+).*:
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
sub("^[^_]*_(\\d+).*", "\\1", myvec)
# => [1] "0" "9" "25"
See the R demo and the regex demo.
Regex details:
^ - start of string
[^_]* - a negated character class that matches any zero or more chars other than _
_ - a _ char
(\d+) - Group 1 (\1 refers to the value captured into this group from the replacement pattern): one or more digits
.* - the rest of the string (. in TRE regex matches line break chars by default).
If you want to extract the first number after the first underscore, you can use a capture group with str_match and the pattern _([0-9]+)
Note to repeat the character class (or \\d+) one or more times.
For example
library(stringr)
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
str_match(myvec, "_([0-9]+)")[,2]
Output
[1] "0" "9" "25"
See a R demo
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(gsub("[^_]*_(\\d+).*", "\\1", myvec))
[1] 0 9 25

R regex - extract words beginning with # symbol

I'm trying to extract twitter handles from tweets using R's stringr package. For example, suppose I want to get all words in a vector that begin with "A". I can do this like so
library(stringr)
# Get all words that begin with "A"
str_extract_all(c("hAi", "hi Ahello Ame"), "(?<=\\b)A[^\\s]+")
[[1]]
character(0)
[[2]]
[1] "Ahello" "Ame"
Great. Now let's try the same thing using "#" instead of "A"
str_extract_all(c("h#i", "hi #hello #me"), "(?<=\\b)\\#[^\\s]+")
[[1]]
[1] "#i"
[[2]]
character(0)
Why does this example give the opposite result that I was expecting and how can I fix it?
It looks like you probably mean
str_extract_all(c("h#i", "hi #hello #me", "#twitter"), "(?<=^|\\s)#[^\\s]+")
# [[1]]
# character(0)
# [[2]]
# [1] "#hello" "#me"
# [[3]]
# [1] "#twitter"
The \b in a regular expression is a boundary and it occurs "Between two characters in the string, where one is a word character and the other is not a word character." see here. Since an space and "#" are both non-word characters, there is no boundary before the "#".
With this revision you match either the start of the string or values that come after spaces.
A couple of things about your regex:
(?<=\b) is the same as \b because a word boundary is already a zero width assertion
\# is the same as #, as # is not a special regex metacharacter and you do not have to escape it
[^\s]+ is the same as \S+, almost all shorthand character classes have their negated counterparts in regex.
So, your regex, \b#\S+, matches #i in h#i because there is a word boundary between h (a letter, a word char) and # (a non-word char, not a letter, digit or underscore). Check this regex debugger.
\b is an ambiguous pattern whose meaning depends on the regex context. In your case, you might want to use \B, a non-word boundary, that is, \B#\S+, and it will match # that are either preceded with a non-word char or at the start of the string.
x <- c("h#i", "hi #hello #me")
regmatches(x, gregexpr("\\B#\\S+", x))
## => [[1]]
## character(0)
##
## [[2]]
## [1] "#hello" "#me"
See the regex demo.
If you want to get rid of this \b/\B ambiguity, use unambiguous word boundaries using lookarounds with stringr methods or base R regex functions with perl=TRUE argument:
regmatches(x, gregexpr("(?<!\\w)#\\S+", x, perl=TRUE))
regmatches(x, gregexpr("(?<!\\S)#\\S+", x, perl=TRUE))
where:
(?<!\w) - an unambiguous starting word boundary - is a negative lookbehind that makes sure there is a non-word char immediately to the left of the current location or start of string
(?<!\S) - a whitespace starting word boundary - is a negative lookbehind that makes sure there is a whitespace char immediately to the left of the current location or start of string.
See this regex demo and another regex demo here.
Note that the corresponding right hand boundaries are (?!\w) and (?!\S).
The answer above should suffice. This will remove the # symbol in case you are trying to get the users' names only.
str_extract_all(c("#tweeter tweet", "h#is", "tweet #tweeter2"), "(?<=\\B\\#)[^\\s]+")
[[1]]
[1] "tweeter"
[[2]]
character(0)
[[3]]
[1] "tweeter2"
While I am no expert with regex, it seems like the issue may be that the # symbol does not correspond to a word character, and thus matching the empty string at the beginning of a word (\\b) does not work because there is no empty string when # is preceding the word.
Here are two great regex resources in case you hadn't seen them:
stat545
Stringr's Regex page, also available as a vignette:
vignette("regular-expressions", package = "stringr")

Alternaion with quantifies in gregexpr and str_extract_all function

Alternaion with quantifies in gregexpr and str_extract_all function
require(stringr)
gregexpr(pattern = "(h|a)*", "xxhx")
[[1]]
[1] 1 2 3 4
attr(,"match.length")
[1] 0 0 1 0
attr(,"useBytes")
[1] TRUE
str_extract_all(pattern = "(h|a)*", "xxhx")
[[1]]
[1] "" "" "h" "" ""
why gregexpr indicates 3 voids while str_extract_all indicates 4 voids
This is the difference between how TRE (gregexpr) and ICU (str_extract_all) regex engines deal with empty (also called "zero length") regex matches. TRE regex advances the regex index after a zero length match, while ICU allows testing the same position twice.
It becomes obvious what positions are tried by both engines if you use replacing functions:
> gsub("(h|a)*", "-\\1", "xxhx")
[1] "-x-x-hx-"
> str_replace_all("xxhx", "(h|a)*", "-\\1")
[1] "-x-x-h-x-"
The TRE engine matched h and moved the index after x while ICU engine matched h and stopped right after h before x to match the empty location before it.

plot function type

For digits I have done so:
digits <- c("0","1","2","3","4","5","6","7","8","9")
You can use the [:punct:] to detect punctuation. This detects
[!"\#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]
Either in grepexpr
x = c("we are friends!, Good Friends!!")
gregexpr("[[:punct:]]", x)
R> gregexpr("[[:punct:]]", x)
[[1]]
[1] 15 16 30 31
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
or via stringi
# Gives 4
stringi::stri_count_regex(x, "[:punct:]")
Notice the , is counted as punctuation.
The question seems to be about getting individual counts of particular punctuation marks. #Joba provides a neat answer in the comments:
## Create a vector of punctuation marks you are interested in
punct = strsplit('[]?!"\'#$%&(){}+*/:;,._`|~[<=>#^-]\\', '')[[1]]
The count how often they appear
counts = stringi::stri_count_fixed(x, punct)
Decorate the vector
setNames(counts, punct)
You can use regular expressions.
stringi::stri_count_regex("amdfa, ad,a, ad,. ", "[:punct:]")
https://en.wikipedia.org/wiki/Regular_expression
might help too.

Get indices of all character elements matches in string in R

I want to get indices of all occurences of character elements in some word. Assume these character elements I look for are: l, e, a, z.
I tried the following regex in grep function and tens of its modifications, but I keep receiving not what I want.
grep("/([leazoscnz]{1})/", "ylaf", value = F)
gives me
numeric(0)
where I would like:
[1] 2 3
To use grep work with individual characters of a string, you first need to split the string into separate character vectors. You can use strsplit for this:
strsplit("ylaf", split="")[[1]]
[1] "y" "l" "a" "f"
Next you need to simplify your regular expression, and try the grep again:
strsplit("ylaf", split="")[[1]]
grep("[leazoscnz]", strsplit("ylaf", split="")[[1]])
[1] 2 3
But it is easier to use gregexpr:
gregexpr("[leazoscnz]", "ylaf")
[[1]]
[1] 2 3
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE

Resources