R regex extract numbers from string depending of context

R regex extract numbers from string depending of context - r

s <- c('abc_1_efg', 'efg_2', 'hi2jk_lmn', 'opq')
How can I use a regex to get the numbers that are beside at least one underscore ("_"). In effect I would like to get outputs like this :
> output # The result
[1] 1 2
> output_l # Alternatively
[1] TRUE TRUE FALSE FALSE

We can use regex lookarounds
grep("(?<=_)\\d+", s, perl = TRUE)
grepl("(?<=_)\\d+", s, perl = TRUE)
#[1] TRUE TRUE FALSE FALSE

If you need to get just indices, use grep with a simple TRE regex (no lookarounds are necessary):
> grep("_\\d+", s)
[1] 1 2
To get the numbers themselves, use a PCRE regex with a positive lookahead with regmatches / gregexpr:
> unlist(regmatches(s, gregexpr("(?<=_)[0-9]+", s, perl=TRUE)))
[1] "1" "2"
Details:
(?<=_) - a positive lookbehind that requires _ to appear immediately to the left of the current position
[0-9]+ - 1+ digits
EDIT: If the digits to the left of _ should also be considered, use 1) "(^|_)\\d|\\d(_|$)" with grep solution and 2) "(?<![^_])\\d+|\\d+(?![^_])" with the number extraction solution.

Using this regex :
[_]([0-9]){1}
And selecting group 1 you'll get your digit, if you want more, use
[_]([0-9]+)
And it will not match the last two strings
You can use this tool : https://regex101.com/

with stringr:
s <- c('abc_1_efg', 'efg_2', 'hi2jk_lmn', 'opq', 'a_1_b')
library(stringr)
which(!is.na(str_match(s, '_\\d|\\d_')))
# [1] 1 2 5

Related

Inverting a regex in R

I have this string:
[1] "19980213" "19980214" "19980215" "19980216" "19980217" "iffi" "geometry"
[8] "date_consid"
and I want to match all the elements that are not dates and not "date_consid". I tried
res = grep("(?!\\d{8})|(?!date_consid)", vec, value=T)
But I just cant make it work...

You can use
vec <- c("19980213", "19980214", "19980215", "19980216","19980217", "iffi","geometry", "date_consid")
grep("^(\\d{8}|date_consid)$", vec, value=TRUE, invert=TRUE)
## => [1] "iffi" "geometry"
See the R demo
The ^(\d{8}|date_consid)$ regex matches a string that only consists of any eight digits or that is equal to date_consid.
The value=TRUE makes grep return values rather than indices and invert=TRUE inverses the regex match result (returns those that do not match).

The pattern that you tried gives all the matches because the lookaheads are unanchored.
Using separate statements with or | will still match all strings.
You can change to logic to asserting from the start of the string, what is directly to the right is not either 8 digits or date_consid in a single check.
Using a positive lookahead, you have to add perl=T and add an anchor ^ to assert the start of the string and add an anchor $ to assert the end of the string after the lookahead.
^(?!\\d{8}$|date_consid$)
^ Start of string
(?! Negative lookahead
\\d{8}$ Match 8 digits until end of string
| Or
date_consid$Match date_consid until end of string
) Close lookahead
For example
vec <- c("19980213", "19980214", "19980215", "19980216","19980217", "iffi","geometry", "date_consid")
grep("^(?!\\d{8}$|date_consid$)", vec, value=T, perl=T)
Output
[1] "iffi" "geometry"

Extract string using `rm_between` function

I want to extract strings using rm_between function from the library(qdapRegex)
I need to extract the string between the second "|" and the word "_HUMAN".
I cant figure out how to select the second "|" and not the first.
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
prots <- rm_between(example, '|', 'HUMAN', extract=TRUE)
Thank you!!

Another alternative using regmatches, regexpr and using perl=TRUE to make use of \K
^(?:[^|]*\|){2}\K[^|_]+(?=_HUMAN)
Regex demo
For example
regmatches(example, regexpr("^(?:[^|]*\\|){2}\\K[^|_]+(?=_HUMAN)", example, perl=TRUE))
Output
[1] "EIFCL" "EIF3C"

In your rm_between(example, '|', 'HUMAN', extract=TRUE) command, the | is used to match the leftmost | and HUMAN is used to match the left most HUMAN right after.
Note the default value for the FIXED argument is TRUE, so | and HUMAN are treated as literal chars.
You need to make the pattern a regex pattern, by setting fixed=FALSE. However, the ^(?:[^|]*\|){2} as the left argument regex will not work because the qdap package creates an ICU regex with lookarounds (since you use extract=TRUE that sets include.markers to FALSE), which is (?<=^(?:[^|]*\|){2}).*?(?=HUMAN).
As a workaround, you could use a constrained-width lookbehind, by replacing * with a limiting quantifier with a reasonably large max parameter. Say, if you do not expect more than a 1000 chars between each pipe, you may use {0,1000}:
rm_between(example, '^(?:[^|]{0,1000}\\|){2}', '_HUMAN', extract=TRUE, fixed=FALSE)
# => [[1]]
# [1] "EIFCL"
#
# [[2]]
# [1] "EIF3C"
However, you really should think of using simpler approaches, like those described in other answers. Here is another variation with sub:
sub("^(?:[^|]*\\|){2}(.*?)_HUMAN.*", "\\1", example)
# => [1] "EIFCL" "EIF3C"
Details
^ - startof strig
(?:[^|]*\\|){2} - two occurrences of any 0 or more non-pipe chars followed with a pipe char (so, matching up to and including the second |)
(.*?) - Group 1: any 0 or more chars, as few as possible
_HUMAN.* - _HUMAN and the rest of the string.
\1 keeps only Group 1 value in the result.
A stringr variation:
stringr::str_match(example, "^(?:[^|]*\\|){2}(.*?)_HUMAN")[,2]
# => [1] "EIFCL" "EIF3C"
With str_match, the captures can be accessed easily, we do it with [,2] to get Group 1 value.

this is not exactly what you asked for, but you can achieve the result with base R:
sub("^.*\\|([^\\|]+)_HUMAN.*$", "\\1", example)
This solution is an application of regular expression.
"^.*\\|([^\\|]+)_HUMAN.*$" matches the entire character string.
\\1 matches whatever was matched inside the first parenthesis.

Using regular gsub:
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
gsub(".*?\\|.*?\\|(.*?)_HUMAN", "\\1", example)
#> [1] "EIFCL" "EIF3C"
The part (.*?) is replaced by itself as the replacement contains the back-reference \\1.
If you absolutely prefer qdapRegex you can try:
rm_between(example, '.{0,100}\\|.{0,100}\\|', '_HUMAN', fixed = FALSE, extract = TRUE)
The reason why we have to use .{0,100} instead of .*? is that the underlying stringi needs a mamixmum length for the look-behind pattern (i.e. the left argument in rm_between).

Just saying that you could easily just use sapply()/strsplit():
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
unlist(sapply(strsplit(example, "|", fixed = T),
function(item) strsplit(item[3], "_HUMAN", fixed = T)))
# [1] "EIFCL" "EIF3C"
It just splits on | in the first list and on _HUMAN on every third element within that list.

Finding the position of decimal point in an integer or a string

If i want to determine the position of a decimal point in an integer, for example, in 524.79, the position of the decimal point is 4, and that is what I want as output in R; which function or command should i use? I have tried using gregexpr as well as regexpr but each time the output comes out to be 1.
This is what I did :
x <- 524.79
gregexpr(pattern = ".", "x")
The output looks like this:
[[1]]
[1] 1
attr(,"match.length")
[1] 1
attr(,"useBytes")
[1] TRUE

The . is a metacharacter which means any character. It either needs to be escaped (\\.) or place it inside square brackets [.] or use fixed = TRUE to get the literal character
as.integer(gregexpr(pattern = ".", x, fixed = TRUE))
#[1] 4
Or a compact option is str_locate
library(stringr)
unname(str_locate(x, "[.]")[,1])
#[1] 4
The second issue in the OP's solution is quoting the object x. So, the gregexpr locates the . as 1 because there is only one character "x" and it is the first position
data
x <- 524.79

We could actually use a regex here:
x <- "524.79"
nchar(sub("(?<=\\.)\\d+", "", x, perl=TRUE))
4

r: regex for containing pattern with negation

Suppose I have the following two strings and want to use grep to see which match:
business_metric_one
business_metric_one_dk
business_metric_one_none
business_metric_two
business_metric_two_dk
business_metric_two_none
And so on for various other metrics. I want to only match the first one of each group (business_metric_one and business_metric_two and so on). They are not in an ordered list so I can't index and have to use grep. At first I thought to do:
.*metric.*[^_dk|^_none]$
But this doesn't seem to work. Any ideas?

You need to use a PCRE pattern to filter the character vector:
x <- c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")
grep("metric(?!.*_(?:dk|none))", x, value=TRUE, perl=TRUE)
## => [1] "business_metric_one" "business_metric_two"
See the R demo
The metric(?!.*(?:_dk|_none)) pattern matches
metric - a metric substring
(?!.*_(?:dk|none)) - that is not followed with any 0+ chars other than line break chars followed with _ and then either dk or none.
See the regex demo.
NOTE: if you need to match only such values that contain metric and do not end with _dk or _none, use a variation, metric.*$(?<!_dk|_none) where the (?<!_dk|_none) negative lookbehind fails the match if the string ends with either _dk or _none.

You can also do something like this:
grep("^([[:alpha:]]+_){2}[[:alpha:]]+$", string, value = TRUE)
# [1] "business_metric_one" "business_metric_two"
or use grepl to match dk and none, then negate the logical when you're indexing the original string:
string[!grepl("(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
more concisely:
string[!grepl("business_metric_[[:alpha:]]+_(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
Data:
string = c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")

How to match strings matching [a-z_]* but with non repetitive symbol "_"

I would like to match strings :
That are composed by [a-z_] ;
That doesn't start or end with "_" ;
That doesn't include repetitive "_" symbol.
So for example the expected matching results would be :
"x"; "x_x" > TRUE
"_x"; "x_"; "_x_"; "x__x" > FALSE
My problems to achieve this is that I can exclude strings ending or starting with "_" but my regexp also excludes length 1 strings.
grepl("^[a-z][a-z_]*[a-z]$", my.string)
My second issue is that I don't know how to negate a match for double characters grepl("(_)\\1", my.string) and how I can integrate it with the 1st part of my regexp.
If possible I would like to do this with perl = FALSE.

You need to use the following TRE regex:
grepl("^[a-z]+(?:_[a-z]+)*$", my.string)
See the regex demo
Details:
^ - start of string
[a-z]+ - one or more ASCII letters
(?:_[a-z]+)* - zero or more sequences (*) of
_ - an underscore
[a-z]+ - one or more ASCII letters
$ - end of string.
See R demo:
my.string <- c("x" ,"x_x", "x_x_x_x_x","_x", "x_", "_x_", "x__x")
grepl("^[a-z]+(?:_[a-z]+)*$", my.string)
## => [1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE

This seems to identify the items correctly:
dat <- c("x" ,"x_x","_x", "x_", "_x_", "x__x")
grep("^_|__|_$", dat, invert=TRUE)
[1] 1 2
So try:
!grepl("^_|__|_$", dat)
[1] TRUE TRUE FALSE FALSE FALSE FALSE
Just uses negation and a pattern with three conditions separated by the regex logical OR operator "|".

Another regex that uses grouping ( and the * for numeration.
myString <- c("x_", "x", "_x", "x_x_x", "x_x", "x__x")
grepl("^([a-z]_)*[a-z]$", myString)
[1] FALSE TRUE FALSE TRUE TRUE FALSE
So ^([a-z]_)* matches 0 or more pairs of "[a-z]_" at the beginning of the string and [a-z]$ assures that the final character is a lower case alphabetical character.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R regex extract numbers from string depending of context - r

s <- c('abc_1_efg', 'efg_2', 'hi2jk_lmn', 'opq') How can I use a regex to get the numbers that are beside at least one underscore ("_"). In effect I would like to get outputs like this : > output # The result [1] 1 2 > output_l # Alternatively [1] TRUE TRUE FALSE FALSE

We can use regex lookarounds grep("(?<=_)\\d+", s, perl = TRUE) grepl("(?<=_)\\d+", s, perl = TRUE) #[1] TRUE TRUE FALSE FALSE

Using this regex : [_]([0-9]){1} And selecting group 1 you'll get your digit, if you want more, use [_]([0-9]+) And it will not match the last two strings You can use this tool : https://regex101.com/

with stringr: s <- c('abc_1_efg', 'efg_2', 'hi2jk_lmn', 'opq', 'a_1_b') library(stringr) which(!is.na(str_match(s, '_\\d|\\d_'))) # [1] 1 2 5

Related

Inverting a regex in R

Extract string using `rm_between` function

Finding the position of decimal point in an integer or a string

r: regex for containing pattern with negation

How to match strings matching [a-z_]* but with non repetitive symbol "_"

Categories

Resources