Regex: extract a number after a string that contains a number - r

Suppose I have a string:
str <- "England has 90 cases(1 discharged, 5 died); Scotland has 5 cases(2 discharged, 1 died)"
How can I grab the number of discharged cases in England?
I have tried
sub("(?i).*England has [\\d] cases(.*?(\\d+).*", "\\1", str),
It's returning the original string. Many Thanks!

We can use regmatches/gregexpr to match one or more digits (\\d+) followed by a space, 'discharged' to extract the number of discharges
as.integer(regmatches(str, gregexpr("\\d+(?= discharged)", str, perl = TRUE))[[1]])
#[1] 1 2
If it is specific only to 'England', start with the 'England' followed by characters tat are not a ( ([^(]+) and (, then capture the digits (\\d+) as a group, in the replacement specify the backreference (\\1) of the captured group
sub("England[^(]+\\((\\d+).*", "\\1", str)
#[1] "1"
Or if we go by the OP's option, the ( should be escaped as it is a metacharacter to capture group (after the cases). Also, \\d+ can be placed outside the square brackets
sub("(?i)England has \\d+ cases\\((\\d+).*", "\\1", str)
#[1] "1"

We can use str_match to capture number before "discharged".
stringr::str_match(str, "England.*?(\\d+) discharged")[, 2]
#[1] "1"

the regex is \d+(?= discharged) and get the first match

Related

Extract all digits values after first underscore

I want to extract the numbers after the 1st underscore (_), but I don't know why just only 1 number digit is selected.
My sample data is:
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(gsub("(.*_){1}(\\d)_.+", "\\2", myvec))
[1] 0 9 NA
Warning message:
NAs introduced by coercion
I'd like:
[1] 0 9 25
Please, any help with it?
Some explanation. We are interested in digits coming after _. [0-9] captures the digits, where the + says that we want to match any number of digits in a row. (?<=_) 'looks behind' the digit and makes sure we are only capturing digits preceded by a _.
library(stringr)
str_extract(myvec, "(?<=_)[0-9]+")
[1] "0" "9" "25"
Another possible solution, based on stringr::str_extract:
library(stringr)
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(str_extract(myvec, "(?<=_)\\d+"))
#> [1] 0 9 25
You can use sub (because you will need a single search and replace operation) with a pattern like ^[^_]*_(\d+).*:
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
sub("^[^_]*_(\\d+).*", "\\1", myvec)
# => [1] "0" "9" "25"
See the R demo and the regex demo.
Regex details:
^ - start of string
[^_]* - a negated character class that matches any zero or more chars other than _
_ - a _ char
(\d+) - Group 1 (\1 refers to the value captured into this group from the replacement pattern): one or more digits
.* - the rest of the string (. in TRE regex matches line break chars by default).
If you want to extract the first number after the first underscore, you can use a capture group with str_match and the pattern _([0-9]+)
Note to repeat the character class (or \\d+) one or more times.
For example
library(stringr)
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
str_match(myvec, "_([0-9]+)")[,2]
Output
[1] "0" "9" "25"
See a R demo
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(gsub("[^_]*_(\\d+).*", "\\1", myvec))
[1] 0 9 25

Extract the first number (with decimals) after a given symbol from a string with multiple numbers in R

I'm trying to get the numbers (including decimals) from a string.
My data is similar to this:
V <- c("7.20-<7.35","25-<32","60-<83e","40-<50","0.85-<1.15","80-<98","3.0-<3.4","NA","3.0-<3.4 (110)")
Where numbers are mixed with letters and symbols.
I need to extract the first number after the < symbol while keeping the index for the missing values.
My output would look like:
desired.output <- c(7.35, 32, 83, 50, 1.15, 98, 3.4, NA, 3.4)
I have tried:
resp <- as.numeric(unlist(regmatches(V,
gregexpr("[[:digit:]]+\\.*[[:digit:]]*",V))))
and
resp <- sub(".*<(^[^-])", "\\1", V)
and another patterns in the sub function but nothing seems to work.
What do you suggest as best approach?
You can use
sub(".*<(\\d+(?:\\.\\d+)?).*", "\\1", V, perl=TRUE)
# => [1] "7.35" "32" "83" "50" "1.15" "98" "3.4" "NA" "3.4"
See the online R demo and the regex demo. Replace \\d+(?:\\.\\d+)? with \\d*\\.?\\d+ if you need to also get numbers like .05. Append -? before the first \\d+ if you need to also get negative numbers.
Details:
.* - any zero or more chars other than line break chars, as many as possible
< - a < char
(\d+(?:\.\d+)?) - Group 1 (referred to with \1 from the replacement pattern): one or more digits followed with an optional sequence of a dot and one or more digits
.* - any zero or more chars other than line break chars, as many as possible
Using str_extract from stringr package in tidyverse:
library(tidyverse)
V <- c("7.20-<7.35","25-<32","60-<83e","40-<50","0.85-<1.15","80-<98","3.0-<3.4","NA","3.0-<3.4 (110)")
str_extract(V, "((?<=\\<)\\d\\.?\\d+|NA)") %>%
as.numeric()
[1] 7.35 32.00 83.00 50.00 1.15 98.00 3.40 NA 3.40
You could also specify the first numbers followed by -< and capture the second part with an optional decimal.
\d+(?:\.\d+)?-<(\d+(?:\.\d+)?).*
The pattern matches:
\d+(?:\.\d+)? Match 1+ digits with an optional decimal part
-< Match literally
( Capture group 1
\d+(?:\.\d+)? Match 1+ digits with an optional decimal part
) Close group 1
.* Match the rest of the line
Regex demo
Then you can match the rest of the string that you don't want in the result, and replace with group 1.
V <- c("7.20-<7.35","25-<32","60-<83e","40-<50","0.85-<1.15","80-<98","3.0-<3.4","NA","3.0-<3.4 (110)")
sub("\\d+(?:\\.\\d+)?-<(\\d+(?:\\.\\d+)?).*", "\\1", V)
Output
[1] "7.35" "32" "83" "50" "1.15" "98" "3.4" "NA" "3.4"
Matching all variations of - < or >, you can use a character class listing all the allowed characters and repeat them 1 or more times:
sub("\\d+(?:\\.\\d+)?[<>-]+(\\d+(?:\\.\\d+)?).*", "\\1", V)
Regex demo

Extract first X digits of N digit numbers

How to select first 2 digits of a number? I just need the name of the function
Example: 12455 turns into 12, 13655 into 13
Basically it's the equivalent of substring for integers.
If at the end you need again a numeric vector/element, you can use
as.numeric(substr(x, 1, 2))
This solution uses gsub, the anchor ^ signifiying the start position of a string, \\d{2} for any two digits appearing at this position, wrapped into (...) to mark it as a capturing group, and backreference \\1 in the replacement argument, which 'recalls' the capturing group:
x <- c(12455,13655)
gsub("(^\\d{2}).*", "\\1", x)
[1] "12" "13"
Alternatively, use str_extract:
library(stringr)
str_extract(x, "^\\d{2}")

R Question: Extracting Numeric Characters from End of String

I have a data frame. One of the columns is in string format. Various letters and numbers, but always ending in a string of numbers. Sadly this string isn't always the same length.
I'd like to know how to write a bit of code to extract just the numbers at the end. So for example:
x <- c("AB ABC 19012301927 / XX - 4625",
"BC - AB / 827 / 9765",
"XXXX-9276"
)
And I'd like to get from this: (4625, 9765, 9276)
Is there any easy way to do this please?
Thank you.
A
We can use sub to capture one or more digits (\\d+) at the end ($) of the string that follows a non-digit ([^0-9]) and other characters (.*), in the replacement, specify the backreference (\\1) of the captured group
sub(".*[^0-9](\\d+)$", "\\1", x)
#[1] "4625" "9765" "9276"
Or with word from stringr
library(stringr)
word(x, -1, sep="[- ]")
#[1] "4625" "9765" "9276"
Or with stri_extract_last
library(stringi)
stri_extract_last_regex(x, "\\d+")
#[1] "4625" "9765" "9276"
Replace everything up to the last non-digit with a zero length string.
sub(".*\\D", "", x)
giving:
[1] "4625" "9765" "9276"

Extract a year number from a string that is surrounded by special characters

What's a good way to extract only the number 2007 from the following string:
some_string <- "1_2_start_2007_3_end"
The pattern to detect the year number in my case would be:
4 digits
surrounded by "_"
I am quite new to using regular expressions. I tried the following:
regexp <- "_+[0-9]+_"
names <- str_extract(files, regexp)
But this does not take into account that there are always 4 digits and outputs the underlines as well.
You may use a sub option, too:
some_string <- "1_2_start_2007_3_end"
sub(".*_(\\d{4})_.*", "\\1", some_string)
See the regex demo
Details
.* - any 0+ chars, as many as possible
_ - a _ char
(\\d{4}) - Group 1 (referred to via \1 from the replacement pattern): 4 digits
_.* - a _ and then any 0+ chars up to the end of string.
NOTE: akrun's str_extract(some_string, "(?<=_)\\d{4}") will extract the leftmost occurrence and my sub(".*_(\\d{4})_.*", "\\1", some_string) will extract the rightmost occurrence of a 4-digit substring enclosed with _. For my my solution to return the leftmost one use a lazy quantifier with the first .: sub(".*?_(\\d{4})_.*", "\\1", some_string).
R test:
some_string <- "1_2018_start_2007_3_end"
sub(".*?_(\\d{4})_.*", "\\1", some_string) # leftmost
## -> 2018
sub(".*_(\\d{4})_.*", "\\1", some_string) # rightmost
## -> 2007
We can use regex lookbehind to specify the _ and extract the 4 digits that follow
library(stringr)
str_extract(some_string, "(?<=_)\\d{4}")
#[1] "2007"
If the pattern also shows - both before and after the 4 digits, then use regex lookahead as well
str_extract(some_string, "(?<=_)\\d{4}(?=_)")
#[1] "2007"
Just to get a non-regex approach out there, in which we split on _ and convert to numeric. All non-numbers will be coerced to NA, so we use !is.na to eliminate them. We then use nchar to count the characters, and pull the one with 4.
i1 <- as.numeric(strsplit(some_string, '_')[[1]])
i1 <- i1[!is.na(i1)]
i1[nchar(i1) == 4]
#[1] 2007
This is the quickest regex I could come up with:
\S.*_(\d{4})_\S.*
It means,
any number of non-space characters,
then _
followed by four digits (d{4})
above four digits is your year captured using ()
another _
any other gibberish non space string
Since, you mentioned you're new, please test this and all other answers at https://regex101.com/, pretty good to learn regex, it explains in depth what your regex is actually doing.
If you just care about (year) then below regex is enough:
_(\d{4})_

Resources