Extract all digits values after first underscore - r

I want to extract the numbers after the 1st underscore (_), but I don't know why just only 1 number digit is selected.
My sample data is:
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(gsub("(.*_){1}(\\d)_.+", "\\2", myvec))
[1] 0 9 NA
Warning message:
NAs introduced by coercion
I'd like:
[1] 0 9 25
Please, any help with it?

Some explanation. We are interested in digits coming after _. [0-9] captures the digits, where the + says that we want to match any number of digits in a row. (?<=_) 'looks behind' the digit and makes sure we are only capturing digits preceded by a _.
library(stringr)
str_extract(myvec, "(?<=_)[0-9]+")
[1] "0" "9" "25"

Another possible solution, based on stringr::str_extract:
library(stringr)
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(str_extract(myvec, "(?<=_)\\d+"))
#> [1] 0 9 25

You can use sub (because you will need a single search and replace operation) with a pattern like ^[^_]*_(\d+).*:
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
sub("^[^_]*_(\\d+).*", "\\1", myvec)
# => [1] "0" "9" "25"
See the R demo and the regex demo.
Regex details:
^ - start of string
[^_]* - a negated character class that matches any zero or more chars other than _
_ - a _ char
(\d+) - Group 1 (\1 refers to the value captured into this group from the replacement pattern): one or more digits
.* - the rest of the string (. in TRE regex matches line break chars by default).

If you want to extract the first number after the first underscore, you can use a capture group with str_match and the pattern _([0-9]+)
Note to repeat the character class (or \\d+) one or more times.
For example
library(stringr)
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
str_match(myvec, "_([0-9]+)")[,2]
Output
[1] "0" "9" "25"
See a R demo

myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(gsub("[^_]*_(\\d+).*", "\\1", myvec))
[1] 0 9 25

Related

Use gsub to extract the first integer number

I'd like to use gsub to remove characters from a filename.
In the example below the desired output is 23
digs = "filepath/23-00.xlsx"
I can remove everything before 23 as follows:
gsub("^\\D+", "",digs)
[1] "23-00.xlsx"
or everything after:
gsub("\\-\\d+\\.xlsx$","", digs)
[1] "filepath/23"
How do I do both at the same time?
We could use | (OR) i.e. match characters (.*) till the / or (|), match the - followed by characters (.*), replace with blank ("")
gsub(".*/|-.*", "", digs)
[1] "23"
Or just do parse_number
readr::parse_number(digs)
[1] 23
You can just use a sub like
sub("^\\D+(\\d+).*", "\\1", digs)
# => [1] "23"
See the R demo. See the regex demo. Details:
^ - start of string
\D+ - one or more non-digit chars
(\d+) - Group 1 (\1 refers to this group value): one or more digits
.* - any zero or more chars as many as possible.

Extract the first number (with decimals) after a given symbol from a string with multiple numbers in R

I'm trying to get the numbers (including decimals) from a string.
My data is similar to this:
V <- c("7.20-<7.35","25-<32","60-<83e","40-<50","0.85-<1.15","80-<98","3.0-<3.4","NA","3.0-<3.4 (110)")
Where numbers are mixed with letters and symbols.
I need to extract the first number after the < symbol while keeping the index for the missing values.
My output would look like:
desired.output <- c(7.35, 32, 83, 50, 1.15, 98, 3.4, NA, 3.4)
I have tried:
resp <- as.numeric(unlist(regmatches(V,
gregexpr("[[:digit:]]+\\.*[[:digit:]]*",V))))
and
resp <- sub(".*<(^[^-])", "\\1", V)
and another patterns in the sub function but nothing seems to work.
What do you suggest as best approach?
You can use
sub(".*<(\\d+(?:\\.\\d+)?).*", "\\1", V, perl=TRUE)
# => [1] "7.35" "32" "83" "50" "1.15" "98" "3.4" "NA" "3.4"
See the online R demo and the regex demo. Replace \\d+(?:\\.\\d+)? with \\d*\\.?\\d+ if you need to also get numbers like .05. Append -? before the first \\d+ if you need to also get negative numbers.
Details:
.* - any zero or more chars other than line break chars, as many as possible
< - a < char
(\d+(?:\.\d+)?) - Group 1 (referred to with \1 from the replacement pattern): one or more digits followed with an optional sequence of a dot and one or more digits
.* - any zero or more chars other than line break chars, as many as possible
Using str_extract from stringr package in tidyverse:
library(tidyverse)
V <- c("7.20-<7.35","25-<32","60-<83e","40-<50","0.85-<1.15","80-<98","3.0-<3.4","NA","3.0-<3.4 (110)")
str_extract(V, "((?<=\\<)\\d\\.?\\d+|NA)") %>%
as.numeric()
[1] 7.35 32.00 83.00 50.00 1.15 98.00 3.40 NA 3.40
You could also specify the first numbers followed by -< and capture the second part with an optional decimal.
\d+(?:\.\d+)?-<(\d+(?:\.\d+)?).*
The pattern matches:
\d+(?:\.\d+)? Match 1+ digits with an optional decimal part
-< Match literally
( Capture group 1
\d+(?:\.\d+)? Match 1+ digits with an optional decimal part
) Close group 1
.* Match the rest of the line
Regex demo
Then you can match the rest of the string that you don't want in the result, and replace with group 1.
V <- c("7.20-<7.35","25-<32","60-<83e","40-<50","0.85-<1.15","80-<98","3.0-<3.4","NA","3.0-<3.4 (110)")
sub("\\d+(?:\\.\\d+)?-<(\\d+(?:\\.\\d+)?).*", "\\1", V)
Output
[1] "7.35" "32" "83" "50" "1.15" "98" "3.4" "NA" "3.4"
Matching all variations of - < or >, you can use a character class listing all the allowed characters and repeat them 1 or more times:
sub("\\d+(?:\\.\\d+)?[<>-]+(\\d+(?:\\.\\d+)?).*", "\\1", V)
Regex demo

Regex: extract a number after a string that contains a number

Suppose I have a string:
str <- "England has 90 cases(1 discharged, 5 died); Scotland has 5 cases(2 discharged, 1 died)"
How can I grab the number of discharged cases in England?
I have tried
sub("(?i).*England has [\\d] cases(.*?(\\d+).*", "\\1", str),
It's returning the original string. Many Thanks!
We can use regmatches/gregexpr to match one or more digits (\\d+) followed by a space, 'discharged' to extract the number of discharges
as.integer(regmatches(str, gregexpr("\\d+(?= discharged)", str, perl = TRUE))[[1]])
#[1] 1 2
If it is specific only to 'England', start with the 'England' followed by characters tat are not a ( ([^(]+) and (, then capture the digits (\\d+) as a group, in the replacement specify the backreference (\\1) of the captured group
sub("England[^(]+\\((\\d+).*", "\\1", str)
#[1] "1"
Or if we go by the OP's option, the ( should be escaped as it is a metacharacter to capture group (after the cases). Also, \\d+ can be placed outside the square brackets
sub("(?i)England has \\d+ cases\\((\\d+).*", "\\1", str)
#[1] "1"
We can use str_match to capture number before "discharged".
stringr::str_match(str, "England.*?(\\d+) discharged")[, 2]
#[1] "1"
the regex is \d+(?= discharged) and get the first match

Capture entire substring using regex if there is a match on a number

I have been unable to find the answer to this specific question, I am using R to clean some survey data.
I have some messy survey data with question names as columns, that sometimes include a number and sometimes don't. When they include a number, it will often contain some subcharacters as well indicating the question. Example, I have this vector:
questions <- c(
"1 question 1 what do you think?",
"1.a. question 1a further details on what you think",
"Please explain",
"2 question 2 what is your motivation",
"2.a. further details",
"2.b. even further details",
"Please explain")
I want to extract the substrings that contain numbers, and return no results if there is no such match. Desired result (using R)
"1"
"1.a."
NA
"2"
"2.a."
"2.b."
NA
I know I can capture the first number, using
stri_extract_first_regex(questions, "[0-9]+")
But I am at a loss how to modify it to capture the whole string until the first whitespace if it finds a match using this pattern.
For you example data you might use:
[0-9]+(?:\.[a-z]\.)?
That will match:
[0-9]+ Match 1+ digits
(?: Non capturing group
\.[a-z]\. Match a dot, lowercase character and a dot
)? Close non capturing group and make it optional
For example:
questions <- c(
"1 question 1 what do you think?",
"1.a. question 1a further details on what you think",
"Please explain",
"2 question 2 what is your motivation",
"2.a. further details",
"2.b. even further details",
"Please explain")
print(stri_extract_first_regex(questions, "[0-9]+(?:\\.[a-z]\\.)?"))
# [1] "1" "1.a." NA "2" "2.a." "2.b." NA
This might work:
hasnumber <- grepl("[0-9]+",questions)
firstspaces <- sapply(gregexpr(" ", questions), function(x) x[[1]])
res <- ifelse(hasnumber, substr(questions,1,firstspaces-1), NA)
> res
[1] "1" "1.a." NA "2" "2.a." "2.b." NA
The most difficult part I guess is to define where are the first spaces in each question, which could be done with loops or here sapply
You may use
questions <- sub("^(\\d+(?:\\.[a-z0-9]+)*\\.?).*|.*", "\\1", questions)
questions[questions==""] <- NA
questions
# => [1] "1" "1.a." NA "2" "2.a." "2.b." NA
The ^(\\d+(?:\\.[a-z0-9]+)*\\.?).*|.* matches
^ - start of string
(\\d+(?:\\.[a-z0-9]+)*) - Capturing group 1:
\\d+ - 1+ digits
(?:\\.[a-z0-9]+)* - 0 or more repetitions of
\\. - a dot
[a-z0-9]+ - 1 or more lowercase ASCII letters or digits
\\.? - an optional dot
.* - any 0+ chars to the end of the string
| - or
.* - the whole string.
Replaces with the contents of Group 1. If the second alternative matches, the result is an empty string, questions[questions==""] <- NA replaces these elements with NAs.

regular expression: ".*\\s([0-9]+)\\snomination.*$"

Could someone explain why "Won 1 Oscar." can be picked out according to the regular expression given as below
awards <- c("Won 1 Oscar.",
"Won 1 Oscar. Another 9 wins & 24 nominations.",
"1 win and 2 nominations.",
"2 wins & 3 nominations.",
"Nominated for 2 Golden Globes. 1 more win & 2 nominations.",
"4 wins & 1 nomination.")
sub(".*\\s([0-9]+)\\snomination.*$", "\\1", awards)
I can only get that the pattern is "abcd (any number 0 -9 ) nominationabcd". Once the pattern is matched, the number will replace the whole string. The matched "Won 1 Oscar" comes from the second element. What I am confused is that there is no nomination.* following "Won 1 " and why there seems to be no replacement.
The gsub function takes the regex (or a plain string if you use fixed=TRUE) and tries to find a match in the input character vector. If the match is found, this match is replaced with the replacement string/pattern. If the match is not found, thecurrent character (string) is returned unchanged.
Since you want to get the only nominations value from each element of the character vector, you need to extract them, rather than replace the matches.
You may rely on the stringr str_extract:
> library(stringr)
> str_extract(awards, "[0-9]+(?=\\s*nomination)")
[1] NA "24" "2" "3" "2" "1"
The [0-9]+(?=\\s*nomination) pattern finds 1 or more digits but only those that are followed with 0+ whitespaces and nomination char sequence (these whitespaces and the "nomination" word are excluded from the matches as this is a pattern inside a positive lookahead ((?=...)) construct that is non-consuming, i.e. not putting the matched text into the match value).

Resources