regular expression: ".*\\s([0-9]+)\\snomination.*$" - r

Could someone explain why "Won 1 Oscar." can be picked out according to the regular expression given as below
awards <- c("Won 1 Oscar.",
"Won 1 Oscar. Another 9 wins & 24 nominations.",
"1 win and 2 nominations.",
"2 wins & 3 nominations.",
"Nominated for 2 Golden Globes. 1 more win & 2 nominations.",
"4 wins & 1 nomination.")
sub(".*\\s([0-9]+)\\snomination.*$", "\\1", awards)
I can only get that the pattern is "abcd (any number 0 -9 ) nominationabcd". Once the pattern is matched, the number will replace the whole string. The matched "Won 1 Oscar" comes from the second element. What I am confused is that there is no nomination.* following "Won 1 " and why there seems to be no replacement.

The gsub function takes the regex (or a plain string if you use fixed=TRUE) and tries to find a match in the input character vector. If the match is found, this match is replaced with the replacement string/pattern. If the match is not found, thecurrent character (string) is returned unchanged.
Since you want to get the only nominations value from each element of the character vector, you need to extract them, rather than replace the matches.
You may rely on the stringr str_extract:
> library(stringr)
> str_extract(awards, "[0-9]+(?=\\s*nomination)")
[1] NA "24" "2" "3" "2" "1"
The [0-9]+(?=\\s*nomination) pattern finds 1 or more digits but only those that are followed with 0+ whitespaces and nomination char sequence (these whitespaces and the "nomination" word are excluded from the matches as this is a pattern inside a positive lookahead ((?=...)) construct that is non-consuming, i.e. not putting the matched text into the match value).

Related

Extract all digits values after first underscore

I want to extract the numbers after the 1st underscore (_), but I don't know why just only 1 number digit is selected.
My sample data is:
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(gsub("(.*_){1}(\\d)_.+", "\\2", myvec))
[1] 0 9 NA
Warning message:
NAs introduced by coercion
I'd like:
[1] 0 9 25
Please, any help with it?
Some explanation. We are interested in digits coming after _. [0-9] captures the digits, where the + says that we want to match any number of digits in a row. (?<=_) 'looks behind' the digit and makes sure we are only capturing digits preceded by a _.
library(stringr)
str_extract(myvec, "(?<=_)[0-9]+")
[1] "0" "9" "25"
Another possible solution, based on stringr::str_extract:
library(stringr)
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(str_extract(myvec, "(?<=_)\\d+"))
#> [1] 0 9 25
You can use sub (because you will need a single search and replace operation) with a pattern like ^[^_]*_(\d+).*:
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
sub("^[^_]*_(\\d+).*", "\\1", myvec)
# => [1] "0" "9" "25"
See the R demo and the regex demo.
Regex details:
^ - start of string
[^_]* - a negated character class that matches any zero or more chars other than _
_ - a _ char
(\d+) - Group 1 (\1 refers to the value captured into this group from the replacement pattern): one or more digits
.* - the rest of the string (. in TRE regex matches line break chars by default).
If you want to extract the first number after the first underscore, you can use a capture group with str_match and the pattern _([0-9]+)
Note to repeat the character class (or \\d+) one or more times.
For example
library(stringr)
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
str_match(myvec, "_([0-9]+)")[,2]
Output
[1] "0" "9" "25"
See a R demo
myvec<-c("increa_0_1-1","increa_9_25-112","increa_25-50-76" )
as.numeric(gsub("[^_]*_(\\d+).*", "\\1", myvec))
[1] 0 9 25

Regex: extract a number after a string that contains a number

Suppose I have a string:
str <- "England has 90 cases(1 discharged, 5 died); Scotland has 5 cases(2 discharged, 1 died)"
How can I grab the number of discharged cases in England?
I have tried
sub("(?i).*England has [\\d] cases(.*?(\\d+).*", "\\1", str),
It's returning the original string. Many Thanks!
We can use regmatches/gregexpr to match one or more digits (\\d+) followed by a space, 'discharged' to extract the number of discharges
as.integer(regmatches(str, gregexpr("\\d+(?= discharged)", str, perl = TRUE))[[1]])
#[1] 1 2
If it is specific only to 'England', start with the 'England' followed by characters tat are not a ( ([^(]+) and (, then capture the digits (\\d+) as a group, in the replacement specify the backreference (\\1) of the captured group
sub("England[^(]+\\((\\d+).*", "\\1", str)
#[1] "1"
Or if we go by the OP's option, the ( should be escaped as it is a metacharacter to capture group (after the cases). Also, \\d+ can be placed outside the square brackets
sub("(?i)England has \\d+ cases\\((\\d+).*", "\\1", str)
#[1] "1"
We can use str_match to capture number before "discharged".
stringr::str_match(str, "England.*?(\\d+) discharged")[, 2]
#[1] "1"
the regex is \d+(?= discharged) and get the first match

Selecting only strings with 2 words in vector via regex

I have a simple character string like:
test<-c("two words", "three more words", "something else", "this has a lot of words", "more of this", "pick me")
I would need a function that returns the indices of test where there are only 2 words in the element (in this example this would be index 1, 3 and 6, but 2, 4 and 5 are completely uninteresting). More context: I am searching for "real" names of persons among a large vector that is mixed also with company names (which have often 3 or more words). I have no clue how to perhaps get regex (or any other technique) to do this...
We can use grep to match the word (\\w+) followed by a space followed by other word (\\w+) from the start (^) to end ($) of the string
grep("^\\w+ \\w+$", test)
[#1] 1 3 6
Or with str_count
library(stringr)
which(str_count(test, "\\w+") == 2)
#[1] 1 3 6
One option involving stringr could be.
which(is.na(word(test, 1, 3, fixed(" "))))
[1] 1 3 6

R Regular expression

I'm learning about the sub & gsub function,
and after reading the definition, I still dont understand what is:
".*" , "\s"
specifically, the question ask what does the following code chunk return and I have no clue how it works
awards <- c("Won 1 Oscar.",
"Won 1 Oscar. Another 9 wins & 24 nominations.",
"1 win and 2 nominations.",
"2 wins & 3 nominations.",
"Nominated for 2 Golden Globes. 1 more win & 2 nominations.",
"4 wins & 1 nomination.")
sub(".*\\s([0-9]+)\\snomination.*$", "\\1", awards)
".*" = . means any character and * is 0 or more of the previous.
"\s" = means any white space
So
sub(".* #match any character 0 or more times
\\s # follow by a space (whitespace)
([0-9]+) # with at least 1 number the () means extract
\\s # follow by another space
nomination # follow by the word "nomination"
.*$", # with 0 or more characters from end of the line
"\\1", awards) # //1 means replace with the first match
Given your sample of strings, the first string does not have the word nomination in it so the original string is returned. The other strings will all match so the number immediately preceding the word "nomination" will be retuned.
Hope this helps.

Capture entire substring using regex if there is a match on a number

I have been unable to find the answer to this specific question, I am using R to clean some survey data.
I have some messy survey data with question names as columns, that sometimes include a number and sometimes don't. When they include a number, it will often contain some subcharacters as well indicating the question. Example, I have this vector:
questions <- c(
"1 question 1 what do you think?",
"1.a. question 1a further details on what you think",
"Please explain",
"2 question 2 what is your motivation",
"2.a. further details",
"2.b. even further details",
"Please explain")
I want to extract the substrings that contain numbers, and return no results if there is no such match. Desired result (using R)
"1"
"1.a."
NA
"2"
"2.a."
"2.b."
NA
I know I can capture the first number, using
stri_extract_first_regex(questions, "[0-9]+")
But I am at a loss how to modify it to capture the whole string until the first whitespace if it finds a match using this pattern.
For you example data you might use:
[0-9]+(?:\.[a-z]\.)?
That will match:
[0-9]+ Match 1+ digits
(?: Non capturing group
\.[a-z]\. Match a dot, lowercase character and a dot
)? Close non capturing group and make it optional
For example:
questions <- c(
"1 question 1 what do you think?",
"1.a. question 1a further details on what you think",
"Please explain",
"2 question 2 what is your motivation",
"2.a. further details",
"2.b. even further details",
"Please explain")
print(stri_extract_first_regex(questions, "[0-9]+(?:\\.[a-z]\\.)?"))
# [1] "1" "1.a." NA "2" "2.a." "2.b." NA
This might work:
hasnumber <- grepl("[0-9]+",questions)
firstspaces <- sapply(gregexpr(" ", questions), function(x) x[[1]])
res <- ifelse(hasnumber, substr(questions,1,firstspaces-1), NA)
> res
[1] "1" "1.a." NA "2" "2.a." "2.b." NA
The most difficult part I guess is to define where are the first spaces in each question, which could be done with loops or here sapply
You may use
questions <- sub("^(\\d+(?:\\.[a-z0-9]+)*\\.?).*|.*", "\\1", questions)
questions[questions==""] <- NA
questions
# => [1] "1" "1.a." NA "2" "2.a." "2.b." NA
The ^(\\d+(?:\\.[a-z0-9]+)*\\.?).*|.* matches
^ - start of string
(\\d+(?:\\.[a-z0-9]+)*) - Capturing group 1:
\\d+ - 1+ digits
(?:\\.[a-z0-9]+)* - 0 or more repetitions of
\\. - a dot
[a-z0-9]+ - 1 or more lowercase ASCII letters or digits
\\.? - an optional dot
.* - any 0+ chars to the end of the string
| - or
.* - the whole string.
Replaces with the contents of Group 1. If the second alternative matches, the result is an empty string, questions[questions==""] <- NA replaces these elements with NAs.

Resources