Extract both occurrences of pattern Regex - r

I have an input vector as follows:
input <- c("fdsfs iwantthis (1,1,1,1) fdsaaa iwantthisaswell (2,3,4,5)", "fdsfs thistoo (1,1,1,1)")
And I would like to use a regex to extract the following:
> output
[1] "iwantthis iwantthisaswell" "thistoo"
I have managed to extract every word that is before an opening bracket.
I tried this to get only the first word:
> gsub(".*?[[:space:]](.*?)[[:space:]]\\(.*", "\\1", input)
[1] "iwantthis" "thistoo"
But I cannot get it to work for multiple occurrences:
> gsub(".*?[[:space:]](.*?)[[:space:]]\\(.*?[[:space:]](.*?)[[:space:]]\\(.*", "\\1 \\2", input)
[1] "iwantthis iwantthisaswell" "fdsfs thistoo (1,1,1,1)"
The closest I have managed is the following:
library(stringr)
> str_extract_all(input, "(\\S*)\\s\\(")
[[1]]
[1] "iwantthis (" "iwantthisaswell ("
[[2]]
[1] "thistoo ("
I am sure I am missing something in my regex (not that good at it) but what?

You may use
> sapply(str_extract_all(input, "\\S+(?=\\s*\\()"), paste, collapse=" ")
[1] "iwantthis iwantthisaswell" "thistoo"
See the regex demo. The \\S+(?=\\s*\\() will extract all 1+ non-whitespace chunks from a text before a ( char preceded with 0+ whitespaces. sapply with paste will join the found matches with a space (with collapse=" ").
Pattern details
\S+ - 1 or more non-whitespace chars
(?=\s*\() - a positive lookahead ((?=...)) that requires the presence of 0+ whitespace chars (\s*) and then a ( char (\() immediately to the right of the current position.

Here is an option using base R
unlist(regmatches(input, gregexpr("\\w+(?= \\()", input, perl = TRUE)))
#[1] "iwantthis" "iwantthisaswell" "thistoo"

This works in R:
gsub('\\w.+? ([^\\s]+) \\(.+?\\)','\\1', input, perl=TRUE)
Result:
[1] "iwantthis iwantthisaswell" "thistoo"
UPDATED to work for the general case. E.g. now finds "i_wantthisaswell2" by searching on non-spaces between the other matches.
Using other suggested general case inputs:
general_cases <- c("fdsfs iwantthis (1,1,1,1) fdsaaa iwantthisaswell (2,3,4,5)",
"fdsfs thistoo (1,1,1,1) ",
"GaGa iwant_this (1,1,1,1)",
"lal2!##$%^&*()_+a i_wantthisaswell2 (2,3,4,5)")
gsub('\\w.+? ([^\\s]+) \\(.+?\\)','\\1', general_cases, perl=TRUE)
results:
[1] "iwantthis iwantthisaswell" "thistoo "
[3] "iwant_this" "i_wantthisaswell2"

Related

gsub replace word not followed by : with word: R

I thought this would be simpler, but I have strings not followed by ':', and strings with : inside the string. I want to append : to strings that don't end in :, and ignore strings that have : inside.
words
[1] "Bajos" "Ascensor" "habs.:3"
gsub('\\b(?!:)', '\\1:', words, perl = TRUE)
[1] ":Bajos:" ":Ascensor:" ":habs:.::3:"
grep('\\W', words)
[1] 3
grep('\\w', words)
[1] 1 2 3 # ?
Desired output:
'Bajos:' 'Ascensor:' 'habs.:3'
sub("^([^:]*)$", "\\1:", words)
# [1] "Bajos:" "Ascensor:" "habs.:3"
or
nocolon <- !grepl(":", words)
words[nocolon] <- paste0(words[nocolon], ":")
words
# [1] "Bajos:" "Ascensor:" "habs.:3"
Use
"(\\p{L}+)\\b(?![\\p{P}\\p{S}])"
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(\p{L}+) one or more letters (group #1)
--------------------------------------------------------------------------------
\b word boundary
--------------------------------------------------------------------------------
(?![\p{P}\p{S}]) no punctuation allowed on the right
--------------------------------------------------------------------------------
R code snippet:
gsub("(\\p{L}+)\\b(?![\\p{P}\\p{S}])", "\\1:", text, perl=TRUE)

R Sub function: pull everything after second number

Trying to figure out how to pull everything after the second number using the sub function in R. I understand the basics with the lazy and greedy matching, but how do I take it one step further and pull everything after the second number?
str <- 'john02imga-04'
#lazy: pulls everything after first number
sub(".*?[0-9]", "", str)
#output: "2imga-04
#greedy: pulls everything after last number
sub(".*[0-9]", "", str)
#output: ""
#desired output: "imga-04"
You can use
sub("\\D*[0-9]+", "", str)
## Or,
## sub("\\D*\\d+", "", str)
## => [1] "imga-04"
See the regex demo. Also, see the R demo online.
sub will find and replace the first occurrence of
\D* (=[^0-9]) - any zero or more non-digit chars
[0-9]+ (=\d+) - one or more digits.
Alternative ways
Match one or more letters, -, one or more digits at the end of the string:
> regmatches(str, regexpr("[[:alpha:]]+-\\d+$", str))
[1] "imga-04"
> library(stringr)
> str_extract(str, "\\p{L}+-\\d+$")
[1] "imga-04"
You can use a capture group for the second part and use that in the replacement
^\D+\d+(\D+\d+)
^ Start of string
\D+\d+ Match 1+ non digits, then 1+ digits
(\D+\d+) Capture group 1, match 1+ non digits and match 1+ digits
Regex demo | R demo
str <- 'john02imga-04'
sub("^\\D+\\d+(\\D+\\d+)", "\\1", str)
Output
[1] "imga-04"
If you want to remove all after the second number:
^\D+\d+(\D+\d+).*
Regex demo
As an alternative getting a match only using perl=T for using PCRE and \K to clear the match buffer:
str <- 'john02imga-04'
regmatches(str, regexpr("^\\D+\\d+\\K\\D+\\d+", str, perl = T))
Output
[1] "imga-04"
See an R demo

regular expression to find exact matching containing a space and a punctuation

I am going through a dataset containing text values (names) that are formatted like this example :
M.Joan (13-2)
A.Alfred (20-13)
F.O'Neil (12-231)
D.Dan Fun (23-3)
T.Collins (51-82) J.Maddon (12-31)
Some strings have two names in it like
M.Joan (13-2) A.Alfred (20-13)
I only want to extract the name from the string.
Some names are easy to extract because they don't have spaces or anything.
However some are hard because they have a space like the last one above.
name_pattern = "[A-Z][.][^ (]{1,}"
base <- str_extract_all(baseball1$Managers, name_pattern)
When I use this code to extract the names, it works well even for names with spaces or punctuations. However, the extracted names have a space at the end. I was wondering if I can find the exact pattern of " (", a space and a parenthesis.
Output:
[[1]]
[1] "Z.Taylor "
[[2]]
[1] "Z.Taylor "
[[3]]
[1] "Z.Taylor "
[[4]]
[1] "Z.Taylor "
[[5]]
[1] "Y.Berra "
[[6]]
[1] "Y.Berra "
You may use
x <- c("M.Joan (13-2) ", "A.Alfred (20-13)", "F.O'Neil (12-231)", "D.Dan Fun (23-3)", "T.Collins (51-82) J.Maddon (12-31)", "T.Hillman (12-34) and N.Yost (23-45)")
regmatches(x, gregexpr("\\p{Lu}.*?(?=\\s*\\()", x, perl=TRUE))
See the regex demo
Or the str_extract_all version:
str_extract_all(baseball1$Managers, "\\p{Lu}.*?(?=\\s*\\()")
See the regex demo.
It matches
\p{Lu} - an uppercase letter
.*? - any char other than line break chars, as few as possible, up to the first occurrence of (but not including into the match, as (?=...) is a non-consuming construct)....
(?=\\s*\\() - positive lookahead that, immediately to the right of the current location, requires the presence of:
\\s* - 0+ whitespace chars
\\( - a literal (.

R Regex: removing only the immediate following character after >

I have the following string in R:
string1 = "A((..A>B)A"
I would like to remove all punctation, and the letter immediately after >, i.e. >B
Here is the output I desire:
output = "AAA"
I tried using gsub() as follows:
output = gsub("[[:punct:]]","", string1)
But this gives AABA, which keeps the immediately following character.
This would work using your work plus a leading lookbehind first to look for what comes after the > character.
gsub('(?<=>).|[[:punct:]]', '', "A((..A>B)A", perl=TRUE)
## [1] "AAA"
A slightly less complex regex without the use of perl seems to work for this example as well:
gsub("[[:punct:]]|>(.)", "", "A((..A>B)A")
[1] "AAA"
You say
remove all punctation, and the letter immediately after >
Punctuation is matched with [[:punct:]] and a letter can be matched with [[:alpha:]], thus, you may use a TRE regex with gsub:
string1 = "A((..A>B)A"
gsub(">[[:alpha:]]|[[:punct:]]", "", string1)
# => [1] "AAA"
See the online R demo
Note that > is also a char matched with [[:punct:]], thus, you do not need any lookarounds here, just remove it with a letter after it.
Pattern details:
>[[:alpha:]] - a > and any letter
| - or
[[:punct:]] - a punctuation or symbol.

In R is there a way to extract data based on the beginning and end of a pattern but not the middle data?

In R is there a way to extract data based on the beginning and end of a pattern but not the middle data?
ie. if the following was in a single cell
(1) Number = '1111111111, 0000000000' Text =....
(2) Number = '0000000000' Text =....
it would result in:
(1) 1111111111, 0000000000
(2) 0000000000
I tried:
x1<-str_match(x,"(?<=Number'\\s\\=\\s\\')(\\d|\\s|\\,)\\d\\'")
but that doesn't work.
We can try with str_extract_all
library(stringr)
sapply(str_extract_all(x, "[0-9]+"), toString)
#[1] "1111111111, 0000000000" "0000000000"
You may use a PCRE regex to extract the numbers after Number=' from your input text:
(?:Number\s*=\s*'|\G(?!\A)\s*,\s*)\K\d+
See the regex demo.
Pattern details:
(?:Number\s*=\s*'|\G(?!\A)\s*,\s*) - either of the two alternatives:
Number\s*=\s*' - Number and a = enclosed with 0+ whitespaces
| - or
\G(?!\A)\s*,\s* - end of the previous successful match (\G(?!\A)) and a comma enclosed with 0+ whitespaces (\s*)
\K - omit the text matched so far
\d+ - 1+ digits (returned as a match)
See the R demo:
> x <- c("(1) Number = '1111111111, 0000000000' Text =....", "(2) Number = '0000000000' Text =....")
> regmatches(x, gregexpr("(?:Number\\s*=\\s*'|\\G(?!\\A)\\s*,\\s*)\\K\\d+", x, perl=TRUE))
[[1]]
[1] "1111111111" "0000000000"
[[2]]
[1] "0000000000"

Resources