Extract strings in round brackets using regex in R

Extract strings in round brackets using regex in R - r

In my transcripts, silent pauses are indicated in round brackets, e.g., (0.9) but also (.) for pauses < 0.3 seconds. I want to extract these pauses. However, transcribers' comments are indicated similarly, namely in double round brackets, e.g. ((coughs)). For this example
yy <- c("well [yes right] (.)", "let's go ((giggles))", "oh [ we::ll] i do n't (0.5) know", "erm [°well right° ]", "(3.2)")
this extracts all the pauses but also the transcriber comment:
pattern <- "(\\(.*?\\))"
grep(pattern, yy, value=T)
matches <- gregexpr(pattern, yy)
paus <- regmatches(yy, matches)
paus <- unlist(paus)
paus
[1] "(.)" "((giggles)" "(0.5)" "(3.2)"
To get rid of the comment, I tried this:
pattern <- "\\([^\\(].*?\\)[^\\)].*?"
That found "(0.5)" but failed to find the string-final pauses "(.)" and "(3.2)".
Any pointer?

Another option with gsub:
gsub("[^(]*(\\(([.0-9]+)\\)|\\b|\\B)[^)]*", "\\2", yy)
#[1] "." "" "0.5" "" "3.2"
Explanation of the pattern:
. [^(]*: anything except an open bracket, 0 or more times
. (\\(([.0-9]+)\\)|\\b|\\B) : what we want to capture : an open bracket followed by a dot or digits, one or more times, followed by a closing bracket (we only want to capture the dot or digits part, hence \\2 in the replacement part) or the empty string that can be at the edge of a word (\\b) or not (\\B). N.B: Here we are not keeping the brackets around the pauses times but we could.
. [^)]*: anything except a closing bracket, 0 or more times

We can use str_extract to extract the pattern which says an optional number followed by a decimal and then followed by another optional number value. We are using optional ("?") here to get the empty value "(.)".
library(stringr)
vec <- str_extract(yy, "(\\((\\d+)?(\\.(\\d)?\\)))")
vec
#[1] "(.)" NA "(0.5)" NA "(3.2)"
and then use is.na to remove NA elements
vec[!is.na(vec)]
#[1] "(.)" "(0.5)" "(3.2)"
Or using the same regular expression with base R regmatches saves a step to remove NA values.
regmatches(yy, regexpr("(\\((\\d+)?(\\.(\\d)?\\)))", yy))
#[1] "(.)" "(0.5)" "(3.2)"

Related

R Use Regular Expression to capture number when sometimes the capture is at the end of the string or not

I need to capture the numbers out of a string that come after a certain parameter name.
I have it working for most, but there is one parameter that is sometimes at the end of the string, but not always. When using the regular expression, it seems to matter.
I've tried different things, but nothing seems to work in both cases.
# Regular expression to capture the digit after the phrase "AppliedWhenID="
p <- ".*&AppliedWhenID=(.\\d*)"
# Tried this, but when at end, it just grabs a blank
#p <- ".*&AppliedWhenID=(.\\d*)&.*|.*&AppliedWhenID=(.\\d*)$"
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
# What should be returned is "2"
gsub(p, "\\1", testAtEnd) # works
gsub(p, "\\1", testNotAtEnd) # doesn't work, it captures 2 + &AgDateTypeID=1

Note that sub and gsub replace the found text(s), thus, in order to extract a part of the input string with a capturing group + a backreference, you need to actually match (and consume) the whole string.
Hence, you need to match the string to the end by adding .* at the end of the pattern:
p <- ".*&AppliedWhenID=(\\d+).*"
sub(p, "\\1", testNotAtEnd)
# => [1] "2"
sub(p, "\\1", testAtEnd)
# => [1] "2"
See the regex demo and the R online demo.
Note that gsub matches multiple occurrences, you need a single one, so it makes sense to replace gsub with sub.
Regex details
.* - any zero or more chars as many as possible
&AppliedWhenID= - a &AppliedWhenID= string
(\d+) - Group 1 (\1): one or more digits
.* - any zero or more chars as many as possible.

You could try using the string look behind conditional "(?<=)" and str_extract() from the stringr library.
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
p <- "(?<=AppliedWhenID=)\\d+"
# What should be returned is "2"
library(stringr)
str_extract(testAtEnd, p)
str_extract(testNotAtEnd, p)
Or in base R
p <- ".*((?<=AppliedWhenID=)\\d+).*"
gsub(p, "\\1", testAtEnd, perl=TRUE)
gsub(p, "\\1", testNotAtEnd, perl=TRUE)

How to extract words containing combinations of certain characters in R

In this sample text:
turns <- tolower(c("Does him good to stir him up now and again .",
"When , when I see him he w's on the settees .",
"Yes it 's been eery for a long time .",
"blissful timing , indeed it was "))
I'd like to extract all words that contain the letters y and e no matter what position or combination, namely yesand eery, using str_extract from stringr:
This regex, in which I determine that y occur immediately before e, matches not surprisingly only yes but not eery:
unlist(str_extract_all(turns, "\\b([a-z]+)?ye([a-z]+)?\\b"))
[1] "yes"
Putting yand e into a character class doesn't get me the desired result either in that all words either with y or with e are matched:
unlist(str_extract_all(turns, "\\b([a-z]+)?[ye]([a-z]+)?\\b"))
[1] "does" "when" "when" "see" "he" "the" "settees" "yes" "been" "eery" "time" "indeed"
So what is the right solution?

You may use both base R and stringr approaches:
stringr::str_extract_all(turns, "\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b")
regmatches(turns, gregexpr("\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b", turns, perl=TRUE))
Or, without turning the strings to lower case, you may use a case insensitive matching with (?i):
stringr::str_extract_all(turns, "(?i)\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b")
regmatches(turns, gregexpr("\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b", turns, perl=TRUE, ignore.case=TRUE))
See the regex demo and the R demo. Also, if you want to make it a tiny bit more efficient, you may use principle of contrast in the lookahead patterns: match any letters but y in the first and all letters but the e in the second using character class substraction:
stringr::str_extract_all(turns, "(?i)\\b(?=[\\p{L}--[y]]*y)(?=[\\p{L}--[e]]*e)\\p{L}+\\b")
Details
(?i) - case insensitive modifier
\b - word boundary
(?=\p{L}*y) - after 0 or more Unicode letters, there must be y ([\p{L}--[y]]* matches any 0 or more letters but y up to the first y)
(?=\p{L}*e) - after 0 or more Unicode letters, there must be e ([\p{L}--[e]]* matches any 0 or more letters but e up to the first e)
\p{L}+ - 1 or more Unicode letters
\b - word boundary

In case there is no urgent need to use stringr::str_extract you can get words containing the letters y and e in base with strsplit and grepl like:
tt <- unlist(strsplit(turns, " "))
tt[grepl("y", tt) & grepl("e", tt)]
#[1] "yes" "eery"
In case you have letter chunks between words:
turns <- c("yes no ay ae 012y345e year.")
tt <- regmatches(turns, gregexpr("\\b[[:alpha:]]+\\b", turns))[[1]]
tt[grepl("y", tt) & grepl("e", tt)]
#[1] "yes" "year"

Extracting string between punctuation, when present

I'm trying to extract a string after a : or ; and before a ; if the 2nd punctuation is present, then to remove everything after a ; if present. Goal result is a number.
The current code is able to do between : and ; OR after : but cannot handle ; alone or : alone.
Also, gsub(|(OF 100); SEE NOTE) isn't working, and I'm not sure why the initial : isn't being excluded and needs the gsub at all.
test<-c("Score (ABC): 2 (of 100); see note","Amount of ABC; 30%","Presence of ABC: negative","ABC not tested")
#works for :/;
toupper((regmatches(toupper(test), gregexpr(":\\s* \\K.*?(?=;)", toupper(test), perl=TRUE))))
#works for :
test<-toupper((regmatches(toupper(test), gregexpr(":\\s* (.*)", toupper(test), perl=TRUE))))
#removes extra characters:
test<-gsub(": |(OF 100); SEE NOTE|%|; ","",test)
#Negative to numeric:
test[grepl("NEGATIVE|<1",test)]<-0
test
Expected result: 2 30 0

Here are some solutions.
The first two are base. The first only uses very simple regular expressions. The second is shorter and the regular expression is only a bit more complicated. In both cases we return NA if there is no match but you can replace NAs with 0 (using ifelse(is.na(x), 0, x) where x is the answer with NAs) afterwards if that is important to you.
The third is almost the same as the second but uses strapply in gsubfn. It returns 0 instead of NA.
1) read.table Replace all colons with semicolons and read it in as semicolon-separated fields. Pick off the second such field and remove the first non-digit and everything after it. Then convert what is left to numeric.
DF <- read.table(text = gsub(":", ";", test),
as.is = TRUE, fill = TRUE, sep = ";", strip.white = TRUE)
as.numeric(sub("\\D.*", "", DF$V2))
##[1] 2 30 NA
2) strcapture Match from the start characters which are not colon or semicolon and then match a colon or semicolon and then match a space and finally capture digits. Return the captured digits converted to numeric.
strcapture("^[^:;]+[;:] (\\d+)", test, list(num = numeric(0)))$num
##[1] 2 30 NA
3) strapply Using the same pattern as in (2) convert the match to numeric and return 0 if the match is empty.
library(gsubfn)
strapply(test, "^[^:;]+[;:] (\\d+)", as.numeric, simplify = TRUE, empty = 0)
## [1] 2 30 0

Another approach:
out <- gsub('(^.+?[;:][^0-9]+)(\\d+)(.*$)|^.+', '\\2', test)
out[out == ''] <- 0
as.numeric(out)
## [1] 2 30 0

Per the OP's description (italics is mine):
extract a string after a : or ; and before a ; if the 2nd punctuation is present, then to remove everything after a ; if present. Goal result is a number.
I think some of the other suggestions may miss that italicized criteria. So here is the OP's test set with one extra condition at the end to test that:
test<-c( "Score (ABC): 2 (of 100); see note",
"Amount of ABC; 30%",
"Presence of ABC: negative",
"...and before a ; if the second punctuation is present, then remove everything after a ; if present [so 666 should not be returned]")
One-liner to return results as requested:
sub( pattern='.+?[:;]\\D*?[^;](\\d*).*?;*.*',
replacement='\\1',
x=test, perl=TRUE)
Results matching OP's request:
[1] "2" "30" "" ""
If the OP really wants an integer with zeros where no match is found, set the sub() replacement = '0\\1' and wrap with as.integer() as follows:
as.integer( gsub( pattern='.+?[:;]\\D*?[^;](\\d*).*?;*.*',
replacement='0\\1',
x=test, perl=TRUE) )
Result:
[1] 2 30 0 0
Fully working online R (R 3.3.2) example:
https://ideone.com/TTuKzG
Regexp explanation
OP wants to find just one match in a string so the sub() function works just fine.
Technique for using sub() is to make a pattern that matches all strings, but use a capture group in the middle to capture zero or more digits if conditions around it are met.
The pattern .+?[:;]\\D*?[^;](\\d*).*?;*.* is read as follows
.+? Match any character (except for line terminators) + between one and unlimited times ? as few times as possible, expanding as needed (lazy)
[:;] Match a single character in the list between the square brackets, in this case : or ;
\\D Match any character that's NOT a digit (equal to [^0-9])
*? Quantifier * Matches between zero and unlimited times ? as few times as possible, expanding as needed (lazy)
[^;] The ^ hat as first character between square brackets means: Match a single character NOT present in the list between the square brackets, in this case match any character NOT ;
(\d*) Everything between curved brackets is a capturing group - this is the 1st capturing croup: \\d* matches a digit (equal to [0-9]) between zero and unlimited times, as many times as possible(greedy)
;* Match the ; character * between zero and unlimited times [so ; does not have to be present but is matched if it is there: This is the key to excluding anything after the second delimiter as the OP requested]
.* Match any character * between zero and unlimited times, as many times as possible (greedy) [so picks up everything to the end of the line]
The replacement = \\1 refers to the 1st capture group in our pattern. We replace everything that was matched by the pattern with what we found in the capture group. \\d* can match no digits, so will return an empty string if there is no number found where we are expecting it.

Extract substring in R using grepl

I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work{*}.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete

I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA (or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA

With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\\.csv)")
# [1] "start" "complete"

Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub. Because I first remove ^.*Work then \\.csv$.
For [\\s\\S] or \\d\\D ... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\\s\\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
. matches also \n when using the R engine.

Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

Using gsub or sub function to only get part of a string?

Col
WBU-ARGU*06:03:04
WBU-ARDU*08:01:01
WBU-ARFU*11:03:05
WBU-ARFU*03:456
I have a column which has 75 rows of variables such as the col above. I am not quite sure how to use gsub or sub in order to get up until the integers after the first colon.
Expected output:
Col
WBU-ARGU*06:03
WBU-ARDU*08:01
WBU-ARFU*11:03
WBU-ARFU*03:456
I tried this but it doesn't seem to work:
gsub("*..:","", df$col)

Following may help you here too.
sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
Output will be as follows.
> sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456b"
Where Input for data frame is as follows.
dat <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456b")
df <- data.frame(dat)
Explanation: Following is only for explanation purposes.
sub(" ##using sub for global subtitution function of R here.
([^:]*) ##By mentioning () we are keeping the matched values from vector's element into 1st place of memory(which we could use later), which is till next colon comes it will match everything.
: ##Mentioning letter colon(:) here.
([^:]*) ##By mentioning () making 2nd place in memory for matched values in vector's values which is till next colon comes it will match everything.
.*" ##Mentioning .* to match everything else now after 2nd colon comes in value.
,"\\1:\\2" ##Now mentioning the values of memory holds with whom we want to substitute the element values \\1 means 1st memory place \\2 is second memory place's value.
,df$dat) ##Mentioning df$dat dataframe's dat value.

You may use
df$col <- sub("(\\d:\\d+):\\d+$", "\\1", df$col)
See the regex demo
Details
(\\d:\\d+) - Capturing group 1 (its value will be accessible via \1 in the replacement pattern): a digit, a colon and 1+ digits.
: - a colon
\\d+ - 1+ digits
$ - end of string.
R Demo:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("(\\d:\\d+):\\d+$", "\\1", col)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternative approach:
df$col <- sub("^(.*?:\\d+).*", "\\1", df$col)
See the regex demo
Here,
^ - start of string
(.*?:\\d+) - Group 1: any 0+ chars, as few as possible (due to the lazy *? quantifier), then : and 1+ digits
.* - the rest of the string.
However, it should be used with the PCRE regex engine, pass perl=TRUE:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("^(.*?:\\d+).*", "\\1", col, perl=TRUE)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
See the R online demo.

sub("(\\d+:\\d+):\\d+$", "\\1", df$Col)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternatively match what you want (instead of subbing out what you don't want) with stringi:
stringi::stri_extract_first(df$Col, regex = "[A-Z-\\*]+\\d+:\\d+")
Slightly more concise stringr:
stringr::str_extract(df$Col, "[A-Z-\\*]+\\d+:\\d+")
# or
stringr::str_extract(df$Col, "[\\w-*]+\\d+:\\d+")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extract strings in round brackets using regex in R - r

Related

R Use Regular Expression to capture number when sometimes the capture is at the end of the string or not

How to extract words containing combinations of certain characters in R

Extracting string between punctuation, when present

Extract substring in R using grepl

Using gsub or sub function to only get part of a string?

Categories

Resources