Extract characters of single word following : - r

I would like to extract the name of the drug, where "Drug:", "Other:",etc precedes name of drug.
Take the first word after every ":", including characters like "-".
If there are 2 instances of ":", then "and" should join the 2 words as one string. The ourpur should be in a one column dataframe with column name Drug.
Here is my reproducible example:
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
The output should look something like this:
output.df <- data.frame(Drugs = c("TLD-1433", "CG0070 and n-dodecyl-B-D-matose", "Atezolizumab", "N-803 and N-803", "Everolimus and Intravesical", "Association and Association"))
This is what I've tried, which didn't work.
Attempt 1:
str_extract(my.df$col1, '(?<=:\\s)(\\w+)')
Attempt 2:
str_extract(my.df$col1, '(?<=:\\s)(\\w+)(-)(\\w+)')

I am not so familiar with R, but a pattern that would give you the matches from the example data could be:
(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*
Then you could concatenate the matches with and in between.
The pattern matches:
(?<=:\s) Positive lookbehind, assert : and a whitespace char to the left
\w+(?:-\w+)* Match 1+ word chars, followed by optionally repeating - and 1+ word chars
(?: Non capture group
and \w+(?:-\w+)* Match and followed by 1+ word chars followed by optionally repeating - and 1+ word chars
)* Close non capture group and optionally repeat
Regex demo
To get all the matches, you can use str_match_all
str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')
For example
library(stringr)
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
lapply(
str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')
, paste, collapse=" and ")
Output
[[1]]
[1] "TLD-1433"
[[2]]
[1] "CG0070 and n-dodecyl-B-D-maltoside"
[[3]]
[1] "Atezolizumab"
[[4]]
[1] "N-803 and BCG and N-803"
[[5]]
[1] "Everolimus and Intravesical"
[[6]]
[1] "Association and Association"

Use
:\s*\b([\w-]+\b(?:\s+and\s+\b[\w-]+)*)\b
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[\w-]+ any character of: word characters (a-z,
A-Z, 0-9, _), '-' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
and 'and'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
[\w-]+ any character of: word characters (a-
z, A-Z, 0-9, _), '-' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
R code:
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
library(stringr)
matches <- str_match_all(my.df$col1, ":\\s*\\b([\\w-]+\\b(?:\\s+and\\s+\\b[\\w-]+)*)\\b")
Drugs <- sapply(matches, function(z) paste(z[,-1], collapse=" and "))
output.df <- data.frame(Drugs)
output.df
Results:
Drugs
1 TLD-1433
2 CG0070 and n-dodecyl-B-D-maltoside
3 Atezolizumab
4 N-803 and BCG and N-803
5 Everolimus and Intravesical
6 Association and Association

Related

Change name of certain character and location in filenames

I want to change one of the _ to another character, for example to -, the reason is there are problems reading in these filenames. I want a to become like b. So I want to change the second last underscore(_), how to specify this in an efficient way?
gsub("_", "-"), it must also be specified to a certain location.
a <- c("2018-01-09_B2_HILIC_POS_123_-14b_090.mzML", "2018-01-09_B2_HILIC_POS_243_-12a_026.mzML", "2020-01-09_B2_HILIC_POS_415_893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV7001248356_040.mzML")
b <- c("2018-01-09_B2_HILIC_POS_123--14b_090.mzML", "2018-01-09_B2_HILIC_POS_243--12a_026.mzML", "2020-01-09_B2_HILIC_POS_415-893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV4004365711_040.mzML")
Here is one base R option using sub :
sub('(.*)(_)(.*_.*)$', '\\1-\\3', a)
#[1] "2018-01-09_B2_HILIC_POS_123--14b_090.mzML"
#[2] "2018-01-09_B2_HILIC_POS_243--12a_026.mzML"
#[3] "2020-01-09_B2_HILIC_POS_415-893a_059.mzML"
#[4] "2020-01-18_B3_HILIC_POS-LV7001248356_040.mzML"
Here we divide data into 3 groups -
The 1st group is everything until second last underscore which is captured using (.*) and used as a backreference (\\1).
The 2nd group is second last underscore which us replaced with -.
The 3rd one is everything after second last underscore which is captured using (.*_.*) and used as a backreference (\\3).
Use
sub("_(?=[^_]*_[^_]*$)", "-", a, perl=TRUE)
See regex proof.
Explanation
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
[^_]* any character except: '_' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
[^_]* any character except: '_' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of look-ahead
See R proof:
a <- c("2018-01-09_B2_HILIC_POS_123_-14b_090.mzML", "2018-01-09_B2_HILIC_POS_243_-12a_026.mzML", "2020-01-09_B2_HILIC_POS_415_893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV7001248356_040.mzML")
sub("_(?=[^_]*_[^_]*$)", "-", a, perl=TRUE)
Results:
[1] "2018-01-09_B2_HILIC_POS_123--14b_090.mzML"
[2] "2018-01-09_B2_HILIC_POS_243--12a_026.mzML"
[3] "2020-01-09_B2_HILIC_POS_415-893a_059.mzML"
[4] "2020-01-18_B3_HILIC_POS-LV7001248356_040.mzML"

How to remove these elements from my string in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
One of my columns contains the following strings:
259333154-Carat Programmatic»FCO»O»EV3»D&B - FCO Prospects ABM DDD 2020 D&B ABM ENDING 1/31/2020»NA+728 x 90»IN»UNV»TP»PM»dCPM»NAT»BTP»RON»NA»DB»N/A»ENG»M»P159WXZ
238114259-Carat Programmatic»CPO XT5»O»EV2»END DATE 2/28/19 Google Custom Intent - XT5 CPO Google Custom Intent Audience/In Market Audience»NA+160 x 600»IN»DSK»TP»PM»CPM»NAT»BTS»RON»NA»DB»N/A»ENG»M»PX5LV6
251368220-Carat Programmatic»XT6»O»EV1»END DATE 9/30/19 2019 Cadillac SMRT Always On - CRM - SMRT Segment XT6 Desktop Display»NA+160 x 600»IN»DSK»TP»PM»CPM»NAT»PRS»RON»NA»DB»N/A»ENG»M»P12LQG3
235105211-Ebay»Silverado 1500»M»CON»ended 3/20 - ROS - eBay Run of Motors Desktop»NA+300 x 250»IN»DSK»TP»PM»CPM»NAT»SCR»ROS»NA»DB»N/A»ENG»M»PW79JH
234990143-Carat Programmatic»XT4»O»EV2»Endemic - Loyalist|Edmunds&Oracle|N/A|Vehicle|N/A 2»NA+300 x 250»IN»MOB»TP»PM»CPM»NAT»BTO»RON»IMW»DB»N/A»ENG»M»PW7NSN
My task is to remove the following from the strings:
Dates (e.g. 1/31/2020, 3/20)
Strings like Ending, ENDING, END, end, End, END DATE but NOT the "end" from strings that have endemic in them like the last one
Double spaces e.g. " "
I am having a really hard time with this. I am using several lines of code as follows to perform some of the operations, but don't know how to go about doing this more succinctly and completely:
dat = gsub("2017|2018|2019|2020", "" ,dat)
dat = gsub("»»", "»", dat)
dat = gsub("END ", "", dat)
dat = gsub("end ", "", dat)
dat = gsub("Ended ", "", dat)
dat = gsub("ENDED ", "", dat)
dat = gsub("DATE|date|Date", "", dat)
Thanks so much for your help!
Check the following:
\b(?:0?[1-9]|1[012])(?:[-/.](?:0?[1-9]|[12][0-9]|3[01]))?[-/.](?:19|20)?\d\d\b should handle the "dates (e.g. 1/31/2020, 3/20)" case
(?i)\bEnd(?: DATE|(?:ing)?)\b should handle the "strings like "Ending", "ENDING", "END", "end", "End", "END DATE" but NOT the "end" from strings that have "endemic" in them like the last one" case
([\s»]){2,} should handle the "double spaces e.g. " "" case.
Combining all:
gsub("\\b(?:End(?:\\s+DATE|(?:ing)?)|(?:0?[1-9]|1[012])(?:[-/.](?:0?[1-9]|[12][0-9]|3[01]))?[-/.](?:19|20)?\\d\\d)\\b|([\\s»]){2,}", "\\1", x, perl=TRUE, ignore.case=TRUE)
See proof.
Explanation
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
End 'End'
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
DATE ' DATE'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
ing 'ing'
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
0? '0' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
[1-9] any character of: '1' to '9'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
1 '1'
--------------------------------------------------------------------------------
[012] any character of: '0', '1', '2'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
[-/.] any character of: '-', '/', '.'
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
0? '0' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
[1-9] any character of: '1' to '9'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[12] any character of: '1', '2'
--------------------------------------------------------------------------------
[0-9] any character of: '0' to '9'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
3 '3'
--------------------------------------------------------------------------------
[01] any character of: '0', '1'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
[-/.] any character of: '-', '/', '.'
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
19 '19'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
20 '20'
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
( group and capture to \1 (at least 2 times
(matching the most amount possible)):
--------------------------------------------------------------------------------
[\s»] any character of: whitespace (\n, \r,
\t, \f, and " "), '»'
--------------------------------------------------------------------------------
){2,} end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)

Regex: Extracting numbers from parentheses with multiple matches

How do I match the year such that it is general for the following examples.
a <- '"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}'
b <- 'Þegar það gerist (1998/I) (TV)'
I have tried the following, but did not have the biggest success.
gsub('.+\\(([0-9]+.+\\)).?$', '\\1', a)
What I thought it did was to go until it finds a (, then it would make a group of numbers, then any character until it meets a ). And if there are several matches, I want to extract the first group.
Any suggestions to where I go wrong? I have been doing this in R.
You could use
library(stringr)
strings <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)')
years <- str_match(strings, "\\((\\d+(?: B\\.C\\.)?)")[,2]
years
# [1] "1953" "1998"
The expression here is
\( # (
(\d+ # capture 1+ digits
(?: B\.C\.)? # B.C. eventually
)
Note that backslashes need to be escaped in R.
Your pattern contains .+ parts that match 1 or more chars as many as possible, and at best your pattern could grab last 4 digit chunks from the incoming strings.
You may use
^.*?\((\d{4})(?:/[^)]*)?\).*
Replace with \1 to only keep the 4 digit number. See the regex demo.
Details
^ - start of string
.*? - any 0+ chars as few as possible
\( - a (
(\d{4}) - Group 1: four digits
(?: - start of an optional non-capturing group
/ - a /
[^)]* - any 0+ chars other than )
)? - end of the group
\) - a ) (OPTIONAL, MAY BE OMITTED)
.* - the rest of the string.
See the R demo:
a <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)', 'Johannes Passion, BWV. 245 (1725 Version) (1996) (V)')
sub("^.*?\\((\\d{4})(?:/[^)]*)?\\).*", "\\1", a)
# => [1] "1953" "1998" "1996"
Another base R solution is to match the 4 digits after (:
regmatches(a, regexpr("\\(\\K\\d{4}(?=(?:/[^)]*)?\\))", a, perl=TRUE))
# => [1] "1953" "1998" "1996"
The \(\K\d{4} pattern matches ( and then drops it due to \K match reset operator and then a (?=(?:/[^)]*)?\\)) lookahead ensures there is an optional / + 0+ chars other than ) and then a ). Note that regexpr extracts the first match only.

RegularExpressionValidator for word count that handles punctuation

I thought I had a great regex for limiting the number of words entered into a TextBox, however I discovered, it fails when there is punctuation in the text.
How can I modify this regex (or use a different one) that correctly counts words that may be made up of several sentences or contain other symbols?
^(?:\b\w+\b[\s\r\n]*){1,10}$
This limits the words to 10.
I think this is the ONLY WAY it can be done.
I would like to see a better way if it exists.
(requires an atomic group)
For Unicode:
^\s*(?>[^\pL\pN]*[\pL\pN](?:[\pL\pN_-]|\pP(?=[\pL\pN\pP_-])|[?.!])*\s*){1,10}$
Explained
^
\s*
(?>
[^\pL\pN]* [\pL\pN] # Not letters/numbers, followed by letter/number
(?:
[\pL\pN_-] # Letter/number or '-'
|
\pP # Or, punctuation if followed by punctuation/letter/number or '-'
(?= [\pL\pN\pP_-] )
|
[?.!] # Or, (Add) Special word ending punctuation
)*
\s*
){1,10}
$
For Ascii:
^\s*(?>[\W_]*[^\W_](?:\w|[[:punct:]_-](?=[\w[:punct:]-])|[?.!])*\s*){1,10}$
Expanded
^
\s*
(?>
[\W_]* [^\W_]
(?:
\w
|
[[:punct:]_-]
(?= [\w[:punct:]-] )
|
[?.!]
)*
\s*
){1,10}
$

r remove words which begins and ends with dashes

How to remove words with dashes (as prefixes or suffixes) from such string:
x <- "word -o -dod -3 -33 dp-pd -d- --- 140 -- s- S- SS- s3- 3e- 33- 3- s SS avf-ada"
And obtain:
word dp-pd 140 s SS avf-ada
Occasionally, standalone dashes also can be removed.
I've found a solution thanks to regex101: (\s-\S+)|(\S+-\s)
I suggest using
x <- "word -o -dod -3 -33 dp-pd -d- --- 140 -- s- S- SS- s3- 3e- 33- 3- s SS avf-ada -"
trimws(gsub("(?:\\S+-\\B|\\B-\\S+|\\B-\\B)\\s*", "", x, perl=TRUE))
See the regex demo and an R demo.
Details:
(?:\S+-\B|\B-\S+|\B-\B) - either of the two alternatives:
\S+-\B - 1+ chars other than whitespace, - and a non-word boundary, that is, the - must be either at the end of string or before a non-word char
| - or
\B-\S+ - a non-word boundary, that is, the - should only be matched if preceded with a non-word char or start of string, then a hyphen and 1+ chars other than whitespace
\B-\B - any - enclosed with non-word boundaries (at the end/start of string or between non-word chars)
\s* - 0+ whitespaces.
The perl=TRUE needs to be used because of the non-word boundary that does not work correctly with a TRE regex version.

Resources