How to remove these elements from my string in R

One of my columns contains the following strings:
259333154-Carat Programmatic»FCO»O»EV3»D&B - FCO Prospects ABM DDD 2020 D&B ABM ENDING 1/31/2020»NA+728 x 90»IN»UNV»TP»PM»dCPM»NAT»BTP»RON»NA»DB»N/A»ENG»M»P159WXZ
238114259-Carat Programmatic»CPO XT5»O»EV2»END DATE 2/28/19 Google Custom Intent - XT5 CPO Google Custom Intent Audience/In Market Audience»NA+160 x 600»IN»DSK»TP»PM»CPM»NAT»BTS»RON»NA»DB»N/A»ENG»M»PX5LV6
251368220-Carat Programmatic»XT6»O»EV1»END DATE 9/30/19 2019 Cadillac SMRT Always On - CRM - SMRT Segment XT6 Desktop Display»NA+160 x 600»IN»DSK»TP»PM»CPM»NAT»PRS»RON»NA»DB»N/A»ENG»M»P12LQG3
235105211-Ebay»Silverado 1500»M»CON»ended 3/20 - ROS - eBay Run of Motors Desktop»NA+300 x 250»IN»DSK»TP»PM»CPM»NAT»SCR»ROS»NA»DB»N/A»ENG»M»PW79JH
234990143-Carat Programmatic»XT4»O»EV2»Endemic - Loyalist|Edmunds&Oracle|N/A|Vehicle|N/A 2»NA+300 x 250»IN»MOB»TP»PM»CPM»NAT»BTO»RON»IMW»DB»N/A»ENG»M»PW7NSN
My task is to remove the following from the strings:
Dates (e.g. 1/31/2020, 3/20)
Strings like Ending, ENDING, END, end, End, END DATE but NOT the "end" from strings that have endemic in them like the last one
Double spaces e.g. " "
I am having a really hard time with this. I am using several lines of code as follows to perform some of the operations, but don't know how to go about doing this more succinctly and completely:
dat = gsub("2017|2018|2019|2020", "" ,dat)
dat = gsub("»»", "»", dat)
dat = gsub("END ", "", dat)
dat = gsub("end ", "", dat)
dat = gsub("Ended ", "", dat)
dat = gsub("ENDED ", "", dat)
dat = gsub("DATE|date|Date", "", dat)
Thanks so much for your help!

Check the following:
\b(?:0?[1-9]|1[012])(?:[-/.](?:0?[1-9]|[12][0-9]|3[01]))?[-/.](?:19|20)?\d\d\b should handle the "dates (e.g. 1/31/2020, 3/20)" case
(?i)\bEnd(?: DATE|(?:ing)?)\b should handle the "strings like "Ending", "ENDING", "END", "end", "End", "END DATE" but NOT the "end" from strings that have "endemic" in them like the last one" case
([\s»]){2,} should handle the "double spaces e.g. " "" case.
Combining all:
gsub("\\b(?:End(?:\\s+DATE|(?:ing)?)|(?:0?[1-9]|1[012])(?:[-/.](?:0?[1-9]|[12][0-9]|3[01]))?[-/.](?:19|20)?\\d\\d)\\b|([\\s»]){2,}", "\\1", x, perl=TRUE,
See proof.
\b the boundary between a word char (\w) and
something that is not a word char
(?: group, but do not capture:
End 'End'
(?: group, but do not capture:
| OR
(?: group, but do not capture (optional
(matching the most amount possible)):
ing 'ing'
)? end of grouping
) end of grouping
| OR
(?: group, but do not capture:
0? '0' (optional (matching the most
amount possible))
[1-9] any character of: '1' to '9'
| OR
1 '1'
[012] any character of: '0', '1', '2'
) end of grouping
(?: group, but do not capture (optional
(matching the most amount possible)):
[-/.] any character of: '-', '/', '.'
(?: group, but do not capture:
0? '0' (optional (matching the most
amount possible))
[1-9] any character of: '1' to '9'
| OR
[12] any character of: '1', '2'
[0-9] any character of: '0' to '9'
| OR
3 '3'
[01] any character of: '0', '1'
) end of grouping
)? end of grouping
[-/.] any character of: '-', '/', '.'
(?: group, but do not capture (optional
(matching the most amount possible)):
19 '19'
| OR
20 '20'
)? end of grouping
\d digits (0-9)
\d digits (0-9)
) end of grouping
\b the boundary between a word char (\w) and
something that is not a word char
| OR
( group and capture to \1 (at least 2 times
(matching the most amount possible)):
[\s»] any character of: whitespace (\n, \r,
\t, \f, and " "), '»'
){2,} end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)


I'm having an issue with the stringr::str_replace_all function. I'm trying to replace all instances of iv with insuredvehicle, but the function only seems to catch the first term.
temp_data <- data.table(text = 'the driver of the 1st vehicle hit the iv iv at a stop')
temp_data[, new_text := stringr::str_replace_all(pattern = ' iv ', replacement = ' insuredvehicle ', string = text)]
The outcome looks like the following, which missed the 2nd iv term:
1: the driver of the 1st vehicle hit the insuredvehicle iv at a stop
I believe the issue is that the 2 instances share a space, which is part of the search pattern. I did that because I want to replace the iv term, and not iv within driver.
I DON'T want to simply consolidate the repeated terms to 1. I'd like the result to look like:
1: the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop
I'd appreciate any help getting this to work!
Maybe if you include a word boundary in your regex, than remove the white spaces from the replacement? It is ideal when you want just a full word matching the pattern, but not parts of words, while staying away from these blank space issues.
\\bseems to do the trick
temp_data[, new_text := stringr::str_replace_all(pattern = '\\biv\\b', replacement = 'insuredvehicle', string = text)]
1: the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop
You can use lookarounds:
temp_data[, new_text := stringr::str_replace_all(pattern = '(?<= )iv(?= )', replacement = 'insuredvehicle', string = text)]
"the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop"
Use gsub:
gsub("\\biv\\b", "insuredvehicle", temp_data$text)
[1] "the driver of the 1st vehicle hit the uninsuredvehicle uninsuredvehicle at a stop"
Use space boundaries:
temp_data <- data.table(text = 'the driver of the 1st vehicle hit the iv iv at a stop')
temp_data[, new_text := stringr::str_replace_all(pattern = '(?<!\\S)iv(?!\\S)', replacement = 'insuredvehicle', string = text)]
See regex proof.
(?<! look behind to see if there is not:
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
) end of look-behind
iv 'iv'
(?! look ahead to see if there is not:
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
) end of look-ahead

I would like to extract the name of the drug, where "Drug:", "Other:",etc precedes name of drug.
Take the first word after every ":", including characters like "-".
If there are 2 instances of ":", then "and" should join the 2 words as one string. The ourpur should be in a one column dataframe with column name Drug.
Here is my reproducible example:
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
The output should look something like this:
output.df <- data.frame(Drugs = c("TLD-1433", "CG0070 and n-dodecyl-B-D-matose", "Atezolizumab", "N-803 and N-803", "Everolimus and Intravesical", "Association and Association"))
This is what I've tried, which didn't work.
Attempt 1:
str_extract(my.df$col1, '(?<=:\\s)(\\w+)')
Attempt 2:
str_extract(my.df$col1, '(?<=:\\s)(\\w+)(-)(\\w+)')
I am not so familiar with R, but a pattern that would give you the matches from the example data could be:
(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*
Then you could concatenate the matches with and in between.
The pattern matches:
(?<=:\s) Positive lookbehind, assert : and a whitespace char to the left
\w+(?:-\w+)* Match 1+ word chars, followed by optionally repeating - and 1+ word chars
(?: Non capture group
and \w+(?:-\w+)* Match and followed by 1+ word chars followed by optionally repeating - and 1+ word chars
)* Close non capture group and optionally repeat
Regex demo
To get all the matches, you can use str_match_all
str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')
For example
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')
, paste, collapse=" and ")
[1] "TLD-1433"
[1] "CG0070 and n-dodecyl-B-D-maltoside"
[1] "Atezolizumab"
[1] "N-803 and BCG and N-803"
[1] "Everolimus and Intravesical"
[1] "Association and Association"
See regex proof.
: ':'
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
\b the boundary between a word char (\w) and
something that is not a word char
( group and capture to \1:
[\w-]+ any character of: word characters (a-z,
A-Z, 0-9, _), '-' (1 or more times
(matching the most amount possible))
\b the boundary between a word char (\w)
and something that is not a word char
(?: group, but do not capture (0 or more
times (matching the most amount
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
and 'and'
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
\b the boundary between a word char (\w)
and something that is not a word char
[\w-]+ any character of: word characters (a-
z, A-Z, 0-9, _), '-' (1 or more times
(matching the most amount possible))
)* end of grouping
) end of \1
\b the boundary between a word char (\w) and
something that is not a word char
R code:
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
matches <- str_match_all(my.df$col1, ":\\s*\\b([\\w-]+\\b(?:\\s+and\\s+\\b[\\w-]+)*)\\b")
Drugs <- sapply(matches, function(z) paste(z[,-1], collapse=" and "))
output.df <- data.frame(Drugs)
1 TLD-1433
2 CG0070 and n-dodecyl-B-D-maltoside
3 Atezolizumab
4 N-803 and BCG and N-803
5 Everolimus and Intravesical
6 Association and Association

I want to change one of the _ to another character, for example to -, the reason is there are problems reading in these filenames. I want a to become like b. So I want to change the second last underscore(_), how to specify this in an efficient way?
gsub("_", "-"), it must also be specified to a certain location.
a <- c("2018-01-09_B2_HILIC_POS_123_-14b_090.mzML", "2018-01-09_B2_HILIC_POS_243_-12a_026.mzML", "2020-01-09_B2_HILIC_POS_415_893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV7001248356_040.mzML")
b <- c("2018-01-09_B2_HILIC_POS_123--14b_090.mzML", "2018-01-09_B2_HILIC_POS_243--12a_026.mzML", "2020-01-09_B2_HILIC_POS_415-893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV4004365711_040.mzML")
Here is one base R option using sub :
sub('(.*)(_)(.*_.*)$', '\\1-\\3', a)
#[1] "2018-01-09_B2_HILIC_POS_123--14b_090.mzML"
#[2] "2018-01-09_B2_HILIC_POS_243--12a_026.mzML"
#[3] "2020-01-09_B2_HILIC_POS_415-893a_059.mzML"
#[4] "2020-01-18_B3_HILIC_POS-LV7001248356_040.mzML"
Here we divide data into 3 groups -
The 1st group is everything until second last underscore which is captured using (.*) and used as a backreference (\\1).
The 2nd group is second last underscore which us replaced with -.
The 3rd one is everything after second last underscore which is captured using (.*_.*) and used as a backreference (\\3).
sub("_(?=[^_]*_[^_]*$)", "-", a, perl=TRUE)
See regex proof.
_ '_'
(?= look ahead to see if there is:
[^_]* any character except: '_' (0 or more
times (matching the most amount
_ '_'
[^_]* any character except: '_' (0 or more
times (matching the most amount
$ before an optional \n, and the end of
the string
) end of look-ahead
See R proof:
a <- c("2018-01-09_B2_HILIC_POS_123_-14b_090.mzML", "2018-01-09_B2_HILIC_POS_243_-12a_026.mzML", "2020-01-09_B2_HILIC_POS_415_893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV7001248356_040.mzML")
sub("_(?=[^_]*_[^_]*$)", "-", a, perl=TRUE)
[1] "2018-01-09_B2_HILIC_POS_123--14b_090.mzML"
[2] "2018-01-09_B2_HILIC_POS_243--12a_026.mzML"
[3] "2020-01-09_B2_HILIC_POS_415-893a_059.mzML"
[4] "2020-01-18_B3_HILIC_POS-LV7001248356_040.mzML"

How do I match the year such that it is general for the following examples.
a <- '"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}'
b <- 'Þegar það gerist (1998/I) (TV)'
I have tried the following, but did not have the biggest success.
gsub('.+\\(([0-9]+.+\\)).?$', '\\1', a)
What I thought it did was to go until it finds a (, then it would make a group of numbers, then any character until it meets a ). And if there are several matches, I want to extract the first group.
Any suggestions to where I go wrong? I have been doing this in R.
You could use
strings <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)')
years <- str_match(strings, "\\((\\d+(?: B\\.C\\.)?)")[,2]
# [1] "1953" "1998"
The expression here is
\( # (
(\d+ # capture 1+ digits
(?: B\.C\.)? # B.C. eventually
Note that backslashes need to be escaped in R.
Your pattern contains .+ parts that match 1 or more chars as many as possible, and at best your pattern could grab last 4 digit chunks from the incoming strings.
You may use
Replace with \1 to only keep the 4 digit number. See the regex demo.
^ - start of string
.*? - any 0+ chars as few as possible
\( - a (
(\d{4}) - Group 1: four digits
(?: - start of an optional non-capturing group
/ - a /
[^)]* - any 0+ chars other than )
)? - end of the group
.* - the rest of the string.
See the R demo:
a <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)', 'Johannes Passion, BWV. 245 (1725 Version) (1996) (V)')
sub("^.*?\\((\\d{4})(?:/[^)]*)?\\).*", "\\1", a)
# => [1] "1953" "1998" "1996"
Another base R solution is to match the 4 digits after (:
regmatches(a, regexpr("\\(\\K\\d{4}(?=(?:/[^)]*)?\\))", a, perl=TRUE))
# => [1] "1953" "1998" "1996"
The \(\K\d{4} pattern matches ( and then drops it due to \K match reset operator and then a (?=(?:/[^)]*)?\\)) lookahead ensures there is an optional / + 0+ chars other than ) and then a ). Note that regexpr extracts the first match only.

I thought I had a great regex for limiting the number of words entered into a TextBox, however I discovered, it fails when there is punctuation in the text.
How can I modify this regex (or use a different one) that correctly counts words that may be made up of several sentences or contain other symbols?
This limits the words to 10.
I think this is the ONLY WAY it can be done.
I would like to see a better way if it exists.
(requires an atomic group)
For Unicode:
[^\pL\pN]* [\pL\pN] # Not letters/numbers, followed by letter/number
[\pL\pN_-] # Letter/number or '-'
\pP # Or, punctuation if followed by punctuation/letter/number or '-'
(?= [\pL\pN\pP_-] )
[?.!] # Or, (Add) Special word ending punctuation
For Ascii:
[\W_]* [^\W_]
(?= [\w[:punct:]-] )
