Stringr str_replace_all misses repeated terms - r

I'm having an issue with the stringr::str_replace_all function. I'm trying to replace all instances of iv with insuredvehicle, but the function only seems to catch the first term.
temp_data <- data.table(text = 'the driver of the 1st vehicle hit the iv iv at a stop')
temp_data[, new_text := stringr::str_replace_all(pattern = ' iv ', replacement = ' insuredvehicle ', string = text)]
The outcome looks like the following, which missed the 2nd iv term:
1: the driver of the 1st vehicle hit the insuredvehicle iv at a stop
I believe the issue is that the 2 instances share a space, which is part of the search pattern. I did that because I want to replace the iv term, and not iv within driver.
I DON'T want to simply consolidate the repeated terms to 1. I'd like the result to look like:
1: the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop
I'd appreciate any help getting this to work!

Maybe if you include a word boundary in your regex, than remove the white spaces from the replacement? It is ideal when you want just a full word matching the pattern, but not parts of words, while staying away from these blank space issues.
\\bseems to do the trick
temp_data[, new_text := stringr::str_replace_all(pattern = '\\biv\\b', replacement = 'insuredvehicle', string = text)]
new_text
1: the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop

You can use lookarounds:
temp_data[, new_text := stringr::str_replace_all(pattern = '(?<= )iv(?= )', replacement = 'insuredvehicle', string = text)]
Output:
"the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop"

Use gsub:
gsub("\\biv\\b", "insuredvehicle", temp_data$text)
[1] "the driver of the 1st vehicle hit the uninsuredvehicle uninsuredvehicle at a stop"

Use space boundaries:
temp_data <- data.table(text = 'the driver of the 1st vehicle hit the iv iv at a stop')
temp_data[, new_text := stringr::str_replace_all(pattern = '(?<!\\S)iv(?!\\S)', replacement = 'insuredvehicle', string = text)]
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
iv 'iv'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-ahead

Related

Find closing parenthesis with regex in r

I have several strings with open and unclosed parenthesis. I managed to remove the opening parenthesis (if there is no closing one), but I do not manage to remove the closing parenthesis if there is no opening one. I want to leave those with matching parenthesis alone
string1 = "This (is solved"
string2 = "This is (fine)"
string3 = "This is the problem)"
This is what I was able to remove the first Problem case with (Opening parenthesis but no opening)
str_remove(data, "[(](?!.*[)])")
But I cannot seem to turn it around. The following grabs all closing parenthesis, but not the one without an oping.
"(?!.*[(])[)]"
Any ideas are appreciated!
If you do not need to handle nested paired (balanced) parentheses, you can use
gsub("(\\([^()]*\\))|[()]", "\\1", string)
See the regex demo. Details:
(\([^()]*\)) - Group 1 (\1 refers to this group value): (, then zero or more chars other than ( and ), and then a ) char
| - or
[()] - a ( or ) char.
See the R demo:
x <- c("This (is solved", "This is (fine)", "This is the problem)")
gsub("(\\([^()]*\\))|[()]", "\\1", x)
# => [1] "This is solved" "This is (fine)" "This is the problem"
If the parentheses can be nested, you can use
gsub("(\\((?:[^()]++|(?1))*\\))|[()]", "\\1", string, perl=TRUE)
See this regex demo. Details:
(\((?:[^()]++|(?1))*\)) - Group 1:
\( - a ( char
(?:[^()\n]++|(?1))* - zero or more sequences of either one or more chars other than ( and ), or the whole Group 1 pattern that is recursed
\) - a ) char
|[()] - or a ( / ) char.

Change name of certain character and location in filenames

I want to change one of the _ to another character, for example to -, the reason is there are problems reading in these filenames. I want a to become like b. So I want to change the second last underscore(_), how to specify this in an efficient way?
gsub("_", "-"), it must also be specified to a certain location.
a <- c("2018-01-09_B2_HILIC_POS_123_-14b_090.mzML", "2018-01-09_B2_HILIC_POS_243_-12a_026.mzML", "2020-01-09_B2_HILIC_POS_415_893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV7001248356_040.mzML")
b <- c("2018-01-09_B2_HILIC_POS_123--14b_090.mzML", "2018-01-09_B2_HILIC_POS_243--12a_026.mzML", "2020-01-09_B2_HILIC_POS_415-893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV4004365711_040.mzML")
Here is one base R option using sub :
sub('(.*)(_)(.*_.*)$', '\\1-\\3', a)
#[1] "2018-01-09_B2_HILIC_POS_123--14b_090.mzML"
#[2] "2018-01-09_B2_HILIC_POS_243--12a_026.mzML"
#[3] "2020-01-09_B2_HILIC_POS_415-893a_059.mzML"
#[4] "2020-01-18_B3_HILIC_POS-LV7001248356_040.mzML"
Here we divide data into 3 groups -
The 1st group is everything until second last underscore which is captured using (.*) and used as a backreference (\\1).
The 2nd group is second last underscore which us replaced with -.
The 3rd one is everything after second last underscore which is captured using (.*_.*) and used as a backreference (\\3).
Use
sub("_(?=[^_]*_[^_]*$)", "-", a, perl=TRUE)
See regex proof.
Explanation
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
[^_]* any character except: '_' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
[^_]* any character except: '_' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of look-ahead
See R proof:
a <- c("2018-01-09_B2_HILIC_POS_123_-14b_090.mzML", "2018-01-09_B2_HILIC_POS_243_-12a_026.mzML", "2020-01-09_B2_HILIC_POS_415_893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV7001248356_040.mzML")
sub("_(?=[^_]*_[^_]*$)", "-", a, perl=TRUE)
Results:
[1] "2018-01-09_B2_HILIC_POS_123--14b_090.mzML"
[2] "2018-01-09_B2_HILIC_POS_243--12a_026.mzML"
[3] "2020-01-09_B2_HILIC_POS_415-893a_059.mzML"
[4] "2020-01-18_B3_HILIC_POS-LV7001248356_040.mzML"

Remove all punctuation except underline between characters in R with POSIX character class

I would like to use R to remove all underlines expect those between words. At the end the code removes underlines at the end or at the beginning of a word.
The result should be
'hello_world and hello_world'.
I want to use those pre-built classes. Right know I have learn to expect particular characters with following code but I don't know how to use the word boundary sequences.
test<-"hello_world and _hello_world_"
gsub("[^_[:^punct:]]", "", test, perl=T)
You can use
gsub("[^_[:^punct:]]|_+\\b|\\b_+", "", test, perl=TRUE)
See the regex demo
Details:
[^_[:^punct:]] - any punctuation except _
| - or
_+\b - one or more _ at the end of a word
| - or
\b_+ - one or more _ at the start of a word
One non-regex way is to split and use trimws by setting the whitespace argument to _, i.e.
paste(sapply(strsplit(test, ' '), function(i)trimws(i, whitespace = '_')), collapse = ' ')
#[1] "hello_world and hello_world"
We can remove all the underlying which has a word boundary on either of the end. We use positive lookahead and lookbehind regex to find such underlyings. To remove underlying at the start and end we use trimws.
test<-"hello_world and _hello_world_"
gsub("(?<=\\b)_|_(?=\\b)", "", trimws(test, whitespace = '_'), perl = TRUE)
#[1] "hello_world and hello_world"
You could use:
test <- "hello_world and _hello_world_"
output <- gsub("(?<![^\\W])_|_(?![^\\W])", "", test, perl=TRUE)
output
[1] "hello_world and hello_world"
Explanation of regex:
(?<![^\\W]) assert that what precedes is a non word character OR the start of the input
_ match an underscore to remove
| OR
_ match an underscore to remove, followed by
(?![^\\W]) assert that what follows is a non word character OR the end of the input

Finding a word with condition in a vector with regex on R (perl)

I would like to find the rows in a vector with the word 'RT' in it or 'R' but not if the word 'RT' is preceded by 'no'.
The word RT may be preceded by nothing, a space, a dot, etc.
With the regex, I tried :
grep("(?<=[no] )RT", aaa,ignore.case = FALSE, perl = T)
Which was giving me all the rows with "no RT".
and
grep("(?=[^no].*)RT",aaa , perl = T)
which was giving me all the rows containing 'RT' with and without 'no' at the beginning.
What is my mistake? I thought the ^ was giving everything but the character that follows it.
Example :
aaa = c("RT alone", "no RT", "CT/RT", "adj.RTx", "RT/CT", "lang, RT+","npo RT" )
(?<=[no] )RT matches any RT that is immediately preceded with "n " or "o ".
You should use a negative lookbehind,
"(?<!no )RT"
See the regex demo.
Or, if you need to check for a whole word no,
"(?<!\\bno )RT"
See this regex demo.
Here, (?<!no ) makes sure there is no no immediately to the left of the current location, and only then RT is consumed.

Splitting character string in R - Extracting the timestamp

Thank you in advance for any feedback.
I am attempting to clean some data in R where a time stamp and a text string are included together in the same cell. I am not getting the expected result. I know the regex needs validation work, but just testing out this particular function
Expected:
"04/05/2018 17:14:35" " -(Additional comments) update"
Actual:
"04/05/2018 17:14:35 -(Additional comments) update"
What I tried:
string <- "04/05/2018 17:14:35 -(Additional comments) update"
pattern <- "[:digit:][:digit:][:punct:]
[:digit:][:digit:][:punct:]
[:digit:][:digit:][:digit:][:digit:]
[[:space:]]
[:digit:][:digit:]
[:punct:]
[:digit:][:digit:]
[:punct:]
[:digit:][:digit:]"
strsplit(string, pattern)
I also tried this variation, same result
pattern <- "[:digit:][:digit:]\\/
[:digit:][:digit:]\\/
[:digit:][:digit:][:digit:][:digit:]
[[:space:]]
[:digit:][:digit:]
\\:
[:digit:][:digit:]
\\:
[:digit:][:digit:]"
You can try :
string <- "04/05/2018 17:14:35 -(Additional comments) update"
gsub("(\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2}).*","\\1", string)
#[1] "04/05/2018 17:14:35"
#RHS part
gsub("(\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2})(.*)","\\2", string)
#" -(Additional comments) update"
Regex explanation:
\\d{2} - 2 digits
\\d{4} - 4 digits
/ - separator
: - separator
() - Group for selection
.* - Followed by anything
Seems OP is very keen on using strsplit. One option could be as:
strsplit(gsub("(\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2})(.*)",
paste("\\1","####","\\2",sep=""), string), split = "####")
# [[1]]
# [1] "04/05/2018 17:14:35" " -(Additional comments) update"
Try this:
sub('-.*','',string)
[1] "04/05/2018 17:14:35 "

Resources