Change name of certain character and location in filenames - r

I want to change one of the _ to another character, for example to -, the reason is there are problems reading in these filenames. I want a to become like b. So I want to change the second last underscore(_), how to specify this in an efficient way?
gsub("_", "-"), it must also be specified to a certain location.
a <- c("2018-01-09_B2_HILIC_POS_123_-14b_090.mzML", "2018-01-09_B2_HILIC_POS_243_-12a_026.mzML", "2020-01-09_B2_HILIC_POS_415_893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV7001248356_040.mzML")
b <- c("2018-01-09_B2_HILIC_POS_123--14b_090.mzML", "2018-01-09_B2_HILIC_POS_243--12a_026.mzML", "2020-01-09_B2_HILIC_POS_415-893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV4004365711_040.mzML")

Here is one base R option using sub :
sub('(.*)(_)(.*_.*)$', '\\1-\\3', a)
#[1] "2018-01-09_B2_HILIC_POS_123--14b_090.mzML"
#[2] "2018-01-09_B2_HILIC_POS_243--12a_026.mzML"
#[3] "2020-01-09_B2_HILIC_POS_415-893a_059.mzML"
#[4] "2020-01-18_B3_HILIC_POS-LV7001248356_040.mzML"
Here we divide data into 3 groups -
The 1st group is everything until second last underscore which is captured using (.*) and used as a backreference (\\1).
The 2nd group is second last underscore which us replaced with -.
The 3rd one is everything after second last underscore which is captured using (.*_.*) and used as a backreference (\\3).

Use
sub("_(?=[^_]*_[^_]*$)", "-", a, perl=TRUE)
See regex proof.
Explanation
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
[^_]* any character except: '_' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
[^_]* any character except: '_' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of look-ahead
See R proof:
a <- c("2018-01-09_B2_HILIC_POS_123_-14b_090.mzML", "2018-01-09_B2_HILIC_POS_243_-12a_026.mzML", "2020-01-09_B2_HILIC_POS_415_893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV7001248356_040.mzML")
sub("_(?=[^_]*_[^_]*$)", "-", a, perl=TRUE)
Results:
[1] "2018-01-09_B2_HILIC_POS_123--14b_090.mzML"
[2] "2018-01-09_B2_HILIC_POS_243--12a_026.mzML"
[3] "2020-01-09_B2_HILIC_POS_415-893a_059.mzML"
[4] "2020-01-18_B3_HILIC_POS-LV7001248356_040.mzML"

Related

Remove all punctuation except underline between characters in R with POSIX character class

I would like to use R to remove all underlines expect those between words. At the end the code removes underlines at the end or at the beginning of a word.
The result should be
'hello_world and hello_world'.
I want to use those pre-built classes. Right know I have learn to expect particular characters with following code but I don't know how to use the word boundary sequences.
test<-"hello_world and _hello_world_"
gsub("[^_[:^punct:]]", "", test, perl=T)
You can use
gsub("[^_[:^punct:]]|_+\\b|\\b_+", "", test, perl=TRUE)
See the regex demo
Details:
[^_[:^punct:]] - any punctuation except _
| - or
_+\b - one or more _ at the end of a word
| - or
\b_+ - one or more _ at the start of a word
One non-regex way is to split and use trimws by setting the whitespace argument to _, i.e.
paste(sapply(strsplit(test, ' '), function(i)trimws(i, whitespace = '_')), collapse = ' ')
#[1] "hello_world and hello_world"
We can remove all the underlying which has a word boundary on either of the end. We use positive lookahead and lookbehind regex to find such underlyings. To remove underlying at the start and end we use trimws.
test<-"hello_world and _hello_world_"
gsub("(?<=\\b)_|_(?=\\b)", "", trimws(test, whitespace = '_'), perl = TRUE)
#[1] "hello_world and hello_world"
You could use:
test <- "hello_world and _hello_world_"
output <- gsub("(?<![^\\W])_|_(?![^\\W])", "", test, perl=TRUE)
output
[1] "hello_world and hello_world"
Explanation of regex:
(?<![^\\W]) assert that what precedes is a non word character OR the start of the input
_ match an underscore to remove
| OR
_ match an underscore to remove, followed by
(?![^\\W]) assert that what follows is a non word character OR the end of the input

Regex to match a pattern but not two specific cases

I want to match every cases of "-", but not these ones:
[\d]-[A-Z]
[A-Z]-[\d]
I tried this pattern: ((?<![A-Z])-(?![0-9]))|((?<![0-9])-(?![A-Z])) but some results are incorrect like: "RUA VF-32 N"
Can anyone help me?
A simple approach is to use grep with your current logic and inverting the result, and then run another grep to only keep those items that have a hyphen in them:
x <- c("QUADRA 120 - ASA BRANCA","FAZENDA LAGE -RODOVIA RIO VERDE","C-15","99-B","A-A")
grep("-", grep("[A-Z]-\\d|\\d-[A-Z]", x, invert=TRUE, value=TRUE), value=TRUE, fixed=TRUE)
# => [1] "QUADRA 120 - ASA BRANCA" "FAZENDA LAGE -RODOVIA RIO VERDE"
# [3] "A-A"
Here, [A-Z]-\\d|\\d-[A-Z] matches a hyphen either in between an uppercase ASCII etter or a digit or betweena digit and an ASCII uppercase letter. If there is a match, the result is inverted due to invert=TRUE.
See the R demo.
To only match - in all contexts other than in between a letter and a digit, you may use the PCRE regex based on SKIP-FAIL technique like
> grep("(?:\\d-[A-Z]|[A-Z]-\\d)(*SKIP)(*F)|-", x, perl=TRUE)
[1] 1 2
See this regex demo
Details
(?:\d-[A-Z]|[A-Z]-\d) - a non-capturing group that matches either a digit, - and then uppercase ASCII letter, or an uppercase ASCII letter, - and a digit
(*SKIP)(*F) - omit the current match and proceed looking for the next match at the end of the "failed" match
| - or
- - a hyphen.

Regex: Extracting numbers from parentheses with multiple matches

How do I match the year such that it is general for the following examples.
a <- '"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}'
b <- 'Þegar það gerist (1998/I) (TV)'
I have tried the following, but did not have the biggest success.
gsub('.+\\(([0-9]+.+\\)).?$', '\\1', a)
What I thought it did was to go until it finds a (, then it would make a group of numbers, then any character until it meets a ). And if there are several matches, I want to extract the first group.
Any suggestions to where I go wrong? I have been doing this in R.
You could use
library(stringr)
strings <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)')
years <- str_match(strings, "\\((\\d+(?: B\\.C\\.)?)")[,2]
years
# [1] "1953" "1998"
The expression here is
\( # (
(\d+ # capture 1+ digits
(?: B\.C\.)? # B.C. eventually
)
Note that backslashes need to be escaped in R.
Your pattern contains .+ parts that match 1 or more chars as many as possible, and at best your pattern could grab last 4 digit chunks from the incoming strings.
You may use
^.*?\((\d{4})(?:/[^)]*)?\).*
Replace with \1 to only keep the 4 digit number. See the regex demo.
Details
^ - start of string
.*? - any 0+ chars as few as possible
\( - a (
(\d{4}) - Group 1: four digits
(?: - start of an optional non-capturing group
/ - a /
[^)]* - any 0+ chars other than )
)? - end of the group
\) - a ) (OPTIONAL, MAY BE OMITTED)
.* - the rest of the string.
See the R demo:
a <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)', 'Johannes Passion, BWV. 245 (1725 Version) (1996) (V)')
sub("^.*?\\((\\d{4})(?:/[^)]*)?\\).*", "\\1", a)
# => [1] "1953" "1998" "1996"
Another base R solution is to match the 4 digits after (:
regmatches(a, regexpr("\\(\\K\\d{4}(?=(?:/[^)]*)?\\))", a, perl=TRUE))
# => [1] "1953" "1998" "1996"
The \(\K\d{4} pattern matches ( and then drops it due to \K match reset operator and then a (?=(?:/[^)]*)?\\)) lookahead ensures there is an optional / + 0+ chars other than ) and then a ). Note that regexpr extracts the first match only.

replace last number in string using regex

I want to replace the last number in a string using regex and gsub
S <- "abcd2efghi2.txt"
The last number and the position of the last number can vary.
So I've tried the regex
?<=[\d+])\b
gsub("?<=[\d+])\b", "", S)
but that doesn't seem to work
Appreciate any help.
You can achieve that with a default TRE engine using the following regex:
\d+(\D*)$
Replace with the \1 backreference.
Details
\d+ - 1 or more digits
(\D*) - Capturing group 1: any 0+ non-digit symbols
$ - end of string
\1 - a backreference to the Group 1 value (so as to restore the text matched and consumed with the (\D*) subpattern).
See the regex demo.
R code demo:
sub("\\d+(\\D*)$", "\\1", S)
## => [1] "abcd2efghi.txt"
You could use this regex:
\d+(?=\D*$)
It matches a sequence of digits when everything that follows consists of non-digits (\D) until the end of the string ($).

How to delete first and last items before the matching pattern or delimiter in R

I have this vector called myvec. I want to delete everything before first delimiter _ and everything after the last delimiter _ (including the delimeter). How do I do this in R to get the result.
myvec <- c("contamination_LPH-001-10_3.txt", "contamination_LPH-001-10_AK1_0.txt",
"contamination_LPH-001-10_AK2_1.txt", "contamination_LPH-001-10_PD_2.txt",
"contamination_LPH-001-10_SCC_4.txt")
Result:
LPH-001-10, LPH-001-10_AK1,LPH-001-10_AK2,LPH-001-10_PD,LPH-001-10_SCC
We can use gsub for this
gsub("^[^_]*_|_[^_]*$", "", myvec)
#[1] "LPH-001-10" "LPH-001-10_AK1" "LPH-001-10_AK2"
#[4] "LPH-001-10_PD" "LPH-001-10_SCC"
From the start (^) of the string, we are matching zero or more characters that are not a _ ([^_]*) followed by a _ or (|) match a _ followed by zero or more charachters that are not a _ ([^_]*) till the end ($) of the string and replace it with "".
Or we can also use capture groups ((...)) and replace with the backreference for the capture groups.
sub("^[^_]*_(.*)_[^_]*$", "\\1", myvec)
#[1] "LPH-001-10" "LPH-001-10_AK1" "LPH-001-10_AK2"
#[4] "LPH-001-10_PD" "LPH-001-10_SCC"

Resources