How to delete first and last items before the matching pattern or delimiter in R - r

I have this vector called myvec. I want to delete everything before first delimiter _ and everything after the last delimiter _ (including the delimeter). How do I do this in R to get the result.
myvec <- c("contamination_LPH-001-10_3.txt", "contamination_LPH-001-10_AK1_0.txt",
"contamination_LPH-001-10_AK2_1.txt", "contamination_LPH-001-10_PD_2.txt",
"contamination_LPH-001-10_SCC_4.txt")
Result:
LPH-001-10, LPH-001-10_AK1,LPH-001-10_AK2,LPH-001-10_PD,LPH-001-10_SCC

We can use gsub for this
gsub("^[^_]*_|_[^_]*$", "", myvec)
#[1] "LPH-001-10" "LPH-001-10_AK1" "LPH-001-10_AK2"
#[4] "LPH-001-10_PD" "LPH-001-10_SCC"
From the start (^) of the string, we are matching zero or more characters that are not a _ ([^_]*) followed by a _ or (|) match a _ followed by zero or more charachters that are not a _ ([^_]*) till the end ($) of the string and replace it with "".
Or we can also use capture groups ((...)) and replace with the backreference for the capture groups.
sub("^[^_]*_(.*)_[^_]*$", "\\1", myvec)
#[1] "LPH-001-10" "LPH-001-10_AK1" "LPH-001-10_AK2"
#[4] "LPH-001-10_PD" "LPH-001-10_SCC"

Related

Change name of certain character and location in filenames

I want to change one of the _ to another character, for example to -, the reason is there are problems reading in these filenames. I want a to become like b. So I want to change the second last underscore(_), how to specify this in an efficient way?
gsub("_", "-"), it must also be specified to a certain location.
a <- c("2018-01-09_B2_HILIC_POS_123_-14b_090.mzML", "2018-01-09_B2_HILIC_POS_243_-12a_026.mzML", "2020-01-09_B2_HILIC_POS_415_893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV7001248356_040.mzML")
b <- c("2018-01-09_B2_HILIC_POS_123--14b_090.mzML", "2018-01-09_B2_HILIC_POS_243--12a_026.mzML", "2020-01-09_B2_HILIC_POS_415-893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV4004365711_040.mzML")
Here is one base R option using sub :
sub('(.*)(_)(.*_.*)$', '\\1-\\3', a)
#[1] "2018-01-09_B2_HILIC_POS_123--14b_090.mzML"
#[2] "2018-01-09_B2_HILIC_POS_243--12a_026.mzML"
#[3] "2020-01-09_B2_HILIC_POS_415-893a_059.mzML"
#[4] "2020-01-18_B3_HILIC_POS-LV7001248356_040.mzML"
Here we divide data into 3 groups -
The 1st group is everything until second last underscore which is captured using (.*) and used as a backreference (\\1).
The 2nd group is second last underscore which us replaced with -.
The 3rd one is everything after second last underscore which is captured using (.*_.*) and used as a backreference (\\3).
Use
sub("_(?=[^_]*_[^_]*$)", "-", a, perl=TRUE)
See regex proof.
Explanation
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
[^_]* any character except: '_' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
[^_]* any character except: '_' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of look-ahead
See R proof:
a <- c("2018-01-09_B2_HILIC_POS_123_-14b_090.mzML", "2018-01-09_B2_HILIC_POS_243_-12a_026.mzML", "2020-01-09_B2_HILIC_POS_415_893a_059.mzML", "2020-01-18_B3_HILIC_POS_LV7001248356_040.mzML")
sub("_(?=[^_]*_[^_]*$)", "-", a, perl=TRUE)
Results:
[1] "2018-01-09_B2_HILIC_POS_123--14b_090.mzML"
[2] "2018-01-09_B2_HILIC_POS_243--12a_026.mzML"
[3] "2020-01-09_B2_HILIC_POS_415-893a_059.mzML"
[4] "2020-01-18_B3_HILIC_POS-LV7001248356_040.mzML"

Remove all punctuation except underline between characters in R with POSIX character class

I would like to use R to remove all underlines expect those between words. At the end the code removes underlines at the end or at the beginning of a word.
The result should be
'hello_world and hello_world'.
I want to use those pre-built classes. Right know I have learn to expect particular characters with following code but I don't know how to use the word boundary sequences.
test<-"hello_world and _hello_world_"
gsub("[^_[:^punct:]]", "", test, perl=T)
You can use
gsub("[^_[:^punct:]]|_+\\b|\\b_+", "", test, perl=TRUE)
See the regex demo
Details:
[^_[:^punct:]] - any punctuation except _
| - or
_+\b - one or more _ at the end of a word
| - or
\b_+ - one or more _ at the start of a word
One non-regex way is to split and use trimws by setting the whitespace argument to _, i.e.
paste(sapply(strsplit(test, ' '), function(i)trimws(i, whitespace = '_')), collapse = ' ')
#[1] "hello_world and hello_world"
We can remove all the underlying which has a word boundary on either of the end. We use positive lookahead and lookbehind regex to find such underlyings. To remove underlying at the start and end we use trimws.
test<-"hello_world and _hello_world_"
gsub("(?<=\\b)_|_(?=\\b)", "", trimws(test, whitespace = '_'), perl = TRUE)
#[1] "hello_world and hello_world"
You could use:
test <- "hello_world and _hello_world_"
output <- gsub("(?<![^\\W])_|_(?![^\\W])", "", test, perl=TRUE)
output
[1] "hello_world and hello_world"
Explanation of regex:
(?<![^\\W]) assert that what precedes is a non word character OR the start of the input
_ match an underscore to remove
| OR
_ match an underscore to remove, followed by
(?![^\\W]) assert that what follows is a non word character OR the end of the input

replace last number in string using regex

I want to replace the last number in a string using regex and gsub
S <- "abcd2efghi2.txt"
The last number and the position of the last number can vary.
So I've tried the regex
?<=[\d+])\b
gsub("?<=[\d+])\b", "", S)
but that doesn't seem to work
Appreciate any help.
You can achieve that with a default TRE engine using the following regex:
\d+(\D*)$
Replace with the \1 backreference.
Details
\d+ - 1 or more digits
(\D*) - Capturing group 1: any 0+ non-digit symbols
$ - end of string
\1 - a backreference to the Group 1 value (so as to restore the text matched and consumed with the (\D*) subpattern).
See the regex demo.
R code demo:
sub("\\d+(\\D*)$", "\\1", S)
## => [1] "abcd2efghi.txt"
You could use this regex:
\d+(?=\D*$)
It matches a sequence of digits when everything that follows consists of non-digits (\D) until the end of the string ($).

Extracting until the last character in a string

Consider this data:
str <- c("OTB_MCD_100-119_0_0", "PS_SPF_16-31_0_0", "PP_DR/>16-77")
How to make it into a string like this?
str
[1] "OTB_MCD" "PS_SPF" "PP_DR"
I tried substr, but it doesn't work when the characters are of different length.
We can use sub to match zero or more _ followed by 0 or more characters that are not alphabets ([^A-Za-z]*) until the end ($) of the string, replace it with blank ("")
sub("_*[^A-Za-z]*$", "", str)
#[1] "OTB_MCD" "PS_SPF" "PP_DR"

Add leading zero within a character string

One column of my data.frame looks like the following:
c("BP_1_CSPP", "BP_2_GEGS", "BP_3_AEAG", "BP_4_KPAP", "BP_5_TAKP",
"BP_6_GGDR", "BP_7_MQQP", "BP_8_EEEE", "BP_9_RSDP", "BP_10_APAS",
"BP_11_KRGG", "BP_12_RSQQ", "BP_13_QQLS", "BP_14_EPEV", "BP_15_AAPS",
"BP_16_SDVT", "BP_17_GQQQ", "BP_18_AETP", "BP_19_PPSA", "BP_20_DATP",
"EpQ_1_AYAT", "EpQ_2_HEKL", "EpQ_3_SCSV", "EpQ_4_MAYV", "EpQ_5_LKDP",
"EpQ_6_ERCE", "EpQ_7_DNPA", "EpQ_8_YGIS", "EpQ_9_GMSS", "EpQ_10_AAKK",
"EpQ_11_NIRI", "EpQ_12_ERRR", "EpQ_13_MDRE", "EpQ_14_SRQM", "EpQ_15_DWSI",
"EpQ_16_VLVQ", "EpQ_17_GRTI", "EpQ_18_EKVR", "EpQ_19_PDVA", "EpQ_20_ADVT",
"LbT_1_RPGG", "LbT_2_TQGD", "LbT_3_EVKS", "LbT_4_VIEM", "LbT_5_GSAD",
"LbT_6_VRPI", "LbT_7_CELG", "LbT_8_APQQ", "LbT_9_SAEE", "LbT_10_GEAE",
"LbT_11_EELR", "LbT_12_EWAN", "LbT_13_IKEE", "LbT_14_VSDF", "LbT_15_WEDV",
"LbT_16_SGGA", "LbT_17_KATN", "LbT_18_EREG", "LbT_19_AWAS", "LbT_20_VDRD",
"abc_1_CVTQ", "abc_2_KEAP", "abc_3_TAYI", "abc_4_MITN", "abc_5_MPTV",
"abc_6_TRTG", "abc_7_KSTI", "abc_8_KEAI", "abc_9_HVYS", "abc_10_LGMG",
"abc_11_VAYQ", "abc_12_AGTG", "abc_13_TDSW", "abc_14_HKKS", "abc_15_YGLA",
"abc_16_WEEW", "abc_17_HSTI", "abc_18_EKCI", "abc_19_PAGI", "abc_20_TGTI",
"TcII")
Considering all the numbers < 10, which are located within the strings (e.g. "BP_1_CSPP", "BP_2_GEGS" , I wanted to add a leading zero to them, such that I would have:
"BP_01_CSPP", "BP_02_GEGS", "BP_03_AEAG", "BP_04_KPAP", "BP_05_TAKP",
"BP_06_GGDR"
and so on.
This question almost did the job, yet it does not worked for my data as:
The "0" will not be inserted at the same position all the time (some strings have 3 characters before the 0 to be inserted (e.g. BP_1_CSPP) while others have 4 (e.g. EpQ_3_SCSV)
I will still have some characters after the zero to be inserted i.e. the zero will be inserted at the middle of the string.
We can use sub to match the pattern of _ followed by a single number (([0-9])) captured as a group (inside the brackets) followed by _ and replace it with _ followed by 0, the backreference of the capture group (\\1) followed by _.
v1 <- sub("_([0-9])_", "_0\\1_", v1)
v1
#[1] "BP_01_CSPP" "BP_02_GEGS" "BP_03_AEAG" "BP_04_KPAP" "BP_05_TAKP" "BP_06_GGDR" "BP_07_MQQP" "BP_08_EEEE" "BP_09_RSDP" "BP_10_APAS" "BP_11_KRGG"
#[12] "BP_12_RSQQ" "BP_13_QQLS" "BP_14_EPEV" "BP_15_AAPS" "BP_16_SDVT" "BP_17_GQQQ" "BP_18_AETP" "BP_19_PPSA" "BP_20_DATP" "EpQ_01_AYAT" "EpQ_02_HEKL"
#[23] "EpQ_03_SCSV" "EpQ_04_MAYV" "EpQ_05_LKDP" "EpQ_06_ERCE" "EpQ_07_DNPA" "EpQ_08_YGIS" "EpQ_09_GMSS" "EpQ_10_AAKK" "EpQ_11_NIRI" "EpQ_12_ERRR" "EpQ_13_MDRE"
#[34] "EpQ_14_SRQM" "EpQ_15_DWSI" "EpQ_16_VLVQ" "EpQ_17_GRTI" "EpQ_18_EKVR" "EpQ_19_PDVA" "EpQ_20_ADVT" "LbT_01_RPGG" "LbT_02_TQGD" "LbT_03_EVKS" "LbT_04_VIEM"
#[45] "LbT_05_GSAD" "LbT_06_VRPI" "LbT_07_CELG" "LbT_08_APQQ" "LbT_09_SAEE" "LbT_10_GEAE" "LbT_11_EELR" "LbT_12_EWAN" "LbT_13_IKEE" "LbT_14_VSDF" "LbT_15_WEDV"
#[56] "LbT_16_SGGA" "LbT_17_KATN" "LbT_18_EREG" "LbT_19_AWAS" "LbT_20_VDRD" "abc_01_CVTQ" "abc_02_KEAP" "abc_03_TAYI" "abc_04_MITN" "abc_05_MPTV" "abc_06_TRTG"
#[67] "abc_07_KSTI" "abc_08_KEAI" "abc_09_HVYS" "abc_10_LGMG" "abc_11_VAYQ" "abc_12_AGTG" "abc_13_TDSW" "abc_14_HKKS" "abc_15_YGLA" "abc_16_WEEW" "abc_17_HSTI"
#[78] "abc_18_EKCI" "abc_19_PAGI" "abc_20_TGTI" "TcII"
If we are using strsplit, another option is split by _, replace the numbers by formatting with sprintf and then paste together
sapply(strsplit(v1, "_"), function(x) {
if(length(x)>1) x[2] <- sprintf("%02d", as.numeric(x[2]))
paste(x, collapse="_")})

Resources