Extracting until the last character in a string - r

Consider this data:
str <- c("OTB_MCD_100-119_0_0", "PS_SPF_16-31_0_0", "PP_DR/>16-77")
How to make it into a string like this?
str
[1] "OTB_MCD" "PS_SPF" "PP_DR"
I tried substr, but it doesn't work when the characters are of different length.

We can use sub to match zero or more _ followed by 0 or more characters that are not alphabets ([^A-Za-z]*) until the end ($) of the string, replace it with blank ("")
sub("_*[^A-Za-z]*$", "", str)
#[1] "OTB_MCD" "PS_SPF" "PP_DR"

Related

Remove all punctuation except underline between characters in R with POSIX character class

I would like to use R to remove all underlines expect those between words. At the end the code removes underlines at the end or at the beginning of a word.
The result should be
'hello_world and hello_world'.
I want to use those pre-built classes. Right know I have learn to expect particular characters with following code but I don't know how to use the word boundary sequences.
test<-"hello_world and _hello_world_"
gsub("[^_[:^punct:]]", "", test, perl=T)
You can use
gsub("[^_[:^punct:]]|_+\\b|\\b_+", "", test, perl=TRUE)
See the regex demo
Details:
[^_[:^punct:]] - any punctuation except _
| - or
_+\b - one or more _ at the end of a word
| - or
\b_+ - one or more _ at the start of a word
One non-regex way is to split and use trimws by setting the whitespace argument to _, i.e.
paste(sapply(strsplit(test, ' '), function(i)trimws(i, whitespace = '_')), collapse = ' ')
#[1] "hello_world and hello_world"
We can remove all the underlying which has a word boundary on either of the end. We use positive lookahead and lookbehind regex to find such underlyings. To remove underlying at the start and end we use trimws.
test<-"hello_world and _hello_world_"
gsub("(?<=\\b)_|_(?=\\b)", "", trimws(test, whitespace = '_'), perl = TRUE)
#[1] "hello_world and hello_world"
You could use:
test <- "hello_world and _hello_world_"
output <- gsub("(?<![^\\W])_|_(?![^\\W])", "", test, perl=TRUE)
output
[1] "hello_world and hello_world"
Explanation of regex:
(?<![^\\W]) assert that what precedes is a non word character OR the start of the input
_ match an underscore to remove
| OR
_ match an underscore to remove, followed by
(?![^\\W]) assert that what follows is a non word character OR the end of the input

How do I replace all the punctuation in a string with '\\W'?

string = 'Hello, how are you?'
What I want to achieve:
Hello\\W how are you\\W
What I've done: Substituting all characters that are not alphanumeric with '\\W'
gsub('(\\W)+[^\\S]+','\\\\W',string,perl=TRUE)
[1] "Hello\\Whow are you?"
I'm not too sure why wasn't the question mark at the end of the sentence substituted with '\\W'and why was the first space being substituted. Could anyone help me out with this? Thank you!
We can do
gsub("[,?]", "\\\\W", string)
#[1] "Hello\\W how are you\\W"
If there are other characters, use [[:punct:]]
gsub("[[:punct:]]", "\\\\W", string)
#[1] "Hello\\W how are you\\W"

R: Remove Zero's at end of alphanumeric string

I want to remove all zeros at the end of an alphanumeric string.
dd<-data.frame(a = c("11234000", "000aa456000", "a2340", "00aa45000900"))
Should result in:
dd<-data.frame(a = c("11234", "000aa456", "a234", "00aa450009"))
dd<-data.frame(a = c("11234000", "000aa456000", "a2340", "00aa45000900"))
dd$a = gsub('0+$', '', dd$a)
You can try this. $ to match the end of string and 0* to match multiple zeros.
sub("0*$", "", dd$a)
# [1] "11234" "000aa456" "a234" "00aa450009"

How to delete first and last items before the matching pattern or delimiter in R

I have this vector called myvec. I want to delete everything before first delimiter _ and everything after the last delimiter _ (including the delimeter). How do I do this in R to get the result.
myvec <- c("contamination_LPH-001-10_3.txt", "contamination_LPH-001-10_AK1_0.txt",
"contamination_LPH-001-10_AK2_1.txt", "contamination_LPH-001-10_PD_2.txt",
"contamination_LPH-001-10_SCC_4.txt")
Result:
LPH-001-10, LPH-001-10_AK1,LPH-001-10_AK2,LPH-001-10_PD,LPH-001-10_SCC
We can use gsub for this
gsub("^[^_]*_|_[^_]*$", "", myvec)
#[1] "LPH-001-10" "LPH-001-10_AK1" "LPH-001-10_AK2"
#[4] "LPH-001-10_PD" "LPH-001-10_SCC"
From the start (^) of the string, we are matching zero or more characters that are not a _ ([^_]*) followed by a _ or (|) match a _ followed by zero or more charachters that are not a _ ([^_]*) till the end ($) of the string and replace it with "".
Or we can also use capture groups ((...)) and replace with the backreference for the capture groups.
sub("^[^_]*_(.*)_[^_]*$", "\\1", myvec)
#[1] "LPH-001-10" "LPH-001-10_AK1" "LPH-001-10_AK2"
#[4] "LPH-001-10_PD" "LPH-001-10_SCC"

grep formatted number using r

I have a string format that I would like to select from a character vector. The form is
123 123 1234
where the two spaces can also be a hyphen. i.e. 3 digits followed by space or hyphen, followed by 3 digits, followed by space or hyphen, followed by 4 digits
I am trying to do this by the following:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4}$)",mytext)
however this yields:
integer(0)
What am I doing wrong?
Your string has a whitespace at the end, so you can either consider that white space, like so:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4} $)",mytext)
Or remove the end of line assertion "$", like so:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4})",mytext)
Also, as pointed out by Wiktor Stribiżew, the character class [ -.] will match any character in the range between " " and ".". To match "-","." and " " you have to escape the "-" or put it at the end of the class. Like [ \-.] or [ .-]

Resources