I have a specific issue with character substitution in strings:
If I have the following strings
"..A.B....c...A..D.."
"A..S.E.Q.......AW.."
".B.C..a...R......Ds"
Which regex substitution should I use to replace the dots and obtain the following strings:
"A_B_c_A_D"
"A_S_E_Q_AW"
"B_C_a_R_Ds"
I am using R.
Thanks in advance!
Using stringr from the ever fantastic tidyverse.
str1 <- "..A.B....c...A..D.."
str1 %>%
#replace all dots that follow any word character ('\\.' escapes search, '+' matches one or more, '(?<=\\w)' followed by logic)
str_replace_all('(?<=\\w)\\.+(?=\\w)', '_') %>%
#delete remaining dots (i.e. at the start)
str_remove_all('\\.')
As always plenty of ways to skin the cat with regex
Here a solution using gsub in two parts
string = c("..A.B....c...A..D..","A..S.E.Q.......AW..",".B.C..a...R......Ds")
first remove start and end points
string2 = gsub("^\\.+|\\.+$", "", string)
finally replace one or more points with _
string2 = gsub("\\.+", "_", string2)
Using x shown in the Note at the end, use trimws to trim dot off both ends. dot means any character so we have to escape it with backslashes to remove that meaning. Then replace every dot with underscore using chartr. No packages are used.
x |> trimws("both", "\\.") |> chartr(old = ".", new = "_")
## [1] "A_B____c___A__D" "A__S_E_Q_______AW" "B_C__a___R______Ds"
Note
x <- c("..A.B....c...A..D..",
"A..S.E.Q.......AW..",
".B.C..a...R......Ds")
Related
I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"
I need to remove values matching "CBII_*_*_" with "MAP_" in vector tt below.
tt <- c("CBII_27_1018_62770", "CBII_2733_101448_6272", "MAP_1222")
I tried
gsub("CBII_*_*", "MAP_") which won't give the expected result. What would be the solution for this so I get:
"MAP_62770", "MAP_6272", "MAP_1222"
You can use:
gsub("^CBII_.*_.*_", "MAP_",tt)
or
stringr::str_replace(tt, "^CBII_.*_.*_", "MAP_")
Output
[1] "MAP_62770" "MAP_6272" "MAP_1222"
An option with trimws from base R along with paste. We specify the whitespace as characters (.*) till the _. Thus, it removes the substring till the last _ and then with paste concatenate a new string ("MAP_")
paste0("MAP_", trimws(tt, whitespace = ".*_"))
#[1] "MAP_62770" "MAP_6272" "MAP_1222"
sub(".*(?<=_)(\\d+)$", "MAP_\\1", tt, perl = T)
[1] "MAP_62770" "MAP_6272" "MAP_1222"
Here we use positive lookbehind to assert that there is an underscore _ on the left of the capturing group (\\d+) at the very end of the string ($); we recall that capturing group with \\1 in the replacement argument to sub and move MAP_in front of it.
I have a string like word_string. What I want is Word_String. If I use the function str_to_title from stringr, what I get is Word_string. It does not capitalize the second word.
Does anyone know any elegant way to achieve that with stringr? Thanks!
Here is a base R option using sub:
input <- "word_string"
output <- gsub("(?<=^|_)([a-z])", "\\U\\1", input, perl=TRUE)
output
[1] "Word_String"
The regex pattern used matches and captures any lowercase letter [a-z] which is preceded by either the start of the string (i.e. it's the first letter) or an underscore. Then, we replace with the uppercase version of that single letter. Note that the \U modifier to change to uppercase is a Perl extension, so we must use sub in Perl mode.
Can also use to_any_case from snakecase
library(snakecase)
to_any_case(str1, "title", sep_out = "_")
#[1] "Word_String"
data
str1 <- "word_string"
This is obviously overly complicating but another base possibility:
test <- "word_string"
paste0(unlist(lapply(strsplit(test, "_"),function(x)
paste0(toupper(substring(x,1,1)),
substring(x,2,nchar(x))))),collapse="_")
[1] "Word_String"
You could first use gsub to replace "_" by " " and apply the str_to_title function
Then use gsub again to change it back to your format
x <- str_to_title(gsub("_"," ","word_string"))
gsub(" ","_",x)
I have a string which looks like this:
something-------another--thing
I want to replace the multiple dashes with a single one.
So the expected output would be:
something-another-thing
We can try using sub here:
x <- "something-------another--thing"
gsub("-{2,}", "-", x)
[1] "something-another-thing"
More generally, if we want to replace any sequence of two or more of the same character with just the single character, then use this version:
x <- "something-------another--thing"
gsub("(.)\\1+", "\\1", x)
The second pattern could use an explanation:
(.) match AND capture any single letter
\\1+ then match the same letter, at least one or possibly more times
Then, we replace with just the single captured letter.
you can do it with gsub and using regex.
> text='something-------another--thing'
> gsub('-{2,}','-',text)
[1] "something-another-thing"
t2 <- "something-------another--thing"
library(stringr)
str_replace_all(t2, pattern = "-+", replacement = "-")
which gives:
[1] "something-another-thing"
If you're searching for the right regex to search for a string, you can test it out here https://regexr.com/
In the above, you're just searching for a pattern that is a hyphen, so pattern = "-", but we add the plus so that the search is 'greedy' and can include many hyphens, so we get pattern = "-+"
I would like to insert a colon every five characters starting from the end of the string, preferably using regex and gsub in R.
text <- "My Very Enthusiastic Mother Just Served Us Noodles!"
I have been able to insert a colon every five characters from beginning of the text using:
gsub('(.{5})', "\\1:", text, perl = T)
I have written an inelegant function for achieving this as follows:
library(dplyr)
str_reverse<-function(x){
strsplit(x,split='')[[1]] %>% rev() %>% paste(collapse = "")
}
text2<-str_reverse(text)
text3<-gsub('(.{5})', "\\1:", text2, perl = T)
str_reverse(text3)
to get the desired result
[1] "M:y Ver:y Ent:husia:stic :Mothe:r Jus:t Ser:ved U:s Noo:dles!"
Is there a way this can be achieved directly using regular expressions?
You may use
gsub('(?=(?:.{5})+$)', ":", text, perl = TRUE)
## => [1] "M:y Ver:y Ent:husia:stic :Mothe:r Jus:t Ser:ved U:s Noo:dles!"
See the regex demo
The (?=(?:.{5})+$) pattern matches any location inside the string that is followed with any 5 chars (other than line break chars) 1 or more times up to the end of the string.
If the input string can contain line breaks you need to add (?s) at the start of the pattern (since . in PCRE regex does not match line breaks by default):
'(?s)(?=(?:.{5})+$)'