String remove from n-th last seperator to the end - r

I have the following string:
data_string = c("Aa_Bbbbb_0_ID1",
"Aa_Bbbbb_0_ID2",
"Aa_Bbbbb_0_ID3",
"Ccccc_D_EEE_0_ID1")
I just wanted to split all the string to have these results:
"Aa_Bbbbb"
"Aa_Bbbbb"
"Aa_Bbbbb"
"Ccccc_D_EEE"
So basically, I'm looking for a function which take data_string, set a separator, and take the split position :
remove_tail(data_table, sep = '_', del = 2)
only removing the tail from 2nd last separator to the end of the string (not split all the string)

Try below:
# split on "_" then paste back removing last 2
sapply(strsplit(data_string, "_", fixed = TRUE),
function(i) paste(head(i, -2), collapse = "_"))
We can make our own function:
# custom function
remove_tail <- function(x, sep = "_", del = 2){
sapply(strsplit(x, split = sep, fixed = TRUE),
function(i) paste(head(i, -del), collapse = sep))
}
remove_tail(data_string, sep = '_', del = 2)
# [1] "Aa_Bbbbb" "Aa_Bbbbb" "Aa_Bbbbb" "Ccccc_D_EEE"

Using gsub
gsub("_0_.*","",data_string)

We can also use sub tp match the _ followed by one or more digits (\\d+) and the rest of the characters, replace it with blank ("")
sub("_\\d+.*", "", data_string)
#[1] "Aa_Bbbbb" "Aa_Bbbbb" "Aa_Bbbbb" "Ccccc_D_EEE"

Related

regex to initialize given name

I have a list of names such as
lastname1, Abc-Def
lastname2, Abc
I am trying to find a regex to initialize the given names (that come after the comma ,) so it gives me:
lastname1, A.-D.
lastname2, A.
The closest I got: https://regex101.com/r/nKtPCq/2/
(.*), ([A-zÀ-ú])\w*-?([A-zÀ-ú])+
In R, instead of regex you could also do this if you want:
str1 = "lastname1, Abc-Def"
str2 = "lastname2, Abc"
initialize = function(nameString) {
namesList = strsplit(nameString, ", ")
splitLast = strsplit(namesList[[1]][2], "-")
initials = paste(substr(splitLast[[1]], 1, 1), ".", sep="", collapse="-")
paste(namesList[[1]][1], ", ", initials, sep="")
}
print(initialize(str1)) # "lastname1, A.-D."
print(initialize(str2)) # "lastname2, A."
Demo

Get rid of extra sep in the paste function in R

I am trying to get rid of the extra sep in the paste function in R.
It looks easy but I cannot find a non-hacky way to fix it. Assume l1-l3 are lists
l1 = list(a=1)
l2 = list(b=2)
l3 = list(c=3)
l4 = list(l1,l2=l2,l3=l3)
note that the first element of l4 is not named. Now I want to add a constant to the names like below:
names(l4 ) = paste('Name',names(l4),sep = '.')
Here is the output:
names(l4)
[1] "Name." "Name.l2" "Name.l3"
How can I get rid of the . in the first output (Name.)
We can ue trimws (from R 3.6.0 - can specify whitespace with custom character)
trimws(paste('Name',names(l4),sep = '.'), whitespace = "\\.")
#[1] "Name" "Name.l2" "Name.l3"
Or with sub to match the . (. is a metacharacter for any character, so we escape \\ to get the literal meaning) at the end ($) of the string and replace with blank ("")
sub("\\.$", "", paste('Name',names(l4),sep = '.'))
If the . is already there in the names at the end, we can use an index option
ifelse(nzchar(names(l4)), paste("Name", names(l4), sep="."), "Name")
#[1] "Name" "Name.l2." "Name.l3"

Replace multiple strings comprising of a different number of characters with one gsubfn()

Here Replace multiple strings in one gsub() or chartr() statement in R? it is explained to replace multiple strings of one character at in one statement with gsubfn(). E.g.:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", " " = ""), x)
# "doremig_k"
I would however like to replace the string 'doremi' in the example with ''. This does not work:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", "doremi" = ""), x)
# "doremi g_k"
I guess it is because of the fact that the string 'doremi' contains multiple characters and me using the metacharacter . in gsubfn. I have no idea what to replace it with - I must confess I find the use of metacharacters sometimes a bit difficult to udnerstand. Thus, is there a way for me to replace '-' and 'doremi' at once?
You might be able to just use base R sub here:
x <- "doremi g-k"
result <- sub("doremi\\s+([^-]+)-([^-]+)", "\\1_\\2", x)
result
[1] "g_k"
Does this work for you?
gsubfn::gsubfn(pattern = "doremi|-", list("-" = "_", "doremi" = ""), x)
[1] " g_k"
The key is this search: "doremi|-" which tells to search for either "doremi" or "-". Use "|" as the or operator.
Just a more generic solution to #RLave's solution -
toreplace <- list("-" = "_", "doremi" = "")
gsubfn(paste(names(toreplace),collapse="|"), toreplace, x)
[1] " g_k"

Remove line return in a string from the second "\n" and the rest (do not remove the first one)

I have sequences that I want to reformat in R.
But R presents me the sequences with returns of lines (as can be seen with the "\n":
seqs <- ">PRTRE213-13 Volkameria aculeatum matK \n------------------------------------------------------------------CCAAC\nCGAGAGCCAGCTCC------TCTTTTTCAAAA---------CGAAAT---------------------CAA\nAAGACTATTCTTATTCTTATAT------------AATTCTCATGTATGTGAATATGAATCCGTTTTCGTCT\nTTTCTACGTAACCAATCTTTT---CATTTACGATCAACATCTTTTGAAGTTCTTCTTGAACGAATCTATTT\nTCTATGTA---------AAAGTAGAACGTCTT------GTGAACGTCTTTGTTAAGATTAAC---------\n-AATTTTCGGGCGAACCCGTGGTTGGTCAAG------GAACCTTTCATGCATTATATTAGGTATCAAAGAA\nAGATCCATTCTGGCTTCA------AAGGGAACATCTTTTTTCATGAAAAAATGGCAATTTTATCTTGTCAC\nCTTTTTGGCAATGGCATTTTTCGCTGTGGTTTCATCCAAGAAGGATTTATCTAAAC---CAATTATCCAAT\nTTATTCCCTTGAA------TTTTTGGGCTATCTTTCA------AGCGTGCGAATGAACCCCTCTGTGGTAC\nCGGAGTCAAATTCTAGAAAATGCATTTCTAATCAATAATGCTATT------AAGAAGTTTGATACCCTTAT\nTTCCAATTATTCCAATGATTGCGTCATTGGCTAAAGCGAAATTTTGTAACGTATTTGGGCATCCTGTTAGT\nTAAGCCGATTTGGGCTGATTTATCAGATTCTAATATTATTGACCGATTTGGTCGTATA---TGCAGAAATC\nCTTTCTC-------------"
But I want to remove all the \nexcept the first one. I.e.:
[1] ">PRTRE213-13 Volkameria aculeatum matK \n------------------------------------------------------------------CCAACCGAGAGCCAGCTCC------TCTTTTTCAAAA---------CGAAAT---------------------CAAAAGACTATTCTTATTCTTATAT------------AATTCTCATGTATGTGAATATGAATCCGTTTTCGTCTTTTCTACGTAACCAATCTTTT---CATTTACGATCAACATCTTTTGAAGTTCTTCTTGAACGAATCTATTTTCTATGTA---------AAAGTAGAACGTCTT------GTGAACGTCTTTGTTAAGATTAAC----------AATTTTCGGGCGAACCCGTGGTTGGTCAAG------GAACCTTTCATGCATTATATTAGGTATCAAAGAAAGATCCATTCTGGCTTCA------AAGGGAACATCTTTTTTCATGAAAAAATGGCAATTTTATCTTGTCACCTTTTTGGCAATGGCATTTTTCGCTGTGGTTTCATCCAAGAAGGATTTATCTAAAC---CAATTATCCAATTTATTCCCTTGAA------TTTTTGGGCTATCTTTCA------AGCGTGCGAATGAACCCCTCTGTGGTACCGGAGTCAAATTCTAGAAAATGCATTTCTAATCAATAATGCTATT------AAGAAGTTTGATACCCTTATTTCCAATTATTCCAATGATTGCGTCATTGGCTAAAGCGAAATTTTGTAACGTATTTGGGCATCCTGTTAGTTAAGCCGATTTGGGCTGATTTATCAGATTCTAATATTATTGACCGATTTGGTCGTATA---TGCAGAAATCCTTTCTC-------------"
If I do this, it removes all of the returns.
gsub(pattern = "\n",replacement = "", x = seqs)
This is not working:
sub("^(.*? \n .*?) \n .*", "\\1", seqs)
This gives me an error:
gsub(pattern = "${'\n'[*]:0:2}",replacement = "", x = seqs)
Error in gsub(pattern = "${'\n'[*]:0:2}", replacement = "", x = seqs) :
invalid regular expression '${'
'[*]:0:2}', reason 'Invalid contents of {}'
My sequences are variable:
">Whatever here before \n the sequence start \n the rest \n..."
The end result would be
">Whatever here before \n the sequence start the rest..."
Interestingly, the code below partially works for the test sentence, but not the sequence above:
seqss = ">Whatever here before \n the sequence start \n the rest \n..."
sub("^(.*? \n .*?) \n .*", "\\1", seqss)
[1] ">Whatever here before \n the sequence start"
Try it like this:
seqs <- ">PRTRE213-13 Volkameria aculeatum matK \n------------------------------------------------------------------CCAAC\nCGAGAGCCAGCTCC------TCTTTTTCAAAA---------CGAAAT---------------------CAA\nAAGACTATTCTTATTCTTATAT------------AATTCTCATGTATGTGAATATGAATCCGTTTTCGTCT\nTTTCTACGTAACCAATCTTTT---CATTTACGATCAACATCTTTTGAAGTTCTTCTTGAACGAATCTATTT\nTCTATGTA---------AAAGTAGAACGTCTT------GTGAACGTCTTTGTTAAGATTAAC---------\n-AATTTTCGGGCGAACCCGTGGTTGGTCAAG------GAACCTTTCATGCATTATATTAGGTATCAAAGAA\nAGATCCATTCTGGCTTCA------AAGGGAACATCTTTTTTCATGAAAAAATGGCAATTTTATCTTGTCAC\nCTTTTTGGCAATGGCATTTTTCGCTGTGGTTTCATCCAAGAAGGATTTATCTAAAC---CAATTATCCAAT\nTTATTCCCTTGAA------TTTTTGGGCTATCTTTCA------AGCGTGCGAATGAACCCCTCTGTGGTAC\nCGGAGTCAAATTCTAGAAAATGCATTTCTAATCAATAATGCTATT------AAGAAGTTTGATACCCTTAT\nTTCCAATTATTCCAATGATTGCGTCATTGGCTAAAGCGAAATTTTGTAACGTATTTGGGCATCCTGTTAGT\nTAAGCCGATTTGGGCTGATTTATCAGATTCTAATATTATTGACCGATTTGGTCGTATA---TGCAGAAATC\nCTTTCTC-------------"
gsub(pattern = "(^.*?\\n)|\\n",replacement = "\\1", x = seqs, perl = TRUE)
regex101 demo
The idea of the regex
(^.*?\\n)|\\n
is to capture everything up to the first newline in a group to retain and put it back in the replacement.
A stringr approach that simply splits the string and combines the two parts back together.
seqs <- ">PRTRE213-13 Volkameria aculeatum matK \n------------------------------------------------------------------CCAAC\nCGAGAGCCAGCTCC------TCTTTTTCAAAA---------CGAAAT---------------------CAA\nAAGACTATTCTTATTCTTATAT------------AATTCTCATGTATGTGAATATGAATCCGTTTTCGTCT\nTTTCTACGTAACCAATCTTTT---CATTTACGATCAACATCTTTTGAAGTTCTTCTTGAACGAATCTATTT\nTCTATGTA---------AAAGTAGAACGTCTT------GTGAACGTCTTTGTTAAGATTAAC---------\n-AATTTTCGGGCGAACCCGTGGTTGGTCAAG------GAACCTTTCATGCATTATATTAGGTATCAAAGAA\nAGATCCATTCTGGCTTCA------AAGGGAACATCTTTTTTCATGAAAAAATGGCAATTTTATCTTGTCAC\nCTTTTTGGCAATGGCATTTTTCGCTGTGGTTTCATCCAAGAAGGATTTATCTAAAC---CAATTATCCAAT\nTTATTCCCTTGAA------TTTTTGGGCTATCTTTCA------AGCGTGCGAATGAACCCCTCTGTGGTAC\nCGGAGTCAAATTCTAGAAAATGCATTTCTAATCAATAATGCTATT------AAGAAGTTTGATACCCTTAT\nTTCCAATTATTCCAATGATTGCGTCATTGGCTAAAGCGAAATTTTGTAACGTATTTGGGCATCCTGTTAGT\nTAAGCCGATTTGGGCTGATTTATCAGATTCTAATATTATTGACCGATTTGGTCGTATA---TGCAGAAATC\nCTTTCTC-------------"
library(stringr)
keep_first_newline <- function(string){
first_newline <- str_locate(string, "\\n")[1]
head <- str_sub(string, end = first_newline)
tail = string %>%
str_sub(start = first_newline + 1) %>%
str_remove_all("\\n")
out <- str_c(head, tail)
}
seqs %>%
keep_first_newline %>%
writeLines
#> >PRTRE213-13 Volkameria aculeatum matK
#> ------------------------------------------------------------------CCAACCGAGAGCCAGCTCC------TCTTTTTCAAAA---------CGAAAT---------------------CAAAAGACTATTCTTATTCTTATAT------------AATTCTCATGTATGTGAATATGAATCCGTTTTCGTCTTTTCTACGTAACCAATCTTTT---CATTTACGATCAACATCTTTTGAAGTTCTTCTTGAACGAATCTATTTTCTATGTA---------AAAGTAGAACGTCTT------GTGAACGTCTTTGTTAAGATTAAC----------AATTTTCGGGCGAACCCGTGGTTGGTCAAG------GAACCTTTCATGCATTATATTAGGTATCAAAGAAAGATCCATTCTGGCTTCA------AAGGGAACATCTTTTTTCATGAAAAAATGGCAATTTTATCTTGTCACCTTTTTGGCAATGGCATTTTTCGCTGTGGTTTCATCCAAGAAGGATTTATCTAAAC---CAATTATCCAATTTATTCCCTTGAA------TTTTTGGGCTATCTTTCA------AGCGTGCGAATGAACCCCTCTGTGGTACCGGAGTCAAATTCTAGAAAATGCATTTCTAATCAATAATGCTATT------AAGAAGTTTGATACCCTTATTTCCAATTATTCCAATGATTGCGTCATTGGCTAAAGCGAAATTTTGTAACGTATTTGGGCATCCTGTTAGTTAAGCCGATTTGGGCTGATTTATCAGATTCTAATATTATTGACCGATTTGGTCGTATA---TGCAGAAATCCTTTCTC-------------
Created on 2018-06-29 by the reprex package (v0.2.0).
Using gsubfn we can do. Not any better than the regex in this case, but more easily extended if you want to keep the first n occurrences with n>1.
library(gsubfn)
p <- proto(fun = function(this, x) if(count > 1) '' else x)
out <- gsubfn('\n', p, seqs)
Same as accepted answer
out == gsub(pattern = "(^.*?\\n)|\\n",replacement = "\\1", x = seqs, perl = TRUE)
#[1] TRUE

Extract a pattern before // and after || symbol

I am not very familiar with regex in R.
in a column I am trying to extract words before // and after || symbol. I.e. this is what I have in my column:
qtaro_269//qtaro_269||qtaro_353//qtaro_353||qtaro_375//qtaro_375||qtaro_11//qtaro_11
This is what I want:
qtaro_269; qtaro_353; qtaro_375; qtaro_11
I found this: Extract character before and after "/" and this: Extract string before "|". However I don't know how to adjust it to my input. Any hint is much appreciated.
EDIT:
a qtaro_269//qtaro_269||qtaro_353//qtaro_353||qtaro_375//qtaro_375||qtaro_11//qtaro_11
b
c qtaro_269//qtaro_269||qtaro_353//qtaro_353||qtaro_375//qtaro_375||qtaro_11//qtaro_11
What about the following?
# Split by "||"
x2 <- unlist(strsplit(x, "\\|\\|"))
[1] "qtaro_269//qtaro_269" "qtaro_353//qtaro_353" "qtaro_375//qtaro_375" "qtaro_11//qtaro_11"
# Remove everything before and including "//"
gsub(".+//", "", x2)
[1] "qtaro_269" "qtaro_353" "qtaro_375" "qtaro_11"
And if you want it as one string with ; for separation:
paste(gsub(".+//", "", x2), collapse = "; ")
[1] "qtaro_269; qtaro_353; qtaro_375; qtaro_11"
This is how I solved it. For sure not the most intelligent and elegant way, so suggestions to improve it are welcome.
df <-unlist(lapply(strsplit(df[[2]],split="\\|\\|"), FUN = paste, collapse = "; "))
df <-unlist(lapply(strsplit(df[[2]],split="\\/\\/"), FUN = paste, collapse = "; "))
df <- sapply(strsplit(df$V2, "; ", fixed = TRUE), function(x) paste(unique(x), collapse = "; "))

Resources