Extraxt substring until "?" with sub() [duplicate] - r

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed last year.
So, I want to extract the substring of a string like this
mystr <- "aa/bb/cc?rest"
I found the sub() function but executing sub("?.*", "", mystr) returns "" instead of "aa/bb/cc".
Why?
The reason is obviousyl because of ? being a special character but using backticks or "\?" doesn't solve this problem.

You need double \ for escaping:
> mystr <- "aa/bb/cc?rest"
> sub("\?.*", "", mystr)
Error: '\?' is an unrecognized escape in character string starting ""\?"
> sub("\\?.*", "", mystr)
[1] "aa/bb/cc"

Related

R: Regex for Phone Numbers [duplicate]

This question already has an answer here:
How to use regex character class extensions in R?
(1 answer)
Closed 3 months ago.
I am working with the R programming language.
I have a column of data that looks something like this:
string = c("a1 123-456-7899 hh", "b 124-123-9999 b3")
I would like to remove the "phone numbers" so that the final result looks like this:
[1] "a1 hh" "b b3"
I tried to apply the answer provided here Regular expression to match standard 10 digit phone number to my question:
gsub("^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$", "", string, fixed = TRUE)
But I get the following error: Error: '\+' is an unrecognized escape in character string starting ""^(\+"
Can someone please show me how to fix this?
Thanks!
Try:
library(stringr)
s <- c("a1 123-456-7899 hh", "b 124-123-9999 b3")
result <- str_replace(s, "\\d+[-]\\d+[-]\\d+\\s", "")
print(result)
OUTPUT:
[1] "a1 hh" "b b3"
This will look for :
\\d+ : one or more digits, followed by
[-] : a hyphen, followed by
\\d+ : one or more digits, followed by
[-] : a hyphen, followed by
\\d+ : one or more digits, followed by
\\s : a space
And replace it with "" - nothing

replacing a term in vector using str_replace_all function R [duplicate]

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 2 years ago.
I am trying to replace the hyphen - in a vector with a period ..
> samples_149_vector
$Patient.ID
[1] MB-0020 MB-0100 MB-0115 MB-0158 MB-0164 MB-0174 MB-0179 MB-0206
[9] MB-0214 MB-0238 MB-0259 MB-0269 MB-0278 MB-0333 MB-0347 MB-0352
[17] MB-0372 MB-0396 MB-0399 MB-0400 MB-0401 MB-0420 MB-0424 MB-0446
[25] MB-0464 MB-0476 MB-0481 MB-0489 MB-0494 MB-0495 MB-0500 MB-0502
The following code
library(stringr)
str_replace_all(samples_149_vector, "-", ".")
generates the following error:
> str_replace_all(samples_149_vector, "-", ".")
[1] "1:149"
[2] "function (length = 0) \n.Internal(vector(\"character\", length))"
Warning message:
In stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
argument is not an atomic vector; coercing
Any ideas? I have tried so many things and combinations but the coercing atomic vector message seems to reoccur
Can you try utilizing an escape since "." is used to match any character when matching patterns with regular expressions? To create the regular expression, you need to use "\\."
str_replace_all(samples_149_vector, "-", "\\.")

Match all elements with punctuation mark except asterisk in r [duplicate]

This question already has answers here:
in R, use gsub to remove all punctuation except period
(4 answers)
Closed 2 years ago.
I have a vector vec which has elements with a punctuation mark in it. I want to return all elements with punctuation mark except the one with asterisk.
vec <- c("a,","abc","ef","abc-","abc|","abc*01")
> vec[grepl("[^*][[:punct:]]", vec)]
[1] "a," "abc-" "abc|" "abc*01"
why does it return "abc*01" if there is a negation mark[^*] for it?
Maybe you can try grep like below
grep("\\*",grep("[[:punct:]]",vec,value = TRUE), value = TRUE,invert = TRUE) # nested `grep`s for double filtering
or
grep("[^\\*[:^punct:]]",vec,perl = TRUE, value = TRUE) # but this will fail for case `abc*01|` (thanks for feedback from #Tim Biegeleisen)
which gives
[1] "a," "abc-" "abc|"
You could use grepl here:
vec <- c("a,","abc-","abc|","abc*01")
vec[grepl("^(?!.*\\*).*[[:punct:]].*$", vec, perl=TRUE)]
[1] "a," "abc-" "abc|"
The regex pattern used ^(?!.*\\*).*[[:punct:]].*$ will only match contents which does not contain any asterisk characters, while also containing at least one punctuation character:
^ from the start of the string
(?!.*\*) assert that no * occurs anywhere in the string
.* match any content
[[:punct:]] match any single punctuation character (but not *)
.* match any content
$ end of the string

How do I remove suffix from a list of Ensembl IDs in R [duplicate]

This question already has answers here:
Remove part of string after "."
(6 answers)
Closed 3 years ago.
I have a large list which contains expressed genes from many cell lines. Ensembl genes often come with version suffixes, but I need to remove them. I've found several references that describe this here or here, but they will not work for me, likely because of my data structure (I think its a nested array within a list?). Can someone help me with the particulars of the code and with my understanding of my own data structures?
Here's some example data
>listOfGenes_version <- list("cellLine1" = c("ENSG001.1", "ENSG002.1", "ENSG003.1"), "cellLine2" = c("ENSG003.1", "ENSG004.1"))
>listOfGenes_version
$cellLine1
[1] "ENSG001.1" "ENSG002.1" "ENSG003.1"
$cellLine2
[1] "ENSG003.1" "ENSG004.1"
And what I would like to see is
>listOfGenes_trimmed
$cellLine1
[1] "ENSG001" "ENSG002" "ENSG003"
$cellLine2
[1] "ENSG003" "ENSG004"
Here are some things I tried, but did not work
>listOfGenes_trimmed <- str_replace(listOfGenes_version, pattern = ".[0-9]+$", replacement = "")
Warning message:
In stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
argument is not an atomic vector; coercing
>listOfGenes_trimmed <- lapply(listOfGenes_version, gsub('\\..*', '', listOfGenes_version))
Error in match.fun(FUN) :
'gsub("\\..*", "", listOfGenes_version)' is not a function, character or symbol
Thanks so much!
An option would be to specify the pattern as . (metacharacter - so escape) followeed by one or more digits (\\d+) at the end ($) of the string and replace with blank ('")
lapply(listOfGenes_version, sub, pattern = "\\.\\d+$", replacement = "")
#$cellLine1
#[1] "ENSG001" "ENSG002" "ENSG003"
#$cellLine2
#[1] "ENSG003" "ENSG004"
The . is a metacharacter that matches any character, so we need to escape it to get the literal value as the mode is by default regex

Replace square bracket with bracket using gsub [duplicate]

This question already has answers here:
Remove all special characters from a string in R?
(3 answers)
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 3 years ago.
I want to change "[" too "(" in a a data.frame (class is string) but i get the following error:
Error in gsub("[", "(", df) :
invalid regular expression '[', reason 'Missing ']''
Doing the revers works perfectly:
df <- gsub("]",")", df)
all "]" got replaced in the data.frame df
so in essence this is the problem
df <- gsub("[","(", df)
Error in gsub("[", "(", df) :
invalid regular expression '[', reason 'Missing ']''
can anyone help to fix the code
or is there an alternative function to gsub which can accomplish the same?
The [ is. a metacharacter, so we may need either fixed = TRUE or escaping \\[
gsub("[", "(", df, fixed = TRUE)
We can also use the Hexadecimal representation of the ASCII character [ by prefixing it with \\x:
gsub('\\x5B', '(', '[')
# [1] "("
Just a preference, but I find this to be more readable in cases where the metacharacter [ and ] is mixed with it's literal/escaped version. For example I find this:
gsub('[\\x5B\\x5D]+', '(', ']][[[', perl = TRUE)
more readable than these:
gsub('[\\]\\[]+', '(', ']][[[', perl = TRUE)
[1] "("
gsub('[][]+', '(', ']][[[', perl = TRUE)
[1] "("
gsub('[\\[\\]]+', '(', ']][[[', perl = TRUE)
[1] "("
especially when you have a long and complicated pattern.
Here is the ASCII table I used from http://www.asciitable.com/
The obvious disadvantage is that you have to lookup the hex code from the table.

Resources