R: Regex for Phone Numbers [duplicate] - r

This question already has an answer here:
How to use regex character class extensions in R?
(1 answer)
Closed 3 months ago.
I am working with the R programming language.
I have a column of data that looks something like this:
string = c("a1 123-456-7899 hh", "b 124-123-9999 b3")
I would like to remove the "phone numbers" so that the final result looks like this:
[1] "a1 hh" "b b3"
I tried to apply the answer provided here Regular expression to match standard 10 digit phone number to my question:
gsub("^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$", "", string, fixed = TRUE)
But I get the following error: Error: '\+' is an unrecognized escape in character string starting ""^(\+"
Can someone please show me how to fix this?
Thanks!

Try:
library(stringr)
s <- c("a1 123-456-7899 hh", "b 124-123-9999 b3")
result <- str_replace(s, "\\d+[-]\\d+[-]\\d+\\s", "")
print(result)
OUTPUT:
[1] "a1 hh" "b b3"
This will look for :
\\d+ : one or more digits, followed by
[-] : a hyphen, followed by
\\d+ : one or more digits, followed by
[-] : a hyphen, followed by
\\d+ : one or more digits, followed by
\\s : a space
And replace it with "" - nothing

Related

Extraxt substring until "?" with sub() [duplicate]

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed last year.
So, I want to extract the substring of a string like this
mystr <- "aa/bb/cc?rest"
I found the sub() function but executing sub("?.*", "", mystr) returns "" instead of "aa/bb/cc".
Why?
The reason is obviousyl because of ? being a special character but using backticks or "\?" doesn't solve this problem.
You need double \ for escaping:
> mystr <- "aa/bb/cc?rest"
> sub("\?.*", "", mystr)
Error: '\?' is an unrecognized escape in character string starting ""\?"
> sub("\\?.*", "", mystr)
[1] "aa/bb/cc"

Match all elements with punctuation mark except asterisk in r [duplicate]

This question already has answers here:
in R, use gsub to remove all punctuation except period
(4 answers)
Closed 2 years ago.
I have a vector vec which has elements with a punctuation mark in it. I want to return all elements with punctuation mark except the one with asterisk.
vec <- c("a,","abc","ef","abc-","abc|","abc*01")
> vec[grepl("[^*][[:punct:]]", vec)]
[1] "a," "abc-" "abc|" "abc*01"
why does it return "abc*01" if there is a negation mark[^*] for it?
Maybe you can try grep like below
grep("\\*",grep("[[:punct:]]",vec,value = TRUE), value = TRUE,invert = TRUE) # nested `grep`s for double filtering
or
grep("[^\\*[:^punct:]]",vec,perl = TRUE, value = TRUE) # but this will fail for case `abc*01|` (thanks for feedback from #Tim Biegeleisen)
which gives
[1] "a," "abc-" "abc|"
You could use grepl here:
vec <- c("a,","abc-","abc|","abc*01")
vec[grepl("^(?!.*\\*).*[[:punct:]].*$", vec, perl=TRUE)]
[1] "a," "abc-" "abc|"
The regex pattern used ^(?!.*\\*).*[[:punct:]].*$ will only match contents which does not contain any asterisk characters, while also containing at least one punctuation character:
^ from the start of the string
(?!.*\*) assert that no * occurs anywhere in the string
.* match any content
[[:punct:]] match any single punctuation character (but not *)
.* match any content
$ end of the string

How do I remove suffix from a list of Ensembl IDs in R [duplicate]

This question already has answers here:
Remove part of string after "."
(6 answers)
Closed 3 years ago.
I have a large list which contains expressed genes from many cell lines. Ensembl genes often come with version suffixes, but I need to remove them. I've found several references that describe this here or here, but they will not work for me, likely because of my data structure (I think its a nested array within a list?). Can someone help me with the particulars of the code and with my understanding of my own data structures?
Here's some example data
>listOfGenes_version <- list("cellLine1" = c("ENSG001.1", "ENSG002.1", "ENSG003.1"), "cellLine2" = c("ENSG003.1", "ENSG004.1"))
>listOfGenes_version
$cellLine1
[1] "ENSG001.1" "ENSG002.1" "ENSG003.1"
$cellLine2
[1] "ENSG003.1" "ENSG004.1"
And what I would like to see is
>listOfGenes_trimmed
$cellLine1
[1] "ENSG001" "ENSG002" "ENSG003"
$cellLine2
[1] "ENSG003" "ENSG004"
Here are some things I tried, but did not work
>listOfGenes_trimmed <- str_replace(listOfGenes_version, pattern = ".[0-9]+$", replacement = "")
Warning message:
In stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
argument is not an atomic vector; coercing
>listOfGenes_trimmed <- lapply(listOfGenes_version, gsub('\\..*', '', listOfGenes_version))
Error in match.fun(FUN) :
'gsub("\\..*", "", listOfGenes_version)' is not a function, character or symbol
Thanks so much!
An option would be to specify the pattern as . (metacharacter - so escape) followeed by one or more digits (\\d+) at the end ($) of the string and replace with blank ('")
lapply(listOfGenes_version, sub, pattern = "\\.\\d+$", replacement = "")
#$cellLine1
#[1] "ENSG001" "ENSG002" "ENSG003"
#$cellLine2
#[1] "ENSG003" "ENSG004"
The . is a metacharacter that matches any character, so we need to escape it to get the literal value as the mode is by default regex

">" is not matched by "[[:punct:]]" when using `stringr::str_replace_all`? [duplicate]

This question already has answers here:
R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?
(2 answers)
Closed 4 years ago.
I find this really odd :
pattern <- "[[:punct:][:digit:][:space:]]+"
string <- "a . , > 1 b"
gsub(pattern, " ", string)
# [1] "a b"
library(stringr)
str_replace_all(string, pattern, " ")
# [1] "a > b"
str_replace_all(string, "[[:punct:][:digit:][:space:]>]+", " ")
# [1] "a b"
Is this expected ?
Still working on this, but ?"stringi-search-charclass" says:
Beware of using POSIX character classes, e.g. ‘[:punct:]’. ICU
User Guide (see below) states that in general they are not
well-defined, so may end up with something different than you
expect.
In particular, in POSIX-like regex engines, ‘[:punct:]’ stands for
the character class corresponding to the ‘ispunct()’
classification function (check out ‘man 3 ispunct’ on UNIX-like
systems). According to ISO/IEC 9899:1990 (ISO C90), the
‘ispunct()’ function tests for any printing character except for
space or a character for which ‘isalnum()’ is true. However, in a
POSIX setting, the details of what characters belong into which
class depend on the current locale. So the ‘[:punct:]’ class does
not lead to portable code (again, in POSIX-like regex engines).
So a POSIX flavor of ‘[:punct:]’ is more like ‘[\p{P}\p{S}]’ in
‘ICU’. You have been warned.
Copying from the issue posted above,
string <- "a . , > 1 b"
mypunct <- "[[\\p{P}][\\p{S}]]"
stringr::str_remove_all(string, mypunct)
I can appreciate stuff being locale-specific, but it still surprises me that [:punct:] doesn't even work in a C locale ...

Problems in a regular expression to extract names using stringr

I cannot fully understand why my regular expression does not work to extract the info I want. I have an unlisted vector that looks like this:
text <- c("Senator, 1.4balbal", "rule 46.1, declares",
"Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23)
I would like to create a regular expression to extract only the name of the "Town", even if the town has a long name as the one written in the vector ("A Town with a Long Name"). I have tried this to extract the name of the town:
reg.town <- "[[:alpha:]](.+?)+,(.+?)\\d{2}"
towns<- unlist(str_extract_all(example, reg.prov))
but I extract everything around the ",".
Thanks in advance,
It looks like a town name starts with a capital letter ([[:upper:]]), ends with a comma (or continues to the end of text if there is no comma) ([^,]+) and should be at the start of the input text (^). The corresponding regex in this case would be:
^[[:upper:]][^,]+
Demo: https://regex101.com/r/QXYtyv/1
I have solve the problem thanks to #Dmitry Egorov 's demo post in the comment. the regular expression is this one ([[:upper:]].+?, [[:digit:]])
Thanks for your quick replies!!
You may use the following regex:
> library(stringr)
> text <- c("Senator, 1.4balbal", "rule 46.1, declares", "Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23")
> towns <- unlist(str_extract_all(text, "\\b\\p{Lu}[^,]++(?=, \\d)"))
> towns
[1] "Senator" "Town"
[3] "A Town with a Long Name"
The regex matches:
\\b - a leading word boundary
\\p{Lu} - an uppercase letter
[^,]++ - 1+ chars other than a , (possessively, due to ++ quantifier, with no backtracking into this pattern for a more efficient matching)
(?=, \\d) - a positive lookahead that requires a ,, then a space and then any digit to appear immediately after the last non-, symbol matched with [^,]++.
Note you may get the same results with base R using the same regex with a PCRE option enabled:
> towns_baseR <- unlist(regmatches(text, gregexpr("\\b\\p{Lu}[^,]++(?=, \\d)", text, perl=TRUE)))
> towns_baseR
[1] "Senator" "Town"
[3] "A Town with a Long Name"
>

Resources