Extract all phone numbers in all formats from string R - r

I'm trying to extract phone numbers in all formats (international and otherwise) in R.
Example data:
phonenum_txt <- "sDlkjazsdklzdjsdasz+49 123 999dszDLJhfadslkjhds0001 123.456sL:hdLKJDHS+31 (0) 8123zsKJHSDlkhzs&^#%Q(999)9999999adlfkhjsflj(999)999-9999sDLKO*$^9999999999adf;jhklslFjafhd9999999999999zdlfjx,hafdsifgsiaUDSahj"
I'd like:
extract_vector
[1] "+49 123 999"
[2] 0001 123.456
[3] "+31 (0) 8123"
[4] (999)9999999
[5] (999)999-9999
[6] 9999999999
[7] 9999999999999
I tried using:
extract_vector <- str_extract_all(phonenum_txt,"^(?:\\+\\d{1,3}|0\\d{1,3}|00\\d{1,2})?(?:\\s?\\(\\d+\\))?(?:[-\\/\\s.]|\\d)+$")
which I got from HERE, but my regex skills aren't good enough to convert it to make it work in R.
Thanks!

While your data does not seem to be realistic, this expression might help you to design a desired expression to match your string.
(?=.+[0-9]{2,})([0-9+\.\-\(\)\s]+)
I have added an extra boundary, which is usually good to add when inputs are complex.
You might add or remove boundaries, if you wish. For instance, this expression might work as well:
([0-9+\.\-\(\)\s]+)
Or you can add additional left and right boundaries to it, for instance if all phone numbers are wrapped with lower/uppercase letters:
[a-z]([0-9+\.\-\(\)\s]+)[a-z]
You can simply call your desired target output, which is in a capturing group using $1.
Regular expression design works best, if/when there is real data available.

You can use this regex to match and extract all the phone numbers you have in your string.
(?: *[-+().]? *\d){6,14}
The idea behind this regex is to allow optionally one character from this set [-+().] (as these characters can appear within your phone number) before one digit in your phone number. If your phone number can contain further more characters like { or } or [ or ] then you may add them to this character set. And this optional character set may be surrounded by optional spaces hence we have space star before and after that char set and at the end we have \d for matching it with a number and whole of this pattern is quantified {6,14} to at least appear 6 or at max appear 14 times (you can configure these numbers as per your needs) as a minimum numbers in a phone number as per your sample data is 6 (although in actual I think it is 7 or 8 of Singapore but that's up to you)
Regex Demo
R Code Demo
library(stringr)
str_match_all("sDlkjazsdklzdjsdasz+49 123 999dszDLJhfadslkjhds0001 123.456sL:hdLKJDHS+31 (0) 8123zsKJHSDlkhzs&^#%Q(999)9999999adlfkhjsflj(999)999-9999sDLKO*$^9999999999adf;jhklslFjafhd9999999999999zdlfjx,hafdsifgsiaUDSahj", "(?: *[-+().]? *\\d){6,14}")
Prints all your required numbers,
[[1]]
[,1]
[1,] "+49 123 999"
[2,] "0001 123.456"
[3,] "+31 (0) 8123"
[4,] "(999)9999999"
[5,] "(999)999-9999"
[6,] "9999999999"
[7,] "9999999999999"

Related

Remove specific string or blank member from character vector

I am scraping https://www.transparency.org/news/pressreleases/year/2010 to retrieve header and details from each page. But along with header and details a telephone number and a blank string is coming in the retrieved list for every page.
[1] "See our simple, animated definitions of types of corruption and the ways to challenge it."
[2] "Judiciary - Commenting on Justice Bean’s sentencing in the BAE Systems’ Tanzania case, Transparency International UK welcomed the Judge’s stringent remarks concerning BAE Systems’ past conduct."
[3] " "
[4] "+49 30 3438 20 666"
I have tried with following codes but they didn't worked.
html %>% str_remove('+49 30 3438 20 666') %>% str_remove(' ').
How these elements can be removed?
Is it because you failed to escape the + sign?
From this cheatsheet,
Metacharacters (. * + etc.) can be used as
literal characters by escaping them. Characters
can be escaped using \ or by enclosing them
in \Q...\E.
s = "+49 30 3438 20 666"
str_remove(s, "\\+49 30 3438 20 666")
# ""
In case you want to drop all lines that start with a + and end with a number:
dd <- c(
"See our simple, animated definitions of types of corruption and the ways to challenge it."
, "Judiciary - Commenting on Justice Bean’s sentencing in the BAE Systems’ Tanzania case, Transparency International UK welcomed the Judge’s stringent remarks concerning BAE Systems’ past conduct."
," "
, "+49 30 3438 20 666")
c <- dd[!grepl("^\\+.*\\d*$",dd)]
You can also use \\s (one empty space) and \\d{2} (2 numbers) to have an exact match, to be on the safe side, if all numbers have the same format. Note that you can also use it in str_remove, with the end result beig an empty string. grep instead returns as logical vector that subsets your string.
If you want to delete also all empty lines
dd[!grepl("^\\s*$",dd)]
Note that you can do both at the same time by using "|":
dd[!grepl("^\\+.*\\d*$|^\\s*$",dd)]
You can get familiar with regex here: https://regex101.com/

R Regex for matching comma separated sections in a column/vector

The original Title for this Question was : R Regex for word boundary excluding space.It reflected the manner I was approaching the problem in. However, this is a better solution to my particular problem. It should work as long as a particular delimiter is used to separate items within a 'cell'
This must be very simple, but I've hit a brick wall on it.
I have a dataframe column where each cell(row) is a comma separated list of items. I want to find the rows that have a specific item.
df<-data.frame( nms= c("XXXCAP,XXX CAPITAL LIMITED" , "XXX,XXX POLYMERS LIMITED, 3455" , "YYY,XXX REP LIMITED,999,XXX" ),
b = c('A', 'X', "T"))
nms b
1 XXXCAP,XXX CAPITAL LIMITED A
2 XXX,XXX POLYMERS LIMITED, 3455 X
3 YYY,XXX REP LIMITED,999,XXX T
I want to search for rows that have item XXX. Rows 2 and 3 should match. Row 1 has the string XXX as part of a larger string and obviously should not match.
However, because XXX in row 1 is separated by spaces in each side, I am having trouble filtering it out with \\b or [[:<:]]
grep("\\bXXX\\b",df$nms, value = F) #matches 1,2,3
The easiest way to do this of course is strsplit() but I'd like to avoid it.Any suggestions on performance are welcome.
When \b does not "work", the problem usually lies in the definition of the "whole word".
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
It seems you want to only match a word in between commas or start/end of the string).
You may use a PCRE regex (note the perl=TRUE argument) like
(?<![^,])XXX(?![^,])
See the regex demo (the expression is "converted" to use positive lookarounds due to the fact it is a demo with a single multiline string).
Details
(?<![^,]) (equal to (?<=^|,)) - either start of the string or a comma
XXX - an XXX word
(?![^,]) (equal to (?=$|,)) - either end of the string or a comma
R demo:
> grep("(?<![^,])XXX(?![^,])",df$nms, value = FALSE, perl=TRUE)
## => [1] 2 3
The equivalent TRE regex will look like
> grep("(?:^|,)XXX(?:$|,)",df$nms, value = FALSE)
Note that here, non-capturing groups are used to match either start of string or , (see (?:^|,)) and either end of string or , (see ((?:$|,))).
This is perhaps a somewhat simplistic solution, but it works for the examples which you've provided:
library(stringr)
df$nms %>%
str_replace_all('\\s', '') %>% # Removes all spaces, tabs, newlines, etc
str_detect('(^|,)XXX(,|$)') # Detects string XXX surrounded by comma or beginning/end
[1] FALSE TRUE TRUE
Also, have a look at this cheatsheet made by RStudio on Regular Expressions - it is very nicely made and very useful (I keep going back to it when I'm in doubt).

Retain string till character limit with last complete word, and store remaining words in 2nd variable

Take these example strings, I want to split them such that the length is limited to X or less characters, a complete word is at the end of each string, and the remaining part is stored in another column. The words are always separated by space. I came across this partial solution in TSQL (doesn't create variable for extra words). However I need to do it in R. I was provided the first half solution in a previous question, this doesn't have the remaining words in new variables. I need help to create the new variable
{gsub(patt="(^.{2,100})([ ].+)", repl="\\1",y)}
For example:
XOVEW VJIEW NI **stays** XOVEW VJIEW NI (assuming X is 14)
XOVEW VJIEW NIGOI **becomes** XOVEW VJIEW (NIGOI goes to a new vector)
XOVEW VJIEWNIGOI **becomes** XOVEW (assuming X is 14)
So new variable will contain c("NIGOI","VJIEWNIGOI") coming from 2nd and 3rd row above.
v1 <- ifelse( nchar(vect) > 14, gsub( "(.*)\\s+(\\w+)", "\\1 - \\2", vect),vect);
values <- data.frame(do.call('rbind', lapply(strsplit(v1,split="-"), `length<-`,2)));
Output:
[,1] [,2]
[1,] "XOVEW VJIEW NI" NA
[2,] "XOVEW VJIEW " " NIGOI"
[3,] "XOVEW " " VJIEWNIGOI"
I have created a small vector which will check if your string length is greater or smaller than 14 (?nchar in case you want to understand it).
Then wherever, it is longer than 14 I have created a string seperated by a dash, This is just to segregate the two strings, where the first strings deptics any collection of word which is not the last one, the second string matches the last word of the statement.
To match these I used regex, a dot represents any character, a star zero or more matches(together it means any character with zero or more matches) , a \\s+ matches 1 or more spaces and \\w+ matches one or more words. Collectively the match is such that it should have last word segregated with rest of the string in cases where string length is more than 14 within ifelse. Also these characters are further captured into \\1 and \\2 with a dash separation. where \\1 matches the first non last word match and \\2 match the last word of the string.
At last do.call is used with with rbind(bind all the rows) and lapply(to get even number of columns across all the elements)
I hope this explains your query.

Finding Abbreviations in Data with R

In my data (which is text), there are abbreviations.
Is there any functions or code that search for abbreviations in text? For example, detecting 3-4-5 capital letter abbreviations and letting me count how often they happen.
Much appreciated!
detecting 3-4-5 capital letter abbreviations
You may use
\b[A-Z]{3,5}\b
See the regex demo
Details:
\b - a word boundary
[A-Z]{3,5} - 3, 4 or 5 capital letters (use [[:upper:]] to match letters other than ASCII, too)
\b - a word boundary.
R demo online (leveraging the regex occurrence count code from #TheComeOnMan)
abbrev_regex <- "\\b[A-Z]{3,5}\\b";
x <- "XYZ was seen at WXYZ with VWXYZ and did ABCDEFGH."
sum(gregexpr(abbrev_regex,x)[[1]] > 0)
## => [1] 3
regmatches(x, gregexpr(abbrev_regex, x))[[1]]
## => [1] "XYZ" "WXYZ" "VWXYZ"
You can use the regular expression [A-Z] to match any ocurrence of acapital letter. If you want this pattern to be repeated 3 times you can add \1{3} to your regex. Consider using variables and a loop to get the job done for 3 to 5 repetition times.

R Remove specific character with range of possible positions within string

I would like to remove the character 'V' (always the last one in the strings) from the following vector containing a large number of strings. They look similar to the following example:
str <- c("VDM 000 V2.1.1",
"ABVC 001 V10.15.0",
"ASDV 123 V1.20.0")
I know that it is always the last 'V', I would like to remove.
I also know that this character is either the sixth, seventh or eighth last character within these strings.
I was not really able to come up with a nice solution. I know that I have to use sub or gsub but I can only remove all V's rather than only the last one.
Has anyone got an idea?
Thank you!
This regex pattern is written to match a "V" that is then followed by 5 to 7 other non-"V" characters. The "[...]" construct is a "character-class" and within such constructs a leading "^" causes negation. The "{...} consturct allows two digits specifying minimum and maximum lengths, and the "$" matches the length-0 end-of-string which I think was desired when you wrote "sixth, seventh or eighth last character":
sub("(V)(.{5,7})$", "\\2", str)
[1] "VDM 000 2.1.1" "ABVC 001 10.15.0" "ASDV 123 1.20.0"
Since you only wanted a single substitution I used sub instead of gsub.
You can use:
gsub("V(\\d+.\\d+.\\d+)$","\\1",str)
##[1] "VDM 000 2.1.1" "ABVC 001 10.15.0" "ASDV 123 1.20.0"
The regex V(\\d+.\\d+.\\d+)$ matches the "version" consisting of the character "V" followed by three sets of digits (i.e., \\d+) separated by two "." at the end of the string (i.e., $). The parenthesis around the \\d+.\\d+.\\d+ provides a group within the match that can be referenced by \\1. Therefore, gsub will replace the whole match with the group, thereby removing that "V".
Since you know it's the last V you want to remove from the string, try this regex V(?=[^V]*$):
gsub("V(?=[^V]*$)", "", str, perl = TRUE)
# [1] "VDM 000 2.1.1" "ABVC 001 10.15.0" "ASDV 123 1.20.0"
The regex matches V before pattern [^V]*$ which consists of non V characters from the end of the String, which guarantees that the matched V is the last V in the string.

Resources