Deriving Phone number from a string in R - r

I have some vectors as below:
I converted all characters, Special characters into X
xxxxxx18002514919xxxxxxxxxxxxxxxxxxxxxxxxxx24XXXXXX7
xxxxxx9000012345xxxxxxxxxxxxx34567xxxxxxxxxxxxx1800XXXXXX7
How can I derive only 11 digit or 10 digit phone number from the above strings in R
My Desired Output is:
For first string: 18002514919
For second string: 9000012345

You can use stringr to solve your problem, There is function called str_extract_all to extract the phone number as desired.
The regex:
\\d --> represent number,
{n,m} --> curly braces are for matching the times of number. Here n is applied for minimum no of matches and m is maximum number of numbers for the match. Since you want to match a phone number whose length between 10 and 11. n becomes 10 and m becomes 11.
X <- c("xxxxxx18002514919xxxxxxxxxxxxxxxxxxxxxxxxxx24XXXXXX7","xxxxxx9000012345xxxxxxxxxxxxx34567xxxxxxxxxxxxx1800XXXXXX7")
library(stringr)
str_extract_all(X,"\\d{10,11}")
Answer:
> str_extract_all(X,"\\d{10,11}")
[[1]]
[1] "18002514919"
[[2]]
[1] "9000012345"
If you are sure that one scalar would contain only one string of phone number then use str_extract.
> str_extract(X,"\\d{10,11}")
[1] "18002514919" "9000012345"

Related

Stringr function or or gsub() to find an x digit string and extract first x digits?

Regex and stringr newbie here. I have a data frame with a column from which I want to find 10-digit numbers and keep only the first three digits. Otherwise, I want to just keep whatever is there.
So to make it easy let's just pretend it's a simple vector like this:
new<-c("111", "1234567891", "12", "12345")
I want to write code that will return a vector with elements: 111, 123, 12, and 12345. I also need to write code (I'm assuming I'll do this iteratively) where I extract the first two digits of a 5-digit string, like the last element above.
I've tried:
gsub("\\d{10}", "", new)
but I don't know what I could put for the replacement argument to get what I'm looking for. Also tried:
str_replace(new, "\\d{10}", "")
But again I don't know what to put in for the replacement argument to get just the first x digits.
Edit: I disagree that this is a duplicate question because it's not just that I want to extract the first X digits from a string but that I need to do that with specific strings that match a pattern (e.g., 10 digit strings.)
If you are willing to use the library stringr from which comes the str_replace you are using. Just use str_extract
vec <- c(111, 1234567891, 12)
str_extract(vec, "^\\d{1,3}")
The regex ^\\d{1,3} matches at least 1 to a maximum of 3 digits occurring right in the beginning of the phrase. str_extract, as the name implies, extracts and returns these matches.
You may use
new<-c("111", "1234567891", "12")
sub("^(\\d{3})\\d{7}$", "\\1", new)
## => [1] "111" "123" "12"
See the R online demo and the regex demo.
Regex graph:
Details
^ - start of string anchor
(\d{3}) - Capturing group 1 (this value is accessed using \1 in the replacement pattern): three digit chars
\d{7} - seven digit chars
$ - end of string anchor.
So, the sub command only matches strings that are composed only of 10 digits, captures the first three into a separate group, and then replaces the whole string (as it is the whole match) with the three digits captured in Group 1.
You can use:
as.numeric(substring(my_vec,1,3))
#[1] 111 123 12

zero padding regex dependent on length of digits

I have a field which contains two charecters, some digits and potentially a single letter. For example
QU1Y
ZL002
FX16
TD8
BF007P
VV1395
HM18743
JK0001
I would like to consistently return all letters in their original position, but digits as follows.
for 1 to 3 digits :
return all digits OR the digits left padded with zeros
For 4 or more digits :
it must not begin with a zero and return the 4 first digits OR if the first is a zero then truncate to three digits
example from the data above
QU001Y
ZL002
FX016
TD008
BF007P
VV1395
HM1874
JK001
The implementation will be in R but I'm interested in a straight regex solution, I'll work out the R side of things. It may not be possible in straight regex which is why I can't get my head round it.
This identifies the correct ones, but I'm hoping to correct those which are not
right.
"[A-Z]{2}[1-9]{0,1}[0-9]{1,3}[F,Y,P]{0,1}"
For the curious, they are flight numbers but entered by a human. Hence the variety...
You may use
> library(gsubfn)
> l <- c("QU1Y", "ZL002", "FX16", "TD8", "BF007P", "VV1395", "HM18743", "JK0001")
> gsubfn('^[A-Z]{2}\\K0*(\\d{1,4})\\d*', ~ sprintf("%03d",as.numeric(x)), l, perl=TRUE)
[1] "QU001Y" "ZL002" "FX016" "TD008" "BF007P" "VV1395" "HM1874" "JK001"
The pattern matches
^ - start of string
[A-Z]{2} - two uppercase letters
\\K - the text matched so far is removed from the match
0* - 0 or more zeros
(\\d{1,4}) - Capturing group 1: one to four digits
\\d* - 0+ digits.
Group 1 is passed to the callback function where sprintf("%03d",as.numeric(x)) pads the value with the necessary amount of digits.

Replace number with a random number of same amount of digits

I have a string containing some numbers and to relace every single digit with a sigle random number.
E.g. "111" should be replaced with 3 random numbers between 0-9 that are concatenated like "364".
My idea was to match a number, get the number of digits, calculate as many random numbers and concatenate them to finally replace my matched number:
test <- "this is 1 example 123. I like the no.37"
gsub("([0-9])", paste0(sample(0:9, nchar("\\1")), collapse = ""), test)
My goal would be to have a string where every single digit is replaced by a random digit. E.g.
"this is 3 an example 628. I like the no.09"
I tried some approaches but can't find a good solution.
Use the gsubfn library, it will make things simpler:
library(gsubfn)
test <- "this is 1 example 123. I like the no.37"
gsubfn("[0-9]+", ~ paste0(sample(0:9, nchar(x)), collapse = ""), test)
[1] "this is 8 example 205. I like the no.37"
Here, gsubfn will match all 1 or more digits in the string (see the [0-9]+ pattern). Then, the matches are passed to the callback where nchar gets the real value of the captured substring (the digit substrings).

Extracting a character that contains a certain type of element in R

For Example, lets say I have the following string
vec <- " #_Jim98 Did you turn off the stove #9am?"
I would like to count the number of # characters that contain only numbers,letters,#, and underscore symbol in the string. In the case above, it would only count 1 since #9am? contains the ? symbol, so it won't be counted.
Also, it could not be longer than 10 characters.
1) Search for # followed by any number of the allowed characters "\\w" followed by a whitespace character "\\s" or | end of string $. If zero word characters are allowable then change the + to *. The expression is vectorized, i.e. x can be a character vector. No packages are used.
x <- " #_Jim98 Did you turn off the stove #9am?" # test input
pat <- "#\\w+(\\s|$)"
lengths(regmatches(x, gregexpr(pat, x)))
## [1] 1
Note that the reason for regmatches is that gregexpr produces a -1 rather than a zero length vector for no matches whereas regmatches will produce a zero length vector. Thus it works for the edge case of no matches.
2) A slightly more compact solution would be this where pat is from above:
library(gsubfn)
lengths(strapplyc(x, pat))
## [1] 1
We can do this with a regular expression. I'm interpreting that you are counting words separated by space characters or occurring at the beginning or end of the string. This assumes the # is at the start of the word, and I match a # followed by some number of word characters \\w(letters and digits) or underscores. You can remove the first (^|\\s) if you don't care about having # at the beginning of the word and would like to count 3 words in, for example, " #_Jim98 Did the Latin#s or tom#domain turn off the stove #9am?"
stringr::str_count(" #_Jim98 Did you turn off the stove #9am?", "(^|\\s)#(\\w|_)*?($|\\s)")
#> [1] 1
Created on 2018-04-12 by the reprex package (v0.2.0).

How can I return the unique number of digits in a character string in R?

I have a vector of strings with 24 digits each. Each digit represents an hour, and if the digit is "0" then the rate from period 0 applies and if the digit is 1 then the rate from period 1 applies.
As an example consider the two strings below. I would like to return the number of periods in each string. For example:
str1 <- "000000000000001122221100"
str2 <- "000000000000000000000000"
#str1: 3
#str2: 1
Any recommendations? I've been thinking about how to use str_count from stringr here. Also, I've searched other posts but most of them focus on counting letters in character strings, whereas this is a slight modification because the string contains digits and not letters.
Thanks!
Here is another option by using charToRaw.
length(unique(charToRaw(str1)))
[1] 3
length(unique(charToRaw(str2)))
[1] 1
This is an ugly solution, but here goes
length(unique(unlist(strsplit(str1,split = ""))))

Resources