Count number of digits including leading zeros in R - r

What is a way to count the number of digits of a numeric object in R including leading zeroes?
For example, I know nchar(x) will return the number of digits if x is numeric but what about instances in which x includes leading zeros?
Note: Count the number of integer digits does not address the issue of leading zeros.
Example:
x<-7
nchar(7)
[1] 1 #that's fine
x<-07
nchar(07)
[1] 1 #that is NOT fine: I want the value of 2 to appear

Related

Numeric data type with "0" digits after dot

Today I had a look at the pop dataset of the wpp2019 package and noticed that the population numbers are shown as numeric values with a "." after the last three digits (e.g. 10500 is 10.500).
library(wpp2019)
data("pop")
pop$`2020`
To remove the dots, I would usually simply turn the column into a character column and then use for example stringr::str_replace(), but as soon as I apply any function (except printing) to the population number columns, the dots disappear.
How can it be that this dataset shows e.g. 10.500 when printing the data.frame even though R usually removes the 0 digits after the dot for numeric values? And what would be the best way to remove the dots in the above example without losing the 0 digits?
Expected output
# instead of
pop$`2020`[153]
#[1] 164.1
# this value should return 164100 because printing the data frame
# shows 164.100
Population estimates in wpp2019 are given in thousands. So multiply by 1000 to get back to the estimated number of individuals:
> pop$`2020`[153]*1000
[1] 164100
R prints the decimal part sometimes but not other times based on the digits argument in print, and what else is in the vector it is printing. For example:
> print(1234567.890)
[1] 1234568 # max 7 digits printed by default
> print(c(1234567.890,0.011))
[1] 1234567.890 0.011 # but when printed alongside 0.011 all the digits shown.
This explains why your data frame always shows all the digits but you don't see all the digits when you extract individual numbers.

zero padding regex dependent on length of digits

I have a field which contains two charecters, some digits and potentially a single letter. For example
QU1Y
ZL002
FX16
TD8
BF007P
VV1395
HM18743
JK0001
I would like to consistently return all letters in their original position, but digits as follows.
for 1 to 3 digits :
return all digits OR the digits left padded with zeros
For 4 or more digits :
it must not begin with a zero and return the 4 first digits OR if the first is a zero then truncate to three digits
example from the data above
QU001Y
ZL002
FX016
TD008
BF007P
VV1395
HM1874
JK001
The implementation will be in R but I'm interested in a straight regex solution, I'll work out the R side of things. It may not be possible in straight regex which is why I can't get my head round it.
This identifies the correct ones, but I'm hoping to correct those which are not
right.
"[A-Z]{2}[1-9]{0,1}[0-9]{1,3}[F,Y,P]{0,1}"
For the curious, they are flight numbers but entered by a human. Hence the variety...
You may use
> library(gsubfn)
> l <- c("QU1Y", "ZL002", "FX16", "TD8", "BF007P", "VV1395", "HM18743", "JK0001")
> gsubfn('^[A-Z]{2}\\K0*(\\d{1,4})\\d*', ~ sprintf("%03d",as.numeric(x)), l, perl=TRUE)
[1] "QU001Y" "ZL002" "FX016" "TD008" "BF007P" "VV1395" "HM1874" "JK001"
The pattern matches
^ - start of string
[A-Z]{2} - two uppercase letters
\\K - the text matched so far is removed from the match
0* - 0 or more zeros
(\\d{1,4}) - Capturing group 1: one to four digits
\\d* - 0+ digits.
Group 1 is passed to the callback function where sprintf("%03d",as.numeric(x)) pads the value with the necessary amount of digits.

How to generate word sequence

1.I want to generate combinations of characters from a given word with each letter being repeated consecutively utmost 2 times and at least 1.The resultant words are of unequal lengths. For example from
"cat"
to
"cat", "catt", "caat", "caatt", "ccat", "ccatt", "ccaat", "ccaatt"
Required function takes a word of length n and generates 2^n words of unequal length. It is almost similar to binary digits with n length gives 2^n combinations. For example a 3 digit binary number gives
000 001 010 011 100 101 110 111
combinations, where 0=t and 1=tt.
2.And also the same function should restrict the resultant sequence maximum upto 2 consecutive repetition of a character even if the given word has repetitions of letters.For example
"catt"
to
"catt" "ccatt" "caatt" "ccaatt"
I tried something like this
pos=expand.grid(l1=c(1,11),l2=c(2,22),l3=c(3,33))
result=chartr('123','cat',paste0(pos[,1],pos[,2],pos[,3]))
#[1] "cat" "ccat" "caat" "ccaat" "catt" "ccatt" "caatt" "ccaatt"
It gives correct sequence but I am stuck with generalizing it to any given word with different lengths.
Thank you.
Use stdout as per normal...
print("Hello, world!")
x="cat"
l=seq(nchar(x))
n=paste(l,collapse="")
m=split(c(l,paste0(l,l)),rep(l,2))
chartr(n,x,do.call(paste0,expand.grid(m)))
1.Just an addition to the answer given by Onyambu to solve the second part of the question i.e. restrict the output to maximum 2 consecutive repetitions of a character given any number of consecutive repetitions of characters in the input word.
x="catt"
l=seq(nchar(x))
n=paste(l,collapse="")
m=split(c(l,paste0(l,l)),rep(l,2))
o <- chartr(n,x,do.call(paste0,expand.grid(m)))
Below line of code removes the words with more than 2 consecutive repetitive characters
unique(gsub('([[:alpha:]])\\1{2,}', '\\1\\1', o))
#[1] "catt" "ccatt" "caatt" "ccaatt"
2.If you want all the combinations starting from "cat" to "ccaattt" given any number of consecutive repetitions of characters in the input word. Code is
x1="catt"
Below line of code restricts the consecutive repetition of characters to 1.
x2= gsub('([[:alpha:]])\\1+', '\\1', x1)
l=seq(nchar(x2))
n=paste(l,collapse="")
m=split(c(l,paste0(l,l)),rep(l,2))
o <- chartr(n,x,do.call(paste0,expand.grid(m)))
unique(gsub('([[:alpha:]])\\1{2,}', '\\1\\1', o))
#[1] "cat" "ccat" "caat" "ccaat" "catt" "ccatt" "caatt" "ccaatt"

Deriving Phone number from a string in R

I have some vectors as below:
I converted all characters, Special characters into X
xxxxxx18002514919xxxxxxxxxxxxxxxxxxxxxxxxxx24XXXXXX7
xxxxxx9000012345xxxxxxxxxxxxx34567xxxxxxxxxxxxx1800XXXXXX7
How can I derive only 11 digit or 10 digit phone number from the above strings in R
My Desired Output is:
For first string: 18002514919
For second string: 9000012345
You can use stringr to solve your problem, There is function called str_extract_all to extract the phone number as desired.
The regex:
\\d --> represent number,
{n,m} --> curly braces are for matching the times of number. Here n is applied for minimum no of matches and m is maximum number of numbers for the match. Since you want to match a phone number whose length between 10 and 11. n becomes 10 and m becomes 11.
X <- c("xxxxxx18002514919xxxxxxxxxxxxxxxxxxxxxxxxxx24XXXXXX7","xxxxxx9000012345xxxxxxxxxxxxx34567xxxxxxxxxxxxx1800XXXXXX7")
library(stringr)
str_extract_all(X,"\\d{10,11}")
Answer:
> str_extract_all(X,"\\d{10,11}")
[[1]]
[1] "18002514919"
[[2]]
[1] "9000012345"
If you are sure that one scalar would contain only one string of phone number then use str_extract.
> str_extract(X,"\\d{10,11}")
[1] "18002514919" "9000012345"

rounding of digits

I'm having troubles with
set.seed(1)
sum(abs(rnorm(100)))
set.seed(1)
cumsum(abs(rnorm(100)))
Why does the value of the sum differ from the last value of the cumulative sum with the cumulative sum preserving the all decimal digits and sum rounding 1 digit off.
Also note that this really really is about how values are printed i.e. presented. This does not change the values themselves, e.g. ...
set.seed(1)
d1 <- sum(abs(rnorm(100)))
set.seed(1)
d2 <- cumsum(abs(rnorm(100)))
(d1 == d2)[100]
## [1] TRUE
This is a consequence of the way R prints atomic vectors.
With the default digits option set to 7 as you likely have, any value between -1 and 1 will print with seven decimal places. Because of the way R prints atomic vectors, all other values in the vector will also have seven decimal places. Furthermore, a value of .6264538 with digits option set to 7 must print with eight digits (0.6264538) because it must have a leading zero. There are two of these values in your rnorm() vector.
If you look at cumsum(abs(rnorm(100)))[100] alone and you can see the difference (actually it becomes the same as printed value as sum(abs(rnorm(100))), although not exactly the same value).
sum(abs(rnorm(100)))
# [1] 71.67207
cumsum(abs(rnorm(100)))[100]
# [1] 71.67207
Notice that both of these values have seven digits. Probably the most basic example of this I can think of is as follows
0.123456789
#[1] 0.1234568
1.123456789
#[1] 1.123457
11.123456789
# [1] 11.12346
## and so on ...

Resources