How to remove ending zeros in binary bit sequence in R? - r

I need to remove ending zeros from binary bit sequences.
The length of the bit sequence is fixed, say 52. i.e.,
0101111.....01100000 (52-bit),
10111010..1010110011 (52-bit),
10111010..1010110100 (52-bit).
From converting decimal number to normalized double precision, significand is 52 bit, and hence zeros are populated to the right hand side even if significand is less than 52 bit at first step. I am reversing the process: i.e., I am trying to convert a normalized double precision in memory to decimal number, hence, I have to remove zeros (at the end) that are used to populate 52 bits for significand.
It is not guaranteed that the sequence in hand necessarily have 0s in the end (like the 2nd example above). If there is, all ending zeros must be truncated:
f(0101111.....01100000) # 0101111.....011; leading 0 must be kept
f(10111010..1010110011) # 10111010..1010110011; no truncation
f(10111010..1010110100) # 10111010..10101101
Unfortunately, the number of truncated 0s at the end differs. (5 in the 1st example; 2 in the 3rd example).
It is OK for me if input and output class are string:
f("0101111.....01100000") # "0101111.....011"; leading 0 must be kept
f("10111010..1010110011") # "10111010..1010110011"; no truncation
f("10111010..1010110100") # "10111010..10101101"
Any help is greatly appreciated.

This is a simple regular expression.
f <- function(x) sub('0+$', '', x)
Explanation:
0 - matches the character 0.
0+ - the character zero repeated at least one time, meaning, one or more times.
$ matches the end of the string.
0+$ the character 0 repeated one or more times and nothing else until the end of the string.
Replace the sub-string matched by the pattern with the empty string, ''.
Now test the function.
f("010111101100000")
#[1] "0101111011"
f("0100000001010101100010000000000000000000000000000000000000000000")
#[1] "010000000101010110001"
f("010000000101010110001000000")
#[1] "010000000101010110001"
f("00010000000101010110001000000")
#[1] "00010000000101010110001"

Related

how do I filter dataset based on "Version" column containing _________.000 decimal?

I have a dataset where I am trying to filter based on 3 different columns.
I have the 2 columns that have character values figured out by doing:
filter(TRANSACTION_TYPE != "ABC", CUSTOMER_CODE == "123") however, I have a "VERSION" column where there will be multiple versions for each customer which will then duplicate my $ amount.
I want to filter on only the VERSION that contains ".000" as decimal since the .000 represents the final and most accurate version. For example, VERSION can = 20220901.000 and 20220901.002 (enter image description here
), 20220901.003, etc. However the numbers before the decimal will always change so I can't filter on it to equal this 20220901 as it will change by day.
I hope I was clear enough, thank you!
Sample data:
quux <- data.frame(VERS_chr = c("20220901.000","20220901.002","20220901.000","20220901.002"),
VERS_num = c(20220901.000,20220901.002,20220901.000,20220901.002))
If is.character(quux$VERSION) is true in your data, then
dplyr::filter(quux, grepl("\\.000$", VERS_chr))
# VERS_chr VERS_num
# 1 20220901.000 20220901
# 2 20220901.000 20220901
Explanation:
"\\.000$" matches the literal period . (it needs to be escaped since it's a regex reserved symbol) followed by three literal zeroes 000, at the end of string ($). See https://stackoverflow.com/a/22944075/3358272 for more info on regex.
If it is false (and it is not a factor), then
dplyr::filter(quux, abs(VERS_num %% 1) < 1e-3)
# VERS_chr VERS_num
# 1 20220901.000 20220901
# 2 20220901.000 20220901
Explanation:
abs(.) < 1e-3 is defensive against high-precision tests of equality, where floating-point limitations (in computers in general) don't always see a number very-close to zero as exactly zero. See Why are these numbers not equal?, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f.
. %% 1 is the modulus operator, reducing a number down to its fractional component.

what's the meaning of format Z(19)9 in teradata?

i see a sql including "format Z(19)9" in terdata environment, and i checked in teradata doc,
the following is for "Z":
Zero‑suppressed decimal digit.
Translates to blank if the digit is zero and preceding digits are also zero.
A FORMAT phrase containing Z characters (including Z(I) and Z(F)), a combination of commas,
dots, G, or D, and no other formatting characters means “blank when zero.”
For example, ZZZZZ, ZZ.Z, GZ(I)DZ(F), GZZZZZZDZZ and Z,ZZZ.ZZ print only blanks if the number
is zero.
A Z cannot follow a 9.
Repeated Z characters must appear to the left of any combination of the radix and any 9
formatting characters.
The characters to the right of the radix cannot be a combination of 9 and Z characters; they
must be all 9s or all Zs. If they are all Zs, then the characters to the left of the radix
must also be all Zs.
If a group of repeated Z characters appears in a FORMAT phrase with a group of repeated sign
characters, the group of Z characters must immediately follow the group of sign characters.
For example, --ZZZ.
and the following is for 9:
Decimal digit (no zero suppress).
but i am still confused, what's the meaning of format Z(19)9 ? for example, the data is 0000.234, then the result will be .234000000 ?

Encode numbers with letters with fixed lentgh?

I have two unique numbers, 100000 - 999999 (fixed 6 chars length [0-9]), second
1000000 - 9999999 (fixed 7 char length [0-9]). How can i encode/decode this numbers (they need to remain separate after decoding), using only uppercase letters [A-Z] and [0-9] digits and have a fixed length of 8 chars in total?
Example:
input -> num_1: 242404, num_2 : 1002000
encode -> AX3B O3XZ
decode -> 2424041002000
Is there any algorithm for this type of problem?
This is just a simple mapping from one set of values to another set of values. The procedure is always the same:
List all possible input and output values.
Find the index of the input.
Return the value of the output list at that index.
Note that it's often not necessary to make an actual list (i.e. loading all values into some data structure). You can typically compute the value for any index on-demand. This case is no different.
Imagine a list of all possible input pairs:
0 100'000, 1'000'000
1 100'000, 1'000'001
2 100'000, 1'000'002
...
K 100'000, 9'999'999
K+1 100'001, 1'000'000
K+2 100'001, 1'000'001
...
N-1 999'999, 9'999'998
N 999'999, 9'999'999
For any given pair (a, b), you can compute its index i in this list like so:
// Make a and b zero-based
a -= 100'000
b -= 1'000'000
i = a*1'000'000 + b
Convert i to base 36 (A-Z and 0-9 gives you 36 symbols), pad on the left with zeros as necessary1, and insert a space after the fourth digit.
encoded = addSpace(zeroPad(base36(i)))
To get back to the input pair:
Convert the 8-character base 36 string to base 10 (this is the index into the list, remember), then derive a and b from the index.
i = base10(removeSpace(encoded))
a = i/1'000'000 + 100'000 // integer divison (i.e. ignore remainder)
b = i%1'000'000 + 1'000'000
Here is an implementation in Go: https://play.golang.org/p/KQu9Hcoz5UH
1 If you don't like the idea of zero padding you can also offset i at this point. The target set of values is plenty big enough, you need only about 32% of all base 36 numbers with eight digits or less.

zero padding regex dependent on length of digits

I have a field which contains two charecters, some digits and potentially a single letter. For example
QU1Y
ZL002
FX16
TD8
BF007P
VV1395
HM18743
JK0001
I would like to consistently return all letters in their original position, but digits as follows.
for 1 to 3 digits :
return all digits OR the digits left padded with zeros
For 4 or more digits :
it must not begin with a zero and return the 4 first digits OR if the first is a zero then truncate to three digits
example from the data above
QU001Y
ZL002
FX016
TD008
BF007P
VV1395
HM1874
JK001
The implementation will be in R but I'm interested in a straight regex solution, I'll work out the R side of things. It may not be possible in straight regex which is why I can't get my head round it.
This identifies the correct ones, but I'm hoping to correct those which are not
right.
"[A-Z]{2}[1-9]{0,1}[0-9]{1,3}[F,Y,P]{0,1}"
For the curious, they are flight numbers but entered by a human. Hence the variety...
You may use
> library(gsubfn)
> l <- c("QU1Y", "ZL002", "FX16", "TD8", "BF007P", "VV1395", "HM18743", "JK0001")
> gsubfn('^[A-Z]{2}\\K0*(\\d{1,4})\\d*', ~ sprintf("%03d",as.numeric(x)), l, perl=TRUE)
[1] "QU001Y" "ZL002" "FX016" "TD008" "BF007P" "VV1395" "HM1874" "JK001"
The pattern matches
^ - start of string
[A-Z]{2} - two uppercase letters
\\K - the text matched so far is removed from the match
0* - 0 or more zeros
(\\d{1,4}) - Capturing group 1: one to four digits
\\d* - 0+ digits.
Group 1 is passed to the callback function where sprintf("%03d",as.numeric(x)) pads the value with the necessary amount of digits.

Regular expression for 3 decimal point

I need a regular expression that satisfy these rules:
The maximum number of decimal point is 3 but a number with no decimal point (e.g 12) should be accepted too
the value must be at least 0
the value must be less or equal to 99999999999.999
the radix point is DOT (e.g 2.5, not 2,5)
Sample of valid numbers:
0
2
0.4
78784764.23
45.232
Sample of invalid numbers:
-2
123456789522144
84.2564
I found an example here (http://forums.asp.net/t/1642501.aspx) and have managed to modify it a little bit to make 0 as the minimum value, 99999999999.999 as the maximum value and to accept only DOT as radix point. Here's my modified regex:
^\-?(([0-9]\d?|0\d{1,2})((\.)\d{0,2})?|99999999999.999((\.)0{1,2})?)$
However, I still have problem with the 3 decimal point and it is rather unstable. Can anyone help me on this since I'm basically illiterate when it comes to regex?
Thanks.
EDITED:
I'm using ASP Regular Expression Validator
This is not that difficult:
^[0-9]{1,11}(?:\.[0-9]{1,3})?$
Explanation:
^ # Start of string
[0-9]{1,11} # Match 1-11 digits (i. e. 0-99999999999)
(?: # Try to match...
\. # a decimal point
[0-9]{1,3} # followed by one to three digits (i. e. 0-999)
)? # ...optionally
$ # End of string

Resources