Long Numbers As A Character String - r

As part of my dataset, one of the columns is a series of 24-digit numbers.
Example:
bigonumber <- 429382748394831049284934
When I import it using either data.table::fread or read.csv, it shows up as numeric in exponential format (EG: 4.293827e+23).
options(digits=...) won't work since the number is longer than 22 digits.
When I do
as.character(bigonumber)
what I get is "4.29382748394831e+23"
Is there a way to get bigonumber converted to a character string and show all of the digits as characters? I don't need to do any math on it, but I do need to search against it and do dplyr joins on it.
I need to this after import, since the column number varies from month to month.
(Yes, in the perfect world, my upstream data provider would use a hash instead of a long number and a static number of columns that stay the same every month, but I don't get to dictate that to them.)

You can specify colClasses on your fread or read.csv statement.
bignums
429382748394831049284934
429382748394831049284935
429382748394831049284936
429382748394831049284937
429382748394831049284938
429382748394831049284939
bignums <- read.csv("~/Desktop/bignums.txt", sep="", colClasses = 'character')

You can suppress the scientific notation with
options(scipen=999)
If you define the number then
bigonumber <- 429382748394831049284934
you can convert it into a string:
big.o.string <- as.character(bigonumber)
Unfortunately, this does not work because R converts the number to a double, thereby losing precision:
#[1] "429382748394831019507712"
The last digits are not preserved, as pointed out by #SabDeM. Even setting
options(digits=22)
doesn't help, and in any case 22 is the largest number that is allowed; and in your case there are 24 digits. So it seems that you will have to read the data directly as character or factor. Great answers have been posted showing how this can be achieved.
As a side note, there is a package called gmp that allows using arbitrarily large integer numbers. However, there is a catch: they have to be read as characters (again, in order to prevent R's internal conversion into double).
library(gmp)
bigonumber <- as.bigz("429382748394831049284934")
> bigonumber
Big Integer ('bigz') :
[1] 429382748394831049284934
> class(bigonumber)
[1] "bigz"
The advantage is that you can indeed treat these entries as numbers and perform calculations while preserving all the digits.
> bigonumber * 2
#Big Integer ('bigz') :
#[1] 858765496789662098569868
This package and my answer here may not solve your problem, because reading the numbers directly as characters is an easier way to achieve your goal, but I thought I might post this anyway as an information for users who may need to use large integers with more than 22 digits.

Use digest::digest on bigonumber to generate an md5 hash of the number yourself?
bigonumber <- 429382748394831049284934
hash_big <- digest::digest(bigonumber)
hash_big
# "e47e7d8a9e1b7d74af6a492bf4f27193"

I saw this before I posted my answer, but dont see it here anymore.
set options(scipen) to a big value so that there is no truncation:
options(scipen = 999)
bigonumber <- 429382748394831049284934
bigonumber
# [1] 429382748394831019507712
as.character(bigonumber)
# [1] "429382748394831019507712"

Use "scan" to read the file - the "what" parameter lets you define the input type of each column.

If you want numbers as numbers you can't print all values. The digits options allows a maximum of 22 digits. The range is from 1 to 22. It uses the print.default method. You can set it with:
options( digits = 22 )
Even with this options, the numbers will change. I ignore why that happens, most likely due to the fact that the object your are about to print (the number) is longer than the allowed amount of digits and so R does some weird stuff. I'll investigate about it.

Related

readxl, returning values are slightly off from the values on Excel

I'm trying to read an Excel file into R.
I used read_excel function of the readxl package with parameter col_types = "text" since the columns of the Excel sheet contain mixed data types.
df <- read_excel("Test.xlsx",sheet="Sheet1",col_types = "text")
But it appears a very slight difference in the numeric value is introduced. It's always those few values so I think it's some hidden attributes in Excel.
I tried to format those values as numbers in Excel, and also tried add 0s after the number, but it won't work.
I changed the numeric value of a cell from 2.3 to 2.4, and it was read correctly by R.
This is a consequence of floating-point imprecision, but it's a little tricky. When you enter the number 1.2 (for example) into R or Excel, it's not represented exactly as 1.2:
print(1.2,digits=22)
## [1] 1.199999999999999955591
Excel and R usually try to shield you from these details, which are inevitable if you're using fixed precision floating-point values (which most computer systems do), by limiting the printing precision to a level that will ignore those floating-point imprecisions. When you explicitly convert to character, however, R figures you don't want to lose information, so it gives you all the digits. Numbers that can be represented exactly in a binary representation, such as 2.375, don't gain all those extra digits.
However, there's a simple solution in this case:
readxl::read_excel("Test.xlsx", na="ND")
This tells R that the string "ND" should be treated as a special "not available" value, so all of your numeric values get handled properly. When you examine your data, the tiny imprecisions will still be there, but R will print the numbers the same way that Excel does.
I feel like there's probably a better way to approach this (mixed-type columns are really hard to deal with), but if you need to 'fix' the format of the numbers you can try something like this:
x <- c(format(1.2,digits=22),"abc")
## [1] "1.199999999999999955591" "abc"
fix_nums <- function(x) {
nn <- suppressWarnings(as.numeric(x))
x[!is.na(nn)] <- format(nn[!is.na(nn)])
return(x)
}
fix_nums(x)
## [1] "1.2" "abc"
Then if you're using tidyverse you can use my_data %>% mutate_all(fix_nums)

R read excel file numeric precision problem

I have a number in an excel file that is equal to -29998,1500000003
When I try to open it in R I get
> library(openxlsx)
> posotest <- as.character(read.xlsx("sofile.xlsx"))
> posotest
[1] "-29998.1500000004"
Any help? Desired result: -29998,1500000003
EDIT: with options(digits=13) I get -29998.150000000373 which could explain why the rounding is done, however even with options(digits=13) I get
> as.character(posotest)
[1] "-29998.1500000004"
Do you have any function that would allow me to get the full number in characters?
EDIT2 format does this but it adds artificial noise at the end.
x <- -29998.150000000373
format(x,digits=22)
[1] "-29998.15000000037252903"
How can I know how many digits to use in format since nchar will give me a wrong value?
The file is here
You can get a string with up to 22 digits of precision via format():
x <- -29998.150000000373
format(x,digits=22)
[1] "-29998.15000000037252903"
Of course, this will show you all sorts of ugliness related to trying to represent a decimal number in a binary representation with finite precision ...

Function write() inconsistent with number notation

Consider the following script:
list_of_numbers <- as.numeric()
for(i in 1001999498:1002000501){
list_of_numbers <- c(list_of_numbers, i)
}
write(list_of_numbers, file = "./list_of_numbers", ncolumns = 1)
The file that is produced looks like this:
[user#pc ~]$ cat list_of_numbers
1001999498
1001999499
1.002e+09
...
1.002e+09
1.002e+09
1.002e+09
1002000501
I found a couple more ranges where R does not print consistently the number format.
Now I have the following questions:
Is this a bug or is there an actual reason for this behavior?
Why just in certain ranges, why not every number above x?
I know how I can solve this like this:
options(scipen = 1000)
But are there more elegant ways than setting global options? Without converting it to a dataframe and changing the format.
It's not a bug, R chooses the shortest representation.
More precisely, in ?options one can read:
fixed notation will be preferred unless it is more than scipen
digits wider.
So when scipen is 0 (the default), the shortest notation is preferred.
Note that you can get the scientific notation of a number x with format(x, scientific = TRUE).
In your case:
1001999499 is 10 characters long whereas its scientific notation 1.001999e+09 is longer (12 characters), so the decimal notation is kept.
1001999500: scientific notation is 1.002e+09, which is shorter.
..................... (scientific notation stays equal to 1.002e+09, hence shorter)
1002000501: 1.002001e+09 is longer.
You may ask: how come that 1001999500 is formatted as 1.002e+09 and not as 1.0019995e+09? It's simply because there is also an option that controls the number of significant digits. It is named digits and its default value is 7. Since 1.0019995 has 8 significant digits, it is rounded up to 1.002.
The simplest way to ensure that decimal notation is kept without changing global options is probably to use format:
write(format(list_of_numbers, scientific = FALSE, trim = TRUE),
file = "./list_of_numbers")
Side note: you didn't need a loop to generate your list_of_numbers (which by the way is not a list but a vector). Simply use:
list_of_numbers <- as.numeric(1001999498:1002000501)

Exporting Long Integer to CSV

I am trying to export a data.frame which has 2 columns having 16 and 24 long integers.
However after exporting to csv, I am getting scientific notation like 4.352370e+15 in place of original non-scientific integer.
Code
write.csv(fulljoin,file="output2.csv")
Following option worked for me. Thanks to everyone for extending help.
options(digits=18)
Everything else was kept constant. It helped while reading as well as while writing.
I would not count on getting that many digits without loss of precision in the default integer data type. Notice that the integers will actually get coerced to numeric after a certain length. Compare class(12345678L) (integer) to class(123456789012L) (numeric with warning). After a little more length you will start to lose precision, regardless of how many digits you are displaying:
option(digits=22) # the max
x <- 1234567890123456789012; x
# [1] 1234567890123456774144 -- whoops!
For larger integers you may want to use a different class such as Big Integer in gmp.
library(gmp)
x <- as.bigz("1234567890123456789012345678901234567890")
x <- x + 1 # do some math
write.csv(as.character(x), "bignumber.csv", row.names=FALSE, quote=FALSE)
# csv looks like:
# x
# 1234567890123456789012345678901234567891

adding or retaining leading zeros without converting to character format

Is it possible to add or retain one or more leading zeros to a number without the result being converted to character? Every solution I have found for adding leading zeros returns a character string, including: paste, formatC, format, and sprintf.
For example, can x be 0123 or 00123, etc., instead of 123 and still be numeric?
x <- 0123
EDIT
It is not essential. I was just playing around with the following code and the last two lines gave the wrong answer. I just thought maybe if I could have leading zeros with numeric format obtaining the correct answer would be easier.
a7 = c(1,1,1,0); b7=c(0,1,1,1); # 4
a77 = '1110' ; b77='0111' ; # 4
a777 = 1110 ; b777=0111 ; # 4
length(b7[(b7 %in% intersect(a7,b7))])
R - count matches between characters of one string and another, no replacement
keyword <- unlist(strsplit(a77, ''))
text <- unlist(strsplit(b77, ''))
sum(!is.na(pmatch(keyword, text)))
ab7 <- read.fwf(file = textConnection(as.character(rbind(a777, b777))), widths = c(1,1,1,1), colClasses = rep("character", 2))
length(ab7[2,][(ab7[2,] %in% intersect(ab7[1,],ab7[2,]))])
You are not thinking correctly about what a "number" is. Programming languages store an internal representation which retains full precision to the machine limit. You are apparently concerned with what gets printed to your screen or console. By definition, those number characters are string elements, which is to say, a couple bytes are processed by the ASCII decoder (or equivalent) to determine what to draw on the screen. What x "is," to draw happily on Presidential Testimony, depends on your definition of what "is" is.
You could always create your own class of objects that has one slot for the value of the number (but if it is stored as numeric then what we see as 123 will actually be stored as as a binary value, something like 01111011 (though probably with more leading 0's)) and another slot or attribute for either the number of leading 0's or the number of significant digits. Then you can write methods for what to do with the number (and what effect that will have on the leading 0's, sig digits, etc.).
The print method could then make sure to print it with the leading zeros while keeping the internal value as a number.
But this seems a bit overkill in most cases (though I know that some fields make a big deal about indicating number of significant digits so that leading 0's could be important). It may be simpler to use the conversion to character methods that you already know about, but just do the printing in a way that does not look obviously like a number, see the cat and print functions for the options.

Resources