Converting numeric to character in R changes the number - r

I am currently working with ngrams, which are stored in a data.table in a numeric format, where each word in a vocabulary is given a unique 5 digit number and a single 4-gram looks like this :
10000100001017060484
The reason for storing ngrams in this manner is that numeric objects take up much less space in R. Hence, I am working with some large numbers, which I occasionally need to convert to character and back to do some string manipulation. Today, I noticed that my Rstudio does not seem to store large numbers correctly. For example :
as.numeric(125124313242345145234513234432)
[1] 125124313242345143744028208602
As you can see, the top number is very different from bottom. The only global option I used was:
options(scipen=999)
Can someone explain why is this happening and how can I fix it?
Regards,
Kamran.

If you run .Machine$integer.max, it would return 2147483647 which means R can't by default would handle integer greater than 2147483647. If you run .Machine$double.xmax, you would get a value of 1.797693e+308 which is the maximum double representation of floating number in R.The reasoning could be seen as exponent(308) and significand(1.797...) which are two different sets of storing the numbers.
?.Machine
http://sites.stat.psu.edu/~drh20/R/html/base/html/zMachine.html
In your case if you try to append L (way of telling R that you want to store something like an integer) in the number you will get something like this:
as.numeric(125124313242345145234513234432L)
[1] 1.251243e+29
Warning message:
non-integer value 125124313242345145234513234432L qualified with L; using numeric value
Hence you can see because of this limitations on saving integer and double in R you are getting this outcome.
To overcome this, you can convert it into a bigz using gmp library
as.bigz("125124313242345145234513234432")
Output:
Big Integer ('bigz') :
[1] 125124313242345145234513234432
This is my understanding about storing numbers in R, It might not be perfect but this how I see things in R for storing numbers.
You may choose to see the gmp documentation: https://cran.r-project.org/web/packages/gmp/gmp.pdf

Sorry for making this an answer, but its too long for a comment. What happens if you run below code. On my machine with scipen = 999 your conversion works fine. Have you stored your numbers for the ngrams really as numeric? In below code you may see that a potential error might arise from converting between character and numeric depending on the settings.
mynumber <- 125124313242345145234513234432
options(scipen = 999)
mynumber == as.numeric(mynumber)
#[1] TRUE
mynumber == as.numeric(as.character(mynumber))
#[1] TRUE
options(scipen = 0)
mynumber == as.numeric(mynumber)
#[1] TRUE
mynumber == as.numeric(as.character(mynumber))
#[1] FALSE

Related

Using factor vs. character and integer vs. double columns in dplyr/ggplot2

I am setting up a pipeline to import, format, normalize, and plot a bunch of datasets. The pipeline will rely heavily on tidyverse solutions (dplyr and ggplot2).
During the input/format step I would like to decide if/when to use factors vs. characters in various columns that contain letters. Likewise, I need to decide if I should designate numerical columns as integers (when it's reasonable) or use double.
My gut feeling is that as default I should just use character and double. Neither speed nor space are an issue since the resulting datasets are relatively small (~20 x 10,000 max) so I figure that this will give me the most flexibility. Are the disadvantages to going down this road?
Performance shouldn't be a concern in most use case, the criterium is the meaning of the variables.
Factor vs character
Use character if your data is just strings that do not hold specific meaning; use factor if it's a categorical variable with a limited set of values. The main advantages of using factors are:
you get an error if you try to give a new value that is not in the levels (so that can save you from typos)
you can give an order to the levels and get an ordered factor
some functions (especially when modelling) require an explicit factor for categorical variables
you make it clear to the reader that these are not random character strings.
Integer vs double
If you know your column will only ever contain integer values, integer can be a better choice. Indeed, computations on doubles can give some numeric error, and in some situations you can end up with 26.0000000001 != 26. In addition, some packages may be aware of the type of input (although I can't think of any example).
For big numbers (more than 2e31), integers won't be able to store them whereas doubles will still behave correctly.
as.integer(2147483647)
#> [1] 2147483647
as.integer(2147483648)
#> [1] NA
#> Warning message:
#> NAs introduced by coercion to integer range
But when the numbers get even bigger, doubles will also start loosing significant digits:
1234578901234567890 == 1234578901234567891
#> [1] TRUE
Overall, I don't think there it makes a big difference in practice, using an integer type can be a way to signal to the reader and to the program that if there is a decimal number in that column, something went wrong.

Converting strings in octal base into decimal integer in R

I have a number in octal base in string form (see below for an example):
02686a6552f426f08ac0f20ce7dca23e
and I need to transform it into a decimal integer using R. I have tried googling the relevant function, but I could only find functions that either convert decimal to octal such as:
as.octmode()
Or provides conversion between hexadecimal and decimal bases from the fBasics package:
.hex.to.dec
I have hoped for there to be a function line change.base(string,base_from,base_to), but I have only been able to find strtoi with the following arguments:
strtoi("02686a6552f426f08ac0f20ce7dca23e",base=8)
which gives me an NA value and the documentation stating "Convert strings to integers according to the given base using", but it doesn't state whether the base argument specifies the base from which we do the transformation or the one into which we transform (I assume the latter since the example poste above doesn't provide retuls)
It seems that php function decoct() gives a result:
echo octdec(02686a6552f426f08ac0f20ce7dca23e)
5176
But I do not really know php. According to our developers decoct(payment_id) + 3 ) * 7 is the only operation applied to an integer in this case. This is pushed into Google Analytics which provides the result in the example form. I wasn't able to find anything on GA doing this by default.
It would be easy to do the conversion mathematically if I had just the number in octal, but since the format looks like what I assume is some kind of a hash representation of the original number, I am clueless.
I need to run this over hundreds of similar records to compare two data sources so can't really use the php sandbox to do it manually.
Thanks for any help or pointers
The hexadecimal number can be converted into a decimal integer with the Rmpfr package:
library(Rmpfr)
x <- mpfr("02686a6552f426f08ac0f20ce7dca23e", base=16)
#> x
# 1 'mpfr' number of precision 128 bits
#[1] 3200612827992787429417270296251769406
To convert this number into an octal one, the same library can be used:
formatMpfr(x,base=8)
#[1] "23206514524572046741053007440634767121076.000"

Are the as.character() and paste() limited by the size of the numeric vales they are given?

I'm running into some problems with the R function as.character() and paste(): they do not give back what they're being fed...
as.character(1415584236544311111)
## [1] "1415584236544311040"
paste(1415584236544311111)
## [1] "1415584236544311040"
what could be the problem or a workaround to paste my number as a string?
update
I found that using the bit64 library allowed me to retain the extra digits I needed with the function as.integer64().
Remember that numbers are stored in a fixed number of bytes based upon the hardware you are running on. Can you show that your very big integer is treated properly by normal arithmetic operations? If not, you're probably trying to store a number to large to store in your R install's integer # of bytes. The number you see is just what could fit.
You could try storing the number as a double which is technically less precise but can store larger numbers in scientific notation.
EDIT
Consider the answers in long/bigint/decimal equivalent datatype in R which list solutions including arbitrary precision packages.

How do I read large numbers precisely in R and perform arithmetic on them?

I am reading a csv file with some really big numbers like 1327707999760, but R automatically converts it into 1.32771e+12. I've tried to assign it a double class but it didn't work because it's already a rounded value.
I've checked other posts like Preserving large numbers . People said "It's not in a "1.67E+12 format", it just won't print entirely using the defaults. R is reading it in just fine and the whole number is there." But when I tried to do some arithmetic things on them, it's just not right.
For example:
test[1,8]
[1] 1.32681e+12
test[2,8]
[1] 1.32681e+12
test[2,8]-test[1,8]
[1] 0
But I know they are different numbers!
That's not large. It is merely a representation problem. Try this:
options(digits=22)
options('digits') defaults to 7, which is why you are seeing what you do. All twelve digits are being read and stored, but not printed by default.
Excel allows custom formats: Format/Cells/Custom and enter #0

Preventing R From Rounding

How do I prevent R from rounding?
For example,
> a<-893893084082902
> a
[1] 8.93893e+14
I am losing a lot of information there. I have tried signif() and it doesn't seem to do what I want.
Thanks in advance!
(This came up as a result of a student of mine trying to determine how long it would take to count to a quadrillion at a number per second)
It's not rounding; it's just the default format for printing large (or small) numbers.
a <- 893893084082902
> sprintf("%f",a)
[1] "893893084082902.000000"
See the "digits" section of ?options for a global solution.
This would show you more digits for all numbers:
options(digits=15)
Or, if you want it just for a:
print(a, digits=15)
To get around R's integer limits, you could use the gmp package for R: http://cran.r-project.org/web/packages/gmp/index.html
I discovered this package when playing with the Project Euler challenges and needing to do factorizations. But it also provides functions for big integers.
EDIT:
It looks like this question was not really one about big integers as much as it was about rounding. But for the next space traveler who comes this way, here's an example of big integer math with gmp:
Try and multiply 1e500 * 1e500 using base R:
> 1e500 * 1e500
[1] Inf
So to do the same with gmp you first need to create a big integer object which it calls bigz. If you try to pass as.bigz() an int or double of a really big number, it will not work, because the whole reason we're using gmp is because R can't hold a number this big. So we pass it a string. So the following code starts with string manipulation to create the big string:
library(gmp)
o <- paste(rep("0", 500), collapse="")
a <- as.bigz(paste("1", o, sep=""))
mul.bigz(a, a)
You can count the zeros if you're so inclined.

Resources