R working with big decimal numbers - r

I'm trying print to console or even inspect the numbers inside my dataframe object that contains big decimal numbers with 8 decimal places such as: "1054792997932.50564756" (the class of the number is numeric)
I tried using print() and cat() and View() to inspect a single number but the only result I get back is and integer "1054792997932" and the decimal places cannot be seen unless I use sprintf("%.8f", number) but the output I get back is the wrong number:
> sprintf("%.8f", 1054792997932.50564756)
[1] "1054792997932.50561523"
So from the looks of it sprintf is not a good method to use to check or format big decimal numbers.
I'm having problems validating and working with rounding such numbers any advice/help you can provide on how to deal with numbers in R would be appreciated as I am stuck
The system setup is:
R version: 3.4.0
I use pretty standard packages:
R stats and R Utils

You can change the number of digits displayed in the console with the option "digits".
To view your current setting, type
getOption("digits")
The default setting is 7. With
options("digits" = 22)
you can change the setting. 22 is the maximum amount of digits R can display.

Related

How to generate all possible unicode characters?

If we type in letters we get all lowercase letters from english alphabet. However, there are many more possible characters like ä, é and so on. And there are symbols like $ or (, too. I found this table of unicode characters which is exactly what I need. Of course I do not want to copy and paste hundreds of possible unicode characters in one vector.
What I've tried so far: The table gives the decimals for (some of) the unicode characters. For example, see the following small table:
Glyph Decimal Unicode Usage in R
! 33 U+0021 "\U0021"
So if type "\U0021" we get a !. Further, paste0("U", format(as.hexmode(33), width= 4, flag="0")) returns "U0021" which is quite close to what I need but adding \ results in an error:
paste0("\U", format(as.hexmode(33), width= 4, flag="0"))
Error: '\U' used without hex digits in character string starting ""\U"
I am stuck. And I am afraid even if I figure out how to transform numbers to characters usings as.hexmode() there is still the problem that there are not Decimals for all unicode characters (see table, Decimals end with 591).
Any idea how to generate a vector with all the unicode characters listed in the table linked?
(The question started with a real world problem but now I am mostly simply eager to know how to do this.)
There may be easier ways to do this, but here goes. The Unicode package contains everything you need.
First we can get a list of unicode scripts and the block ranges:
library(Unicode)
uranges <- u_scripts()
Check what we've got:
head(uranges, 3)
$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B U+1E950..U+1E959 U+1E95E..U+1E95F
$Ahom
[1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726 U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F
[11] U+11740..U+11746
$Anatolian_Hieroglyphs
[1] U+14400..U+14646
Next we can convert the ranges into their sequences.
expand_uranges <- lapply(uranges, as.u_char_seq)
To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:
all_unicode_chars <- unlist(expand_uranges)
# The Wikipedia page linked states there are 144,697 characters
length(all_unicode_chars)
[1] 144762
So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:
intToUtf8(expand_uranges$Katakana[[1]])
[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"

19 digit number doesn't get parsed as is in R

I have the following number in JSON in a character array:
[1544018118438041139,162.0,38.0,0.023529414,1.0,2131230815,1]
library(jsonlite)
fromJSON(
'[1544018118438041139,162.0,38.0,0.023529414,1.0,2131230815,1]'
)
returns:
[1] 1544018118438041088.000000000000000000000
162.000000000000000000000 38.000000000000000000000 [4] 0.023529413999999998497
1.000000000000000000000 2131230815.000000000000000000000 [7] 1.000000000000000000000
I want it to return the right exact time (the big number: 1544018118438041139)
Please advise how to make it work.
This appears to be a limit in precision inherent in R itself. Consider:
x <- 1544018118438041139
as.character(x)
[1] "1544018118438041088"
So, I'm not sure that the jsonlite package is responsible for this rounding error. Based on the #Konrad comment, this is a limit in precision of the IEEE data type.
There is a workaround if all you want to do is preserve the digits in the original number. You could instead treat the data as a string:
'["1544018118438041139",162.0,38.0,0.023529414,1.0,2131230815,1]'
Then, presumably jsonlite would read in the "number" as a string, without any precision issues coming into play. But then you would still have the problem should you decide to cast to numeric in R and do math with it.

R is adding extra numbers while reading file

I have been trying to read a file which has date field and a numeric field. I have the data in an excel sheet and looks something like below -
Date X
1/25/2008 0.0023456
12/23/2008 0.001987
When I read this in R using the readxl::read_xlsx function, the data in R looks like below -
Date X
1/25/2008 0.0023456000000000
12/23/2009 0.0019870000000000
I have tried limiting the digits using functions like round, format (nsmall = 7), etc. but nothing seems to work. What am I doing wrong? I also tried saving the data as a csv and a txt and read it using read.csv and read.delim but I face the same issue again. Any help would be really appreciated!
As noted in the comments to the OP and the other answer, this problem is due to the way floating point math is handled on the processor being used to run R, and its interaction with the digits option.
To illustrate, we'll create an Excel spreadsheet with the data from the OP, and demonstrate what happens as we adjust the options(digits=) option.
Next, we'll write a short R script to illustrate what happens when we adjust the digits option.
> # first, display the number of significant digits set in R
> getOption("digits")
[1] 7
>
> # Next, read data file from Excel
> library(xlsx)
>
> theData <- read.xlsx("./data/smallNumbers.xlsx",1,header=TRUE)
>
> head(theData)
Date X
1 2008-01-25 0.0023456
2 2008-12-23 0.0019870
>
> # change digits to larger number to replicate SO question
> options(digits=17)
> getOption("digits")
[1] 17
> head(theData)
Date X
1 2008-01-25 0.0023456000000000002
2 2008-12-23 0.0019870000000000001
>
However, the behavior of printing significant digits varies by processor / operating system, as setting options(digits=16) results in the following on a machine running an Intel i7-6500U processor with Microsoft Windows 10:
> # what happens when we set digits = 16?
> options(digits=16)
> getOption("digits")
[1] 16
> head(theData)
Date X
1 2008-01-25 0.0023456
2 2008-12-23 0.0019870
>
library(formattable)
x <- formattable(x, digits = 7, format = "f")
or you may want to add this to get the default formatting from R:
options(defaultPackages = "")
then, restart your R.
Perhaps the problem isn't your source file as you say this happens with .csv and .txt as well.
Try checking to see the current value of your display digits option by running options()$digits
If the result is e.g. 14 then that is likely the problem.
In which case, try running r command options(digits=8) which will set the display digits=8 for the session.
Then, simply reprint your dataframe to see the change has already taken effect with respect to how the decimals are displayed by default to the screen.
Consult ?options for more info about digits display setting and other session options.
Edit to improve original answer and to clarify for future readers:
Changing options(digits=x) either up or down does not change the value that is stored or read into into internal memory for floating point variables. The digits session option merely changes how the floating point values print i.e. display on the screen for common print functions per the '?options` documentation:
digits: controls the number of significant digits to print when printing numeric values.
What the OP showed as the problem he was having (R displaying more decimals after last digit in a decimal number than the OP expected to see) was not caused by the source file having been read from Excel - i.e. given the OP had the same problem with CSV and TXT the import process didn't cause a problem.
If you are seeing more decimals than you want by default in your printed/displayed output (e.g. for dataframes and numeric variables) try checking options()$digits and understand that option is simply the default for the number of digits used by R's common display and printing methods. HOWEVER, it does not affect floating point storage on any of your data or variables.
Regarding floating point numbers though, another answer here shows how setting option(digits=n) higher than the default can help demonstrate some precision/display idiosyncrasies that are related to floating point precision. That is a separate problem to what the OP displayed in his example but it's well worth understanding.
For a much more detailed and topic specific discussion of floating point precision than would be appropriate to rehash here, it's well worth reading this definitive SO question+answer: Why are these numbers not equal?
That other question+answer+discussion covers issues specifically around floating point precision and contains a long, well presented list of references that you will find helpful if you need more information on the subject.

Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?

In Unicode, letters with accents can be represented in two ways: the accentuated letter itself, and the combination of the bare letter plus the accent. For example, é (+U00E9) and e´ (+U0065 +U0301) are usually displayed in the same way.
R renders the following (version 3.0.2, Mac OS 10.7.5):
> "\u00e9"
[1] "é"
> "\u0065\u0301"
[1] "é"
However, of course:
> "\u00e9" == "\u0065\u0301"
[1] FALSE
Is there a function in R which converts two-unicode-character-letters into their one-character form? In particular, here it would collapse "\u0065\u0301" into "\u00e9".
That would be extremely handy to process large quantities of strings. Plus, the one-character forms can easily be converted to other encodings via iconv -- at least for the usual Latin1 characters -- and is better handled by plot.
Thanks a lot in advance.
Ok, it appears that a package has been developed to enhance and simplify the string manipulation toolbox in R (finally!). It is called stringi and looks very promising. Its documentation is very well written, and in particular I find the pages about encodings and locales much more enlightening than some of the standard R documentation on the subject.
It has Unicode normalization functions, as I was looking for (here form C):
> stri_trans_nfc('\u00e9') == stri_trans_nfc('\u0065\u0301')
[1] TRUE
It also contains a smart comparison function which integrates these normalization questions and lessens the pain of having to think about them:
> stri_compare('\u00e9', '\u0065\u0301')
[1] 0
# i.e. equal ;
# otherwise it returns 1 or -1, i.e. greater or lesser, in the alphabetic order.
Thanks to the developers, Marek Gągolewski and Bartek Tartanus, and to Kurt Hornik for the info!

Working with Unicode in R

I read in text from a MySQL table into and R dataframe. (using RODBC, sqlFetch). Have two questions:
How do I figure out if R has read it in as utf-8? It's character
type but what's the function to show encoding?
How do I compute the number of characters in an Unicode string in R?
The length function does not work with Unicode and always returns 1 I think.
You should be able to read the encoding (assuming it is specified) with:
Encoding(x)
The number of characters can be determined with:
nchar(x)

Resources