R is adding extra numbers while reading file - r

I have been trying to read a file which has date field and a numeric field. I have the data in an excel sheet and looks something like below -
Date X
1/25/2008 0.0023456
12/23/2008 0.001987
When I read this in R using the readxl::read_xlsx function, the data in R looks like below -
Date X
1/25/2008 0.0023456000000000
12/23/2009 0.0019870000000000
I have tried limiting the digits using functions like round, format (nsmall = 7), etc. but nothing seems to work. What am I doing wrong? I also tried saving the data as a csv and a txt and read it using read.csv and read.delim but I face the same issue again. Any help would be really appreciated!

As noted in the comments to the OP and the other answer, this problem is due to the way floating point math is handled on the processor being used to run R, and its interaction with the digits option.
To illustrate, we'll create an Excel spreadsheet with the data from the OP, and demonstrate what happens as we adjust the options(digits=) option.
Next, we'll write a short R script to illustrate what happens when we adjust the digits option.
> # first, display the number of significant digits set in R
> getOption("digits")
[1] 7
>
> # Next, read data file from Excel
> library(xlsx)
>
> theData <- read.xlsx("./data/smallNumbers.xlsx",1,header=TRUE)
>
> head(theData)
Date X
1 2008-01-25 0.0023456
2 2008-12-23 0.0019870
>
> # change digits to larger number to replicate SO question
> options(digits=17)
> getOption("digits")
[1] 17
> head(theData)
Date X
1 2008-01-25 0.0023456000000000002
2 2008-12-23 0.0019870000000000001
>
However, the behavior of printing significant digits varies by processor / operating system, as setting options(digits=16) results in the following on a machine running an Intel i7-6500U processor with Microsoft Windows 10:
> # what happens when we set digits = 16?
> options(digits=16)
> getOption("digits")
[1] 16
> head(theData)
Date X
1 2008-01-25 0.0023456
2 2008-12-23 0.0019870
>

library(formattable)
x <- formattable(x, digits = 7, format = "f")
or you may want to add this to get the default formatting from R:
options(defaultPackages = "")
then, restart your R.

Perhaps the problem isn't your source file as you say this happens with .csv and .txt as well.
Try checking to see the current value of your display digits option by running options()$digits
If the result is e.g. 14 then that is likely the problem.
In which case, try running r command options(digits=8) which will set the display digits=8 for the session.
Then, simply reprint your dataframe to see the change has already taken effect with respect to how the decimals are displayed by default to the screen.
Consult ?options for more info about digits display setting and other session options.
Edit to improve original answer and to clarify for future readers:
Changing options(digits=x) either up or down does not change the value that is stored or read into into internal memory for floating point variables. The digits session option merely changes how the floating point values print i.e. display on the screen for common print functions per the '?options` documentation:
digits: controls the number of significant digits to print when printing numeric values.
What the OP showed as the problem he was having (R displaying more decimals after last digit in a decimal number than the OP expected to see) was not caused by the source file having been read from Excel - i.e. given the OP had the same problem with CSV and TXT the import process didn't cause a problem.
If you are seeing more decimals than you want by default in your printed/displayed output (e.g. for dataframes and numeric variables) try checking options()$digits and understand that option is simply the default for the number of digits used by R's common display and printing methods. HOWEVER, it does not affect floating point storage on any of your data or variables.
Regarding floating point numbers though, another answer here shows how setting option(digits=n) higher than the default can help demonstrate some precision/display idiosyncrasies that are related to floating point precision. That is a separate problem to what the OP displayed in his example but it's well worth understanding.
For a much more detailed and topic specific discussion of floating point precision than would be appropriate to rehash here, it's well worth reading this definitive SO question+answer: Why are these numbers not equal?
That other question+answer+discussion covers issues specifically around floating point precision and contains a long, well presented list of references that you will find helpful if you need more information on the subject.

Related

R working with big decimal numbers

I'm trying print to console or even inspect the numbers inside my dataframe object that contains big decimal numbers with 8 decimal places such as: "1054792997932.50564756" (the class of the number is numeric)
I tried using print() and cat() and View() to inspect a single number but the only result I get back is and integer "1054792997932" and the decimal places cannot be seen unless I use sprintf("%.8f", number) but the output I get back is the wrong number:
> sprintf("%.8f", 1054792997932.50564756)
[1] "1054792997932.50561523"
So from the looks of it sprintf is not a good method to use to check or format big decimal numbers.
I'm having problems validating and working with rounding such numbers any advice/help you can provide on how to deal with numbers in R would be appreciated as I am stuck
The system setup is:
R version: 3.4.0
I use pretty standard packages:
R stats and R Utils
You can change the number of digits displayed in the console with the option "digits".
To view your current setting, type
getOption("digits")
The default setting is 7. With
options("digits" = 22)
you can change the setting. 22 is the maximum amount of digits R can display.

Scientific notation issue in R

I have an ID variable with 20 digits. Once i read the data in R , it changes to Scientific notation and then if i write the same id to csv file, the value of ID changes.
For example , running the below code should print me the value of x as "12345678912345678912",but it prints "12345678912345679872":
Code:
options(scipen=999)
x <- 12345678912345678912
print(x)
Output:
[1] 12345678912345679872
My questions are :
1) Why it is happening ?
2) How to fix this problem ?
I know it has to do with the storage of data types in R but still i think there should be some way to deal with this problem. I hope i am clear with this question.
I don't know if this question was asked or not in so point me to a link if its a duplicate.I will remove this post
I have gone through this, so i can relate with the issue of mine, but i am unable to fix it.
Any help would be highly appreciated. Thanks
R does not by default handle integers numerically larger than 2147483647L.
If you append an L to your number (to tell R its an integer), you get:
x <- 12345678912345678912L
#Warning message:
#non-integer value 12345678912345678912L qualified with L; using numeric value
This also explains the change of the last digits as R stores the number as a double.
I think the gmp-package should be able to handle large numbers in general. You should therefore either accept the loss of precision, store them as character stings, or use a data-type from the gmp package.
To circumvent the problem due to number storing/representation, you can import your ID variable directly as character with the option colClasses, for example, if using read.csv and importing a data.frame with the ÌD column and another numeric column:
mydata<-read.csv("file.csv",colClasses=c("character","numeric"),...)
Using readr you can do
mydata <- readr::read_csv("file.csv", col_types = list(ID=col_character()))
where "ID" is the name of your ID column

Long number field does not retail string of digits due to Excel. Excel Office Professional 2013 converts digits to rounded number

I'm reading a csv file in R that includes a conversion ID column. The issue I'm running into is that my conversionID is being rounded as an exponential number. Below is snapshot of the CSV file (opened in Excel) that I'm reading into R. As you can see, the conversion ID is an exponential format, but the value is: 383305820480.
When I read the data into R, using the following lines, I got the following output. Which looks like it's rounding the string of conversion IDs.
x<-read.csv("./Test2.csv")
options("scipen"=100, "digits"=15)
x
When I export the file as CSV, using the code
write.csv(x,"./Test3.csv")
I get the following output. As you can see, I no longer have a unique identifier as it rounds the number.
I also tried reading the file as a factor, using the code, but I get the same output with numbers rounded. I need the Conversion.ID to be a unique identifier.
x<-read.csv("./Test2.csv", colClasses="character")
The only way I can get the Conversion ID column to stay as a unique identifier is to open the CSV file and write a ' in front of each conversion ID. That is not scalable because I have hundreds of files.
I can't replicate your experience.
(Update: OP reports that the problem is actually with Excel converting/rounding the data on import [!!!])
I created a file on disk with full precision (I don't know the least-significant digits of your data, you didn't show them except for the first element, but I put a non-zero value in the units place for illustration):
writeLines(c(
"Conversion ID",
" 383305820480",
" 39634500000002",
" 213905000000002",
"1016890000000002",
"1220910000000002"),
con="Test2.csv")
Read the file and print it with full precision (use check.names=FALSE for perfect "round trip" capability -- not something you want to do on a regular basis):
x <- read.csv("Test2.csv",check.names=FALSE)
options(scipen=100)
print(x,digits=20)
## Conversion ID
## 1 383305820480
## 2 39634500000002
## 3 213905000000002
## 4 1016890000000002
## 5 1220910000000002
Looks OK.
Now write output (use row.names=FALSE to avoid adding row names/allow a clean round-trip):
write.csv(x,"Test3.csv",row.names=FALSE,quote=FALSE)
The least-mediated way to examine a file on disk from within R is file.show():
file.show("Test3.csv")
## Conversion ID
## 383305820480
## 39634500000002
## 213905000000002
## 1016890000000002
## 1220910000000002
x3 <- read.csv("Test3.csv",check.names=FALSE)
all.equal(x,x3) ## TRUE
Use system tools to check that the files are the same (except for white space differences -- the original file was right-justified):
system("diff -w Test2.csv Test3.csv") ## no difference
If you have even longer ID strings you will need to read them as character to avoid loss of precision:
read.csv("Test2.csv",colClasses="character")
## Conversion.ID
## 1 383305820480
## 2 39634500000002
## 3 213905000000002
## 4 1016890000000002
## 5 1220910000000002
You could probably round-trip through Excel more safely (if you still think that's a good idea) by importing as character and exporting with quotation marks to protect the values.
I just figured out the issue. It looks like my version of Excel is converting the data, causing it to lose the digits. If I avoid opening the file in Excel after downloading it, it retains all the digits. I'm not sure if this is a known issue with newer version. I'm using Excel Office Professional Plus 2013.

Force R not to use exponential notation (e.g. e+10)?

Can I force R to use regular numbers instead of using the e+10-like notation? I have:
1.810032e+09
# and
4
within the same vector and want to see:
1810032000
# and
4
I am creating output for an old fashioned program and I have to write a text file using cat.
That works fine so far but I simply can't use the e+10 notation there.
This is a bit of a grey area. You need to recall that R will always invoke a print method, and these print methods listen to some options. Including 'scipen' -- a penalty for scientific display. From help(options):
‘scipen’: integer. A penalty to be applied when deciding to print
numeric values in fixed or exponential notation. Positive
values bias towards fixed and negative towards scientific
notation: fixed notation will be preferred unless it is more
than ‘scipen’ digits wider.
Example:
R> ran2 <- c(1.810032e+09, 4)
R> options("scipen"=-100, "digits"=4)
R> ran2
[1] 1.81e+09 4.00e+00
R> options("scipen"=100, "digits"=4)
R> ran2
[1] 1810032000 4
That said, I still find it fudgeworthy. The most direct way is to use sprintf() with explicit width e.g. sprintf("%.5f", ran2).
It can be achieved by disabling scientific notation in R.
options(scipen = 999)
My favorite answer:
format(1810032000, scientific = FALSE)
# [1] "1810032000"
This gives what you want without having to muck about in R settings.
Note that it returns a character string rather than a number object
Put options(scipen = 999) in your .Rprofile file so it gets auto-executed by default. (Do not rely on doing it manually.)
(This is saying something different to other answers: how?
This keeps things sane when you thunk between multiple projects, multiple languages on a daily or monthly basis. Remembering to type in your per-project settings is error-prone and not scalable. You can have a global ~/.Rprofile or per-project .Rprofile. Or both, with the latter overriding the former.
Keeping all your config in a project-wide or global .Rprofile auto-executes it. This is useful for e.g. default package loads, data.table configuration, environment etc. Again, that config can run to a page of settings, and there's zero chance you'll remember those and their syntax and type them in

Setting column width in a data set

I would like to set column widths (for all the 3 columns) in this data set, as: anim=1-10; sireid=11-20; damid=21-30. Some columns have missing values.
anim=c("1A038","1C467","2F179","38138","030081")
sireid=c("NA","NA","1W960","1W960","64404")
damid=c("NA","NA","1P119","1P119","63666")
mydf=data.frame(anim,sireid,damid)
From reading your question as well as your comments to previous answers, it seems to me that you are trying to create a fixed width file with your data. If this is the case, you can use the function write.fwf in package gdata:
Load the package and create a temporary output file:
library(gdata)
ff <- tempfile()
Write your data in fixed width format to the temporary file:
write.fwf(mydf, file=ff, width=c(10,10,10), colnames=FALSE)
Read the file with scan and print the results (to demonstrate fixed width output):
zz <- scan(ff, what="character", sep="\n")
cat(zz, sep="\n")
1A038 NA NA
1C467 NA NA
2F179 1W960 1P119
38138 1W960 1P119
030081 64404 63666
Delete the temporary file:
unlink(ff)
You can also write fixed width output for numbers and strings using the sprintf() function, which derives from C's counterpart.
For instance, to pad integers with 0s:
sprintf("%012d",99)
To pad with spaces:
sprintf("%12d",123)
And to pad strings:
sprintf("%20s","hello world")
The options for formatting are found via ?sprintf and there are many guides to formatting C output for fixed width.
It sounds like you're coming from a SAS background, where character variables should have explicit lengths specified to avoid unexpected truncations. In R, you don't need to worry about this. A character string has exactly as many characters as it needs, and automatically expands and contracts as its contents change.
One thing you should be aware of, though, is silent conversion of character variables to factors in a data frame. However, unless you change the contents at a later point in time, you should be able to live with the default.

Resources