R read_delim() changes values when reading data - r

I am trying to read in a tabstop seperated csv file using read_delim(). For some reason the function seems to change some field entries in integer values:
# Here some example data
# This should have 3 columns and 1 row
file_string = c("Tage\tID\tVISITS\n19.02.01\t2163994407707046646\t40")
# reading that data using read_delim()
data = read_delim(file_string, delim = "\t")
view(data)
data$ID
2163994407707046656 # This should be 2163994407707046646
I totally do not understand what is happening here. If I chnage the col type to character the entry stays the same. Does anyone has an explanation for this?
Happy about any help!

Your number has so many digits, that it does not fit into the R object. According to the specification IEEE 754, the precision of double is 53 bits which is approx. a number with 15 decimal digits. You reach that limit using as.double("2163994407707046646").

Related

Splitting a column in a dataframe in R into two based on content

I have a column in a R dataframe that holds a product weight i.e. 20 kg but it has mixed measuring systems i.e. 1 lbs & 2 kg etc. I want to separate the value from the measurement and put them in separate columns then convert them in a new column to a standard weight. Any thoughts on how I might achieve that? Thanks in advance.
Assume you have the column given as
x <- c("20 kg","50 lbs","1.5 kg","0.02 lbs")
and you know that there is always a space between the number and the measurement. Then you can split this up at the space-character, e.g. via
splitted <- strsplit(x," ")
This results in a list of vectors of length two, where the first is the number and the second is the measurement.
Now grab the numbers and convert them via
numbers <- as.numeric(sapply(splitted,"[[",1))
and grab the units via
units <- sapply(splitted,"[[",2)
Now you can put everything together in a `data.frame.
Note: When using as.numeric, the decimal point has to be a dot. If you have commas instead, you need to replace them by a dot, for example via gsub(",","\\.",...).
separate(DataFrame, VariableName, into = c("Value", "Metric"), sep = " ")
My case was simple enough that I could get away with just one space separator but I learned you can also use a regular expression here for more complex separator considerations.

Why R doesn't show digit after decimal point in numeric data type?

I have a data frame read from a file into R. Data contains some numeric columns some with digit after decimal point and some column do not have. The columns without values with digit after decimal points have an issue as the following:
If I do some manipulation into that column which result in values with some digit after decimal point. However, the final data has some numbers such as 22.5, R does not show the value in this format, it only shows 22. But if I check it in If condition it confirm the value is actually 22.5. This does not happen when the original data contains some decimal points.
Could anyone let me know how to resolve this issue?
This is a FAQ. Presentation may be different from content as it is generally optimised for meaningful output. Really simple example follows:
> df <- data.frame(a=c(10000.12, 10000.13), b=c(42L, 43L))
> df
a b
1 10000.1 42
2 10000.1 43
> all.equal(df[1,"a"], 10000.12)
[1] TRUE
>
So the last digit did not "disappear" as the test confirms---it is simply beyond the (six in the default) digits displayed.
Similarly, you can always explicitly display with more decimals than the (compact, default) displays do:
> cat(sprintf("%14.8f", df[1,"a"]), "\n")
10000.12000000
>
Edit You can also increate the default display size by one or more:
options(digits=7) is the minimal change but not all columns use seven digits:
> options(digits=7)
> df
a b
1 10000.12 42
2 10000.13 43
>
Needless to say, if you had digits .123 only the first two would be shown etc.

How to import really large numbers into R? [duplicate]

I am importing a csv that has a single column which contains very long integers (for example: 2121020101132507598)
a<-read.csv('temp.csv',as.is=T)
When I import these integers as strings they come through correctly, but when imported as integers the last few digits are changed. I have no idea what is going on...
1 "4031320121153001444" 4031320121153001472
2 "4113020071082679601" 4113020071082679808
3 "4073020091116779570" 4073020091116779520
4 "2081720101128577687" 2081720101128577792
5 "4041720081087539887" 4041720081087539712
6 "4011120071074301496" 4011120071074301440
7 "4021520051054304372" 4021520051054304256
8 "4082520061068996911" 4082520061068997120
9 "4082620101129165548" 4082620101129165312
As others have noted, you can't represent integers that large. But R isn't reading those values into integers, it's reading them into double precision numerics.
Double precision can only represent numbers to ~16 places accurately, which is why you see your numbers rounded after 16 places. See the gmp, Rmpfr, and int64 packages for potential solutions. Though I don't see a function to read from a file in any of them, maybe you could cook something up by looking at their sources.
UPDATE:
Here's how you can get your file into an int64 object:
# This assumes your numbers are the only column in the file
# Read them in however, just ensure they're read in as character
a <- scan("temp.csv", what="")
ia <- as.int64(a)
R's maximum intger value is about 2E9. As #Joshua mentions in another answer, one of the potential solutions is the int64 package.
Import the values as character instead. Then convert to type int64.
require(int64)
a <- read.csv('temp.csv', colClasses = 'character', header=FALSE)[[1]]
a <- as.int64(a)
print(a)
[1] 4031320121153001444 4113020071082679601 4073020091116779570
[4] 2081720101128577687 4041720081087539887 4011120071074301496
[7] 4021520051054304372 4082520061068996911 4082620101129165548
You simply cannot represent integers that big. See
.Machine
which on my box has
$integer.max
[1] 2147483647
The maximum value of a 32-bit signed integer is 2,147,483,647. Your numbers are much larger.
Try importing them as floating point values instead.
There4 are a few caveats to be aware of when dealing with floating point arithmetic in R or any other language:
http://blog.revolutionanalytics.com/2009/11/floatingpoint-errors-explained.html
http://blog.revolutionanalytics.com/2009/03/when-is-a-zero-not-a-zero.html
http://floating-point-gui.de/basic/

formatting, Sprintf in R

I am very new to this site and I am having an issue using sprintf in R. Briefly what I am trying to do is the following:
I need to create a text file with a header (which is space delimited and has to maintain that particular space), below which I need to copy some numbers made of X rows (depending on the data, I will read a big table made of thousands of line, but for each ID there will be a variable number of rows; this is not a major problem as I can loop through them). My problem is that i cannot align the numbers below the header.
setwd("C:\\Example\\formatting")
My data are in a CSV format so I read:
s100 = read.csv("example.csv", header=T)
Then I take the columns I am interested in and transform it in this way:
SID1 = as.vector(as.matrix(s100$Row1))
SID2 = as.vector(as.matrix(s100$Row2))
SID3 = as.vector(as.matrix(s100$Row3))
SIDN = as.vector(as.matrix(s100$RowN))
Then I have the following (do not worry about the letters, that part up to a certain point is really easy, I got stuck at the end when I need to read the SID:
sink("Example.xxx", append = T)
cat("*Some description goes here\n")
This goes on an on until I need to put the numbers down. So, when I arrive to this piece:
cat("# SAB SSD TAR.....", sep = "\n")
I now need to have aligned the numbers under SAB, SSD, TAR... and so on.
So, now I do the following ( I only tried using one column and one header first):
cat("SAB ", sep = "\n")
cat(sprintf("%s\n", SID1, sep="\n" ))
But, what I get in the end is the following:
SAB
0.30
0.40
0.50
Instead of
SAB SSD TAR
0.30 0.40 10
0.40 0.80 40
0.50 0.90 00
.... .... ...
So my two questions are:
How to solve the above problem?
Since at the beginning of that header I have a "#" spaced before the "SAB" how do I align all my numbers accordingly?
I hope I have been clear and not messy, it seems as a simple solution but my knowledge of R and programming go only up to a certain point.
Thank you in advance for any help!
I think the problem is your call to sprintf. SID1 provides a vector of values but you only have one position in your string that accepts input. How about replacing the last line with
cat(paste(SID1, collapse="\n"))
EDIT/UPDATE:
Here's an example that might work (xx represents your SID data that have been combined into a matrix or data frame)
library(MASS)
xx <- matrix(rnorm(100),10)
write.matrix(round(xx, 2))

Reading a CSV file containing longs? [duplicate]

I am importing a csv that has a single column which contains very long integers (for example: 2121020101132507598)
a<-read.csv('temp.csv',as.is=T)
When I import these integers as strings they come through correctly, but when imported as integers the last few digits are changed. I have no idea what is going on...
1 "4031320121153001444" 4031320121153001472
2 "4113020071082679601" 4113020071082679808
3 "4073020091116779570" 4073020091116779520
4 "2081720101128577687" 2081720101128577792
5 "4041720081087539887" 4041720081087539712
6 "4011120071074301496" 4011120071074301440
7 "4021520051054304372" 4021520051054304256
8 "4082520061068996911" 4082520061068997120
9 "4082620101129165548" 4082620101129165312
As others have noted, you can't represent integers that large. But R isn't reading those values into integers, it's reading them into double precision numerics.
Double precision can only represent numbers to ~16 places accurately, which is why you see your numbers rounded after 16 places. See the gmp, Rmpfr, and int64 packages for potential solutions. Though I don't see a function to read from a file in any of them, maybe you could cook something up by looking at their sources.
UPDATE:
Here's how you can get your file into an int64 object:
# This assumes your numbers are the only column in the file
# Read them in however, just ensure they're read in as character
a <- scan("temp.csv", what="")
ia <- as.int64(a)
R's maximum intger value is about 2E9. As #Joshua mentions in another answer, one of the potential solutions is the int64 package.
Import the values as character instead. Then convert to type int64.
require(int64)
a <- read.csv('temp.csv', colClasses = 'character', header=FALSE)[[1]]
a <- as.int64(a)
print(a)
[1] 4031320121153001444 4113020071082679601 4073020091116779570
[4] 2081720101128577687 4041720081087539887 4011120071074301496
[7] 4021520051054304372 4082520061068996911 4082620101129165548
You simply cannot represent integers that big. See
.Machine
which on my box has
$integer.max
[1] 2147483647
The maximum value of a 32-bit signed integer is 2,147,483,647. Your numbers are much larger.
Try importing them as floating point values instead.
There4 are a few caveats to be aware of when dealing with floating point arithmetic in R or any other language:
http://blog.revolutionanalytics.com/2009/11/floatingpoint-errors-explained.html
http://blog.revolutionanalytics.com/2009/03/when-is-a-zero-not-a-zero.html
http://floating-point-gui.de/basic/

Resources