Accurately converting from character->POSIXct->character with sub millisecond datetimes - r

I have a character datetime column in a file. I load the file (into a data.table) and do things that require the column to be converted to POSIXct. I then need to write the POSIXct value back to file, but the datetime will not be the same (because it is printed incorrectly).
This print/formatting issue is well known and has been discussed several times. I've read some posts describing this issue. The most authoritative answers I found are given in response to this question. The answers to that question provide two functions (myformat.POSIXct and form) that are supposed to solve this issue, but they do not seem to work on this example:
x <- "04-Jan-2013 17:22:08.139"
options("digits.secs"=6)
form(as.POSIXct(x,format="%d-%b-%Y %H:%M:%OS"),format="%d-%b-%Y %H:%M:%OS3")
[1] "04-Jan-2013 17:22:08.138"
form(as.POSIXct(x,format="%d-%b-%Y %H:%M:%OS"),format="%d-%b-%Y %H:%M:%OS4")
[1] "04-Jan-2013 17:22:08.1390"
myformat.POSIXct(as.POSIXct(x,format="%d-%b-%Y %H:%M:%OS"),digits=3)
[1] "2013-01-04 17:22:08.138"
myformat.POSIXct(as.POSIXct(x,format="%d-%b-%Y %H:%M:%OS"),digits=4)
[1] "2013-01-04 17:22:08.1390"
My sessionInfo:
R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] fasttime_1.0-0 data.table_1.8.9 bit64_0.9-2 bit_1.1-9
[5] sas7bdat_0.3 chron_2.3-43 vimcom_0.9-6
loaded via a namespace (and not attached):
[1] tools_2.15.2

So I guess you do need a little fudge factor added to my suggestion here: https://stackoverflow.com/a/7730759/210673. This seems to work but perhaps might include other bugs; test carefully and think about what it's doing before using for anything important.
myformat.POSIXct <- function(x, digits=0) {
x2 <- round(unclass(x), digits)
attributes(x2) <- attributes(x)
x <- as.POSIXlt(x2)
x$sec <- round(x$sec, digits) + 10^(-digits-1)
format.POSIXlt(x, paste("%Y-%m-%d %H:%M:%OS",digits,sep=""))
}

As the answers to the questions you linked to already say, how a value is printed/formatted is not the same as what the actual value is. This is just a printed representation issue.
R> as.POSIXct('2011-10-11 07:49:36.3')-as.POSIXlt('2011-10-11 07:49:36.3')
Time difference of 0 secs
R> as.POSIXct('2011-10-11 07:49:36.2')-as.POSIXlt('2011-10-11 07:49:36.3')
Time difference of -0.0999999 secs
Your understanding that POSIXct is less precise than POSIXlt is incorrect. You're also incorrect in saying that you can't include a POSIXlt object as a column in a data.frame.
R> x <- data.frame(date=Sys.time())
R> x$date <- as.POSIXlt(x$date)
R> str(x)
'data.frame': 1 obs. of 1 variable:
$ date: POSIXlt, format: "2013-03-13 07:38:48"

Two things:
1) #statquant is right (and the otherwise known experts #Joshua Ulrich and #Dirk Eddelbuettel are wrong), and #Aaron in his comment, but that will not be important for the main question here:
POSIXlt by design is definitely more accurate in storing times than POSIXct: As its seconds are always in [0, 60), it has a granularity of about 6e-15, i.e., 6 femtoseconds which would be dozens of million times less granular than POSIXct.
However, this is not very relevant here (and for current R): Almost all operations, notably numeric ones, use the Ops group method (yes, not known to beginners, but well documented), just look at Ops.POSIXt which indeed trashes the extra precision by first coercing to POSIXct. In addition, the format()/print() ing uses 6 decimals after the "." at most, and hence also does not distinguish between the internally higher precision of POSIXlt and the "only" 100 nanosecond granularity of POSIXct.
(For the above reason, both Dirk and Joshua were lead to their wrong assertion: For all simple practical uses, the precision of *lt and *ct is made the same).
2) I do tend to agree that we (R Core) should improve the format()ing and hence print()ing of such fractions of seconds POSIXt objects (still after the bug fix mentioned by #Aaron above).
But then I may be wrong, and "we" have got it right, by some definition of "right" ;-)

When you write
My understanding is that POSIXct representation is less precise than
the POSIXlt representation
you are plain wrong.
It is the same representation for both -- down to milliseconds on Windows, and down to (almost) microseconds on the other OSs. Did you read help(DateTimeClasses) ?
As for your last question, yes the development version of my RcppBDT package uses Boost Date.Time and can go all the way to nanoseconds if your OS supports it and you turned the proper representation on. But it does replace POSIXct, and does not yet support vectors of time objects.
Edit: Regarding your follow-up question:
R> one <- Sys.time(); two <- Sys.time(); two - one
Time difference of 7.43866e-05 secs
R>
R> as.POSIXlt(two) - as.POSIXlt(one)
Time difference of 7.43866e-05 secs
R>
R> one # options("digits.sec"=6) on my box
[1] "2013-03-13 07:30:57.757937 CDT"
R>
Edit 2: I think you are simply experiencing that floating point representation on computers is inexact:
R> print(as.numeric(as.POSIXct("04-Jan-2013 17:22:08.138",
+ format="%d-%b-%Y %H:%M:%OS")), digits=18)
[1] 1357341728.13800001
R> print(as.numeric(as.POSIXct("04-Jan-2013 17:22:08.139",
+ format="%d-%b-%Y %H:%M:%OS")), digits=18)
[1] 1357341728.13899994
R>
The difference is not precisely 1/1000 as you assumed.

Related

R mdy_hms unpredictable results?

Working with the mdy_hms function on some data I have and am running into an interesting problem. I'm getting data uploaded from many sources, but they all should be in csv format and conform to the same guidelines, so they SHOULD all be identical formats.
I have 2 variables.
> good_time
[1] "12/28/2019 16:22"
> test_time
[1] "3/4/2020 16:46"
> str(good_time)
chr "12/28/2019 16:22"
> str(test_time)
chr "3/4/2020 16:46"
So they appear to be the same to me in terms of formatting, but good_time can parse just fine via mdy_hms whereas test_time cannot. Can anyone explain to me why?
> mdy_hms(good_time)
[1] "2020-12-28 19:16:22 UTC"
> mdy_hms(test_time)
[1] NA
Warning message:
All formats failed to parse. No formats found.
Oddly enough if I use mdy_hm(test_time) it works fine.
> mdy_hm(test_time)
[1] "2020-03-04 16:46:00 UTC"
lubridate is expecting leading zeroes in single-digit months (and days).
From ?lubridate::mdy_hms:
truncated: integer, indicating how many formats can be missing. See
details.
...
The most common type of irregularity in date-time data is the
truncation due to rounding or unavailability of the time stamp. If
the 'truncated' parameter is non-zero, the 'ymd_hms()' functions
also check for truncated formats. For example, 'ymd_hms()' with
'truncated = 3' will also parse incomplete dates like 2012-06-01
12:23, 2012-06-01 12 and '2012-06-01'. NOTE: The 'ymd()' family of
functions is based on 'base::strptime()' which currently fails to
parse %y-%m formats.
Just add truncated=1:
lubridate::mdy_hms("3/4/2020 16:46", truncated=1)
# [1] "2020-03-04 16:46:00 UTC"
(This was also discussed in tidyverse/lubridate#669.)

R: How to convert long number to string to save precision

I have a problem to convert a long number to a string in R. How to easily convert a number to string to preserve precision? A have a simple example below.
a = -8664354335142704128
toString(a)
[1] "-8664354335142704128"
b = -8664354335142703762
toString(b)
[1] "-8664354335142704128"
a == b
[1] TRUE
I expected toString(a) == toString(b), but I got different values. I suppose toString() converts the number to float or something like that before converting to string.
Thank you for your help.
Edit:
> -8664354335142704128 == -8664354335142703762
[1] TRUE
> along = bit64::as.integer64(-8664354335142704128)
> blong = bit64::as.integer64(-8664354335142703762)
> along == blong
[1] TRUE
> blong
integer64
[1] -8664354335142704128
I also tried:
> as.character(blong)
[1] "-8664354335142704128"
> sprintf("%f", -8664354335142703762)
[1] "-8664354335142704128.000000"
> sprintf("%f", blong)
[1] "-0.000000"
Edit 2:
My question first was, if I can convert a long number to string without loss. Then I realized, in R is impossible to get the real value of a long number passed into a function, because R automatically read the value with the loss.
For example, I have the function:
> my_function <- function(long_number){
+ string_number <- toString(long_number)
+ print(string_number)
+ }
If someone used it and passed a long number, I am not able to get the information, which number was passed exactly.
> my_function(-8664354335142703762)
[1] "-8664354335142704128"
For example, if I read some numbers from a file, it is easy. But it is not my case. I just need to use something that some user passed.
I am not R expert, so I just was curious why in another language it works and in R not. For example in Python:
>>> def my_function(long_number):
... string_number = str(long_number)
... print(string_number)
...
>>> my_function(-8664354335142703762)
-8664354335142703762
Now I know, the problem is how R reads and stores numbers. Every language can do it differently. I have to change the way how to pass numbers to R function, and it solves my problem.
So the correct answer to my question is:
""I suppose toString() converts the number to float", nope, you did it yourself (even if unintentionally)." - Nope, R did it itself, that is the way how R reads numbers.
So I marked r2evans answer as the best answer because this user helped me to find the right solution. Thank you!
Bottom line up front, you must (in this case) read in your large numbers as string before converting to 64-bit integers:
bit64::as.integer64("-8664354335142704128") == bit64::as.integer64("-8664354335142703762")
# [1] FALSE
Some points about what you've tried:
"I suppose toString() converts the number to float", nope, you did it yourself (even if unintentionally). In R, when creating a number, 5 is a float and 5L is an integer. Even if you had tried to create it as an integer, it would have complained and lost precision anyway:
class(5)
# [1] "numeric"
class(5L)
# [1] "integer"
class(-8664354335142703762)
# [1] "numeric"
class(-8664354335142703762L)
# Warning: non-integer value 8664354335142703762L qualified with L; using numeric value
# [1] "numeric"
more appropriately, when you type it in as a number and then try to convert it, R processes the inside of the parentheses first. That is, with
bit64::as.integer64(-8664354335142704128)
R first has to parse and "understand" everything inside the parentheses before it can be passed to the function. (This is typically a compiler/language-parsing thing, not just an R thing.) In this case, it sees that it appears to be a (large) negative float, so it creates a class numeric (float). Only then does it send this numeric to the function, but by this point the precision has already been lost. Ergo the otherwise-illogical
bit64::as.integer64(-8664354335142704128) == bit64::as.integer64(-8664354335142703762)
# [1] TRUE
In this case, it just *happens that the 64-bit version of that number is equal to what you intended.
bit64::as.integer64(-8664254335142704128) # ends in 4128
# integer64
# [1] -8664254335142704128 # ends in 4128, yay! (coincidence?)
If you subtract one, it results in the same effective integer64:
bit64::as.integer64(-8664354335142704127) # ends in 4127
# integer64
# [1] -8664354335142704128 # ends in 4128 ?
This continues for quite a while, until it finally shifts to the next rounding point
bit64::as.integer64(-8664254335142703617)
# integer64
# [1] -8664254335142704128
bit64::as.integer64(-8664254335142703616)
# integer64
# [1] -8664254335142703104
It is unlikely to be coincidence that the difference is 1024, or 2^10. I haven't fished yet, but I'm guessing there's something meaningful about this with respect to floating point precision in 32-bit land.
fortunately, bit64::as.integer64 has several S3 methods, useful for converting different formats/classes to a integer64
library(bit64)
methods(as.integer64)
# [1] as.integer64.character as.integer64.double as.integer64.factor
# [4] as.integer64.integer as.integer64.integer64 as.integer64.logical
# [7] as.integer64.NULL
So, bit64::as.integer64.character can be useful, since precision is not lost when you type it or read it in as a string:
bit64::as.integer64("-8664354335142704128")
# integer64
# [1] -8664354335142704128
bit64::as.integer64("-8664354335142704128") == bit64::as.integer64("-8664354335142703762")
# [1] FALSE
FYI, your number is already near the 64-bit boundary:
-.Machine$integer.max
# [1] -2147483647
-(2^31-1)
# [1] -2147483647
log(8664354335142704128, 2)
# [1] 62.9098
-2^63 # the approximate +/- range of 64-bit integers
# [1] -9.223372e+18
-8664354335142704128
# [1] -8.664354e+18

Minimum evaluatable scientific value?

Does R store the smallest possibly representable scientific value?
To clarify: on my current machine:
>1e-300
[1] 1e-300
While
>1e-400
[1] 0
Through trial and error I know it is somewhere around the e-324 mark on my machine (where it also starts losing precision).
>1e-324
[1] 0
>5e-324
[1] 4.940656e-324
I've searched through the .Machine list and none of the values it stores contain either the value, or the exponent I'm looking for.
Edit:
Linked threads on the side indicate that this should be .Machine$double.eps, which is 2.220446e-16. Clearly this is no longer the case?
The smallest normalised is double.xmin, as described in this page. The Wikipedia entry is very interesting and has the subnormal limit which is 2^-1074, which is approximately 4.9406564584124654 x 10^-324 (from Wikipedia as Ben Bolker mentioned in the comments). Your output in R is matching this value.
double.epsilon is not what you think. It is the smallest number you can add to 1 such as you obtain a number which will be not recognised as 1.
I suggest you read about how the double are stored in memory and the basics of double operations. Once you understand how a double is stored the lower limit is obvious.
The accepted answer remains correct for base R, but using the package Rmpfr enables arbitrary precision. Example:
First, note issue in base R:
> p <- c("5e-600","2e-324","3e-324","4e-324", "5e-324","6e-324","7.1e-324","8e-324")
> as.numeric(p)
[1] 0.000000e+00 0.000000e+00 4.940656e-324 4.940656e-324 4.940656e-324 4.940656e-324
[7] 4.940656e-324 9.881313e-324
Observe that as we near the limit the precision is an issue and all values are 4.940656e-324.
Now use mpfr function from 'Rmpfr` package to cast the strings as floats:
> library(Rmpfr)
> .N <- function(.) mpfr(., precBits = 20)
> .N(p)
8 'mpfr' numbers of precision 20 bits
[1] 5.0000007e-600 2.00000e-324 2.9999979e-324 4.00000e-324 4.9999966e-324 5.9999959e-324
[7] 7.09999e-324 8.00000e-324

Multicore and memory usage in R under Ubuntu

I am running R on an Ubuntu workstation with 8 virtual cores and 8 Gb of ram. I was hoping to routinely use the multicore package to make use of the 8 cores in parallel; however I find that the whole R process becomes duplicated 8 times.
As R actually seems to use much more memory than is reported in gc (by a factor 5, even after gc()), this means that even a relatively mild memory usage (one 200Mb object) becomes intractably memory-heavy once duplicated 8 times.
I looked into bigmemory to have the child processes share the same memory space; but it would require some major rewriting of my code as it doesn't deal with dataframes.
Is there a way to make R as lean as possible before forking, i.e. have the OS reclaim as much memory as possible?
EDIT:
I think I understand what is going on now. The problem is not where I thought it was -- objects that exist in the parent thread and are not manipulated do not get duplicated eight times. Instead my problem, I believe, came from the nature of the manipulation I am making each child process perform. Each has to manipulate a big factor with hundreds of thousands of levels, and I think this is the memory-heavy bit. As a result, it is indeed the case that the overall memory load is proportional to the number of cores; but not as dramatically as I thought.
Another lesson I learned is that with 4 physical cores + possibility of hyperthreading, hyperthreading is actually not typically a good idea for R. The gain is minimal, and the memory cost may be non-trivial. So I'll be working on 4 cores from now on.
For those who would like to experiment, this is the type of code I was running:
# Create data
sampdata <- data.frame(id = 1:1000000)
for (letter in letters) {
sampdata[, letter] <- rnorm(1000000)
}
sampdata$groupid = ceiling(sampdata$id/2)
# Enable multicore
library(multicore)
options(cores=4) # number of cores to distribute the job to
# Actual job
system.time(do.call("cbind",
mclapply(subset(sampdata, select = c(a:z)), function(x) tapply(x, sampdata$groupid, sum))
))
Have you tried data.table?
> system.time(ans1 <- do.call("cbind",
lapply(subset(sampdata,select=c(a:z)),function(x)tapply(x,sampdata$groupid,sum))
))
user system elapsed
906.157 13.965 928.645
> require(data.table)
> DT = as.data.table(sampdata)
> setkey(DT,groupid)
> system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])
user system elapsed
186.920 1.056 191.582 # 4.8 times faster
> # massage minor diffs in results...
> ans2$groupid=NULL
> ans2=as.matrix(ans2)
> colnames(ans2)=letters
> rownames(ans1)=NULL
> identical(ans1,ans2)
[1] TRUE
Your example is very interesting. It is reasonably large (200MB), there are many groups (1/2 million), and each group is very small (2 rows). The 191s can probably be improved by quite a lot, but at least it's a start. [March 2011]
And now, this idiom (i.e. lapply(.SD,...)) has been improved a lot. With v1.8.2, and on a faster computer than the test above, and with the latest version of R etc, here is the updated comparison :
sampdata <- data.frame(id = 1:1000000)
for (letter in letters) sampdata[, letter] <- rnorm(1000000)
sampdata$groupid = ceiling(sampdata$id/2)
dim(sampdata)
# [1] 1000000 28
system.time(ans1 <- do.call("cbind",
lapply(subset(sampdata,select=c(a:z)),function(x)
tapply(x,sampdata$groupid,sum))
))
# user system elapsed
# 224.57 3.62 228.54
DT = as.data.table(sampdata)
setkey(DT,groupid)
system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])
# user system elapsed
# 11.23 0.01 11.24 # 20 times faster
# massage minor diffs in results...
ans2[,groupid:=NULL]
ans2[,id:=NULL]
ans2=as.matrix(ans2)
rownames(ans1)=NULL
identical(ans1,ans2)
# [1] TRUE
sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] data.table_1.8.2 RODBC_1.3-6
Things I've tried on Ubuntu 64 bit R, ranked in order of success:
Work with fewer cores, as you are doing.
Split the mclapply jobs into pieces, and save the partial results to a database using DBI with append=TRUE.
Use the rm function along with gc() often
I have tried all of these, and mclapply still begins to create larger and larger processes as it runs, leading me to suspect each process is holding onto some sort of residual memory it really doesn't need.
P.S. I was using data.table, and it seems each child process copies the data.table.

unexpected agrep() results related to max.distance in R

EDIT: This bug was found in 32-bit versions of R was fixed in R version 2.9.2.
This was tweeted to me by #leoniedu today and I don't have an answer for him so I thought I would post it here.
I have read the documentation for agrep() (fuzzy string matching) and it appears that I don't fully understand the max.distance parameter. Here's an example:
pattern <- "Staatssekretar im Bundeskanzleramt"
x <- "Bundeskanzleramt"
agrep(pattern,x,max.distance=18)
agrep(pattern,x,max.distance=19)
That behaves exactly like I would expect. There are 18 characters different between the strings so I would expect that to be the threshold of a match. Here's what's confusing me:
agrep(pattern,x,max.distance=30)
agrep(pattern,x,max.distance=31)
agrep(pattern,x,max.distance=32)
agrep(pattern,x,max.distance=33)
Why are 30 and 33 matches, but not 31 and 32? To save you some counting,
> nchar("Staatssekretar im Bundeskanzleramt")
[1] 34
> nchar("Bundeskanzleramt")
[1] 16
I posted this on the R list a while back and reported as a bug in R-bugs-list. I had no useful responses, so I twitted to see if the bug was reproducible or I was just missing something. JD Long was able to reproduce it and kindly posted the question here.
Note that, at least in R, then, agrep is a misnomer since it does not matches regular expressions, while grep stands for "Globally search for the Regular Expression and Print". It shouldn't have a problem with patterns longer than the target vector. (i think!)
In my linux server, all is well but not so in my Mac and Windows machines.
Mac:
sessionInfo()
R version 2.9.1 (2009-06-26)
i386-apple-darwin8.11.1
locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
agrep(pattern,x,max.distance=30)
[1] 1
agrep(pattern,x,max.distance=31)
integer(0)
agrep(pattern,x,max.distance=32)
integer(0)
agrep(pattern,x,max.distance=33)
[1] 1
Linux:
R version 2.9.1 (2009-06-26)
x86_64-unknown-linux-gnu
locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
agrep(pattern,x,max.distance=30)
[1] 1
agrep(pattern,x,max.distance=31)
[1] 1
agrep(pattern,x,max.distance=32)
[1] 1
agrep(pattern,x,max.distance=33)
[1] 1
I am not sure if your example makes sense. For the basic grep(), pattern is often a simple or a regular expression, and x is a vector whose element get matched to pattern. Having pattern as longer string that x strikes me as odd.
Consider this where we just use grep instead of substr:
R> grep("vo", c("foo","bar","baz")) # vo is not in the vector
integer(0)
R> agrep("vo", c("foo","bar","baz"), value=TRUE) # but is close enough to foo
[1] "foo"
R> agrep("vo", c("foo","bar","baz"), value=TRUE, max.dist=0.25) # still foo
[1] "foo"
R> agrep("vo", c("foo","bar","baz"), value=TRUE, max.dist=0.75) # now all match
[1] "foo" "bar" "baz"
R>

Resources