R: data.table. How to save dates properly with fwrite? - r

I have a dataset. I can choose to load it on R from a Stata file or from a SPSS file.
In both cases it's loaded properly with the haven package.
The dates are recognized properly.
But when I save it to disk with data.table's fwrite function.
fwrite(ppp, "ppp.csv", sep=",", col.names = TRUE)
I have a problem, the dates dissapear and are converted to different numbers. For example, the date 1967-08-06 is saved in the csv file as -879
I've also tried playing with fwrite options, such as quote=FALSE, with no success.
I've uploaded a small sample of the files, the spss, the stata and the saved csv.
and this is the code, in order to do things easier for you.
library(haven)
library(data.table)
ppp <- read_sav("pspss.sav") # choose one of these two.
ppp <- read_dta("pstata.dta") # choose one of these two.
fwrite(ppp, "ppp.csv", sep=",", col.names = TRUE)
The real whole table has more than one thousand variables and one million individuals. That's why I would like to use a fast way to do things.
http://www73.zippyshare.com/v/OwzwbyQq/file.html
This is for #ArtificialBreeze:
> head(my)
# A tibble: 6 x 9
ID_2006_2011 TIS FECHA_NAC_2006 año2006 Edad_31_12_2006 SEXO_2006
<dbl> <chr> <date> <date> <dbl> <chr>
1 1.60701e+11 BBNR670806504015 1967-08-06 2006-12-31 39 M
2 1.60701e+11 BCBD580954916014 1958-09-14 2006-12-31 48 F
3 1.60701e+11 BCBL451245916015 1945-12-05 2006-12-31 61 F
4 1.60701e+11 BCGR610904916012 1961-09-04 2006-12-31 45 M
5 1.60701e+11 BCMR580148916015 1958-01-08 2006-12-31 48 F
6 1.60701e+11 BCMX530356917018 1953-03-16 2006-12-31 53 F
# ... with 3 more variables: PAIS_NAC_2006 <dbl>, FECHA_ALTA_TIS_2006 <date>,
# FECHA_ALTA_TIS_2006n <date>

Since this question was asked 6 months ago, fwrite has improved and been released to CRAN. I believe it should work as you wanted now; i.e. fast, direct and convenient date formatting. It now has the dateTimeAs argument as follows, copied from fwrite's manual page for v1.10.0 as on CRAN now. As time progresses, please check the latest version of the manual page.
====
dateTimeAs : How Date/IDate, ITime and POSIXct items are written.
"ISO" (default) - 2016-09-12, 18:12:16 and 2016-09-12T18:12:16.999999Z. 0, 3 or 6 digits of fractional seconds are printed if and when present for convenience, regardless of any R options such as digits.secs. The idea being that if milli and microseconds are present then you most likely want to retain them. R's internal UTC representation is written faithfully to encourage ISO standards, stymie timezone ambiguity and for speed. An option to consider is to start R in the UTC timezone simply with "$ TZ='UTC' R" at the shell (NB: it must be one or more spaces between TZ='UTC' and R, anything else will be silently ignored; this TZ setting applies just to that R process) or Sys.setenv(TZ='UTC') at the R prompt and then continue as if UTC were local time.
"squash" - 20160912, 181216 and 20160912181216999. This option allows fast and simple extraction of yyyy, mm, dd and (most commonly to group by) yyyymm parts using integer div and mod operations. In R for example, one line helper functions could use %/%10000, %/%100%%100, %%100 and %/%100 respectively. POSIXct UTC is squashed to 17 digits (including 3 digits of milliseconds always, even if 000) which may be read comfortably as integer64 (automatically by fread()).
"epoch" - 17056, 65536 and 1473703936.999999. The underlying number of days or seconds since the relevant epoch (1970-01-01, 00:00:00 and 1970-01-01T00:00:00Z respectively), negative before that (see ?Date). 0, 3 or 6 digits of fractional seconds are printed if and when present.
"write.csv" - this currently affects POSIXct only. It is written as write.csv does by using the as.character method which heeds digits.secs and converts from R's internal UTC representation back to local time (or the "tzone" attribute) as of that historical date. Accordingly this can be slow. All other column types (including Date, IDate and ITime which are independent of timezone) are written as the "ISO" option using fast C code which is already consistent with write.csv.
The first three options are fast due to new specialized C code. The epoch to date-part conversion uses a fast approach by Howard Hinnant (see references) using a day-of-year starting on 1 March. You should not be able to notice any difference in write speed between those three options. The date range supported for Date and IDate is [0000-03-01, 9999-12-31]. Every one of these 3,652,365 dates have been tested and compared to base R including all 2,790 leap days in this range. This option applies to vectors of date/time in list column cells, too. A fully flexible format string (such as "%m/%d/%Y") is not supported. This is to encourage use of ISO standards and because that flexibility is not known how to make fast at C level. We may be able to support one or two more specific options if required.
====

I had the same problem, and I just changed the date column to as.character before writing, and then changed it back to as.Date after reading. I don't know how it influences read and write times, but it was a good enough solution for me.

These numbers have sense :) It seems that fwrite change data format into "Matlab coding" where origin is "1970-01-01".
If you read your data, you can simply change number into date using these code:
my$FECHA_NAC_2006<-as.Date(as.numeric(my$FECHA_NAC_2006),origin="1970-01-01")
For example
as.Date(-879,origin="1970-01-01")
[1] "1967-08-06"

Since it seems there is no simple solution I'm trying to store column classes and change them back again.
I take the original dataset ppp,
areDates <- (sapply(ppp, class) == "Date")
I save it on an file and I can read it next time.
ppp <- fread("ppp.csv", encoding="UTF-8")
And now I change the classes of the newly read dataset back to the original one.
ppp[,names(ppp)[areDates] := lapply(.SD,as.Date),
.SDcols = areDates ]
Maybe someone can written it better with a for loop and the command set.
ppp[,lapply(.SD, setattr, "class", "Date") ,
.SDcols = areDates]
It can be also written with positions instead of a vector of TRUE and FALSE

You need to add the argument: dateTimeAs = "ISO".
By adding the argument dateTimeAs = and specifying the appropriate option, you will get dates writting in your csv file with the desired format AND with their respective time zone.
This is particularly important in the case of dealing with POSIXct variables, which are time zone dependant. The lack of this argument might affect the timestamps written in the csv file by shifting dates and times according to the difference of hours between time zones. Thus, the date/time variable POSIXct, you will need to add: dateTimeAs = "write.csv" ; unfortunately this option can be slow (https://www.rdocumentation.org/packages/data.table/versions/1.10.0/topics/fwrite?). Good luck!!!

Related

Converting 7 or 8 digit numbers to dates in R

I am importing a very large fixed-width dataset into R and wish to use vroom for much better speed. However, the dates in this dataset are in numeric format with either 7 or 8 digits, depending on whether the day of the month has 1 or 2 digits (examples below).
#8 digit date (1985-03-21):
# 21031985
#7 digit date (1985-03-01):
# 1031985
I cannot see any way to specify this type of format using col_date(format = ) as one normally would. It is easy to make a function that converts these 7/8 digit numbers into dates, but doing that means materialising the imported data and removes the speed advantage that vroom provides.
I am looking for a way to have vroom interpret these numbers on its own, or a workaround that does not sacrifice vroom's speed.
Thanks very much for any help here.
Those formats are horrible in general, but regardless I expect nothing in readr is going to work right for you because of the 1 or 2 digit day-of-month. I suggest importing reading that column in as col_character, then post-processing them with
vec <- c("21031985", "1031985")
as.Date(paste0(strrep("0", pmax(8 - nchar(vec), 0)), vec), format = "%d%m%Y")
# [1] "1985-03-21" "1985-03-01"
Quick walk-through:
8 - nchar(vec) tells us how many 0s need to be padded to the left of each string. In this case, it should be 0 and 1, respectively. This might be a problem if you have length 6 strings, only you know if that's an issue.
strrep("0", ..) repeats the 0 string as many times as we need, including strrep("0", 0) producing "" (no zeroes).
pmax(.., 0) is the defensive programmer, if there's a length-9 string in there, we cannot do strrep("0", -1), we want to keep it from going negative.
paste0(.., vec) to do the actual padding.
From there, all strings should be normalized and able to be converted using "%d%m%Y".
Vroom can use a pipe as input. That means you can use a tool like awk to fix the format (e.g. make it always 8 digit, which is eaasy with sprintf). That way you can still benefit from vroom streaming the file. You could even use R - but if you are after performance, you need something that can process the file streaming and better be lightweight.
I used a test file test.csv:
id,date,text
1,1022020,some
2,12042020,more
3,2012020,text
I could read it via (of course the awk call needs to be adjusted for your data - but essentially if you need to just adjust the column $2 means 2nd column, the ',' specifies the separator):
vroom(pipe("awk -F ',' 'BEGIN{OFS=\",\"}; NR==1{print}; NR!=1 {$2=sprintf(\"%08d\",$2);print;}' test.csv"),
col_types=cols(date=col_date(format='%d%m%Y'))
)
giving
# A tibble: 3 × 3
id date text
<int> <date> <chr>
1 1 2020-02-01 some
2 2 2020-04-12 more
3 3 2020-01-02 text
If you have integer data you can left pad the lost 0s back on.
as.Date(sprintf("%08d", vec), format = "%d%m%Y")
# [1] "1985-03-21" "1985-03-01"

Timestamp field in a dbf file (dBase 7 format) is not making sense

I've looked at both [1] and [2] and I'm completely confused (and since the dbf file is a version
4 file, [1] should apply well). For one thing why does [1] state that the timestamp's date portion is the # of days since 1/1/4713 BC? That's just very puzzling. Secondly, assuming that it is the # of days since 4713 BC, I'm having some trouble with the value I am getting.
First off, my dbf file has a timestamp field which has an 8 byte long value. The actual
date is 2000/8/16 17:21:41. In the dbf file, the 8 byte sequence is as follows
0x42ccb20e0340df00.
From [1], it says the first 4 bytes are for the date, and 2nd 4 bytes for the time. If the original
byte sequence is actually little-endian (0x42ccb20e) then that should be 0x0eb2cc42 which
comes to the value of 246598722. So date is 0x0eb2cc42 (246598722) and time is 0x00df4003
(14630915).
I must be missing something here or calculating something wrong. 246598722 is equivalent to 675612 years(assuming 1yr = 365 days, as adding leap years would confuse me..and shouldn't really be that much off).
From [2], I shouldn't use 01/01/4173bc as the basis but 12/31/1899 (well, 1/1/1900). But then, the date value I have isn't even in the range of what [2] shows.
Now if I take the actual value (2000/8/16) and use [1] and [2], I get the following:
method [1]: 2450501 days : (2000 - -4713) * 365 + (8 * 30) + 16
method [2]: 36756 days : [100 * 365 + 8 * 30 + 16] (over counting the # of days)
The dbf file isn't corrupted (otherwise, if I look at the timestamp in dBase, it'd crap out
and display something crazy).
I've thought of using big-endian, but that makes even less sense as the values are even larger. I've even thought of the possibility that it's actually the # of seconds elapsed since either date, but that makes the values with even less sense. i.e. 246598722 = # of seconds elapsed (counting back from 2000/8/16) will make the base year as 1812. (calculations: 246898722 / (3600 * 365) = 187.8985, so 2000 - 187.8985 = 1812.1015)
Can someone point out where I'm doing this wrong?
Thanks!
[1] - https://www.dbase.com/Knowledgebase/INT/db7_file_fmt.htm
[2] - Convert dBase Timestamp
For any dBASE questions, I would recommend to go to the dBASE newsgroups, they have a very helpful and knowledgeable community.
I've finally found the answer thanks to [3].
Basically, the timestamp 8 byte sequence is used as a whole with the following notes:
It's stored in big-endian.
The last byte is not used.
It's a Julian Day Number.
So in my case, it's 0x42ccb20e0340df00 and truncating the last byte,
I get 0x42ccb20e0340df.
Then the following python code gets the correct info:
import datetime
base = 0x42cc418ba99a00
frm_date = int('42ccb20e0340df', 16)
final_ts = (frm_date - base) / 500
final_date = datetime.datetime.utcfromtimestamp(final_ts)
which outputs 2000-8-16 17:21:41 and some milliseconds, which I just ignore.
So I'm guessing the theory is that the above code moves the 'base' date to
1970/1/1 from 1/1/1, which helps since utcfromtimestamp() doesn't
work with any value prior to 1970/1/1.
My confusion stems from the fact it doesn't use 4713BC as the
base year, instead it uses 1/1/1, though I'm still trying to figure out how to get the value 0x42cc418ba99a00 for 1970/1/1.
[3] - https://stackoverflow.com/a/60424157/10860403

How does data.table::setkey sort characters in R? [duplicate]

Yesterday I had to spend some time trying to find a bug in my code and I found that data.table package sorts strings in a way a bit different from base. Is this a normal behavior and what is the most efficient way (which has benefits of data.table) to reproduce results obtained with base order function? Here is a toy reproducible example:
library(data.table)
options(stringsAsFactors = FALSE)
d <- data.frame(cn=c("USA","Ubuntu","Uzbekistan"))
d[order(d$cn),,drop=F]
# cn
#2 Ubuntu
#1 USA
#3 Uzbekistan
dt <- data.table(d)
setkey(dt, cn)
dt
# cn
#1: USA
#2: Ubuntu
#3: Uzbekistan
options(stringsAsFactors = default.stringsAsFactors())
OS Windows 7
Update March 2014
There's been some debate about this one. As of v1.9.2 we've settled for now on setkey sorting using C locale; e.g., all capital letters come before all lower case letters, regardless of user's locale. This was a change made in v1.8.8 which we had intended to reverse but have stuck with for now.
Consider save()-ing a keyed table in your locale and a colleague load()-ing it in a different locale. When they join to that table it may no longer work correctly if it were locale sort order. We have to think a bit more carefully if setkey is to allow locale ordering again, probably by saving the locale name along with the "sorted" attribute, so data.table can at least compare and detect if the current locale is different to the one that ran setkey.
It's also for speed reasons as sorting according to locale is much slower than C locale. Although, we can do it as efficiently as possible and allowing it optionally would be ideal.
Hence, this is now a feature request and further comments are very welcome.
FR#4842 setkey to sort using session's locale not C locale
Nice catch! The call to setkey in turn calls setkeyv and that calls fastorder to "order" the columns/entries that in turn calls chorder.
chorder in turn calls a C function Ccountingcharacter.c. Now, here I suppose the problem comes due to "locale".
Let's see what "locale" I'm on my mac.
Sys.getLocale()
# [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
Now let's see how order sorts it:
x <- c("USA", "Ubuntu", "Uzbekistan")
order(x)
# [1] 2 1 3
Now, let's change the "locale" to "C".
Sys.setlocale("LC_ALL", "C")
# [1] "C/C/C/C/C/en_US.UTF-8"
order(x)
# [1] 1 2 3
From ?order:
The sort order for character vectors will depend on the collating sequence of the locale in use: see Comparison.
From ?Comparison:
Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z....
So, basically, order as well under "C" locale, gives the same order as data.table's setkey. My guess is that the C-function called by chorder automatically runs on C-locale which will compare ascii values for which "S" comes before "b".
It's probably important to bring this to #MatthewDowle's attention (if he's not already aware of it). So, I'd suggest that you file this as a bug here (just to be sure).
Well, I am not sure what the most efficient way is but you can do the following to reproduce the data.frame result.
dt[order(dt$cn)]
cn
1: Ubuntu
2: USA
3: Uzbekistan

Generate time-series of any frequency in R

I used to generate time-series in Rusing xts package as
library(xts)
seq <- timeBasedSeq('2015-06-01/2015-06-05 23')
z <- xts(1:length(seq),seq)
After a bit of tweaks, I find it easy to generate data at a default rate of 1 hour or 1 minute or 1 second. Reading the help page of ?timeBasedseq does not clearly mention how to generate data at other frequncies. Say, I want to generate data at 10 minutes rate. Where Should I mention M (minutes) and 10 in the said command to generate 10 minutes data. Option M is mentioned in the help pages.
This isn't currently possible. The "by" value is essentially hard-coded at 1 unit. It should be possible to patch the code so you can specify the "by" component as something like "10M" for 10-minutes, since seq.POSIXct would accept a by = "10 mins" value.
If you want, you can create an issue for this feature request and we can discuss details of what this patch should include.

Descriptive statistics of time variables

I want to compute simple descriptive statistics (mean, etc) of times when people go to bed. I ran into two problems. The original data comes from an Excel file in which just the time that people went to bed, were typed in - in 24 hrs format. My problem is that r so far doesn't recognizes if people went to bed at 1.00 am the next day. Meaning that a person who went to bed at 10 pm is 3 hrs apart from the one at 1.00 am (and not 21 hrs).
In my dataframe the variable in_bed is a POSIXct format so I thought to apply an if-function telling that if the time is before 12:00 than I want to add 24 hrs.
My function is:
Patr$in_bed <- if(Patr$in_bed <= ) {
Patr$in_bed + 24*60*60
}
My data frame looks like this
in_bed
1 1899-12-30 22:13:00
2 1899-12-30 23:44:00
3 1899-12-30 00:08:00
If I run my function my variable gets deleted and the following error message gets printed:
Warning message:
In if (Patr$in_bed < "1899-12-30 12:00") { :
the condition has length > 1 and only the first element will be used
What do I do wrong or does anyone has a better idea? And can I run commands such as mean on variables in POSIXct format and if not how do I do it?
When you compare Patr$in_bed (vector) and "1899-12-30 12:00" (single value), you get a logical vector. But the IF statement requires a single logical, thus it generates a warning and consider only the first element of the vector.
You can try :
Patr$in_bed <- Patr$in_bed + 24*60*60 * (Patr$in_bed < as.POSIXct("1899-12-30 12:00"))
Explanations : the comparison in the parenthesis will return a logical vector, which will be converted to integer (0 for FALSE and 1 for TRUE). Then the dates for which the statement is true will have +24*60*60, and the others dates will have +0.
But since the POSIXct format includes the date, I don't see the purpose of adding 24 hrs. For instance,
as.POSIXct("1899-12-31 01:00:00") - as.POSIXct("1899-12-30 22:00:00")
returns a time difference of 3 hours, not 21.
To answer your last question, yes you can compute the mean of a POSIXct vector, simply with :
mean(Patr$in_bed)
Hope it helps,
Jérémy

Resources