I am reading a file that contains timestamps and a timezone specification. I would like to be able to detect if a given timezone on this file is recognized by R or not, and supply my own default in case it isn't.
However, it seems like as.POSIXct silently falls back to UTC if given an invalid timezone, with no error or warning I could catch and handle:
> as.POSIXct("1970-01-01", tz="blah")
[1] "1970-01-01 UTC"
What would be a 'proper' way in R to check if a given timezone is recognized or not?
help("time zones") explains a lot of the issues with time zones in detail and is well worth the read.
Results will vary based on your OS, but example("time zones") shows how you can read a zone.tab file if your OS has one.
tzfile <- "/usr/share/zoneinfo/zone.tab"
tzones <- read.delim(tzfile, row.names = NULL, header = FALSE,
col.names = c("country", "coords", "name", "comments"),
as.is = TRUE, fill = TRUE, comment.char = "#")
str(tzones$name)
#chr [1:415] "Europe/Andorra" "Asia/Dubai" "Asia/Kabul" "America/Antigua" "America/Anguilla" ...
NROW(tzones)
#[1] 415
head(tzones)
# country coords name comments
#1 AD +4230+00131 Europe/Andorra
#2 AE +2518+05518 Asia/Dubai
#3 AF +3431+06912 Asia/Kabul
#4 AG +1703-06148 America/Antigua
#5 AI +1812-06304 America/Anguilla
#6 AL +4120+01950 Europe/Tirane
You could use a timezone library which has knowledge of time zones. This is from the SVN version of RcppBDT:
R> tz <- new(bdtTz, "America/Chicago")
R> cat("tz object initialized as: ", format(tz), "\n")
tz object initialized as: America/Chicago
R> tzBAD <- new(bdtTz, "blah")
Error in new_CppObject_xp(fields$.module, fields$.pointer, ...) :
Unknown region supplied, no tz object created
R>
In general, time zone support is dependent on the operating system. So for a portable solution you need to supply a list of valid time zones from somewhere...
And for what it is worth, I am using the csv file from the Boost sources. A copy of that time zones file is eg here at github.
Just stumbled on this question since I was looking to figure out the same thing. Turned out using the following. Leaving this for anyone who might stumble on this question...
is.valid.timezone <- function(timezone) {
return(timezone %in% (OlsonNames()))
}
You can also use the Rmetrics package timeDate package to check for timezone.
require(timeDate)
timeDate("1970-01-01", zone = "Africa/Dakar")
## [1] [1970-01-01]
timeDate("1970-01-01", zone = "blah")
## Error in .formatFinCenterNum(unclass(ct), zone, type = "any2gmt") :
## 'blah' is not a valid FinCenter.
Related
Thanks for your help.
One of variable in my dataset looks like this:
> df$TM
> [1] "000054" "000020" "000056" "000051" "000025" "000116" "000219" "000207" "000233" "000206" "000142" "000126" "000237" "000215" "000236" "000246" "000219"
[18] "000227" "000803" "000920"...
The real meaning of each character is hours, minutes and seconds.
When I adjust hms function in Lubridates as follows
> df$TM <- hms(df$TM)
Warning message is coming: "In .parse_hms(..., order = "HMS", quiet = quiet) :
Some strings failed to parse, or all strings are NAs"
After that, all the values in the column changes to NA.
I also tried
> df$TM <- as.POSIXct(df$TM, format = "%H:%M:%S")
and
> df$TM <- chronicle(times = df$TM)
and
> df$TM <- strptime(df$TM, format = "%H:%M:%S")
but... these three trial also have same results.
(Actually all data has changed to NA, so warning message is same as error message to me)
I really appreciate your help.
You can make use of this answer to include a semicolon after every second element. After that you can transform the resulting character string as date (with day, month and year) or leave it as is.
For completeness, the solution for your problem then is
as.POSIXct(sub(":+$", "", gsub('(.{2})', '\\1:', df$TM)), format = "%H:%M:%S")
I'm working with minute data of NASDAQ, it has the index "2015-07-13 12:05:00 EST". I adjusted the system time with Sys.setenv(TZ = 'EST').
I want to program a simple buy/hold/sell strategy, therefore I create a vector of flat positions as a foundation.
pos_flat <- xts(rep(0, nrow(NASDAQ)), index(NASDAQ))
Then I want to apply a constraint, that in a certain time window, positions are bound to be flat, which in my case means equal to 1.
pos_flat["T13:41/T14:00"] <- 1
And this returns the error:
"Error in as.POSIXlt.POSIXct(.POSIXct(.index(x)), tz = indexTZ(x)) :invalid 'tz' value".
I also get this error doing other calculations, I just used this example because it is easy and shows the problem.
As extra information:
> Sys.timezone
function (location = TRUE)
{
tz <- Sys.getenv("TZ", names = FALSE)
if (nzchar(tz))
return(tz)
if (location)
return(.Internal(tzone_name()))
z <- as.POSIXlt(Sys.time())
zz <- attr(z, "tzone")
if (length(zz) == 3L)
zz[2L + z$isdst]
else zz[1L]
}
<bytecode: 0x03648ff4>
<environment: namespace:base>
I don't understand the problem with the tz value... Any ideas?
The source of your "invalid 'tz' value" error is because, for whatever reason, R doesn't accept tz = df$var. If you set tz = 'America/New_York' or some other character value, then it will work.
Better answer (instead of using force_tz below) for converting UTC times to various timezones based on location. It is also simpler and better than looping through or using a nested ifelse. I subset and change tz based on a timezone column (which my data already has, if not you can create it). Just make sure you account for all timezones in your data
(unique(df$timezone))
df$datetime2[df$timezone == 'America/New_York'] <- format(df$datetime, tz="America/New_York")[df$timezone == 'America/New_York']
df$datetime2[df$timezone == 'America/Chicago'] <- format(df$datetime, tz="America/Chicago")[df$timezone == 'America/Chicago']
df$datetime2[df$timezone == 'America/Denver'] <- format(df$datetime, tz="America/Denver")[df$timezone == 'America/Denver']
df$datetime2[df$timezone == 'America/Los_Angeles'] <- format(df$datetime, tz="America/Los_Angeles")[df$timezone == 'America/Los_Angeles']
Previous solution: Converting to Local Time in R - Vector of Timezones
require(lubridate)
require(dplyr)
df = data.frame(timestring = c("2015-12-12 13:34:56", "2015-12-14 16:23:32"), localzone = c("America/Los_Angeles", "America/New_York"), stringsAsFactors = F)
df$moment = as.POSIXct(df$timestring, format="%Y-%m-%d %H:%M:%S", tz="UTC")
df = df %>% rowwise() %>% mutate(localtime = force_tz(moment, localzone))
df
You are getting errors because "EST" is not a valid timezone specification. It's an abbreviation that's often used when printing and displaying timezones.
The index is printed as "2015-07-13 12:05:00 EST" because "EST" probably represents Eastern Standard Time in the United States. If you want to set the TZ environment variable to that timezone, you should use Sys.setenv() with Country/City notation:
Sys.setenv(TZ = "America/New_York")
You can also set the timezone in the xts constructor:
pos_flat <- xts(rep(0, nrow(NASDAQ)), index(NASDAQ), tzone = "America/New_York")
Your error occurs because of a misinterpretation of the time object. You need to have UNIX timestamps in order to use something like
pos_flat["T13:41/T14:00"] <- 1
Try a conversion of your indices by doing something like this:
index(NASDAQ) <- as.POSIXct(strptime(index(NASDAQ), "%Y-%m-%d %H:%M:%S"))
As you want to use EST, you have to change your environment variables (if you are not living in EST timezone). So all in all, this should work:
Sys.setenv(TZ = 'EST')
#load stuff
#...
index(NASDAQ) <- as.POSIXct(strptime(index(NASDAQ), "%Y-%m-%d %H:%M:%S"))
pos_flat <- xts(rep(0, nrow(NASDAQ)), index(NASDAQ))
pos_flat["T13:41/T14:00"] <- 1
For further information, have a look at the POSIXct and POSIXlt structures in R.
Best regards
I'm trying to understand my difficulties in the past with inputting zoo objects. The following two uses of read.zoo give different results despite the default argument for tz supposedly being "" and that is the only difference between the two read.zoo calls:
Lines <- "2013-11-25 12:41:21 2
2013-11-25 12:41:22.25 2
2013-11-25 12:41:22.75 75
2013-11-25 12:41:24.22 3
2013-11-25 12:41:25.22 1
2013-11-25 12:41:26.22 1"
library(zoo)
z <- read.zoo(text = Lines, index = 1:2)
> dput(z)
structure(c(2L, 2L, 75L, 3L, 1L, 1L), index = structure(c(16034,
16034, 16034, 16034, 16034, 16034), class = "Date"), class = "zoo")
z <- read.zoo(text = Lines, index = 1:2, tz="")
> dput(z)
structure(c(2L, 2L, 75L, 3L, 1L, 1L), index = structure(c(1385412081,
1385412082.25, 1385412082.75, 1385412084.22, 1385412085.22, 1385412086.22
), class = c("POSIXct", "POSIXt"), tzone = ""), class = "zoo")
>
The answer (of course) is in the sources for read.zoo(), wherein there is:
....
ix <- if (missing(format) || is.null(format)) {
if (missing(tz) || is.null(tz))
processFUN(ix)
else processFUN(ix, tz = tz)
}
else {
if (missing(tz) || is.null(tz))
processFUN(ix, format = format)
else processFUN(ix, format = format, tz = tz)
}
....
Even though the default for tz is "", in your first case tz is considered missing (by missing()) and hence processFUN(ix) is used. When you set tz = "", it is no longer missing and hence you get processFUN(ix, tz = tz).
Without looking at the details of read.zoo() this could possibly be handled better by having tz = NULL or tz (no default) in the arguments and then in the code, if tz needs to be set to "" for some reason, do:
if (missing(tz) || is.null(tz)) {
tz <- ""
}
or perhaps this is not even needed if all the is required is to avoid the confusion about the two different calls?
Effectively, the default index class is "Date" unless tz is used in which case the default is "POSIXct". Thus the first example in the question gives "Date" class since that is the default and the second "POSIXct" since tz was specified.
If you want to specify the class without making use of these defaults then to be explicit use the FUN argument:
read.zoo(...whatever..., FUN = as.Date)
read.zoo(...whatever..., FUN = as.POSIXct) # might need FUN=paste,FUN2=as.POSIXct
read.zoo(...whatever..., FUN = as.yearmon)
# etc.
The FUN argument can also take a custom function as shown in the examples in the package.
Note that it always assumes standard formats (e.g. "%Y-%m-%d" in the case of "Date" class) if no format is specified and never tries to automatically determine the format.
The way it works is explained in detail in ?read.zoo and there are many examples in ?read.zoo (there are 78 lines of code in the examples section) as well as in an entire vignette (one of six vignettes) dedicated just to read.zoo" : Reading Data in zoo.
Added Have expanded the above. Also, in the development version of zoo available here the heuristic has been improved and with that improvement the first example in the question does recognize the date/times and chooses POSIXct. Also some clarification of the simple heuristic has been added to the read.zoo help file so that the many examples provided do not have to be relied upon as much.
Here are some examples. Note that the heuristic referred to is a heuristic to determine the class of the time index only. It can only identify "numeric", "Date" and "POSIXct" classes. The heuristic cannot identify other classes (although you can specify them yourself using FUN=). Also the heuristic does not identify formats. If the format is not provided using format= or implicitly through FUN= then standard format is assumed, e.g. "%Y-%m-%d" in the case of "Date".
Lines <- "2013-11-25 12:41:21 2
2013-12-25 12:41:22.25 3
2013-12-26 12:41:22.75 8"
# explicit. Uses POSIXct.
z <- read.zoo(text = Lines, index = 1:2, FUN = paste, FUN2 = as.POSIXct)
# tz implies POSIXct
z <- read.zoo(text = Lines, index = 1:2, tz = "")
# heuristic: Date now; devel ver uses POSIXct
z <- read.zoo(text = Lines, index = 1:2)
Lines <- "2013-11-251 2
2013-12-25 3
2013-12-26 8"
z <- read.zoo(text = Lines, FUN = as.Date) # explicit. Uses Date.
z <- read.zoo(text = Lines, format = "%Y-%m-%d") # format & no tz implies Date
z <- read.zoo(text = Lines) # heuristic: Date
Note:
(1) In general, its safer to be explicit by using FUN or by using tz and/or format as opposed to relying on the heuristic. If you are explicit by using FUN or semi-explicit by using tz and/or format then there is no change between the current and the development versions of read.zoo.
(2) Its safer to rely on the documentation rather than the internals as the internals can change without warning and in fact have changed in the development version. If you really want to look at the code despite this then the key statement that selects the class of the index if FUN is not explicitly defined is the if (is.null(FUN)) ... statement in the read.zoo source.
(3) I recommend using read.zoo as being easier, direct and compact rather than workarounds such as read.table followed by zoo. I have been using read.zoo for years as have many others and it seems pretty solid to me but if anyone finds specific problems with read.zoo or with the documentation (always possible since there is quite a bit of it) they can always be reported. Even though the package has been around for years improvements are still being made.
I suspect your use of read.zoo tripped you up. Here is what I did:
library(zoo)
tt <- read.table(text=Lines)
z <- zoo(as.integer(tt[,3]), order.by=as.POSIXct(paste(tt[,1], tt[,2])))
Now z is a proper zoo object:
R> z
2013-11-25 12:41:21.00 2013-11-25 12:41:22.25 2013-11-25 12:41:22.75
2 2 75
2013-11-25 12:41:24.22 2013-11-25 12:41:25.22 2013-11-25 12:41:26.22
3 1 1
R> class(z)
[1] "zoo"
R> class(index(z))
[1] "POSIXct" "POSIXt"
R>
And by making sure I used a POSIXct object for the index, I am in fact getting a POSIXct object back.
I'm trying to get file creation dates from R and I understand that this information might not be possible to retrieve at all on some operating systems that just don't store it anywhere. However, I'm unsure how to retrieve it generically when it is (at least, theoretically) retrievable.
On Windows, this is straight forward because ctime from file.info provides this information, for reference, this is the relevant excerpt from ?file.info
What is meant by the three file times depends on the OS and file system. On Windows native file systems ctime is the file creation time (something which is not recorded on most Unix-alike file systems).
However, although most unix systems don't record this information (as pointed out in the help), some unix-based systems such as OS X do in fact store this. On OS X, for example, the system command metadata ls mdls will print file metadata and list kMDItemContentCreationDate (the actual creation date of the file) as one of the file attributes.
My question is, what advice do people have for getting at file creation dates (if they are available at all) from file metadata? (e.g. specifically in the case of OS X where there's a system command but no direct R call)
UPDATE:
Thanks to info from the comments + details on SO and SE here and here, I've come up with a way to solve this in R on OS X type unix platforms that track creation date and have the BSD style stat command. However, I still couldn't figure out how to do this in R on other linux systems that track creation date but don't have this version of stat. In this answer on unix SE, it is suggested that this info could be retrieved with debugfs + stat even when stat itself does not report it (provided the file system records birthdate), but that solution I couldn't get to work (only linux I could test on didn't have debugfs). Anyways, here's how far I got:
get_birthdate <- function(filepath) {
switch(Sys.info()[['sysname']],
Windows = {
# Windows
file.info(filepath)$ctime
},
Darwin = {
# OS X
cmd <- paste('stat -f "%DB"', filepath) # use BSD stat command
ctime_sec <- as.integer(system(cmd, intern=T)) # retrieve birth date in seconds from start of epoch (%DB)
as.POSIXct(ctime_sec, origin = "1970-01-01", tz = "") # convert to POSIXct
},
Linux = {
# Linux
stop("not sure how to do this")
})
}
Following other's pointers, this should work quite reasonably.
Unfortunately it needs root privileges (dued to debugfs) and it's not very efficient yet (especially a bit quick'n dirty on regular expressions, but it's 01:00 o clock in the morning here :) ).
BTW, we set up the pager to be cat (making debugfs to print on standard output), find in which device the file is stored in order to use debugfs properly and finally get the stats and elaborate it a bit.
In general, in UNIX, once you have a bash-command to read its output in R you have to use pipe in read mode(that is default) and readLines.
Test done in a Debian Gnu Linux.
np350v5c:/home/l# R
> my.file <- "/etc/network/interfaces"
>
> setup_pager <- function() {system("export PAGER=cat")}
>
> where_is <- function(file) {
con <- pipe(sprintf("df %s", file))
res <- strsplit(readLines(con)[2], " ")[[1]][1]
close(con)
res
}
>
> where_is(my.file) # could be /dev/sda1 as well, depending on /etc/fstab
[1] "/dev/disk/by-uuid/9ce40c2b-60d8-40b1-890f-1e5da4199c88"
>
> my.command <- sprintf("debugfs -R 'stat %s' %s",
my.file,
where_is(my.file))
>
> ## root privileges especially here ..
> setup_pager()
> con <- pipe(my.command)
> debugfs <- readLines(con)
debugfs 1.42.9 (4-Feb-2014)
> close(con)
>
> my.date <- gsub("^crtime:.+-- ", "", grep("^crtime", debugfs, value = TRUE))
> my.date
[1] "Tue Feb 19 00:07:21 2013"
> strptime(tolower(substr(my.date, 5, nchar(my.date))),
format = "%b %d %H:%M:%S %Y")
[1] "2013-02-19 00:07:21 CET"
HTH, Luca
I know I am a little late to the game here, but here is a pretty easy solution for unix/Mac OS:
file.name <- "~/dir/file.extension"
df$file_created_dt <- system(paste0("stat -f %SB ", file.name), intern = T)
And then you can format it however you like:
df$file_created_dt <- as.POSIXct(df$file_created_dt, format = "%b %d %H:%M:%S %Y", origin = "1970-01-01 00:00:00", tz = "your/timezone")
I have data.table named data like this:
> head(data)
start end unit
1: 2008-11-17 2007-01-23 ADM 2-05
2: 2008-12-29 2007-01-06 BOB 4-07
3: 2008-12-31 2007-01-01 DAT15-02
4: 2008-12-31 2010-01-01 DAT15-06
5: 2008-12-31 2010-01-02 TUW 4-09
6: 2008-12-31 2010-01-02 BEG 5-01
With data types as follows:
sapply(dane, class)
start end unit
"Date" "Date" "character"
I'm trying to debug this line:
data[,
list(date = format(seq(from = start, to = end, by = "1 day"), "%Y-%m-%d")),
by = list(start, end, unit)
]
Then I get error message:
Error in del/by : non-numeric argument to binary operator
I figured out, that the error is caused by conversion to numeric which takes place when I pass something as argument to the list in 'by'.
So this modified code works:
dane[,
list(date = format(seq(
from = as.Date(start, origin = "1970-01-01"),
to = as.Date(end, origin = "1970-01-01"), by = "1 day"),
"%Y-%m-%d")),
by = list(start, end, unit)
]
This looks like a bug in data.table package. I wonder if anybody knows about this one.
Thanks in advance.
This is now fixed in commit #1256 of v1.9.3. From NEWS:
o Using by columns with attributes (ex: factor, Date) in j did not retain the attributes, also in case of :=. This was partially a regression from an earlier fix (bug #2531) due to recent changes for R3.1.0. Now fixed and clearer tests added. Thanks to Christophe Dervieux for reporting and to AdamB for reporting here on SO. Closes #5437.
Thanks again for reporting.