D <- "06.12.1948" # which is dd.mm.yyyy
as.Date(D, "%d.%m.%y") # convert to date
[1] "2019-12-06" # ????
what is it that I am missing?
Sys.getlocale(category = "LC_ALL")
[1] "LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252"
The format is case-sensitive ("%y" is ambiguous and system dependent, I believe):
as.Date(D, "%d.%m.%Y")
[1] "1948-12-06"
The help topic ?strptime has details:
‘%y’ Year without century (00-99). On input, values 00 to 68 are
prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2004 and 2008 POSIX standards, but they do
also say ‘it is expected that in a future version the default
century inferred from a 2-digit year will change’.
To avoid remembering formats of the date we can use packaged solutions.
1) With lubridate
lubridate::dmy(D)
#[1] "1948-12-06"
2) Using anytime
anytime::anydate(D)
#[1] "1948-06-12"
Might be helpful for someone. I have found this function in tutorial "Handling date-times in R" by Cole Beck. The function identifies format of your data.
# FUNCTION guessDateFormat #x vector of character dates/datetimes #returnDates return
# actual dates rather than format convert character datetime to POSIXlt datetime, or
# at least guess the format such that you could convert to datetime
guessDateFormat <- function(x, returnDates = FALSE, tzone = "") {
x1 <- x
# replace blanks with NA and remove
x1[x1 == ""] <- NA
x1 <- x1[!is.na(x1)]
if (length(x1) == 0)
return(NA)
# if it's already a time variable, set it to character
if ("POSIXt" %in% class(x1[1])) {
x1 <- as.character(x1)
}
dateTimes <- do.call(rbind, strsplit(x1, " "))
for (i in ncol(dateTimes)) {
dateTimes[dateTimes[, i] == "NA"] <- NA
}
# assume the time part can be found with a colon
timePart <- which(apply(dateTimes, MARGIN = 2, FUN = function(i) {
any(grepl(":", i))
}))
# everything not in the timePart should be in the datePart
datePart <- setdiff(seq(ncol(dateTimes)), timePart)
# should have 0 or 1 timeParts and exactly one dateParts
if (length(timePart) > 1 || length(datePart) != 1)
stop("cannot parse your time variable")
timeFormat <- NA
if (length(timePart)) {
# find maximum number of colons in the timePart column
ncolons <- max(nchar(gsub("[^:]", "", na.omit(dateTimes[, timePart]))))
if (ncolons == 1) {
timeFormat <- "%H:%M"
} else if (ncolons == 2) {
timeFormat <- "%H:%M:%S"
} else stop("timePart should have 1 or 2 colons")
}
# remove all non-numeric values
dates <- gsub("[^0-9]", "", na.omit(dateTimes[, datePart]))
# sep is any non-numeric value found, hopefully / or -
sep <- unique(na.omit(substr(gsub("[0-9]", "", dateTimes[, datePart]), 1, 1)))
if (length(sep) > 1)
stop("too many seperators in datePart")
# maximum number of characters found in the date part
dlen <- max(nchar(dates))
dateFormat <- NA
# when six, expect the century to be omitted
if (dlen == 6) {
if (sum(is.na(as.Date(dates, format = "%y%m%d"))) == 0) {
dateFormat <- paste("%y", "%m", "%d", sep = sep)
} else if (sum(is.na(as.Date(dates, format = "%m%d%y"))) == 0) {
dateFormat <- paste("%m", "%d", "%y", sep = sep)
} else stop("datePart format [six characters] is inconsistent")
}else if (dlen == 8) {
if (sum(is.na(as.Date(dates, format = "%Y%m%d"))) == 0) {
dateFormat <- paste("%Y", "%m", "%d", sep = sep)
} else if (sum(is.na(as.Date(dates, format = "%m%d%Y"))) == 0) {
dateFormat <- paste("%m", "%d", "%Y", sep = sep)
} else stop("datePart format [eight characters] is inconsistent")
} else {
stop(sprintf("datePart has unusual length: %s", dlen))
}
if (is.na(timeFormat)) {
format <- dateFormat
} else if (timePart == 1) {
format <- paste(timeFormat, dateFormat)
} else if (timePart == 2) {
format <- paste(dateFormat, timeFormat)
} else stop("cannot parse your time variable")
if (returnDates)
return(as.POSIXlt(x, format = format, tz = tzone))
format
}
# generate some dates
mydates <- format(as.POSIXct(sample(31536000, 20), origin = "2011-01-01", tz = "UTC"), "%m.%d.%Y %H:%M")
mydates
## [1] "02/07/2011 06:51" "11/21/2011 17:03" "09/17/2011 22:42" "02/16/2011 13:45"
## [5] "12/14/2011 19:11" "09/08/2011 09:22" "12/06/2011 14:06" "02/02/2011 11:00"
## [9] "03/27/2011 06:12" "01/05/2011 15:09" "04/15/2011 04:17" "10/20/2011 14:20"
## [13] "11/13/2011 21:46" "02/26/2011 03:24" "12/29/2011 11:02" "03/17/2011 02:24"
## [17] "02/27/2011 13:51" "06/27/2011 08:36" "03/14/2011 10:54" "01/28/2011 14:14"
guessDateFormat(mydates)
[1] "%m.%d.%Y %H:%M"
Lubridate is the best option for this in my opinion. The following will work fine.
`data %>% mutate(date_variable = as.Date(dmy(date_variable)))`
Interestingly though I found dmy() to behave weirdly when as.Date() and dmy() were called in separate steps
Related
I have a variable that needs to be converted to military time. This variable is very messy because it lacks consistency in the format.
Here is an example of what might be found in the variable.
x <- c("0.9305555555555547", "15:20 Found", "10:00:00 AM Found", "0.125", "Found 1525")
So far I had some success in getting everything in a more consistent format with RegEx:
x <- str_extract(x, "[0-9]+.+[0-9]|[0-9][0-9][:][0-9][0-9]")
x <- str_remove(x, "[:]")
x <- str_remove(x, "[:][0-9][0-9]$")
As you can see I get: "0.9305555555555547", "1520", "1000", "0.125", "1525"
The problem is the decimals need to be multiplied by 2400 to get back to military time but I also don't want to multiply integers (since those are already in military time).
x is essentially a variable in a dataframe.
I was thinking about using if/else logic but I don't know how to implement that.
To clarify, I want:
Input: "0.9305555555555547", "15:20 Found", "10:00:00 AM Found", "0.125", "Found 15:25"
Output: "2233", "1520", "1000", "0300", "1525"
After the pre-processing you did with your regexes, you can implement if/else logic here by using str_detect()
x <- ifelse(str_detect(x, "\\."),
as.integer(as.numeric(x) * 2400),
as.integer(x)) %>%
sprintf("%04d", .)
This will return your desired output as character
Then you could do something like this to parse it to POSIXct
x <- as.POSIXct(x,
format = "%H%M",
origin = "1970-01-01",
tz = "UTC")
We should extend the regex for AM/PM indicators so that the forces do not miss each other. Next, in subsets we handle decimal time, imperial time, 24h time, and return the result.
milt <- function(x) {
u <- trimws(gsub('\\D*(\\d*\\W?[AP]?M?)\\D*', '\\1', x))
u[grep('\\.', u)] <- sprintf('%04d', round(as.double(u[grep('\\.', u)])*2400))
u[grep('[AP]M', u)] <- strftime(strptime(u[grep('[AP]M', u)], '%I:%M:%S %p'), '%H%M')
u[grep(':', u)] <- gsub(':', '', u[grep(':', u)] )
return(u)
}
milt(x)
# [1] "2233" "1520" "1000" "2200" "0300" "1525" "0000" "1020"
Data:
x <- c("0.9305555555555547", "15:20 Found", "10:00:00 AM Found",
"10:00:00 PM Found", "0.125", "Found 1525", "0000", "10:20")
I followed your logic literally and got the exact same result.
Numeric conversion
Multiply by 2400 and back to character
for loop to detect dots in the character and delete after that
for loop to put 0 in front of numbers with less than 4 characters
x <- c("0.9305555555555547", "15:20 Found", "10:00:00 AM Found", "0.125", "Found 1525")
x <- str_extract(x, "[0-9]+.+[0-9]|[0-9][0-9][:][0-9][0-9]")
x <- str_remove(x, "[:]")
x <- str_remove(x, "[:][0-9][0-9]$")
x <- as.numeric(x)
x <- as.character(ifelse(x<1,x*2400,x))
for(i in 1:length(x)){
ii <- stri_locate_first_regex(x[i],"\\.")[1]
if(!is.na(ii)){
x[i] <- str_sub(x[i],1,ii-1)
}
}
for(i in 1:length(x)){
while (nchar(x[i])<4) {
x[i] <- paste0("0",x[i])
}
}
x
[1] "2233" "1520" "1000" "0300" "1525"
>
Edit: I realized this:
the decimals need to be multiplied by 2400 to get back to military time
Isn't quite right. "2400" isn't a decimal number (technically it's sexagesimal, base 60), so decimal multiplication won't give a correct result. I've changed my code accordingly.
Rather than using the same regex on everything right out the gate, I would first determine the format of each element of x, then process the element accordingly.
I like the hms library for working with times and use it below, but you could also use the base POSIXct or POSIXlt class.
library(tidyverse)
library(hms)
x <- c("0.9305555555555547", "15:20 Found", "10:00:00 AM Found", "0.125", "Found 1525")
# ----
# define functions for parsing each possible format in `x`
parse_decimal <- function(x) {
hrs <- as.numeric(str_extract(x, "^0\\.\\d+")) * 24
min <- (hrs %% 1) * 60
hrs <- floor(hrs)
hms(hours = hrs, minutes = min)
}
parse_timestring <- function(x) {
out <- str_remove_all(x, "\\D")
hr_digits <- if_else(str_length(out) == 3, 1, 2)
hrs <- as.numeric(str_sub(out, end = hr_digits))
hrs <- if_else(str_detect(str_to_upper(x), "P\\.?M"), hrs + 12, hrs)
min <- as.numeric(str_sub(out, start = hr_digits + 1))
hms(hours = hrs, minutes = min)
}
# ----
# test each element of x, and pass to appropriate parsing Fx
time <- case_when(
str_detect(x, "^[01]$|^0\\.\\d") ~ parse_decimal(x),
str_detect(x, "\\d{1,2}:?\\d{2}") ~ parse_timestring(x),
TRUE ~ NA_real_
)
time
# 22:20:00.000000
# 15:20:00.000000
# 10:00:00.000000
# 03:00:00.000000
# 15:25:00.000000
I have some data from event producer. In a "created_at column I have mixed type of datetime value.
Some NA, some ISO8601 like, some POSIX with and without millisec.
I build a func that should take care of everything meanning let's NA and ISO8601 info as it is, and convert POSIX date to ISO8601.
library(anytime)
convert_time <- function(x) {
nb_char = nchar(x)
if (is.na(x)) return(x)
else if (nb_char == 10 | nb_char == 13) {
num_x = as.numeric(x)
if (nb_char == 13) {
num_x = round(num_x / 1000, 0)
}
return(anytime(num_x))
}
return(x)
}
If I passe one problematic value
convert_time("1613488656")
"2021-02-16 15:17:36 UTC"
Works well !
Now
df_offer2$created_at = df_offer2$created_at %>% sapply(convert_time)
I still have the problematic values.
Any tips here ?
I would suggest the following small changes...
convert_time <- function(x) {
nb_char = nchar(x)
if (is.na(x)) return(x)
else if (nb_char == 10 | nb_char == 13) {
num_x = as.numeric(x)
if (nb_char == 13) {
num_x = round(num_x / 1000, 0)
}
return(num_x) #remove anytime from here
}
return(x)
}
df_offer2$created_at = df_offer2$created_at %>%
sapply(convert_time) %>% anytime() #put it back in at this point
Two things that have worked for me:
col1<-seq(from=1,to=10)
col2<-rep("1613488656",10)
df <- data.frame(cbind(col1,col2))
colnames(df)<-c("index","created_at")
df <- df%>%
mutate(converted = convert_time(df$created_at))`
alternatively
col1<-seq(from=1,to=10)
col2<-rep("1613488656",10)
df <- data.frame(cbind(col1,col2))
colnames(df)<-c("index","created_at")
df$created_at <- convert_time(df$created_at)
Both spit out warnings but appear to make the correction properly
D <- "06.12.1948" # which is dd.mm.yyyy
as.Date(D, "%d.%m.%y") # convert to date
[1] "2019-12-06" # ????
what is it that I am missing?
Sys.getlocale(category = "LC_ALL")
[1] "LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252"
The format is case-sensitive ("%y" is ambiguous and system dependent, I believe):
as.Date(D, "%d.%m.%Y")
[1] "1948-12-06"
The help topic ?strptime has details:
‘%y’ Year without century (00-99). On input, values 00 to 68 are
prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2004 and 2008 POSIX standards, but they do
also say ‘it is expected that in a future version the default
century inferred from a 2-digit year will change’.
To avoid remembering formats of the date we can use packaged solutions.
1) With lubridate
lubridate::dmy(D)
#[1] "1948-12-06"
2) Using anytime
anytime::anydate(D)
#[1] "1948-06-12"
Might be helpful for someone. I have found this function in tutorial "Handling date-times in R" by Cole Beck. The function identifies format of your data.
# FUNCTION guessDateFormat #x vector of character dates/datetimes #returnDates return
# actual dates rather than format convert character datetime to POSIXlt datetime, or
# at least guess the format such that you could convert to datetime
guessDateFormat <- function(x, returnDates = FALSE, tzone = "") {
x1 <- x
# replace blanks with NA and remove
x1[x1 == ""] <- NA
x1 <- x1[!is.na(x1)]
if (length(x1) == 0)
return(NA)
# if it's already a time variable, set it to character
if ("POSIXt" %in% class(x1[1])) {
x1 <- as.character(x1)
}
dateTimes <- do.call(rbind, strsplit(x1, " "))
for (i in ncol(dateTimes)) {
dateTimes[dateTimes[, i] == "NA"] <- NA
}
# assume the time part can be found with a colon
timePart <- which(apply(dateTimes, MARGIN = 2, FUN = function(i) {
any(grepl(":", i))
}))
# everything not in the timePart should be in the datePart
datePart <- setdiff(seq(ncol(dateTimes)), timePart)
# should have 0 or 1 timeParts and exactly one dateParts
if (length(timePart) > 1 || length(datePart) != 1)
stop("cannot parse your time variable")
timeFormat <- NA
if (length(timePart)) {
# find maximum number of colons in the timePart column
ncolons <- max(nchar(gsub("[^:]", "", na.omit(dateTimes[, timePart]))))
if (ncolons == 1) {
timeFormat <- "%H:%M"
} else if (ncolons == 2) {
timeFormat <- "%H:%M:%S"
} else stop("timePart should have 1 or 2 colons")
}
# remove all non-numeric values
dates <- gsub("[^0-9]", "", na.omit(dateTimes[, datePart]))
# sep is any non-numeric value found, hopefully / or -
sep <- unique(na.omit(substr(gsub("[0-9]", "", dateTimes[, datePart]), 1, 1)))
if (length(sep) > 1)
stop("too many seperators in datePart")
# maximum number of characters found in the date part
dlen <- max(nchar(dates))
dateFormat <- NA
# when six, expect the century to be omitted
if (dlen == 6) {
if (sum(is.na(as.Date(dates, format = "%y%m%d"))) == 0) {
dateFormat <- paste("%y", "%m", "%d", sep = sep)
} else if (sum(is.na(as.Date(dates, format = "%m%d%y"))) == 0) {
dateFormat <- paste("%m", "%d", "%y", sep = sep)
} else stop("datePart format [six characters] is inconsistent")
}else if (dlen == 8) {
if (sum(is.na(as.Date(dates, format = "%Y%m%d"))) == 0) {
dateFormat <- paste("%Y", "%m", "%d", sep = sep)
} else if (sum(is.na(as.Date(dates, format = "%m%d%Y"))) == 0) {
dateFormat <- paste("%m", "%d", "%Y", sep = sep)
} else stop("datePart format [eight characters] is inconsistent")
} else {
stop(sprintf("datePart has unusual length: %s", dlen))
}
if (is.na(timeFormat)) {
format <- dateFormat
} else if (timePart == 1) {
format <- paste(timeFormat, dateFormat)
} else if (timePart == 2) {
format <- paste(dateFormat, timeFormat)
} else stop("cannot parse your time variable")
if (returnDates)
return(as.POSIXlt(x, format = format, tz = tzone))
format
}
# generate some dates
mydates <- format(as.POSIXct(sample(31536000, 20), origin = "2011-01-01", tz = "UTC"), "%m.%d.%Y %H:%M")
mydates
## [1] "02/07/2011 06:51" "11/21/2011 17:03" "09/17/2011 22:42" "02/16/2011 13:45"
## [5] "12/14/2011 19:11" "09/08/2011 09:22" "12/06/2011 14:06" "02/02/2011 11:00"
## [9] "03/27/2011 06:12" "01/05/2011 15:09" "04/15/2011 04:17" "10/20/2011 14:20"
## [13] "11/13/2011 21:46" "02/26/2011 03:24" "12/29/2011 11:02" "03/17/2011 02:24"
## [17] "02/27/2011 13:51" "06/27/2011 08:36" "03/14/2011 10:54" "01/28/2011 14:14"
guessDateFormat(mydates)
[1] "%m.%d.%Y %H:%M"
Lubridate is the best option for this in my opinion. The following will work fine.
`data %>% mutate(date_variable = as.Date(dmy(date_variable)))`
Interestingly though I found dmy() to behave weirdly when as.Date() and dmy() were called in separate steps
I am attempting to set up an empty data frame in R which will be populated by, amongst other things, two date-timestamps in the form of e.g. 21/08/2014 20:51.
This is my code:
eventised <- data.frame(student_id=integer(),
session_id=integer(),
start_ts=as.POSIXct(format = "%d/%m/%Y %H:%M"),
stop_ts=as.POSIXct(format = "%d/%m/%Y %H:%M"),
week=integer(),
macro_process=character(),
micro_process=character(),
stringsAsFactors=FALSE)
raw_events <- read.csv(file="SRL_Concat_ST1_Test_2.csv", header = TRUE, sep=",")
last_sess_ID <- 0
for (row in 1:nrow(raw_events)) {
if(raw_events[row, "SESSION_ID"] != last_sess_ID || row == nrow(raw_events)) {
print(row)
if(row !=1) {
eventised[nrow(eventised)+1,] <- c(r_student_id, r_session_id, r_start_ts, r_stop_ts, r_week, "MAC", "MIC")
# eventised[nrow(eventised)+1,] <- c(r_student_id, r_session_id, r_week, "MAC", "MIC")
}
r_student_id <- raw_events[row, "STUDENT_ID"]
r_session_id <- raw_events[row, "SESSION_ID"]
r_start_ts <- raw_events[row, "TIMESTAMP"]
r_stop_ts <- raw_events[row, "TIMESTAMP"]
r_week <- raw_events[row, "WEEK"]
last_sess_ID <- raw_events[row, "SESSION_ID"]
} else {
r_stop_ts <- raw_events[row, "TIMESTAMP"]
}
I get this error:
Error in inherits(x, "POSIXct") :
argument "x" is missing, with no default
Then later I attempt to do this:
eventised[nrow(eventised)+1,] <- c(r_student_id, r_session_id, r_start_ts, r_stop_ts, r_week, "MAC", "MIC")
I get:
Error in charToDate(x) :
character string is not in a standard unambiguous format
I am probably doing something stupid but I would really appreciate some help.
Thanks in advance,
F
DATA
STUDENT_ID SESSION_ID TIMESTAMP LACTION_TYPE WEEK STUDY_MODE
4 7 11/08/2014 23:08 CONTENT_ACCESS 3 revisiting
This should take care of the datetime format:
df <- data.frame(start_ts=as.POSIXct(character()))
Basically, I'm trying to keep a vector named dates of special Dates that come up a lot in my analysis, say New Year's 2016 and July 4 2015. I want to be able to extract from this by name instead of index for robustness, e.g., dates["nyd"] to get New Year's and dates["ind"] to get July 4.
I thought this would be simple:
dates <- as.Date(c(ind = "2015-07-04", nyd = "2016-01-01"))
But as.Date has stripped the names:
dates
# [1] "2015-07-04" "2016-01-01"
It's not like Date vectors can't be named (which would be strange, given they're basically specifically-interpreted integers):
setNames(dates, c("ind", "nyd"))
# ind nyd
# "2015-07-04" "2016-01-01"
And unfortunately there's no way to declare a Date vector directly (as far as I know?), especially without knowing the underlying integer values of the dates.
Exploring this, it seems this is standard practice for the as* class of functions:
as.integer(c(a = "123", b = "436"))
# [1] 123 436
as(c(a = 1, b = 2), "character")
# [1] "1" "2"
Is there a reason why this is the case? The loss of names isn't mentioned in ?as or any of the other help pages I've seen.
More generally, is there a way (using something other than as*) to ensure the names of an object are not lost in a conversion?
Of course one approach is to write custom functions like as.Date.named or create a custom class as.named with associated methods, but it would be surprising to me if there wasn't something like this already in place, as it seems like this should be a pretty common operation.
In case it matters, I'm on 3.2.2.
Indeed there is a discrepancy in the different as.Date methods and here is why (or rather "how"):
First, your example:
> as.Date(c(ind = "2015-07-04", nyd = "2016-01-01"))
[1] "2015-07-04" "2016-01-01"
Here we use method as.Date.character:
> as.Date.character
function (x, format = "", ...)
{
charToDate <- function(x) {
xx <- x[1L]
if (is.na(xx)) {
j <- 1L
while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j]
if (is.na(xx))
f <- "%Y-%m-%d"
}
if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d",
tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d",
tz = "GMT")))
return(strptime(x, f))
stop("character string is not in a standard unambiguous format")
}
res <- if (missing(format))
charToDate(x)
else strptime(x, format, tz = "GMT")
as.Date(res)
}
<bytecode: 0x19d3dff8>
<environment: namespace:base>
Whether the format is given or not, your vector is passed to strptime which converts it to class POSIXlt, and then it is passed to as.Date again but this time with method as.Date.POSIXlt which is:
> as.Date.POSIXlt
function (x, ...)
.Internal(POSIXlt2Date(x))
<bytecode: 0x19d2df50>
<environment: namespace:base>
meaning that ultimately the function used to convert to class Date is the C function called by POSIXlt2Date (a quick look at file names.c show that the function is do_POSIXlt2D from file datetime.c). For reference, here it is:
SEXP attribute_hidden do_POSIXlt2D(SEXP call, SEXP op, SEXP args, SEXP env)
{
SEXP x, ans, klass;
R_xlen_t n = 0, nlen[9];
stm tm;
checkArity(op, args);
PROTECT(x = duplicate(CAR(args)));
if(!isVectorList(x) || LENGTH(x) < 9)
error(_("invalid '%s' argument"), "x");
for(int i = 3; i < 6; i++)
if((nlen[i] = XLENGTH(VECTOR_ELT(x, i))) > n) n = nlen[i];
if((nlen[8] = XLENGTH(VECTOR_ELT(x, 8))) > n) n = nlen[8];
if(n > 0) {
for(int i = 3; i < 6; i++)
if(nlen[i] == 0)
error(_("zero-length component in non-empty \"POSIXlt\" structure"));
if(nlen[8] == 0)
error(_("zero-length component in non-empty \"POSIXlt\" structure"));
}
/* coerce relevant fields to integer */
for(int i = 3; i < 6; i++)
SET_VECTOR_ELT(x, i, coerceVector(VECTOR_ELT(x, i), INTSXP));
PROTECT(ans = allocVector(REALSXP, n));
for(R_xlen_t i = 0; i < n; i++) {
tm.tm_sec = tm.tm_min = tm.tm_hour = 0;
tm.tm_mday = INTEGER(VECTOR_ELT(x, 3))[i%nlen[3]];
tm.tm_mon = INTEGER(VECTOR_ELT(x, 4))[i%nlen[4]];
tm.tm_year = INTEGER(VECTOR_ELT(x, 5))[i%nlen[5]];
/* mktime ignores tm.tm_wday and tm.tm_yday */
tm.tm_isdst = 0;
if(tm.tm_mday == NA_INTEGER || tm.tm_mon == NA_INTEGER ||
tm.tm_year == NA_INTEGER || validate_tm(&tm) < 0)
REAL(ans)[i] = NA_REAL;
else {
/* -1 must be error as seconds were zeroed */
double tmp = mktime00(&tm);
REAL(ans)[i] = (tmp == -1) ? NA_REAL : tmp/86400;
}
}
PROTECT(klass = mkString("Date"));
classgets(ans, klass);
UNPROTECT(3);
return ans;
}
Unfortunately my understanding of C is too limited to know why the attributes are lost here. My guess would be that it happens either during the coerceVector operation or when each element of the POSIXlt list is individually coerced to integers (if that's what happens lines 1268-70).
But let's have a look at the other as.Date method, starting with the main offender, as.Date.POSIXct:
> as.Date.POSIXct
function (x, tz = "UTC", ...)
{
if (tz == "UTC") {
z <- floor(unclass(x)/86400)
attr(z, "tzone") <- NULL
structure(z, class = "Date")
}
else as.Date(as.POSIXlt(x, tz = tz))
}
<bytecode: 0x19c268bc>
<environment: namespace:base>
With this one, if no timezone is given, or if the timezone is "UTC", the function just manipulate the POSIXct lists to extract the data that can be resolved to a Date object, thus not losing the attributes, but if any other timezones is given, it is then converted to a POSIXlt object and therefore passed further to the same POSIXlt2Date internal, which eventually lose its attributes! And indeed:
> as.Date(c(a = as.POSIXct("2016-01-01")), tz="UTC")
a
"2015-12-31"
> as.Date(c(a = as.POSIXct("2016-01-01")), tz="CET")
[1] "2016-01-01"
And finally, as #Roland mentioned, as.Date.numeric does keep the attributes:
> as.Date.numeric
function (x, origin, ...)
{
if (missing(origin))
stop("'origin' must be supplied")
as.Date(origin, ...) + x
}
<bytecode: 0x568943d4>
<environment: namespace:base>
origin is converted to Date via as.Date.character and then the vector of numeric is added, thus keeping the attributes because of this:
> c(a=1) + 2
a
3
So naturally:
> c(a=16814) + as.Date("1970-01-01")
a
"2016-01-14"
Until this discrepancy is taken care of, the only solutions you have to keep your attributes, I think, are either to first convert to POSIXct (but beware of timezone issues) or to numeric, or to copy the attributes of your original vector:
> before <- c(ind = "2015-07-04", nyd = "2016-01-01")
> after <- as.Date(before)
> names(after) <- names(before)
> after
ind nyd
"2015-07-04" "2016-01-01"
This isn't a full answer to the question, but as a way round the problem, no-one has mentioned the mode function.
vec <- c(a = "1", b = "2")
mode(vec) <- "integer"
vec
# returns:
# a b
# 1 2
I'm not sure how you'd apply this to dates though:
vec <- c(a = "2010-01-01")
mode(vec) <- "POSIXlt"
gives something, but it doesn't seem quite right.
You could also use
sapply(vec, as.whatever)
which will preserve names. However, I think this will be slower as you lose the advantage of a vectorised function.
Thirdly, there is:
structure(as.whatever(vec), names = names(vec))