How to clean a time column in r - r

I have a time column in R as:
22:34:47
06:23:15
7:35:15
5:45
How to make all the time values in a column into hh:mm:ss format. I have used
as_date(a$time, tz=NULL) but I am not able to get the format which I wanted.

Here is an option with parse_date_time which can take multiple formats
library(lubridate)
format(parse_date_time(time, c("HMS", "HM"), tz = "GMT"), "%H:%M:%S")
#[1] "22:34:47" "06:23:15" "07:35:15" "05:45:00"
data
time <- c("22:34:47", "06:23:15", "7:35:15", "5:45")

Nothing a bit of formatting can't take care of:
x <- c("22:34:47","06:23:15","7:35:15","5:45")
format(
pmax(
as.POSIXct(x, format="%T", tz="UTC"),
as.POSIXct(x, format="%R", tz="UTC"), na.rm=TRUE
),
"%T"
)
#[1] "22:34:47" "06:23:15" "07:35:15" "05:45:00"
The pmax means any additional seconds will be taken in preference to just hh:mm.
You could get functional if you wanted to get a similar result with less typing, and more opportunity for turning it into a repeatable function.
do.call(pmax, c(lapply(c("%T","%R"), as.POSIXct, x=x, tz="UTC"), na.rm=TRUE))

Using a tidyverse approach with dplyr and hms verbs.
library(dplyr)
library(hms)
a <- tibble(time = c("22:34:47", "06:23:15", "7:35:15", "5:45"))
a %>%
mutate(
time = case_when(
is.na(parse_hms(time)) ~ parse_hm(time),
TRUE ~ parse_hms(time)
)
)
# # A tibble: 4 x 1
# time
# <time>
# 1 22:34
# 2 06:23
# 3 07:35
# 4 05:45
Note that the use of case_when could be replaced with an ifelse. The reason for this conditional is that parse_hms will return NA for values without seconds.
You may also want the output to be a POSIX compliant value, you may adapt the previous solution to do so.
a %>%
mutate(
time = case_when(
is.na(parse_hms(time)) ~ as.POSIXct(parse_hm(time)),
TRUE ~ as.POSIXct(parse_hms(time))
)
)
# # A tibble: 4 x 1
# time
# <dttm>
# 1 1970-01-01 22:34:47
# 2 1970-01-01 06:23:15
# 3 1970-01-01 07:35:15
# 4 1970-01-01 05:45:00
Note this will set the date to origin, which is 1970-01-01 by default.

Related

Rounding dates with round_date() in R

I try convert date format yyyymmdd in yyyy only in R.
In how to convert numeric only year in Date in R? presented a very interesting answer, as it managed to make R understand to convert an 8-digit entry (yyyymmdd) as a 4-digit year year (yyyy) in the lubricated package, this is very good for me.
in old code i used round_date() for it:
date2<-c('01/01/2000','08/08/2000','16/03/2001','25/12/2000','29/02/2000')
name<-c('A','B','C','D','E')
df<-data.frame(date2,name)
df2 <- df %>%
mutate(date2 = dmy(date2)) %>%
mutate(year_date = round_date(date2,'year'))
df2
str(df2)
date2<date> name<chr> year_date <date>
2000-01-01 A 2000-01-01
2000-08-08 B 2001-01-01
2001-03-16 C 2001-01-01
2000-12-25 D 2001-01-01
2000-02-29 E 2000-01-01
But I started to have problems with my statistical analysis when discovering for example that a date 2000-08-08 was rounded up to the year 2001-01-01, instead of 2001-01-01 as I expected.
This is a very big problem for me, since information that belongs to the year 2005 has been moved to the year 2006, considering that I have more than 1400 rows in my database.
I noticed that dates after the middle of the year (after June) are rounded up to the next year, this is very bad.
How do I round a 2000-08-08 date to just 2000 instead of 2001?
Doesn't this (simpler, also only base R) operation do what you want?
> date2 <- c('01/01/2000','08/08/2000','16/03/2001','25/12/2000','29/02/2000')
> dd <- as.Date(date2, "%d/%m/%Y")
> yd <- format(dd, "%Y-01-01")
> dt <- as.Date(yd)
> D <- data.frame(date2=date2, date=dd, y=yd, d=dt)
> D
date2 date y d
1 01/01/2000 2000-01-01 2000-01-01 2000-01-01
2 08/08/2000 2000-08-08 2000-01-01 2000-01-01
3 16/03/2001 2001-03-16 2001-01-01 2001-01-01
4 25/12/2000 2000-12-25 2000-01-01 2000-01-01
5 29/02/2000 2000-02-29 2000-01-01 2000-01-01
>
In essence we just extract the year component from the (parsed as date) Date object and append -01-01.
Edit: There are also trunc() operations for Date and Datetime objects. Oddly, truncation for years only works for Datetime (see the help page for trunc.Date for more) so this works too:
> as.Date(trunc(as.POSIXlt(dd), "years"))
[1] "2000-01-01" "2000-01-01" "2001-01-01" "2000-01-01" "2000-01-01"
>
Edit 2: We can use that last step in a cleaner / simpler solution in a data.frame with three columns for input data (as characters), parse data as a proper Date type and the desired truncated year data — all using base R without further dependencies. Of course, if you would want to you could rewrite it via the pipe and lubridate for the same result via slightly slower route (which only matters for "large" data).
> date2 <- c('01/01/2000','08/08/2000','16/03/2001','25/12/2000','29/02/2000')
> pd <- as.Date(date2, "%d/%m/%Y")
> td <- as.Date(trunc(as.POSIXlt(pd), "years"))
> D <- data.frame(input = date2, parsed = pd, output = td)
> D
input parsed output
1 01/01/2000 2000-01-01 2000-01-01
2 08/08/2000 2000-08-08 2000-01-01
3 16/03/2001 2001-03-16 2001-01-01
4 25/12/2000 2000-12-25 2000-01-01
5 29/02/2000 2000-02-29 2000-01-01
>
For a real "production" use you may not need the data.frame and do not need to keep the intermediate result leading to a one-liner:
> as.Date(trunc(as.POSIXlt( as.Date(date2, "%d/%m/%Y") ), "years"))
[1] "2000-01-01" "2000-01-01" "2001-01-01" "2000-01-01" "2000-01-01"
>
which is likely the most compact and efficient conversion you can get.
If you want just the year (and not the date corresponding to the first day of the year) you can use lubridate::year().
df %>% mutate(across(date2,dmy),
year_date=year(date2))
If you do want the first day of the year then floor_date() will do the trick.
df %>% mutate(across(date2,dmy),
year_date=floor_date(date2,"year"))
or if you only need the truncated date you could go directly to mutate(year_date=floor_date(dmy(date2)))
In base R, year() would be format(date2, "%Y"), as shown in #DirkEddelbuettel's answer.
If you consult the round_datehelp page, you will also see floor_date:
library("lubridate")
library("dplyr")
date2 <- c('01/01/2000','08/08/2000','16/03/2001','25/12/2000','29/02/2000')
name <- c('A','B','C','D','E')
df <- data.frame(date2,name)
df2 <- df %>%
mutate(date2 = dmy(date2)) %>%
mutate(year_date = floor_date(date2,'year'))
df2

How to calculate time difference in milliseconds using R when formats are different?

I have a problem in R that is killing me! Can you help me?
I found a question in StackOverflow that gave me a very good explanation.
Here is the link: How to parse milliseconds?
I was able to implement the following code that works very well.
z2 <- strptime("10/2/20 11:16:17.682", "%d/%m/%y %H:%M:%OS")
z1 <- strptime("10/2/20 11:16:16.683", "%d/%m/%y %H:%M:%OS")
When I calculate z2-z1, I get
Time difference of 0.9989998 secs
Similarly, when I use
z3 <- strptime("130 11:16:16.683", "%j %H:%M:%OS")
z4 <- strptime("130 11:16:18.682", "%j %H:%M:%OS")
When I calculate z4-z3, I get
Time difference of 1.999 secs
What is my problem?
The first column has the format 130 18:25:50.408, with millions of rows!!!
The second column has the format 2020 130 18:25:51.357 that is like the first column but has the year 2020.
The first column is also from 2020, but as the year is not there R uses the current year.
First question,
How can I substract both columns? I know how to substract columns.
What I do not know is to subtract these two times.
For example, second time is 2020 130 18:25:51.357
and first time is 130 18:25:50.408
I guess that I can do it programmatically converting it to a string, and eliminating the 2020. However, I am hoping that a quicker solution is available using base R or the lubridate package.
Second question,
"%j %H:%M:%OS" is the format for 130 11:16:16.683
What is the format for 2020 130 18:25:51.357?
As explained before this is working very well:
z3 <- strptime("130 11:16:16.683", "%j %H:%M:%OS")
But, this is NOT working.
z7 <- strptime("2020 130 11:16:16.683", "%y %j %H:%M:%OS")
UPDATE 1
I solved the second question!
However, I have not figured out yet the first question.
For the second question, the mistake in the format was that instead of %y, I need to write %Y with upper case.
Here is one example:
later <- strptime("2020 130 11:16:17.683", "%Y %j %H:%M:%OS")
earlier <- strptime("2020 130 11:16:16.684", "%Y %j %H:%M:%OS")
difftime(later,earlier,units="secs")
The R results is:
Time difference of 0.9990001 secs
UPDATE 2
At this point, what is pending is the following:
I need to substract two times that were made the same day on 2020.
The second time does have the year, the first time does not.
later <- strptime("2020 130 11:16:17.683", "%Y %j %H:%M:%OS")
earlier <- strptime("130 11:16:16.684", "%j %H:%M:%OS")
difftime(later,earlier,units="secs")
R produces the following result:
Time difference of -31622399 secs
Why? As we are on 2021, R formats the vector earlier as the current year, 2021 because the year is not there.
My columns has millions of rows.
At this point, my guess is that I would need to add 2020 with a concatenation or something like that. Is there any other method?
Thank you for your help!
Your object z2 is a POSIX list object. What this means is that it is a list of the time elements of your time.
print.default(z2)
# $sec
# [1] 17.682
#
# $min
# [1] 16
#
# $hour
# [1] 11
#
# $mday
# [1] 10
#
# $mon
# [1] 1
#
# $year
# [1] 120
#
# $wday
# [1] 1
#
# $yday
# [1] 40
#
# $isdst
# [1] 0
#
# $zone
# [1] "GMT"
#
# $gmtoff
# [1] NA
#
# attr(,"class")
# [1] "POSIXlt" "POSIXt"
When you do a subtraction, z2 - z1 R dispatches this operation to a function called -.POSIXt, which itself calls difftime. This function converts z2 to a POSIX count object. What this means is that it gets converted to a count of seconds since the beginning of the epoch, by default "1970-01-01".
options("digits" = 16)
print.default(as.POSIXct(z2))
# [1] 1581333377.682
# attr(,"class")
# [1] "POSIXct" "POSIXt"
# attr(,"tzone")
# [1] ""
difftime(z2, z1)
# Time difference of 0.9989998340606689 secs
R, like most software, works with double precision numerics. This means that arithmetic is imprecise, although approximately true. Most software will try to hide this imprecision by reducing the number of digits shown. That said, different numbers will give you different imprecision, so you might prefer referring directly to the list element of z2.
print.default(z2$sec - z1$sec)
# [1] 0.9989999999999988
You could therefore apply the time difference using your favourite data.frame tools.
options("digits" = 6)
# character columns
df1 <- data.frame(
col1 = c("10/2/20 11:16:17.682", "10/2/20 11:16:16.683"),
col2 = c("130 11:16:16.683", "130 11:16:18.682"),
stringsAsFactors = FALSE)
library(dplyr)
# convert columns to POSIXlt
df2 <- mutate(df1,
col1 = strptime(col1, "%d/%m/%y %H:%M:%OS"),
col2 = strptime(stringr::str_c("2020 ", col2), "%Y %j %H:%M:%OS"),
diff_days = unclass(difftime(col2, col1, units = "days")))
df2
# col1 col2 diff_days
# 1 2020-02-10 11:16:17 2020-05-09 11:16:16 88.9583
# 2 2020-02-10 11:16:16 2020-05-09 11:16:18 88.9584

Vectorised time zone conversion with lubridate

I have a data frame with a column of date-time strings:
library(tidyverse)
library(lubridate)
testdf = data_frame(
mytz = c('Australia/Sydney', 'Australia/Adelaide', 'Australia/Perth'),
mydt = c('2018-01-17T09:15:00', '2018-01-17T09:16:00', '2018-01-17T09:18:00'))
testdf
# A tibble: 3 x 2
# mytz mydt
# <chr> <chr>
# 1 Australia/Sydney 2018-01-17T09:15:00
# 2 Australia/Adelaide 2018-01-17T09:16:00
# 3 Australia/Perth 2018-01-17T09:18:00
I want to convert these date-time strings to POSIX date-time objects with their respective timezones:
testdf %>% mutate(mydt_new = ymd_hms(mydt, tz = mytz))
Error in mutate_impl(.data, dots) :
Evaluation error: tz argument must be a single character string.
In addition: Warning message:
In if (tz != "UTC") { :
the condition has length > 1 and only the first element will be used
I get the same result if I use ymd_hms without a timezone and pipe it into force_tz. Is it fair to conclude that lubridate doesn't support any sort of vectorisation when it comes to timezone operations?
Another option is map2. It may be better to store different tz output in a list as this may get coerced to a single tz
library(tidyverse)
out <- testdf %>%
mutate(mydt_new = map2(mydt, mytz, ~ymd_hms(.x, tz = .y)))
If required, it can be unnested
out %>%
unnest
The values in the list are
out %>%
pull(mydt_new)
#[[1]]
#[1] "2018-01-17 09:15:00 AEDT"
#[[2]]
#[1] "2018-01-17 09:16:00 ACDT"
#[[3]]
#[1] "2018-01-17 09:18:00 AWST"
tz argument must be a single character string. indicates that there are more than one time zones thrown into ymd_hms(). In order to make sure that there is only one time zone being thrown into the function, I used rowwise(). Note that I am not in Australian time zone. So I am not sure if the outcome I have is identical to yours.
testdf <- data_frame(mytz = c('Australia/Sydney', 'Australia/Adelaide', 'Australia/Perth'),
mydt = c('2018-01-17 09:15:00', '2018-01-17 09:16:00', '2018-01-17 09:18:00'))
testdf %>%
rowwise %>%
mutate(mydt_new = ymd_hms(mydt, tz = mytz))
mytz mydt mydt_new
<chr> <chr> <dttm>
1 Australia/Sydney 2018-01-17 09:15:00 2018-01-17 06:15:00
2 Australia/Adelaide 2018-01-17 09:16:00 2018-01-17 06:46:00
3 Australia/Perth 2018-01-17 09:18:00 2018-01-17 09:18:00

Epoch time and local time, different time zone

I want to convert epoch time to the local time. As you can see here, I have different time zones and I want to get a local time for each row. How can I do the conversion considering each time zone?
df <- data.frame(Epoch_Time = c(1460230930,1460231830, 1459929664),
Time_Zone = c("UTC−12:00", "UTC+10:00", "UTC-9:00"))
You need to store your epoch time as POSIX and then you can manipulate more easily.
library(dplyr)
library(lubridate)
df <- tibble(
time_epoch = as.POSIXct(
c(1460230930,1460231830, 1459929664), tz = "UTC", origin = "1970-01-01"
),
time_zone = c("UTC-12:00", "UTC+10:00", "UTC-09:00")
)
df <- mutate(df,
time_zone = as.numeric(substr(time_zone, 4, 6)),
time_local = as.character(time_epoch + hours(time_zone))
)
df
# # A tibble: 3 x 3
# time_epoch time_zone time_local
# <dttm> <dbl> <chr>
# 1 2016-04-09 21:42:10 -12 2016-04-09 11:42:10
# 2 2016-04-09 21:57:10 10 2016-04-10 09:57:10
# 3 2016-04-06 10:01:04 -9 2016-04-06 03:01:04
Notes:
I haven't put the effort in to properly generalise the conversion
from your UTC strings, only enough to use for this example. Ideally,
you want Olson Names instead of offsets, you can get these
here
the time_local is stored as character, you cannot store a date/time column with multiple time zones, they are stored with a
single value, see attributes(df$time_epoch)
attributes(df$time_epoch)
# $class
# [1] "POSIXct" "POSIXt"
#
# $tzone
# [1] "UTC"

Merge on date and hour range

I have a dataframe of datetimes, like so:
library(lubridate)
date_seq <- seq.POSIXt(ymd_hm('2016-04-01 0:00'), ymd_hm('2016-04-30 23:30'), by = '30 mins')
datetimes <- data.frame(datetime = date_seq)
I've also got a dataframe containing opening times that specify a range of days over which the opening times apply and an hour range over which the store is open for the days in the date range, like so:
opening_times <- data.frame(from_date = c('2016-03-01', '2016-04-15'),
till_date = c('2016-04-15', '2016-05-20'),
from_time = c('11:00', '10:30'),
till_time = c('22:00', '23:00'))
What I would like is to mark in datetimes those rows which are inside the opening hours. That is, I want a column that is TRUE whenever the datetime in the row is within both from_date and till_date and within from_time and till_time.
If the dataset isn't too big, I'd recommend creating a new dataset from opening_times -
opening_times$from_date = as.Date(opening_times$from_date, '%Y-%m-%d')
opening_times$till_date = as.Date(opening_times$till_date, '%Y-%m-%d')
opening_times2 = do.call(
rbind,
lapply(
seq(nrow(opening_times)),
function (rownumber) {
data.frame(
seq.Date(
from = opening_times[rownumber,'from_date'],
to = opening_times[rownumber,'till_date'],
by = 1
),
from_time = opening_times[rownumber,'from_time'],
till_time = opening_times[rownumber,'till_time']
)
}
)
)
and then merging it with datetimes by date and checking for whether time falls between the two values.
lubridate has a %within% function for checking whether a time is within a lubridate::interval which can make this easy once you create a vector of intervals:
# make a sequence of days in each set from opening_times
open_intervals <- apply(opening_times, 1, function(x){
dates <- seq.Date(ymd(x[1]), ymd(x[2]), by = 'day')
})
# turn each date into a lubridate::interval object with the appropriate times
open_intervals <- mapply(function(dates, from, to){
interval(ymd_hm(paste(dates, from)), ymd_hm(paste(dates, to)))
}, open_intervals, opening_times$from_time, opening_times$till_time)
# combine list items into one vector of intervals
open_intervals <- do.call(c, open_intervals)
# use lubridate::%within% to check if each datetime is in any open interval
datetimes$open <- sapply(datetimes$datetime, function(x){
any(x %within% open_intervals)
})
datetimes[20:26,]
# datetime open
# 20 2016-04-01 09:30:00 FALSE
# 21 2016-04-01 10:00:00 FALSE
# 22 2016-04-01 10:30:00 FALSE
# 23 2016-04-01 11:00:00 TRUE
# 24 2016-04-01 11:30:00 TRUE
# 25 2016-04-01 12:00:00 TRUE
# 26 2016-04-01 12:30:00 TRUE
Edit
If you have exactly two sets of hours, you can condense the whole thing into a (somewhat huge) ifelse:
datetimes$open <- ifelse(as.Date(datetimes$datetime) %within%
interval(opening_times$from_date[1],
opening_times$till_date[1]),
hm(format(datetimes$datetime, '%H:%M')) >= hm(opening_times$from_time)[1] &
hm(format(datetimes$datetime, '%H:%M')) <= hm(opening_times$till_time)[1],
hm(format(datetimes$datetime, '%H:%M')) >= hm(opening_times$from_time)[2] &
hm(format(datetimes$datetime, '%H:%M')) <= hm(opening_times$till_time)[2])
or
datetimes$open <- ifelse(as.Date(datetimes$datetime) %within%
interval(opening_times$from_date[1],
opening_times$till_date[1]),
datetimes$datetime %within%
interval(ymd_hm(paste(as.Date(datetimes$datetime), opening_times$from_time[1])),
ymd_hm(paste(as.Date(datetimes$datetime), opening_times$till_time[1]))),
datetimes$datetime %within%
interval(ymd_hm(paste(as.Date(datetimes$datetime), opening_times$from_time[2])),
ymd_hm(paste(as.Date(datetimes$datetime), opening_times$till_time[2]))))

Resources