R: Lubridate Failing to Convert Character to Numeric - r

I'm new to R - and searched old post for an answer but failed to come across anything that resolved my issue.
I pulled in a csv with the time a trip started in the mdy h:mm:ss format, but it is currently recognized as a character. I've tried to use mdy_hms(c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00"))
as well as
as.Date(parse_date_time(dc_biketrips$started_at, c(mdy_hms))) to no avail.
Does anyone have any suggestions for how I could fix this?
UPDATE: I also tried to use date <-mdy_hms(c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00")) str(date)but this also did not work
attempt to use date <-mdy_hms(C("11/1/2020 0:05:00"etc
image of csv

The first of your two options works:
library(lubridate)
date <-mdy_hms(c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00"))
str(date)
# POSIXct[1:3], format: "2020-11-01 00:05:00" "2020-11-01 07:29:00" "2020-11-01 14:04:00"
How were your data "pulled in"?

One option would be to use as.POSIXct:
started_at <- c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00")
as.POSIXct(started_at, format = "%m/%d/%Y %H:%M:%OS")
#> [1] "2020-11-01 00:05:00 CET" "2020-11-01 07:29:00 CET"
#> [3] "2020-11-01 14:04:00 CET"
EDIT
library(lubridate)
library(dplyr)
started_at <- c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00")
Both as.POSIXct and lubridate::mdy_hms return an object of class "POSIXct" "POSIXt"
class(as.POSIXct(started_at, format = "%m/%d/%Y %H:%M:%OS"))
#> [1] "POSIXct" "POSIXt"
class(mdy_hms(started_at))
#> [1] "POSIXct" "POSIXt"
Not sure what you expect. When I run your code everything works fine except that we end up with 0 obs after filtering for week < 15 as all the dates in the example data are from week 44:
dc_biketrips <- data.frame(
started_at
)
dc_biketrips <- dc_biketrips %>%
mutate(started_at = as.POSIXct(started_at, format = "%m/%d/%Y %H:%M:%OS"),
interval60 = floor_date(started_at, unit = "hour"),
interval15 = floor_date(started_at, unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE))
dc_biketrips
#> started_at interval60 interval15 week dotw
#> 1 2020-11-01 00:05:00 2020-11-01 00:00:00 2020-11-01 00:00:00 44 So
#> 2 2020-11-01 07:29:00 2020-11-01 07:00:00 2020-11-01 07:15:00 44 So
#> 3 2020-11-01 14:04:00 2020-11-01 14:00:00 2020-11-01 14:00:00 44 So
dc_biketrips %>%
filter(week < 15)
#> [1] started_at interval60 interval15 week dotw
#> <0 rows> (or 0-length row.names)

Related

Lubridate hour() does not function with times derived from parse_date_time()

I do not understand why an time which is derived from the function parse_date_time is not usable by another function in lubridate().
This produces a df that has the dates with am/pm parsed correctly.
dt2 <- data.frame('date_time' = c("11/24/19 06:00:00 PM",
"11/25/19 12:00:00 AM",
"11/25/19 06:00:00 AM",
"11/25/19 12:00:00 PM",
"11/25/19 06:00:00 PM",
"11/26/19 12:00:00 AM"),
'date' = c(1:6), 'time' = c(1:6)) %>%
mutate(date_time = parse_date_time(date_time, orders = "mdy IMS %p"),
date = date(date_time),
time = strftime(date_time,"%H:%M:%S", tz = "UTC"))
When I try to extract the hour from the hour column I get errors:
dt2 <- dt2 %>% mutate(hour_from_hour = hour(time))
Error: Problem with mutate() column hour_from_hour.
i hour_from_hour = hour(time).
x character string is not in a standard unambiguous format
But when I use the the original variable "date_time" it works fine.
dt2 <- dt2 %>% mutate(hour_from_date_time = hour(date_time))
My data sets have variable headers (some are in date time, some are already parsed). It would be nice if I could use hour() on the time column.
R doesn't have a native way to handle times that aren't associated to a day. But you can use a package like hms. For example:
library(tidyverse)
library(lubridate)
library(hms)
dt2 <- data.frame('date_time' = c("11/24/19 06:00:00 PM",
"11/25/19 12:00:00 AM",
"11/25/19 06:00:00 AM",
"11/25/19 12:00:00 PM",
"11/25/19 06:00:00 PM",
"11/26/19 12:00:00 AM"),
'date' = c(1:6), 'time' = c(1:6)) %>%
mutate(date_time = parse_date_time(date_time, orders = "mdy IMS %p"),
date = date(date_time),
time = as_hms(date_time),
hour = hour(time))
But to be honest, it's probably better to keep the date_time column and use hour directly on it.
If I understood your question correctly this code answers it. It first extracts the two digits for the hour as a character string and then converts them to an integer. The code assumes leading zeros and no leading spaces. The regular expression needs to be edited if cases with different formatting are to be handled. The solution is rather simple once one finds which functions to use, but it is not trivial, I think.
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(stringr)
dt2 <- data.frame('date_time' = c("11/24/19 06:00:00 PM",
"11/25/19 12:00:00 AM",
"11/25/19 06:00:00 AM",
"11/25/19 12:00:00 PM",
"11/25/19 06:00:00 PM",
"11/26/19 12:00:00 AM"),
'date' = c(1:6), 'time' = c(1:6)) %>%
mutate(date_time = parse_date_time(date_time, orders = "mdy IMS %p"),
date = date(date_time),
time = strftime(date_time,"%H:%M:%S", tz = "UTC"))
# hour is of mode character, assuming that TZ is always UTC
dt2 <- dt2 %>% mutate(hour_from_hour = as.integer(str_extract(time, "^[0-2][0-9]")),
hour_from_date_time = hour(date_time))
identical(dt2$hour_from_hour, dt2$hour_from_date_time)
#> [1] TRUE
dt2
#> date_time date time hour_from_hour hour_from_date_time
#> 1 2019-11-24 18:00:00 2019-11-24 18:00:00 18 18
#> 2 2019-11-25 00:00:00 2019-11-25 00:00:00 0 0
#> 3 2019-11-25 06:00:00 2019-11-25 06:00:00 6 6
#> 4 2019-11-25 12:00:00 2019-11-25 12:00:00 12 12
#> 5 2019-11-25 18:00:00 2019-11-25 18:00:00 18 18
#> 6 2019-11-26 00:00:00 2019-11-26 00:00:00 0 0
Created on 2021-12-21 by the reprex package (v2.0.1)

How to round datetime to nearest time of day, preferably vectorized?

Say I have a POSIXct vector like
timestamps = seq(as.POSIXct("2021-01-23"), as.POSIXct("2021-01-24"), length.out = 6)
I would like to round these times up to the nearest hour of the day in a vector:
hours_of_day = c(6, 14, 20)
i.e., the following result:
timestamps result
1 2021-01-23 00:00:00 2021-01-23 02:00:00
2 2021-01-23 04:48:00 2021-01-23 14:00:00
3 2021-01-23 09:36:00 2021-01-23 14:00:00
4 2021-01-23 14:24:00 2021-01-23 20:00:00
5 2021-01-23 19:12:00 2021-01-23 20:00:00
6 2021-01-24 00:00:00 2021-01-24 02:00:00
Is there a vectorized solution to this (or otherwise fast)? I have a few million timestamps and need to apply it for several hours_of_day.
One way to simplify this problem is to (1) find the next hours_of_day for each lubridate::hour(timestamps) and then (2) result = lubridate::floor_date(timestamps) + next_hour_of_day * 3600. But how to do step 1 vectorized?
Convert to as.POSIXlt, which allows you to extract hours and minutes, and calculate decimal hours. In an lapply/sapply combination first look up where these are less than the hours of the day vector, and choose the maximum hour using which.max. Now create new date-time using ISOdate and add one day ifelse date-time is smaller than original time.
timestamps <- as.POSIXlt(timestamps)
h <- hours_of_day[sapply(lapply(with(timestamps, hour + min/60 + sec/3600),
`<=`, hours_of_day), which.max)]
r <- with(timestamps, ISOdate(1900 + year, mon + 1, mday, h,
tz=attr(timestamps, "tzone")[[1]]))
r[r < timestamps] <- r[r < timestamps] + 86400
Result
r
# [1] "2021-01-23 06:00:00 CET" "2021-01-23 06:00:00 CET"
# [3] "2021-01-23 14:00:00 CET" "2021-01-23 20:00:00 CET"
# [5] "2021-01-23 20:00:00 CET" "2021-01-24 06:00:00 CET"
# [7] "2021-01-25 06:00:00 CET" "2021-01-27 20:00:00 CET"
data.frame(timestamps, r)
# timestamps r
# 1 2021-01-23 00:00:00 2021-01-23 06:00:00
# 2 2021-01-23 04:48:00 2021-01-23 06:00:00
# 3 2021-01-23 09:36:00 2021-01-23 14:00:00
# 4 2021-01-23 14:24:00 2021-01-23 20:00:00
# 5 2021-01-23 19:12:00 2021-01-23 20:00:00
# 6 2021-01-24 00:00:00 2021-01-24 06:00:00
# 7 2021-01-24 23:59:00 2021-01-25 06:00:00
# 8 2021-01-27 20:00:00 2021-01-27 20:00:00
Note: I've added "2021-01-24 23:59:00 CET" to timestamps to demonstrate the date change.
Benchmark
Tested on a length 1.4e6 vector.
# Unit: seconds
# expr min lq mean median uq max neval cld
# POSIX() 32.96197 33.06495 33.32104 33.16793 33.50057 33.83321 3 a
# lubridate() 47.36412 47.57762 47.75280 47.79113 47.94715 48.10316 3 b
Data:
timestamps <- structure(c(1611356400, 1611373680, 1611390960, 1611408240, 1611425520,
1611442800, 1611529140, 1611774000), class = c("POSIXct", "POSIXt"
))
hours_of_day <- c(6, 14, 20)
I would extract the hour component, use cut to bin it, and assign the binned hours back to the original:
hours_of_day = c(2, 14, 20)
library(lubridate)
library(magrittr) ## just for the pipe
new_hours = timestamps %>%
hour %>%
cut(breaks = c(0, hours_of_day), labels = hours_of_day, include.lowest = TRUE) %>%
as.character() %>%
as.integer()
result = floor_date(timestamps, "hour")
hour(result) = new_hours
result
# [1] "2021-01-23 02:00:00 EST" "2021-01-23 14:00:00 EST" "2021-01-23 14:00:00 EST"
# [4] "2021-01-23 14:00:00 EST" "2021-01-23 20:00:00 EST" "2021-01-24 02:00:00 EST"
Building on the approach by #jay.sf, I made a function for floor as well while adding support for NA values.
floor_date_to = function(timestamps, hours_of_day) {
# Handle NA with a temporary filler so code below doesn't break
na_timestamps = is.na(timestamps)
timestamps[na_timestamps] = as.POSIXct("9999-12-31")
# Proceed as usual
timestamps = as.POSIXlt(timestamps)
hours_of_day = rev(hours_of_day) # floor-specific: because which.max returns the first index by default
nearest_hour = hours_of_day[sapply(lapply(with(timestamps, hour + min/60 + sec/3600), `<`, hours_of_day), function(x) which.max(-x))] # floor-specific: negative which.max()
rounded = with(timestamps, ISOdate(1900 + year, mon + 1, mday, nearest_hour, tz = attr(timestamps, "tzone")[1]))
rounded[rounded > timestamps] = rounded[rounded > timestamps] - 86400 # floor: use minus
return(rounded)
timestamps[na_timestamps] = NA # Overwrite with NA again
}

From character to dates-time in R using decimal hours

I'm trying to convert characters to dates in R.
My data as the following format:
df <- data.frame(Date = c("23Jul2019 11:51:09 AM","23Jul2019 11:53:09 AM","19Jul2019 2:30:06 PM","01Aug2019 3:00:17 PM"))
Based on the solution found here:
Convert character to Date in R
I could use
> as.Date(df$Date, "%d/%b/%Y %I:%M:%S %p")
[1] NA NA NA NA
%I is for decimal hour (12h) and %p Locale-specific AM/PM (https://www.stat.berkeley.edu/~s133/dates.html) but for some reason, it returns NAs.
My goal is to sort the rows of a dataframe by date and time once dates in the character format converted to Dates in R.
What is the issue with the code I'm using?
This should solve it
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
df <- data.frame(Date = c("23Jul2019 11:51:09 AM","23Jul2019 11:53:09 AM","19Jul2019 2:30:06 PM","01Aug2019 3:00:17 PM"))
df %>%
mutate(r_date_time = Date %>% dmy_hms)
#> Date r_date_time
#> 1 23Jul2019 11:51:09 AM 2019-07-23 11:51:09
#> 2 23Jul2019 11:53:09 AM 2019-07-23 11:53:09
#> 3 19Jul2019 2:30:06 PM 2019-07-19 14:30:06
#> 4 01Aug2019 3:00:17 PM 2019-08-01 15:00:17
dmy_hms(df$Date)
#> [1] "2019-07-23 11:51:09 UTC" "2019-07-23 11:53:09 UTC"
#> [3] "2019-07-19 14:30:06 UTC" "2019-08-01 15:00:17 UTC"
Created on 2020-01-14 by the reprex package (v0.3.0)

lubridate: inconsistent behavior with timezones

Consider the following example
library(lubridate)
library(tidyverse)
> hour(ymd_hms('2008-01-04 00:00:00'))
[1] 0
Now,
dataframe <- data_frame(time = c(ymd_hms('2008-01-04 00:00:00'),
ymd_hms('2008-01-04 00:01:00'),
ymd_hms('2008-01-04 00:02:00'),
ymd_hms('2008-01-04 00:03:00')),
value = c(1,2,3,4))
mutate(dataframe,hour = strftime(time, format="%H:%M:%S"),
hour2 = hour(time))
# A tibble: 4 × 4
time value hour hour2
<dttm> <dbl> <chr> <int>
1 2008-01-03 19:00:00 1 19:00:00 19
2 2008-01-03 19:01:00 2 19:01:00 19
3 2008-01-03 19:02:00 3 19:02:00 19
4 2008-01-03 19:03:00 4 19:03:00 19
What is going on here? Why are the dates converted into some local time which I dont event know?
This is not an issue with lubridate, but with the way POSIXct values are combined into a vector.
You have
> ymd_hms('2008-01-04 00:01:00')
[1] "2008-01-04 00:01:00 UTC"
But when combining into a vector you get
> c(ymd_hms('2008-01-04 00:01:00'), ymd_hms('2008-01-04 00:01:00'))
[1] "2008-01-03 19:01:00 EST" "2008-01-03 19:01:00 EST"
The reason is that the tzone attribute gets lost when combining POSIXct values (see c.POSIXct).
> attributes(ymd_hms('2008-01-04 00:01:00'))
$tzone
[1] "UTC"
$class
[1] "POSIXct" "POSIXt"
but
> attributes(c(ymd_hms('2008-01-04 00:01:00')))
$class
[1] "POSIXct" "POSIXt"
What you can use instead is
> ymd_hms(c('2008-01-04 00:01:00', '2008-01-04 00:01:00'))
[1] "2008-01-04 00:01:00 UTC" "2008-01-04 00:01:00 UTC"
which will use the default tz = "UTC" for all arguments.
You also need to pass tz = "UTC" into strftime because its default is your current time zone (unlike ymd_hms which defaults to UTC).

Text process using R

I am quite new in programming and R Software.
My data-set includes date-time variables as following:
2007/11/0103
2007/11/0104
2007/11/0105
2007/11/0106
I need an operator which count from left up to the character number 10 and then execute a space and copy the last two characters and then add :00 for all columns.
Expected results:
2007/11/01 03:00
2007/11/01 04:00
2007/11/01 05:00
2007/11/01 06:00
If you want to actually turn your data into a "POSIXlt" "POSIXt" class in R (so you could subtract/add days, minutes and etc from/to it) you could do
# Your data
temp <- c("2007/11/0103", "2007/11/0104", "2007/11/0105", "2007/11/0106")
temp2 <- strptime(temp, "%Y/%m/%d%H")
## [1] "2007-11-01 03:00:00 IST" "2007-11-01 04:00:00 IST" "2007-11-01 05:00:00 IST" "2007-11-01 06:00:00 IST"
You could then extract hours for example
temp2$hour
## [1] 3 4 5 6
Add hours
temp2 + 3600
## [1] "2007-11-01 04:00:00 IST" "2007-11-01 05:00:00 IST" "2007-11-01 06:00:00 IST" "2007-11-01 07:00:00 IST"
And so on. If you just want the format you mentioned in your question (which is just a character string), you can also do
format(strptime(temp, "%Y/%m/%d%H"), format = "%Y/%m/%d %H:%M")
#[1] "2007/11/01 03:00" "2007/11/01 04:00" "2007/11/01 05:00" "2007/11/01 06:00"
Try
library(lubridate)
dat <- read.table(text="2007/11/0103
2007/11/0104
2007/11/0105
2007/11/0106",header=F,stringsAsFactors=F)
dat$V1 <- format(ymd_h(dat$V1),"%Y/%m/%d %H:%M")
dat
# V1
# 1 2007/11/01 03:00
# 2 2007/11/01 04:00
# 3 2007/11/01 05:00
# 4 2007/11/01 06:00
Suppose your dates are a vector named dates
library(stringr)
paste0(paste(str_sub(dates, end=10), str_sub(dates, 11)), ":00")
paste and substr are your friends here. Type ? before either to see the documentation
my.parser <- function(a){
paste0(substr(a, 0,10),' ',substr(a,11,12),':00') # paste0 is like paste but does not add whitespace
}
a<- '2007/11/0103'
my.parser(a) # = "2007/11/01 03:00"

Resources