I am trying to combine dates and times. These are from a file when imported, looks like this:
library(tidyverse)
library(lubridate)
bookings <- structure(list(booking_date = structure(c(1549670400, 1550275200,
1550880000, 1551484800, 1552089600, 1552694400), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), start_time = structure(c(-2209043700,
-2209043700, -2209043700, -2209043700, -2209043700, -2209043700
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Which looks like this:
# A tibble: 6 x 2
booking_date start_time
<dttm> <dttm>
1 2019-02-09 00:00:00 1899-12-31 08:45:00
2 2019-02-16 00:00:00 1899-12-31 08:45:00
3 2019-02-23 00:00:00 1899-12-31 08:45:00
4 2019-03-02 00:00:00 1899-12-31 08:45:00
5 2019-03-09 00:00:00 1899-12-31 08:45:00
6 2019-03-16 00:00:00 1899-12-31 08:45:00
Obviously the date in the start_time column is wrong. It should be combined with the booking date, so that the first row should read 2019-02-09 08:45:00.
What would the best way of doing that be? I have tried this (based on this other answer), which doesn't really work in my situation.
bookings %>%
select(booking_date, start_time) %>%
mutate(time_2 = as.character(start_time)) %>%
mutate(time_3 = str_sub(time_2, -8, -1)) %>%
mutate(booking_start = dmy(paste(booking_date, time_3)))
Thanks.
If you want to get date for start_time from booking_date a base R approach would be to paste "Date" part from booking_date and "time" part from start_time and convert them to POSIXct.
bookings$start_time <- as.POSIXct(paste(as.Date(bookings$booking_date),
format(bookings$start_time, "%T")))
bookings
# A tibble: 6 x 2
# booking_date start_time
# <dttm> <dttm>
#1 2019-02-09 00:00:00 2019-02-09 08:45:00
#2 2019-02-16 00:00:00 2019-02-16 08:45:00
#3 2019-02-23 00:00:00 2019-02-23 08:45:00
#4 2019-03-02 00:00:00 2019-03-02 08:45:00
#5 2019-03-09 00:00:00 2019-03-09 08:45:00
#6 2019-03-16 00:00:00 2019-03-16 08:45:00
If you want to use it in pipes you can do
library(dplyr)
bookings %>%
mutate(start_time = as.POSIXct(paste(as.Date(booking_date),
format(start_time, "%T"))))
We can also do this with lubridate::date.
date() <- lets you set the date component of a date/time object:
# Set the date component of start_time to be the date component of booking_date
date(bookings$start_time) <- bookings$booking_date
bookings
# A tibble: 6 x 2
booking_date start_time
<dttm> <dttm>
1 2019-02-09 00:00:00 2019-02-09 08:45:00
2 2019-02-16 00:00:00 2019-02-16 08:45:00
3 2019-02-23 00:00:00 2019-02-23 08:45:00
4 2019-03-02 00:00:00 2019-03-02 08:45:00
5 2019-03-09 00:00:00 2019-03-09 08:45:00
6 2019-03-16 00:00:00 2019-03-16 08:45:00
Since it uses assignment (<-), you can't use this first method as part of a pipe. What does work in a pipe, is the update.POSIXt method (see ?DateTimeUpdate), which lets you update the date components of a date, though you have to specify each part of the components specifically:
library(lubridate)
bookings %>%
mutate(date_time = update(start_time,
year = year(booking_date),
month = month(booking_date),
day = day(booking_date)))
booking_date start_time date_time
<dttm> <dttm> <dttm>
1 2019-02-09 00:00:00 1899-12-31 08:45:00 2019-02-09 08:45:00
2 2019-02-16 00:00:00 1899-12-31 08:45:00 2019-02-16 08:45:00
3 2019-02-23 00:00:00 1899-12-31 08:45:00 2019-02-23 08:45:00
4 2019-03-02 00:00:00 1899-12-31 08:45:00 2019-03-02 08:45:00
5 2019-03-09 00:00:00 1899-12-31 08:45:00 2019-03-09 08:45:00
6 2019-03-16 00:00:00 1899-12-31 08:45:00 2019-03-16 08:45:00
Related
I have got have a data.table that looks like this
library(dplyr)
library(data.table)
dt <- data.table(ID=c("A001","A002","A003","A004"),start_time=c('2019-06-18 05:18:00','2020-03-04 05:00:00',
'2019-05-10 19:00:00','2020-01-06 22:42:00'),end_time=c('2019-06-18 08:41:00','2020-03-04 05:04:00',
'2019-05-10 19:08:00','2020-01-07 03:10:00'))
ID
start_time end_time duration
1: A001 2019-06-18 05:18:00 2019-06-18 08:41:00 203 mins
2: A002 2020-03-04 05:59:00 2020-03-04 06:04:00 5 mins
3: A003 2019-05-10 19:00:00 2019-05-10 19:08:00 8 mins
4: A004 2020-01-06 22:42:00 2020-01-07 03:10:00 268 mins
Duration was simply calculated as
dt$start_time <- as.POSIXct(dt$start_time, tz='UTC')
dt$end_time <- as.POSIXct(dt$end_time, tz='UTC')
dt <- dt %>% mutate(duration = (end_time-start_time))
I need to duplicate rows where duration is larger than the end of the hour from start_time (records that cover > 1 hour). I need to change for them start time (beginning of the hour), end time - end of hour OR the original end time if if's the last row (last viewing hour),and duration accordingly, so that the final output would look like:
dt_expected <- data.table(ID=c("A001","A001","A001","A001","A002","A002","A003","A004","A004","A004","A004","A004","A004"),
start_time=c('2019-06-18 05:18:00','2019-06-18 06:00:00','2019-06-18 07:00:00','2019-06-18 08:00:00', '2020-03-04 05:59:00', '2020-03-04 06:00:00', '2019-05-10 19:00:00',
'2020-01-06 22:42:00', '2020-01-06 23:00:00','2020-01-07 00:00:00','2020-01-07 01:00:00','2020-01-07 02:00:00','2020-01-07 03:00:00'),
end_time=c('2019-06-18 05:59:00','2019-06-18 06:59:00','2019-06-18 07:59:00','2019-06-18 08:41:00','2020-03-04 05:59:00','2020-03-04 06:04:00', '2019-05-10 19:08:00', '2020-01-06 22:59:00','2020-01-06 23:59:00','2020-01-07 00:59:00','2020-01-07 01:59:00', '2020-01-07 02:59:00','2020-01-07 03:10:00'),
duration = c(12,60,60,41,1,4,8,18,60,60,60,60,10))
Note that records for ID A002 should also be duplicated as duration happened in 2 different hours.
ID start_time end_time duration
1: A001 2019-06-18 05:18:00 2019-06-18 05:59:00 12
2: A001 2019-06-18 06:00:00 2019-06-18 06:59:00 60
3: A001 2019-06-18 07:00:00 2019-06-18 07:59:00 60
4: A001 2019-06-18 08:00:00 2019-06-18 08:41:00 41
5: A002 2020-03-04 05:59:00 2020-03-04 05:59:00 1
6: A002 2020-03-04 06:00:00 2020-03-04 06:04:00 4
7: A003 2019-05-10 19:00:00 2019-05-10 19:08:00 8
8: A004 2020-01-06 22:42:00 2020-01-06 22:59:00 18
9: A004 2020-01-06 23:00:00 2020-01-06 23:59:00 60
10: A004 2020-01-07 00:00:00 2020-01-07 00:59:00 60
11: A004 2020-01-07 01:00:00 2020-01-07 01:59:00 60
12: A004 2020-01-07 02:00:00 2020-01-07 02:59:00 60
13: A004 2020-01-07 03:00:00 2020-01-07 03:10:00 10
I think this is pretty close to what you're looking for.
This creates new rows of start and end times, one row for each hour using map from purrr.
Then, for each ID, it will determine start_time and end_time using pmin.
First, for the end_time, it takes the minimum value between that row's end_time and an hour later than the start_time for that row. For example, the first row for A001 would have end_time of 6:00, which is the ceiling_date time for 5:18 to the nearest hour, and less than 6:18 from the sequence generated from map. For the last row for A001, the end_time is 8:41, which is less than the ceiling_date time of 9:00.
The start_time will take the minimum value between the last row's end_time and that row's start_time. For example, the second row of A001 will have 6:00, which is the row above's end_time which is less than 6:18 from the sequence generated from map.
Note that one row has 0 minutes for duration - the time fell right on the hour (19:00:00). These could be filtered out.
library(purrr)
library(dplyr)
library(tidyr)
library(lubridate)
dt %>%
rowwise() %>%
mutate(start_time = map(start_time, ~seq.POSIXt(., ceiling_date(end_time, "hour"), by = "hour"))) %>%
unnest(start_time) %>%
group_by(ID) %>%
mutate(end_time = pmin(ceiling_date(start_time, unit = "hour"), end_time),
start_time = pmin(floor_date(lag(end_time, default = first(end_time)), unit = "hour"), start_time),
duration = difftime(end_time, start_time, units = "mins"))
Output
ID start_time end_time duration
<chr> <dttm> <dttm> <drtn>
1 A001 2019-06-18 05:18:00 2019-06-18 06:00:00 42 mins
2 A001 2019-06-18 06:00:00 2019-06-18 07:00:00 60 mins
3 A001 2019-06-18 07:00:00 2019-06-18 08:00:00 60 mins
4 A001 2019-06-18 08:00:00 2019-06-18 08:41:00 41 mins
5 A002 2020-03-04 05:59:00 2020-03-04 06:00:00 1 mins
6 A002 2020-03-04 06:00:00 2020-03-04 06:04:00 4 mins
7 A003 2019-05-10 19:00:00 2019-05-10 19:00:00 0 mins
8 A003 2019-05-10 19:00:00 2019-05-10 19:08:00 8 mins
9 A004 2020-01-06 22:42:00 2020-01-06 23:00:00 18 mins
10 A004 2020-01-06 23:00:00 2020-01-07 00:00:00 60 mins
11 A004 2020-01-07 00:00:00 2020-01-07 01:00:00 60 mins
12 A004 2020-01-07 01:00:00 2020-01-07 02:00:00 60 mins
13 A004 2020-01-07 02:00:00 2020-01-07 03:00:00 60 mins
14 A004 2020-01-07 03:00:00 2020-01-07 03:10:00 10 mins
I imported some data from Excel that has separate columns for "Date" and "Time". When I imported the "Time" column, it returned with 1899-12-31 19:00:00 with the date 1899-12-31 for every single time value.
I would like to create a new column that would combine the date from the "Date" column and time from the "Time" column so I can do some calculations.
# A tibble: 207 x 2
DoS ToS
<dttm> <dttm>
1 2018-01-27 00:00:00 1899-12-31 19:00:00
2 2018-02-07 00:00:00 1899-12-31 15:45:00
3 2018-02-13 00:00:00 1899-12-31 23:00:00
4 2018-02-15 00:00:00 1899-12-31 13:45:00
5 2018-02-16 00:00:00 1899-12-31 10:00:00
6 2018-02-19 00:00:00 1899-12-31 15:00:00
7 2018-02-20 00:00:00 1899-12-31 15:05:00
8 2018-02-21 00:00:00 1899-12-31 15:00:00
> dput(head(sample, 10))
structure(list(DoS = structure(c(1517011200, 1517961600, 1518480000,
1518652800, 1518739200, 1518998400, 1519084800, 1519171200, 1519257600,
1519862400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
ToS = structure(c(-2209006800, -2209018500, -2208992400,
-2209025700, -2209039200, -2209021200, -2209020900, -2209021200,
-2209033800, -2209005000), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
Is there some way I can extract the time values and paste it to the Date column?
Using base R, we can extract date from DoS and time from ToS and combine them together.
transform(sample, Datetime = as.POSIXct(paste(as.Date(DoS), format(ToS, "%T"))))
# DoS ToS Datetime
#1 2018-01-27 1899-12-31 19:00:00 2018-01-27 19:00:00
#2 2018-02-07 1899-12-31 15:45:00 2018-02-07 15:45:00
#3 2018-02-13 1899-12-31 23:00:00 2018-02-13 23:00:00
#4 2018-02-15 1899-12-31 13:45:00 2018-02-15 13:45:00
#5 2018-02-16 1899-12-31 10:00:00 2018-02-16 10:00:00
#6 2018-02-19 1899-12-31 15:00:00 2018-02-19 15:00:00
#7 2018-02-20 1899-12-31 15:05:00 2018-02-20 15:05:00
#8 2018-02-21 1899-12-31 15:00:00 2018-02-21 15:00:00
#9 2018-02-22 1899-12-31 11:30:00 2018-02-22 11:30:00
#10 2018-03-01 1899-12-31 19:30:00 2018-03-01 19:30:00
Create 2 columns in R with one column having 2019 date and in second column time, which has time slot 9.00AM to 8PM with 1 hour gap. So in total for a date we should have 11 columns. For example(below)
I am not sure, what is your desired column type, so you have different options below :-)
Here comes my solution:
library(lubridate)
library(tidyverse)
start <- ymd_hms("2019-05-01 09:00:00")
end <- start + hm("11:00")
tibble(timestamp = seq.POSIXt(start, end, by = 3600)) %>%
mutate(day = date(timestamp),
time = strftime(timestamp, format="%H:%M:%S")) %>%
select(day, time, timestamp)
day time timestamp
<date> <chr> <dttm>
1 2019-05-01 09:00:00 2019-05-01 09:00:00
2 2019-05-01 10:00:00 2019-05-01 10:00:00
3 2019-05-01 11:00:00 2019-05-01 11:00:00
4 2019-05-01 12:00:00 2019-05-01 12:00:00
5 2019-05-01 13:00:00 2019-05-01 13:00:00
6 2019-05-01 14:00:00 2019-05-01 14:00:00
7 2019-05-01 15:00:00 2019-05-01 15:00:00
8 2019-05-01 16:00:00 2019-05-01 16:00:00
9 2019-05-01 17:00:00 2019-05-01 17:00:00
10 2019-05-01 18:00:00 2019-05-01 18:00:00
11 2019-05-01 19:00:00 2019-05-01 19:00:00
12 2019-05-01 20:00:00 2019-05-01 20:00:00
Regards
Paweł
A random date range:
df <- data.frame(
date = seq.Date(Sys.Date() - 6, Sys.Date(), 1)
)
df <- merge(df,expand.grid(date = df$date, time = 9:20))
df <- df[order(df$date, df$time), ]
df$time <- sprintf("%02i:00", df$time)
I have start of some process, end of it and process duration.
process_start process_end hourly_process_duration
2019-01-01 00:00:00 2019-01-01 12:00:00 12
2019-01-01 12:00:00 2019-01-01 13:00:00 1
NA NA 11
NA NA 15
2019-01-02 15:00:00 2019-01-02 18:00:00 3
I always have hourly_process_duration. Processes are continuous - when one process ends the next one begins.
I need to replace NA correctly. Like in the example:
process_start process_end hourly_process_duration
2019-01-01 00:00:00 2019-01-01 12:00:00 12
2019-01-01 12:00:00 2019-01-01 13:00:00 1
2019-01-01 13:00:00 2019-01-02 00:00:00 11
2019-01-02 00:00:00 2019-01-02 15:00:00 15
2019-01-02 15:00:00 2019-01-02 18:00:00 3
Here is one option to fill the missing date time
library(dplyr)
library(lubridate)
df1 %>%
mutate(process_start = coalesce(process_start, lag(process_end)),
process_end = coalesce(process_end, lead(process_start))) %>%
mutate_at(vars(process_start, process_end), ymd_hms) %>%
mutate_at(vars(process_start, process_end),
list(~ replace(., is.na(.), floor_date(.[which(is.na(.))+1], "day"))))
# process_start process_end hourly_process_duration
#1 2019-01-01 00:00:00 2019-01-01 12:00:00 12
#2 2019-01-01 12:00:00 2019-01-01 13:00:00 1
#3 2019-01-01 13:00:00 2019-01-02 00:00:00 11
#4 2019-01-02 00:00:00 2019-01-02 15:00:00 15
#5 2019-01-02 15:00:00 2019-01-02 18:00:00 3
data
df1 <- structure(list(process_start = c("2019-01-01 00:00:00",
"2019-01-01 12:00:00",
NA, NA, "2019-01-02 15:00:00"), process_end = c("2019-01-01 12:00:00",
"2019-01-01 13:00:00", NA, NA, "2019-01-02 18:00:00"),
hourly_process_duration = c(12L,
1L, 11L, 15L, 3L)), class = "data.frame", row.names = c(NA, -5L
))
I have a dataframe where I splitted the datetime column by date and time (two columns). However, when I group by time it gives me duplicates in time. So, to analyze it I used table() on time column, and it gave me duplicates also. This is a sample of it:
> table(df$time)
00:00:00 00:00:00 00:15:00 00:15:00 00:30:00 00:30:00
2211 1047 2211 1047 2211 1047
As you may see, when I splitted one of the "unique" values kept a " " inside. Is there a easy way to solve this?
PS: The datatype of the time column is character.
EDIT: Code added
df$datetime <- as.character.Date(df$datetime)
x <- colsplit(df$datetime, ' ', names = c('Date','Time'))
df <- cbind(df, x)
There are a number of approaches. One of them is to use appropriate functions to extract Dates and Times from Datetime column:
df <- data.frame(datetime = seq(
from=as.POSIXct("2018-5-15 0:00", tz="UTC"),
to=as.POSIXct("2018-5-16 24:00", tz="UTC"),
by="30 min") )
head(df$datetime)
#[1] "2018-05-15 00:00:00 UTC" "2018-05-15 00:30:00 UTC" "2018-05-15 01:00:00 UTC" "2018-05-15 01:30:00 UTC"
#[5] "2018-05-15 02:00:00 UTC" "2018-05-15 02:30:00 UTC"
df$Date <- as.Date(df$datetime)
df$Time <- format(df$datetime,"%H:%M:%S")
head(df)
# datetime Date Time
# 1 2018-05-15 00:00:00 2018-05-15 00:00:00
# 2 2018-05-15 00:30:00 2018-05-15 00:30:00
# 3 2018-05-15 01:00:00 2018-05-15 01:00:00
# 4 2018-05-15 01:30:00 2018-05-15 01:30:00
# 5 2018-05-15 02:00:00 2018-05-15 02:00:00
# 6 2018-05-15 02:30:00 2018-05-15 02:30:00
table(df$Time)
#00:00:00 00:30:00 01:00:00 01:30:00 02:00:00 02:30:00 03:00:00 03:30:00 04:00:00 04:30:00 05:00:00 05:30:00
#3 2 2 2 2 2 2 2 2 2 2 2
#06:00:00 06:30:00 07:00:00 07:30:00 08:00:00 08:30:00 09:00:00 09:30:00 10:00:00 10:30:00 11:00:00 11:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#12:00:00 12:30:00 13:00:00 13:30:00 14:00:00 14:30:00 15:00:00 15:30:00 16:00:00 16:30:00 17:00:00 17:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#18:00:00 18:30:00 19:00:00 19:30:00 20:00:00 20:30:00 21:00:00 21:30:00 22:00:00 22:30:00 23:00:00 23:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#If the data were given as character strings and contain extra spaces the above approach will still work
df <- data.frame(datetime=c("2018-05-15 00:00:00","2018-05-15 00:30:00",
"2018-05-15 01:00:00", "2018-05-15 02:00:00",
"2018-05-15 00:00:00","2018-05-15 00:30:00"),
stringsAsFactors=FALSE)
df$Date <- as.Date(df$datetime)
df$Time <- format(as.POSIXct(df$datetime, tz="UTC"),"%H:%M:%S")
head(df)
# datetime Date Time
# 1 2018-05-15 00:00:00 2018-05-15 00:00:00
# 2 2018-05-15 00:30:00 2018-05-15 00:30:00
# 3 2018-05-15 01:00:00 2018-05-15 01:00:00
# 4 2018-05-15 02:00:00 2018-05-15 02:00:00
# 5 2018-05-15 00:00:00 2018-05-15 00:00:00
# 6 2018-05-15 00:30:00 2018-05-15 00:30:00
table(df$Time)
#00:00:00 00:30:00 01:00:00 02:00:00
# 2 2 1 1
reshape2::colsplit accepts regular expressions, so you could split on '\s+' which matches 1 or more whitespace characters.
You can find out more about regular expressions in R using ?base::regex. The syntax is generally constant between languages, so you can use pretty much any regex tutorial. Take a look at https://regex101.com/. This site evaluates your regular expressions in real time and shows you exactly what each part is matching. It is extremely helpful!
Keep in mind that in R, as compared to most other languages, you must double the number of backslashes \. So \s (to match 1 whitespace character) must be written as \\s in R.