I am having trouble converting a time range in a column to a readable data for R. How would I go about converting this?
[1] "05:30P -08:00P" "07:00A -09:35A" "08:00A -10:30A" "08:55P -11:00P" "06:00P -06:30P"
c("05:30P -08:00P", "07:00A -09:35A", "08:00A -10:30A", "08:55P -11:00P",
"06:00P -06:30P")
If we want to convert to Datetime, an option is to split at the - into two columns and then use as.POSIXct to do the conversion
library(stringr)
library(dplyr)
library(tidyr)
str_replace_all(str1, "([AP])", "\\1M") %>%
tibble(str1 = .) %>%
separate(str1, into = c('start', 'end'), sep="\\s*-") %>%
mutate(across(c(start, end), ~ as.POSIXct(., format = '%I:%M %p')))
# A tibble: 5 x 2
# start end
# <dttm> <dttm>
#1 2020-08-19 17:30:00 2020-08-19 20:00:00
#2 2020-08-19 07:00:00 2020-08-19 09:35:00
#3 2020-08-19 08:00:00 2020-08-19 10:30:00
#4 2020-08-19 20:55:00 2020-08-19 23:00:00
#5 2020-08-19 18:00:00 2020-08-19 18:30:00
Or using lubridate
library(lubridate)
str_replace_all(str1, "([AP])", "\\1M") %>%
tibble(str1 = .) %>%
separate(str1, into = c('start', 'end'), sep="\\s*-") %>%
mutate(across(c(start, end), ~ parse_date_time(., 'IMp')))
data
str1 <- c("05:30P -08:00P", "07:00A -09:35A", "08:00A -10:30A", "08:55P -11:00P",
"06:00P -06:30P")
Base R attempt using strcapture to separate the timestamps out to two parts:
dr <- c("05:30P -08:00P", "07:00A -09:35A", "08:00A -10:30A", "08:55P -11:00P",
"06:00P -06:30P")
tms <- strcapture(r"((\d+:\d+[AP])[- ]+(\d+:\d+[AP]))", dr, proto=list(start="",end=""))
tms[] <- lapply(tms, function(x) as.POSIXct(paste0(x, "M"), format="%I:%M%p", tz="UTC"))
# start end
#1 2020-08-20 17:30:00 2020-08-20 20:00:00
#2 2020-08-20 07:00:00 2020-08-20 09:35:00
#3 2020-08-20 08:00:00 2020-08-20 10:30:00
#4 2020-08-20 20:55:00 2020-08-20 23:00:00
#5 2020-08-20 18:00:00 2020-08-20 18:30:00
Related
I have a start and end date for individuals and i need to estimate if the time passed from the start to the end is within 2 days
or 3 plus days.These dates are assign to record ids, how can i filter ones that ended within 2 days (from the start date)
and the ones that ended after 3 days or later.
Record_id <- c("2245","6728","5122","9287")
Start <- c("2021-01-13 CST" ,"2021-01-21 CST" ,"2021-01-17 CST","2021-01-13 CST")
End <- c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST","2021-01-25 15:00:00 CST")
I tried using
elapsed.time <- DF$start %--% DF$End
time.duration <- as.duration(elapsed.time)
but I am getting error because End date contains hour.Thank you.
Here's a dplyr pipe that will include both constraints (2 and 3 days):
df %>%
mutate(across(Start:End, as.POSIXct)) %>%
mutate(d = difftime(End, Start, units = "days")) %>%
filter(!between(difftime(End, Start, units = "days"), 2, 3))
# # A tibble: 4 x 4
# Record_id Start End d
# <chr> <dttm> <dttm> <drtn>
# 1 2245 2021-01-13 00:00:00 2021-01-21 18:00:00 8.750000 days
# 2 6728 2021-01-21 00:00:00 2021-01-22 16:00:00 1.666667 days
# 3 5122 2021-01-17 00:00:00 2021-01-22 13:00:00 5.541667 days
# 4 9287 2021-01-13 00:00:00 2021-01-25 15:00:00 12.625000 days
I included mutate(d= so that we can see what the actual differences are. If you were looking to remove those, then use filter(between(..)) (no !).
In the case of the data you provided, all observations are less than 2 or more than 3 days. I'll expand this range so that we can see it in effect:
df %>%
mutate(across(Start:End, as.POSIXct)) %>%
mutate(d = difftime(End, Start, units = "days")) %>%
filter(!between(difftime(End, Start, units = "days"), 1, 6))
# # A tibble: 2 x 4
# Record_id Start End d
# <chr> <dttm> <dttm> <drtn>
# 1 2245 2021-01-13 00:00:00 2021-01-21 18:00:00 8.750 days
# 2 9287 2021-01-13 00:00:00 2021-01-25 15:00:00 12.625 days
Data
df <- structure(list(Record_id = c("2245", "6728", "5122", "9287"), Start = c("2021-01-13 CST", "2021-01-21 CST", "2021-01-17 CST", "2021-01-13 CST"), End = c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST", "2021-01-25 15:00:00 CST")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
I just converted the character to a date time with lubridate and then subtracted the dates. What you'll get back are days. I then filter for dates that are within 2 days.
Record_id<- c("2245","6728","5122","9287")
Start<-c("2021-01-13 CST" ,"2021-01-21 CST" ,"2021-01-17 CST","2021-01-13 CST")
End<-c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST","2021-01-25 15:00:00 CST")
df <- dplyr::tibble(x = Record_id, y = Start, z = End)
df %>%
dplyr::mutate_at(vars(y:z), ~ lubridate::as_datetime(.)) %>%
dplyr::mutate(diff = as.numeric(z - y)) %>%
dplyr::filter(diff <= 2 )
I'm setting up a large dataset for time-series analysis. The data has a date start time and a date end-time.
The data was input as 24:00:00 for end-time I've now converted this to 00:00:00. I want to move all my end data that finishes at 00:00:00 forward by one day.
#Current database
id<-c("m1","m1","m1","m2","m2","m2","m3","m4","m4")
x<-c("2020-01-03 10:00:00","2020-01-03 16:00:00","2020-01-03
19:20:00","2020-01-05 10:00:00","2020-01-05 15:20:00","2020-01-05
20:50:00","2020-01-06 06:30:00","2020-01-08 06:30:00","2020-01-08
07:50:00")
start<-strptime(x,"%Y-%m-%d %H:%M:%S")
y<-c("2020-01-03 16:00:00","2020-01-03 19:20:00","2020-01-03
00:00:00","2020-01-05 15:20:00","2020-01-05 20:50:00","2020-01-05
00:00:00","2020-01-06 07:40:00","2020-01-08 07:50:00","2020-01-08
08:55:00")
end<-strptime(y,"%Y-%m-%d %H:%M:%S")
mydata<-data.frame(id,start,end)
#Output
id2<-c("m1","m1","m1","m2","m2","m2","m3","m4","m4")
x2<-c("2020-01-03 10:00:00","2020-01-03 16:00:00","2020-01-03
19:20:00","2020-01-05 10:00:00","2020-01-05 15:20:00","2020-01-05
20:50:00","2020-01-06 06:30:00","2020-01-08 06:30:00","2020-01-08
07:50:00")
start2<-strptime(x2,"%Y-%m-%d %H:%M:%S")
y2<-c("2020-01-03 16:00:00","2020-01-03 19:20:00","2020-01-04
00:00:00","2020-01-05 15:20:00","2020-01-05 20:50:00","2020-01-06
00:00:00","2020-01-06 07:40:00","2020-01-08 07:50:00","2020-01-08
08:55:00")
end2<-strptime(y2,"%Y-%m-%d %H:%M:%S")
mydata2<-data.frame(id2,start2,end2)
I expect the output for row 3 and row 6 to have moved date forward one day. Is it the "if" function or is there a simpler way?
if seems pretty straightforward. ifelse is vectorized:
library(lubridate)
mydata$end2 = as_datetime(ifelse(format(mydata$end, "%H:%M:%S") == "00:00:00", mydata$end + days(1), mydata$end), tz = Sys.timezone())
mydata
# id start end end2
# 1 m1 2020-01-03 10:00:00 2020-01-03 16:00:00 2020-01-03 16:00:00
# 2 m1 2020-01-03 16:00:00 2020-01-03 19:20:00 2020-01-03 19:20:00
# 3 m1 2020-01-03 19:20:00 2020-01-03 00:00:00 2020-01-04 00:00:00
# 4 m2 2020-01-05 10:00:00 2020-01-05 15:20:00 2020-01-05 15:20:00
# 5 m2 2020-01-05 15:20:00 2020-01-05 20:50:00 2020-01-05 20:50:00
# 6 m2 2020-01-05 20:50:00 2020-01-05 00:00:00 2020-01-06 00:00:00
# 7 m3 2020-01-06 06:30:00 2020-01-06 07:40:00 2020-01-06 07:40:00
# 8 m4 2020-01-08 06:30:00 2020-01-08 07:50:00 2020-01-08 07:50:00
# 9 m4 2020-01-08 07:50:00 2020-01-08 08:55:00 2020-01-08 08:55:00
As you asked for a "simpler way": lubridate package does this automatically for you (even for times over 24:00:00). If you are not familiar with it, check out this cheatsheet on RStudio website.
Date-times ending with 00:00:00 will stay on the same day and date-times ending with 24:00:00 will leap one day foward. Some examples:
library(lubridate)
ymd_hms("2019-07-30 00:00:00")
[1] "2019-07-30 UTC"
ymd_hms("2019-07-30 24:00:00")
[1] "2019-07-31 UTC"
ymd_hms("2019-07-30 24:01:05")
[1] "2019-07-31 00:01:05 UTC"
I really recommend using this package, as it makes working with date-times much less of a hassle. There is a small trade-off on consistency over perfomance, but I think it is not an issue on most cases.
If the data is in POSIXct than adding 86400 is equivalent to adding a day. Instead of using an if statement you could vectorize it.
library(lubridate)
my_hours <- rep(0, nrow(mydata))
my_hours[which(hour(mydata$end)==0)] <- 86400
my_hours <- which(hour(mydata$end) == 0)
mydata$end <- mydata$end + my_hours
mydata$end == mydata2$end2
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
I have a raw data frame that looks like this:
test
id class time
1 1 start 2019-06-20 00:00:00
2 1 end 2019-06-20 00:05:00
3 1 start 2019-06-20 00:10:00
4 1 end 2019-06-20 00:15:00
5 2 end 2019-06-20 00:20:00
6 2 start 2019-06-20 00:25:00
7 2 end 2019-06-20 00:30:00
8 2 start 2019-06-20 00:35:00
9 3 end 2019-06-20 00:40:00
10 3 start 2019-06-20 00:45:00
11 3 end 2019-06-20 00:50:00
12 3 start 2019-06-20 00:55:00
My goal is to map the values to an output table for each id only where there is a start and an end in consecutive order (time). Therefore, the output would look like:
output
id start end
1 1 2019-06-20 00:00:00 2019-06-20 00:05:00
2 1 2019-06-20 00:10:00 2019-06-20 00:15:00
3 2 2019-06-20 00:25:00 2019-06-20 00:30:00
4 3 2019-06-20 00:45:00 2019-06-20 00:50:00
I have tried with the dplyr package, but
test %>% group_by(id) %>% arrange(time) %>% starts_with("start")
Error in starts_with(., "start") : is_string(match) is not TRUE
starts_with always throws an error. I would like to avoid writing a for loop because I am sure this can be handled by a few chain operations. Any ideas for a workaround in dplyr or data.table?
One possible approach:
test[, {
si <- which(class=="start" & shift(class, -1L)=="end")
.(id, start=time[si], end=time[si + 1L])
}, by=.(id)]
output:
id start end
1: 1 1 2019-06-20 00:00:00 2019-06-20 00:05:00
2: 1 1 2019-06-20 00:10:00 2019-06-20 00:15:00
3: 2 2 2019-06-20 00:25:00 2019-06-20 00:30:00
4: 3 3 2019-06-20 00:45:00 2019-06-20 00:50:00
data:
library(data.table)
test <- fread("id,class,time
1,start,2019-06-20 00:00:00
1,end,2019-06-20 00:05:00
1,start,2019-06-20 00:10:00
1,end,2019-06-20 00:15:00
2,end,2019-06-20 00:20:00
2,start,2019-06-20 00:25:00
2,end,2019-06-20 00:30:00
2,start,2019-06-20 00:35:00
3,end,2019-06-20 00:40:00
3,start,2019-06-20 00:45:00
3,end,2019-06-20 00:50:00
3,start,2019-06-20 00:55:00")
I usually use cumsum() is these cases
test %>%
group_by(id) %>%
arrange(time, .by_group = TRUE) %>% # should use .by_group arg
mutate(flag = cumsum(class == "start")) %>%
group_by(id, flag) %>%
filter(n() == 2L) %>%
ungroup() %>%
spread(class, time) %>%
select(-flag)
Using dplyr and tidyr, we can first filter the rows which follow the "start" and "end" pattern, create groups of 2 rows and spread to long format.
library(dplyr)
library(tidyr)
test %>%
group_by(id) %>%
filter(class == "start" & lead(class) == "end" |
class == "end" & lag(class) == "start") %>%
group_by(group = gl(n()/2, 2)) %>%
spread(class, time) %>%
ungroup() %>%
select(-group) %>%
select(id, start, end)
# id start end
# <int> <dttm> <dttm>
#1 1 2019-06-20 00:00:00 2019-06-20 00:05:00
#2 1 2019-06-20 00:10:00 2019-06-20 00:15:00
#3 2 2019-06-20 00:25:00 2019-06-20 00:30:00
#4 3 2019-06-20 00:45:00 2019-06-20 00:50:00
You can keep each start row plus the end immediately after it (if any), then use dcast to switch from long to wide form:
test[,
if (.N >= 2) head(.SD, 2)
, by=.(g = rleid(id, cumsum(class=="start"))), .SDcols=names(test)][,
dcast(.SD, id + g ~ factor(class, levels=c("start", "end")), value.var="time")
]
id g start end
1: 1 1 2019-06-20 00:00:00 2019-06-20 00:05:00
2: 1 2 2019-06-20 00:10:00 2019-06-20 00:15:00
3: 2 4 2019-06-20 00:25:00 2019-06-20 00:30:00
4: 3 7 2019-06-20 00:45:00 2019-06-20 00:50:00
rleid and cumsum are used to find the sequences; and factor is needed to tell dcast the column order.
Side note: This is essentially the same as #cheetahfly's answer (I didn't realize when I posted): since the cumsum is increasing, it is sufficient to group by id + cumsum and there's no need to use rleid (which is for tracking runs of values). The only difference is that my approach woudl keep a run like start, end, end; while the other answer would filter it out with the n() == 2 check.
Given the datraframe below
class timestamp
1 A 2019-02-14 15:00:29
2 A 2019-01-27 17:59:53
3 A 2019-01-27 18:00:00
4 B 2019-02-02 18:00:00
5 C 2019-03-08 16:00:37
observation 2 and 3 point to the same event. How do I remove rows belonging to the same class if another timestamp within 2 minutes already exists?
Desired output:
class timestamp
1 A 2019-02-14 15:00:00
2 B 2019-01-27 18:00:00
3 A 2019-02-02 18:00:00
4 C 2019-03-08 16:00:00
round( ,c("mins")) can be used to get rid of the second component but if the timestamps are to far off some test samples will be rounded to the wrong minute leaving still different timestamps
EDIT
I think I over-complicated the problem in first attempt, I think what would work for your case is to round time for 2 minute intervals which we can do using round_date from lubridate .
library(lubridate)
library(dplyr)
df %>%
mutate(timestamp = round_date(as.POSIXct(timestamp), unit = "2 minutes")) %>%
group_by(class) %>%
filter(!duplicated(timestamp))
# class timestamp
# <chr> <dttm>
#1 A 2019-02-14 15:00:00
#2 A 2019-01-27 18:00:00
#3 B 2019-02-02 18:00:00
#4 C 2019-03-08 16:00:00
Original Attempt
We can first convert the timestamp to POSIXct object, then arrange rows by class and timestamp, use cut to divide them into "2 min" interval and then remove duplicates.
library(dplyr)
df %>%
mutate(timestamp = as.POSIXct(timestamp)) %>%
arrange(class, timestamp) %>%
group_by(class) %>%
filter(!duplicated(as.numeric(cut(timestamp, breaks = "2 mins")), fromLast = TRUE))
# class timestamp
# <chr> <dttm>
#1 A 2019-01-27 18:00:00
#2 A 2019-02-14 15:00:29
#3 B 2019-02-02 18:00:00
#4 C 2019-03-08 16:00:37
Here, I haven't changed or rounded the timestamp column and kept it as it is but it would be simple to round it if you use cut in mutate. Also if you want to keep the first entry like 2019-01-27 17:59:53 then remove fromLast = TRUE argument.
I'm trying to apply the anytime() function from the anytime package in a dplyr chain to all columns ending with Date
I'm however getting this error.
Error: Unsupported Type
when I use
invoicePayment <- head(raw.InvoicePayment) %>%
mutate_at(ends_with("Date"), funs(anytime))
but it's fine when I use
invoicePayment <- head(raw.InvoicePayment) %>%
select(ends_with("Date")) %>%
mutate_at(ends_with("Date"), funs(anytime))
Any help is appreciated,
Thanks,
We may need to wrap with vars
library(anytime)
library(dplyr)
df1 %>%
mutate_at(vars(ends_with("Date")), anytime)
# col1 col2_Date col3_Date
#1 1 2017-06-07 05:30:00 2017-06-07 05:30:00
#2 2 2017-06-08 05:30:00 2017-06-06 05:30:00
#3 3 2017-06-09 05:30:00 2017-06-05 05:30:00
#4 4 2017-06-10 05:30:00 2017-06-04 05:30:00
#5 5 2017-06-11 05:30:00 2017-06-03 05:30:00
data
df1 <- data.frame(col1 = 1:5, col2_Date = Sys.Date() + 0:4, col3_Date = Sys.Date() - 0:4)